Local LLM benchmarking platform for ICT Project Management tasks. Reproducible by design, sustainability-aware, fully open-source.
PUMA Platform
Wiki ·
Contribute ·
Issues
PUMA ·
PUMA Community ·
PUMA Vault
PUMA Info
Youtube ·
PUMA Wiki ·
PUMA Community Wiki ·
NotebookLM ·
Drive (info)
PUMA Contact
Reddit ·
Discord ·
GitHub Discussions ·
Twitter/X ·
|
Following empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies. |
Within ICT environments, recurring inefficiencies hinder effective decision-making. |
PUMA Community
HF Organization ·
HF Submissions ·
HF Leaderboard ·
Zenodo ·
Kaggle ·
Zotero
PUMA Code
PUMA Project ·
PUMA Community ·
PUMA Vault
PUMA is a local-first benchmarking platform for open-weight language models on
ICT Project Management tasks. PUMA runs entirely on your hardware via
Ollama; it never calls an external inference API and
never needs an account or token to evaluate a model. The platform exercises
two production scenarios end to end — issue triage (multi-class
classification on the Jira Social Repository dataset) and effort estimation
(story-point regression on the TAWOS dataset) — plus an experimental
backlog-prioritisation scenario. Every run reports both quality metrics
(F1-macro, accuracy, MAE, MdAE, calibration / ECE, confusion matrix) and a
full sustainability footprint (CO2 grams, energy kWh, tracking mode) via
CodeCarbon. Results are persisted to a local SQLite
database with a bi-temporal schema so historical runs are reproducible
bit-exact. Users who want to share their evaluations can publish to the
companion data hub at
pumacp/puma-community with a
single CLI command.
- Local-first execution via Ollama. CPU-only and GPU configurations supported on Linux; native Apple Silicon support on macOS.
- Two production scenarios:
triage_jira(issue classification) andeffort_tawos(story-point estimation), plus experimentalprioritization_jira. - Multi-strategy prompting: zero-shot, zero-shot-CoT, few-shot (k=3 / k=5 / k=8), CoT few-shot, RCOIF, contextual anchoring, EGI, self-consistency.
- Multi-dimensional metrics: F1-macro, accuracy, MAE, MdAE, ECE, per-class breakdown, confusion matrix, Wilcoxon signed-rank pairwise tests.
- Sustainability tracking via CodeCarbon with chip-aware tracking modes on Apple Silicon and Linux.
- 15 hardware profiles spanning CPU-only, GPU-equipped, and Apple Silicon M3 / M4 / M5 generations; 17 supported model tags in the catalog.
- Streamlit dashboard for browsing runs, comparing models, exploring metrics, and publishing results to PUMA Community.
- Reproducible by design: deterministic seed, temperature 0.0, Ollama logprobs API for calibration, predictions-hash integrity check.
pip install puma-cp # from PyPI
docker pull ghcr.io/pumacp/puma:latest # from GitHub Container Registrygit clone https://github.com/pumacp/puma.git
cd puma
docker compose up -dRun a benchmark:
docker compose run --rm puma_runner puma run \
--scenario triage_jira \
--model qwen2.5:3b \
--strategy zero_shot \
--instances 50Open the dashboard:
docker compose run --rm -p 8501:8501 puma_runner \
streamlit run src/puma/dashboard/app.py
# Then open http://localhost:8501git clone https://github.com/pumacp/puma.git
cd puma
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
# Install Ollama separately: https://ollama.com/download
puma --helppuma auth login github # store a Personal Access Token (one-off)
puma share-results --dry-run --run-id <id> # preview the payload as a local JSON file
puma share-results --run-id <id> # fork, branch, commit, and open the PRThe tool builds the payload from your local SQLite results, scans for
personal data, signs the integrity hash, and opens the Pull Request on
your behalf against
pumacp/puma-community.
The first official community submission is documented end to end in the first-submission write-up.
The puma entry point exposes a Typer-based hierarchy of commands. The most
useful top-level commands:
puma preflight— detect hardware capabilities and select an execution profile.puma models— read-only sub-group inspecting the models Ollama already has locally (list/show <name>/recommended). Pulling is delegated toollama pull <tag>(ordocker compose exec puma_ollama ollama pull <tag>in the Compose flow).puma run— execute a benchmark for a given scenario / model / strategy.puma compare— compare two runs side by side.puma validate-baseline— verify reproducibility against a published baseline.puma list-runs— show the runs stored in the local SQLite database.puma prepare-datasets— fetch and pre-process the supported datasets.puma wilcoxon— Wilcoxon signed-rank pairwise comparison.puma bias-analysis— gendered-prefix robustness sweep.puma generate-plots— render result plots (Sustainability Frontier, reliability diagrams, etc.).puma db— inspect or migrate the local results database (migrate,downgrade,history,status).puma auth— manage credentials for community publishing (login,status,logout).puma share-results— publish a run to PUMA Community.puma dashboard— launch the Streamlit dashboard.
The platform is organised in layered modules under src/puma/:
- Orchestrator schedules instances against the model under test, applies the chosen prompting strategy, and records per-prediction latency.
- Inference cache keeps runs deterministic by caching
(prompt, seed, model)results when the user explicitly opts in. - Scenarios are pluggable task modules —
triage,effort,prioritization— each owning its prompt templates and label space. - Metrics engine computes performance, calibration, sustainability, and pairwise-test metrics on top of the stored predictions.
- Storage is a SQLite database with a bi-temporal schema (
runs,instances,predictions,metrics,emissions,profile_snapshots) managed by SQLAlchemy + Alembic. - Dashboard is a Streamlit app with eight views (Overview, Model Comparison, Reliability, Robustness, Fairness, Sustainability Frontier, Instance Drill-down, and PUMA Community).
- Community integration composes the data-layer modules with a
credential store, a local rate limiter, and a narrow PyGithub wrapper to
open Pull Requests against
pumacp/puma-community.
puma/
├── .github/workflows/ # CI: lint-and-test, smoke, release
├── alembic/ # Database migrations
├── assets/img/ # Logo and visual assets
├── config/ # Hardware profiles and model catalog
├── data/ # SQLite database and cache (gitignored)
├── docs/ # Internal documentation
├── scripts/ # Helper scripts
├── src/puma/ # Python source
│ ├── cli.py # Top-level CLI entry point
│ ├── community/ # PUMA Community submission flow
│ ├── dashboard/ # Streamlit dashboard and views
│ ├── orchestrator/ # Run scheduling and run-spec parsing
│ ├── scenarios/ # Task modules (triage, effort, prioritization)
│ ├── metrics/ # Metric computation
│ ├── sustainability/ # CodeCarbon integration
│ ├── preflight/ # Hardware detection and profile selection
│ └── storage/ # SQLite ORM (SQLAlchemy + Alembic)
├── tests/ # pytest suite (unit, integration, smoke, community)
├── CODE_OF_CONDUCT.md # Contributor Covenant v2.1
├── CONTRIBUTING.md # Development guide
├── docker-compose.yml # Docker stack definition
├── Dockerfile # Runner image
├── LICENSE # MIT
├── pyproject.toml # Package metadata and dependencies
└── README.md # This file
- Contributing guide — development setup, tests, commit
conventions, PR process. The canonical procedural reference is
docs/development-workflow.md(also at https://pumacp.github.io/puma/development-workflow/). - Technical reference — consolidated architecture + configuration + JSON Schema + ORM + CLI overview + glossary + decisions timeline (also at https://pumacp.github.io/puma/technical_reference/).
- Code of Conduct — Contributor Covenant v2.1.
- PUMA Community — public hub for community-contributed benchmark results.
- Wiki — extended documentation (when populated).
- Releases — semantic-versioned releases and changelogs.
- PUMA benchmark tool — https://github.com/pumacp/puma — local-LLM evaluation engine for ICT Project Management tasks
- PUMA Community — https://github.com/pumacp/puma-community — public archive of community-contributed benchmark results
- PUMA Vault — https://github.com/pumacp/puma-vault — knowledge-management graph of the project
- PUMA docs — https://pumacp.github.io/puma/
- PUMA Vault — https://pumacp.github.io/puma-vault/
- PUMA Community — https://pumacp.github.io/puma-community/ (in setup — Sprint 12 Phase C)
- Wiki (benchmark tool) — https://github.com/pumacp/puma/wiki
- Wiki (community hub) — https://github.com/pumacp/puma-community/wiki
- Organization — https://huggingface.co/pumaproject
- Dataset of community submissions — https://huggingface.co/datasets/pumaproject/puma-community-submissions
- Leaderboard (public Gradio Space) — https://huggingface.co/spaces/pumaproject/puma-leaderboard
- Verifier (private endpoint) — https://huggingface.co/spaces/pumaproject/puma-verifier
- Personal namespace (project datasets) — https://huggingface.co/pumacp
- Zenodo community (production) — https://zenodo.org/communities/pumacp
- Zenodo community (sandbox, for pipeline validation) — https://sandbox.zenodo.org/communities/pumacp
- Source dataset — Jira Social Repository — https://doi.org/10.5281/zenodo.5901893
- Kaggle dataset — https://www.kaggle.com/datasets/pumacp/puma-community-submissions
- Discord — https://discord.gg/fVhcpHREJv
- GitHub Discussions — https://github.com/pumacp/puma-community/discussions
- Contact email — pumacapstoneproject@gmail.com
- Zotero library — https://www.zotero.org/pumacp/library
- Google Drive (PDF repository) — https://drive.google.com/drive/folders/1TKbYhYqLIrq7liAPISF7ztS2Bv0l7vZS?usp=sharing
- ResearchRabbit map 1 — https://app.researchrabbit.ai/folder-shares/d8244f17-47f7-4f6c-a589-473876578b54
- ResearchRabbit map 2 — https://app.researchrabbit.ai/folder-shares/b6c00471-2f28-4c66-85f5-ab5399470228
- Mastodon — @pumacp@fosstodon.org (account creation pending)
- Bluesky — @pumacp.bsky.social (account creation pending)
- Telegram — deferred pending phone-number policy decision
- PUMA Community — companion data repository for community submissions, with auto-validation and outward mirrors to Hugging Face, Zenodo, and Kaggle.
- Ollama — local LLM runtime that PUMA delegates to for all model execution.
- CodeCarbon — sustainability tracking library PUMA uses for energy and emissions reporting.
- Datasets used — Jira Social Repository (Zenodo DOI 5901893) and TAWOS.
If you use PUMA in your work, please cite the repository:
@software{puma_project,
author = {{PUMA Project contributors}},
title = {PUMA: PUMA Understanding & Management w Agents},
url = {https://github.com/pumacp/puma},
version = {2.7.0},
year = {2026}
}Update version to match the tag you used.
PUMA is released under the MIT License. See LICENSE for the
full text. Third-party dependencies retain their own licenses; the
canonical list lives in pyproject.toml.
This project follows the
Contributor Covenant v2.1. Conduct concerns can be
reported privately to pumacapstoneproject@gmail.com.
