PUMA

Local LLM benchmarking platform for ICT Project Management tasks. Reproducible by design, sustainability-aware, fully open-source.

_{PUMA Platform}
Wiki · Contribute · Issues PUMA · PUMA Community · PUMA Vault

_{PUMA Info}
Youtube · PUMA Wiki · PUMA Community Wiki · NotebookLM · Drive (info)

_{PUMA Contact}
Reddit · Discord · GitHub Discussions · Twitter/X ·

Following empirical evidence, ICT project management faces triage, estimation, and learning inefficiencies.
Observed widely, these persist despite abundant historical data.
Laying a rigorous foundation requires reproducible benchmarking.
Leveraging labeled datasets enables systematic evaluation of LLM performance.
Outcomes are compared using quantitative metrics and statistical analysis.
With an incremental design, a minimal viable benchmark is defined.
Through open-source release, results become reproducible and verifiable.
Hence, the framework supports extensibility across models and tasks.
Eventually, it enables integration into real organizational settings.

Within ICT environments, recurring inefficiencies hinder effective decision-making.
Heterogeneous data sources complicate prioritization and estimation processes.
In response, this work builds a reproducible LLM-based benchmark.
The focus is on issue triage and story-point estimation tasks.
Evaluation follows controlled experiments with statistical validation.
Protocols ensure reproducibility through fixed parameters and configurations.
Using carbon tracking, the framework measures energy impact.
Moreover, the MVP delivers a valid and original contribution.
All artefacts are released as open source for replication and extension.

_{PUMA Community}
HF Organization · HF Submissions · HF Leaderboard · Zenodo · Kaggle · Zotero

_{PUMA Code}
PUMA Project · PUMA Community · PUMA Vault

Overview

PUMA is a local-first benchmarking platform for open-weight language models on ICT Project Management tasks. PUMA runs entirely on your hardware via Ollama; it never calls an external inference API and never needs an account or token to evaluate a model. The platform exercises two production scenarios end to end — issue triage (multi-class classification on the Jira Social Repository dataset) and effort estimation (story-point regression on the TAWOS dataset) — plus an experimental backlog-prioritisation scenario. Every run reports both quality metrics (F1-macro, accuracy, MAE, MdAE, calibration / ECE, confusion matrix) and a full sustainability footprint (CO2 grams, energy kWh, tracking mode) via CodeCarbon. Results are persisted to a local SQLite database with a bi-temporal schema so historical runs are reproducible bit-exact. Users who want to share their evaluations can publish to the companion data hub at pumacp/puma-community with a single CLI command.

Features

Local-first execution via Ollama. CPU-only and GPU configurations supported on Linux; native Apple Silicon support on macOS.
Two production scenarios: triage_jira (issue classification) and effort_tawos (story-point estimation), plus experimental prioritization_jira.
Multi-strategy prompting: zero-shot, zero-shot-CoT, few-shot (k=3 / k=5 / k=8), CoT few-shot, RCOIF, contextual anchoring, EGI, self-consistency.
Multi-dimensional metrics: F1-macro, accuracy, MAE, MdAE, ECE, per-class breakdown, confusion matrix, Wilcoxon signed-rank pairwise tests.
Sustainability tracking via CodeCarbon with chip-aware tracking modes on Apple Silicon and Linux.
15 hardware profiles spanning CPU-only, GPU-equipped, and Apple Silicon M3 / M4 / M5 generations; 17 supported model tags in the catalog.
Streamlit dashboard for browsing runs, comparing models, exploring metrics, and publishing results to PUMA Community.
Reproducible by design: deterministic seed, temperature 0.0, Ollama logprobs API for calibration, predictions-hash integrity check.

Quick start

Released packages (available with the v4.0.0 release)

pip install puma-cp                          # from PyPI
docker pull ghcr.io/pumacp/puma:latest       # from GitHub Container Registry

Docker (recommended)

git clone https://github.com/pumacp/puma.git
cd puma
docker compose up -d

Run a benchmark:

docker compose run --rm puma_runner puma run \
  --scenario triage_jira \
  --model qwen2.5:3b \
  --strategy zero_shot \
  --instances 50

Open the dashboard:

docker compose run --rm -p 8501:8501 puma_runner \
  streamlit run src/puma/dashboard/app.py
# Then open http://localhost:8501

Manual install (advanced)

git clone https://github.com/pumacp/puma.git
cd puma
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
# Install Ollama separately: https://ollama.com/download
puma --help

Share your results with the community (optional)

puma auth login github                       # store a Personal Access Token (one-off)
puma share-results --dry-run --run-id <id>   # preview the payload as a local JSON file
puma share-results --run-id <id>             # fork, branch, commit, and open the PR

The tool builds the payload from your local SQLite results, scans for personal data, signs the integrity hash, and opens the Pull Request on your behalf against pumacp/puma-community.

The first official community submission is documented end to end in the first-submission write-up.

CLI overview

The puma entry point exposes a Typer-based hierarchy of commands. The most useful top-level commands:

puma preflight — detect hardware capabilities and select an execution profile.
puma models — read-only sub-group inspecting the models Ollama already has locally (list / show <name> / recommended). Pulling is delegated to ollama pull <tag> (or docker compose exec puma_ollama ollama pull <tag> in the Compose flow).
puma run — execute a benchmark for a given scenario / model / strategy.
puma compare — compare two runs side by side.
puma validate-baseline — verify reproducibility against a published baseline.
puma list-runs — show the runs stored in the local SQLite database.
puma prepare-datasets — fetch and pre-process the supported datasets.
puma wilcoxon — Wilcoxon signed-rank pairwise comparison.
puma bias-analysis — gendered-prefix robustness sweep.
puma generate-plots — render result plots (Sustainability Frontier, reliability diagrams, etc.).
puma db — inspect or migrate the local results database (migrate, downgrade, history, status).
puma auth — manage credentials for community publishing (login, status, logout).
puma share-results — publish a run to PUMA Community.
puma dashboard — launch the Streamlit dashboard.

Architecture

The platform is organised in layered modules under src/puma/:

Orchestrator schedules instances against the model under test, applies the chosen prompting strategy, and records per-prediction latency.
Inference cache keeps runs deterministic by caching (prompt, seed, model) results when the user explicitly opts in.
Scenarios are pluggable task modules — triage, effort, prioritization — each owning its prompt templates and label space.
Metrics engine computes performance, calibration, sustainability, and pairwise-test metrics on top of the stored predictions.
Storage is a SQLite database with a bi-temporal schema (runs, instances, predictions, metrics, emissions, profile_snapshots) managed by SQLAlchemy + Alembic.
Dashboard is a Streamlit app with eight views (Overview, Model Comparison, Reliability, Robustness, Fairness, Sustainability Frontier, Instance Drill-down, and PUMA Community).
Community integration composes the data-layer modules with a credential store, a local rate limiter, and a narrow PyGithub wrapper to open Pull Requests against pumacp/puma-community.

Repository structure

puma/
├── .github/workflows/    # CI: lint-and-test, smoke, release
├── alembic/              # Database migrations
├── assets/img/           # Logo and visual assets
├── config/               # Hardware profiles and model catalog
├── data/                 # SQLite database and cache (gitignored)
├── docs/                 # Internal documentation
├── scripts/              # Helper scripts
├── src/puma/             # Python source
│   ├── cli.py            # Top-level CLI entry point
│   ├── community/        # PUMA Community submission flow
│   ├── dashboard/        # Streamlit dashboard and views
│   ├── orchestrator/     # Run scheduling and run-spec parsing
│   ├── scenarios/        # Task modules (triage, effort, prioritization)
│   ├── metrics/          # Metric computation
│   ├── sustainability/   # CodeCarbon integration
│   ├── preflight/        # Hardware detection and profile selection
│   └── storage/          # SQLite ORM (SQLAlchemy + Alembic)
├── tests/                # pytest suite (unit, integration, smoke, community)
├── CODE_OF_CONDUCT.md    # Contributor Covenant v2.1
├── CONTRIBUTING.md       # Development guide
├── docker-compose.yml    # Docker stack definition
├── Dockerfile            # Runner image
├── LICENSE               # MIT
├── pyproject.toml        # Package metadata and dependencies
└── README.md             # This file

Documentation

Contributing guide — development setup, tests, commit conventions, PR process. The canonical procedural reference is docs/development-workflow.md (also at https://pumacp.github.io/puma/development-workflow/).
Technical reference — consolidated architecture + configuration + JSON Schema + ORM + CLI overview + glossary + decisions timeline (also at https://pumacp.github.io/puma/technical_reference/).
Code of Conduct — Contributor Covenant v2.1.
PUMA Community — public hub for community-contributed benchmark results.
Wiki — extended documentation (when populated).
Releases — semantic-versioned releases and changelogs.

Project resources

Code repositories

PUMA benchmark tool — https://github.com/pumacp/puma — local-LLM evaluation engine for ICT Project Management tasks
PUMA Community — https://github.com/pumacp/puma-community — public archive of community-contributed benchmark results
PUMA Vault — https://github.com/pumacp/puma-vault — knowledge-management graph of the project

Documentation sites (GitHub Pages)

PUMA docs — https://pumacp.github.io/puma/
PUMA Vault — https://pumacp.github.io/puma-vault/
PUMA Community — https://pumacp.github.io/puma-community/ (in setup — Sprint 12 Phase C)
Wiki (benchmark tool) — https://github.com/pumacp/puma/wiki
Wiki (community hub) — https://github.com/pumacp/puma-community/wiki

Hugging Face Hub

Organization — https://huggingface.co/pumaproject
Dataset of community submissions — https://huggingface.co/datasets/pumaproject/puma-community-submissions
Leaderboard (public Gradio Space) — https://huggingface.co/spaces/pumaproject/puma-leaderboard
Verifier (private endpoint) — https://huggingface.co/spaces/pumaproject/puma-verifier
Personal namespace (project datasets) — https://huggingface.co/pumacp

Persistent archives & DOIs

Zenodo community (production) — https://zenodo.org/communities/pumacp
Zenodo community (sandbox, for pipeline validation) — https://sandbox.zenodo.org/communities/pumacp
Source dataset — Jira Social Repository — https://doi.org/10.5281/zenodo.5901893

Planned channels (post-Sprint-12 activation)

Mastodon — @pumacp@fosstodon.org (account creation pending)
Bluesky — @pumacp.bsky.social (account creation pending)
Telegram — deferred pending phone-number policy decision

Related projects

PUMA Community — companion data repository for community submissions, with auto-validation and outward mirrors to Hugging Face, Zenodo, and Kaggle.
Ollama — local LLM runtime that PUMA delegates to for all model execution.
CodeCarbon — sustainability tracking library PUMA uses for energy and emissions reporting.
Datasets used — Jira Social Repository (Zenodo DOI 5901893) and TAWOS.

Citation

If you use PUMA in your work, please cite the repository:

@software{puma_project,
  author  = {{PUMA Project contributors}},
  title   = {PUMA: PUMA Understanding & Management w Agents},
  url     = {https://github.com/pumacp/puma},
  version = {2.7.0},
  year    = {2026}
}

Update version to match the tag you used.

License

PUMA is released under the MIT License. See LICENSE for the full text. Third-party dependencies retain their own licenses; the canonical list lives in pyproject.toml.

Code of Conduct

This project follows the Contributor Covenant v2.1. Conduct concerns can be reported privately to pumacapstoneproject@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
.githooks		.githooks
.github/workflows		.github/workflows
.streamlit		.streamlit
alembic		alembic
assets/img		assets/img
config		config
data		data
db		db
docs		docs
logs		logs
results		results
scripts		scripts
specs		specs
src/puma		src/puma
tests		tests
wiki		wiki
.env.example		.env.example
.gitignore		.gitignore
.gitnexusignore		.gitnexusignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.publish		Dockerfile.publish
INDEX.md		INDEX.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
start_puma.sh		start_puma.sh
stop_puma_native.sh		stop_puma_native.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PUMA

Overview

Features

Quick start

Released packages (available with the v4.0.0 release)

Docker (recommended)

Manual install (advanced)

Share your results with the community (optional)

CLI overview

Architecture

Repository structure

Documentation

Project resources

Code repositories

Documentation sites (GitHub Pages)

Hugging Face Hub

Persistent archives & DOIs

Community catalogs

Conversation & community

Knowledge management & research

Planned channels (post-Sprint-12 activation)

Related projects

Citation

License

Code of Conduct

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PUMA

Overview

Features

Quick start

Released packages (available with the v4.0.0 release)

Docker (recommended)

Manual install (advanced)

Share your results with the community (optional)

CLI overview

Architecture

Repository structure

Documentation

Project resources

Code repositories

Documentation sites (GitHub Pages)

Hugging Face Hub

Persistent archives & DOIs

Community catalogs

Conversation & community

Knowledge management & research

Planned channels (post-Sprint-12 activation)

Related projects

Citation

License

Code of Conduct

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages