How to verify what this repo actually does, starting from benchmark data.
Tool-calling reliability with checked-in benchmark evidence. The work is benchmarked, bounded, and you can verify it yourself.
- Benchmark artifact:
docs/benchmarks/stagepilot-latest.json - API endpoints:
/v1/runtime-brief,/v1/summary-pack,/v1/benchmark-summary - These docs: this guide,
executive-one-pager.md,solution-architecture.md - Static site / SVG:
site/,docs/summary-pack.svg
If docs and API disagree, trust the benchmark artifact.
pnpm install
pnpm review:proof
pnpm api:stagepilot
# open http://127.0.0.1:8080/demodocs/benchmarks/stagepilot-latest.jsondocs/tool-call-reliability-case-study.mddocs/executive-one-pager.mddocs/solution-architecture.mddocs/summary-pack.svgdocs/STAGEPILOT.mdfor runtime/operator details
GET /v1/runtime-brief-- readiness, integrationsGET /v1/summary-pack-- benchmark validation dataGET /v1/benchmark-summary-- success-rate lift, weakest strategyGET /v1/developer-ops-pack-- dev workflow / release postureGET /v1/workflow-run-replay-- replay after execution
Checked-in benchmark snapshot:
- baseline: 29.17%
- parser middleware: 87.50%
- bounded retry: 100.00% on the 24-case snapshot
- delta vs baseline: +58.33pp middleware, +70.83pp loop
- Strong: parser middleware + benchmark discipline + reviewable orchestration
- Reasonable: runtime/API shape is practical and Cloud Run-friendly
- Don't oversell: static site and docs are supporting material, not proof of live traffic
pnpm review:proofdocs/benchmarks/stagepilot-latest.json/v1/summary-packdocs/solution-architecture.md