65 lines (46 loc) · 1.87 KB

Validation guide

How to verify what this repo actually does, starting from benchmark data.

What this is

Tool-calling reliability with checked-in benchmark evidence. The work is benchmarked, bounded, and you can verify it yourself.

Trust order

Benchmark artifact: docs/benchmarks/stagepilot-latest.json
API endpoints: /v1/runtime-brief, /v1/summary-pack, /v1/benchmark-summary
These docs: this guide, executive-one-pager.md, solution-architecture.md
Static site / SVG: site/, docs/summary-pack.svg

If docs and API disagree, trust the benchmark artifact.

Quick path

Local

pnpm install
pnpm review:proof
pnpm api:stagepilot
# open http://127.0.0.1:8080/demo

Reading order

docs/benchmarks/stagepilot-latest.json
docs/tool-call-reliability-case-study.md
docs/executive-one-pager.md
docs/solution-architecture.md
docs/summary-pack.svg
docs/STAGEPILOT.md for runtime/operator details

API evidence

GET /v1/runtime-brief -- readiness, integrations
GET /v1/summary-pack -- benchmark validation data
GET /v1/benchmark-summary -- success-rate lift, weakest strategy
GET /v1/developer-ops-pack -- dev workflow / release posture
GET /v1/workflow-run-replay -- replay after execution

Current numbers

Checked-in benchmark snapshot:

baseline: 29.17%
parser middleware: 87.50%
bounded retry: 100.00% on the 24-case snapshot
delta vs baseline: +58.33pp middleware, +70.83pp loop

Honest framing

Strong: parser middleware + benchmark discipline + reviewable orchestration
Reasonable: runtime/API shape is practical and Cloud Run-friendly
Don't oversell: static site and docs are supporting material, not proof of live traffic

Two-minute version

pnpm review:proof
docs/benchmarks/stagepilot-latest.json
/v1/summary-pack
docs/solution-architecture.md