Evaluate your agentic RAG sessions in under 10 minutes.
- Python 3.12 or later
- uv package manager
uv pip install rakiFor HTML reports, install the html extra:
uv pip install raki[html]For development or analytical metrics (Ragas), install all extras:
git clone https://github.com/decko/raki.git
cd raki
uv sync --python 3.12 --all-extrasNote: The
ragasextra pullsscikit-network, which requires a C++ compiler (g++).
Verify the install:
uv run raki --helpRAKI organizes metrics into three tiers, each adding more depth:
| Tier | What you need | What you get |
|---|---|---|
| Operational | Nothing (zero config) | First-pass success rate, rework cycles, cost, severity, latency, tokens, self-correction |
| Knowledge | --docs-path ./docs |
Knowledge gap rate, knowledge miss rate |
| Analytical | --judge + LLM credentials |
Faithfulness, answer relevancy, context precision, context recall |
Start with operational metrics and add tiers as needed.
Run operational metrics immediately -- no API keys, no docs, no ground truth:
uv run raki run --manifest raki.yamlThis is the default mode. It computes all seven operational metrics from session transcript data alone:
- First-pass success rate -- % sessions with no rework cycles
- Rework cycles -- mean review-fix iterations per session
- Severity score -- weighted severity of review findings
- Cost / session -- mean USD cost per session
- Self-correction rate -- ratio of rework findings resolved
- Phase execution time -- mean phase time in seconds
- Tokens / phase -- mean tokens per phase
uv run raki metrics # table of all metrics
uv run raki metrics --json # machine-readableuv run raki run --manifest raki.yaml --metrics cost_efficiency,rework_cyclesPoint RAKI at your project documentation to activate knowledge metrics:
uv run raki run --manifest raki.yaml --docs-path ./docsOr configure it in your manifest:
docs:
path: ./docs
extensions: [".md", ".rst", ".txt"]This adds two metrics:
- Knowledge gap rate -- how often rework happens in domains not covered by your docs
- Knowledge miss rate -- how often the agent fails despite having relevant docs
See Knowledge Metrics Reference for details.
Enable LLM-judged retrieval quality metrics with --judge:
# Vertex AI Anthropic (default)
uv run raki run --manifest raki.yaml --judge
# Direct Anthropic API
uv run raki run --manifest raki.yaml --judge --judge-provider anthropic
# Google AI
uv run raki run --manifest raki.yaml --judge --judge-provider google
# LiteLLM (any model via the LiteLLM proxy, e.g. OpenAI)
uv run raki run --manifest raki.yaml --judge \
--judge-provider litellm --judge-model gpt-4oThis adds four Ragas-backed metrics:
- Faithfulness -- is the output grounded in retrieved context?
- Answer relevancy -- does the output address the question?
- Context precision -- is the retrieved context relevant? (requires ground truth)
- Context recall -- was all needed context retrieved? (requires ground truth)
Set ANTHROPIC_API_KEY for direct Anthropic API, or configure Google Cloud credentials for Vertex AI.
For LiteLLM, set the appropriate provider credentials (e.g. OPENAI_API_KEY) and install the extra:
uv pip install raki[litellm]See Analytical Metrics Reference for details.
Instead of passing --judge-provider and --judge-model on every run, you can
save them in your manifest:
judge:
provider: vertex-anthropic
model: claude-sonnet-4-6The --judge flag is still required to enable analytical metrics — the manifest
just persists which provider and model to use.
RAKI resolves judge provider and model using a 4-tier priority chain:
- CLI flags (
--judge-provider,--judge-model) — highest priority - Manifest (
judge.provider,judge.model) - Environment variables (
RAKI_JUDGE_PROVIDER,RAKI_JUDGE_MODEL) - Built-in defaults (
vertex-anthropic,claude-sonnet-4-6)
This means you can set defaults in your manifest, override per-environment via env vars, and still override everything on the command line.
Check your manifest and session data without running metrics:
uv run raki validate --manifest raki.yamlFor a deeper smoke test (adapter loading, ground truth wiring, metric computation against one sample):
uv run raki validate --manifest raki.yaml --deepA typical operational run produces:
Operational Health
First-pass success rate 0.75
Rework cycles 0.2
Severity score 0.39
Cost / session $10.93
Self-correction rate N/A (no rework findings)
Phase execution time 142.3s
Tokens / phase 3,241
Reports are saved as JSON (always) and HTML (when jinja2 is installed). Re-render anytime:
uv run raki report results/raki-report-20260410T120000.json
uv run raki report results/raki-report-20260410T120000.json --html report.htmlTo compare two runs and see metric deltas, direction indicators, and per-session verdict transitions, use the --diff subcommand:
uv run raki report --diff results/before.json results/after.jsonFor the full before/after comparison workflow — including how to scope manifests, what the diff output shows, and how to gate CI on regressions — see Comparing Runs.
Use --gate for per-metric quality gates:
uv run raki run --manifest raki.yaml \
--gate 'first_pass_success_rate>0.85' \
--gate 'rework_cycles<1.5' \
--quietSee CI Integration Guide for --gate syntax, --fail-on-regression, exit codes, and full GitHub Actions / GitLab CI examples.
- Operational Metrics Reference -- all seven operational metrics in detail
- Knowledge Metrics Reference -- knowledge gap and miss rates
- Analytical Metrics Reference -- Ragas-backed retrieval quality metrics
- CI Integration Guide -- quality gates, regression detection, CI examples
- Results Interpretation Reference -- zone tables and common patterns
- Ground Truth Curation Guide -- writing ground truth for context precision/recall
- Adapter Guide -- integrating custom session formats