Releases: decko/raki
v0.15.0 — Second Glass
What's Changed
- feat(run): incremental evaluation — skip sessions seen in prior runs by @decko in #328
- feat(metrics): split first_pass_success_rate into review-rework vs patch dims by @decko in #333
- fix(deps): pin langchain-community<0.4.2 for ragas ChatVertexAI import by @decko in #337
- fix(cli): JSON serialization default raises TypeError for unexpected types by @decko in #338
- fix(cli): show --fail-on-regression warning for N>2 groups by @decko in #339
- fix(cli): raise UsageError for --until + --group-by before session loading by @decko in #340
- test(report): strengthen assertion in test_tool_call_count_shown_when_present by @decko in #343
- docs(comparing-runs): fix incorrect y-axis description for lower_is_better by @decko in #344
- misc(changelog): correct --before → --since in fragment for #260 by @decko in #345
- docs(comparing-runs): remove duplicate --until section by @decko in #347
- chore(cli): replace bare list annotation with list[RegressionResult] by @decko in #348
- feat(history): match sparkline trends by manifest name field instead of filename by @decko in #349
- fix(report): align CLI and HTML score color thresholds at 0.85 by @decko in #350
- test(report): fix inverted SparklineData direction for lower-is-better metrics by @decko in #353
- fix(report): superseded phases missing and gen-N sorts after submit by @decko in #356
- fix(report): phase status dot color reflects verdict not execution status by @decko in #359
- chore(report): remove dead no-op Jinja2 block in report.html.j2 by @decko in #360
- fix(report): extract phase_dot_class() to eliminate vacuous test assertions by @decko in #361
- test(272): add skipped phase dot coloring coverage to TestPhaseTimelineDotColoring by @decko in #363
- release: v0.15.0 by @decko in #366
Full Changelog: v0.14.0...v0.15.0
v0.14.0
What's Changed
- docs: add v0.13.0 gotchas and agent pitfalls to AGENTS.md by @decko in #279
- feat(metrics): persist per-sample metric scores in JSON report by @decko in #290
- fix(report): use exit_code over passed for verify ✓/✗ in HTML by @decko in #292
- feat(cli): detect duplicate raki run and warn before re-evaluation by @decko in #294
- fix(report): hide empty 'Worst N Sessions' and 'Recurring Failures' sections by @decko in #296
- fix(report): severity distribution chart rounding and zone coloring by @decko in #297
- feat(report): show metric context in score cards — thresholds, breakdowns, tooltips by @decko in #301
- fix(report): truncate long transcripts to reduce HTML report file size by @decko in #302
- feat(report): add navigation TOC and sort controls to HTML report by @decko in #303
- feat(report): inline sparklines and delta indicators in HTML score cards by @decko in #304
- feat(trends): HTML trend report with SVG dot charts by @decko in #306
- feat(cohort): date-based session split and diff within a single report by @decko in #310
- feat(cohort): manifest cohort tags and --group-by for multi-cohort comparison by @decko in #314
- fix(report): nav bar alignment, trend indicators, sticky context by @decko in #322
- docs: inline trend indicators and sticky nav context (Tech Preview) by @decko in #324
- release: v0.14.0 by @decko in #326
Full Changelog: v0.13.0...v0.14.0
v0.13.0 — Louche
The louche effect — what was clear turns opaque and reveals hidden structure.
This release is about revealing the signal that was always dissolved in raw output. Session drill-downs now render structured HTML instead of raw JSON dumps. Two new triage metrics measure whether the agent knows what it's doing before it starts writing code.
Features
- triage_calibration metric — predicted complexity vs actual cost (#251)
- file_prediction_accuracy metric — triage files vs actual changes, mean F1 score (#252)
- Structured drill-down sections in HTML report — phases, findings, and metrics in collapsible blocks (#250)
Bug Fixes
- Phase timeline ordering — canonical pipeline order instead of alphabetical (#249)
- Rework phases highlighted with amber dots instead of green (#249)
output_structuredrendered as formatted HTML: triage approach/risks, plan task list, files changed with M/A/D prefixes, verify verdict with command results, review findings with severity badges (#277)- Chronological phase sorting with rework interleaving (implement→verify→implement→verify instead of grouped) (#277)
Documentation
- New Comparing Runs guide for
raki report --diffworkflow (#258)
Stats
- 1581 tests passing
- SODA pipeline cost for this milestone: ~$23
- 16 metrics total (9 operational, 2 knowledge, 4 analytical, 1 experimental)
v0.12.0 — Clear Signals
v0.12.0 — Clear Signals
Make evaluation output transparent and judge configuration effortless. Enrich report headers with project identity and context, persist judge config in the manifest, fix provider bugs, and add Alcove pipeline export support.
Highlights
- Manifest judge config — persist
judge.providerandjudge.modelin your manifest YAML with 4-tier priority resolution (CLI > manifest > env vars > defaults) - Report header enrichment — project name, session formats, docs path, and judge annotation now appear in HTML reports
- Alcove pipeline adapter — new
alcove-pipelineadapter loads multi-step Alcove pipeline exports (run.json+steps/) - Google provider fixes — async client mismatch (#233), VERTEXAI_PROJECT env var fallback (#231), embeddings SDK path (#245)
- SDK migration — dropped deprecated
vertexai._model_gardendependency, removed 28 transitive packages
Full changelog
See CHANGELOG.md for the complete list of changes.
Install / Upgrade
pip install raki==0.12.0v0.11.0 - Full Recall
What's Changed
- docs: add gotchas #25-#27 and gitignore soda temp files by @decko in #217
- chore: remove stale ty: ignore comments in llm_setup.py by @decko in #224
- fix: increase Alcove adapter DETECT_READ_SIZE to 32KB by @decko in #225
- chore: add SODA session test fixture for adapter integration tests by @decko in #226
- fix: derive rework_cycles from SODA phase generation metadata by @decko in #227
- feat: extend session-schema adapter for full SODA phase coverage by @decko in #228
- feat: add raki import-history command for backfilling evaluation history by @decko in #229
- chore: release v0.11.0 by @decko in #232
Full Changelog: v0.10.0...v0.11.0
v0.10.0
What's Changed
- feat: show judge model name in HTML and CLI report headers (#207) by @decko in #209
- feat: add LiteLLM provider adapter for judge metrics (#103) by @decko in #210
- fix: knowledge_miss_rate on SODA sessions — skip synthesized context (#183) by @decko in #211
- feat: show phase output/transcript in HTML session drill-down (#194) by @decko in #212
- feat: synthesize review findings from Alcove transcripts (#186) by @decko in #213
- feat: wire LiteLLM provider into CLI and embeddings (#208) by @decko in #214
- feat: RAKI as pipeline quality gate (#184) by @decko in #215
- chore: bump version to 0.10.0 by @decko in #216
Full Changelog: v0.9.1...v0.10.0
v0.9.1
What's Changed
- docs: update AGENTS.md with v0.9.0 learnings by @decko in #196
- chore: remove deprecated --no-llm flag (#195) by @decko in #198
- chore: rename skip_llm to skip_judge in report config (#178) by @decko in #199
- feat: add pipeline/orchestrator metadata to session adapter (#175) by @decko in #200
- refactor: extract shared scoring loop from Ragas metrics (#182) by @decko in #201
- fix: Alcove detect() accepts sessions with 'id' instead of 'session_id' (#197) by @decko in #202
- feat: distinguish agent model from judge model in reports (#179) by @decko in #203
- feat: track judge cost per report (#174) by @decko in #204
- feat: metric health checks — detect degenerate and dead metrics (#162) by @decko in #205
- chore: bump version to 0.9.1 by @decko in #206
Full Changelog: v0.9.0...v0.9.1
v0.9.0
What's Changed
- docs: add CI workflow guide with test ticket pattern by @decko in #167
- docs: add diff use cases to CI workflow guide by @decko in #168
- chore: commit soda pipeline config to main by @decko in #177
- fix(ragas): detect and skip instructor#1658 silent-zero scores from Google provider by @decko in #180
- docs: update AGENTS.md with v0.8.0 learnings by @decko in #181
- chore(soda): improve pipeline prompts and commit SODA config by @decko in #185
- feat(report): serialize judge config fields into report JSON (#173) by @decko in #188
- feat(report): warn when judge configs differ in --diff comparison (#187) by @decko in #189
- feat(history): JSONL history log for cross-run tracking (#170) by @decko in #190
- fix(adapters): alcove adapter rework cycle and phase detection (#176) by @decko in #191
- feat(trends): add raki trends command for metric trajectories (#171) by @decko in #192
- chore: bump version to 0.9.0 by @decko in #193
Full Changelog: v0.8.0...v0.9.0
v0.8.0
What's Changed
- docs: add release gating rule — agents must not tag releases by @decko in #128
- chore: cherry-pick v0.7.1 fixes to main by @decko in #154
- fix(gates): round actual values to 4dp in --gate output by @decko in #155
- feat(adapters): support bridge/alcove session format by @decko in #163
- fix(cli): validate --gate metric names early, exit 2 for unknown metrics by @decko in #156
- fix(cli): include knowledge metrics in raki metrics output by @decko in #157
- fix(cli): three-tier section headers and progression nudges by @decko in #158
- fix(cli): use CWD as project root for --docs-path guard by @decko in #159
- fix(cli): add --gate and --require-metric flags to report subcommand by @decko in #160
- fix(metrics): rename first_pass_verify_rate to first_pass_success_rate by @decko in #161
- fix(knowledge): replace loose word overlap with path+word hybrid matcher by @decko in #164
- docs(metrics): rationale & interpretation guide for all metrics by @decko in #165
- chore: bump version to 0.8.0 by @decko in #166
Full Changelog: v0.7.0...v0.8.0
v0.7.1
What's Changed
- fix(deps): pin instructor>=1.0 in ragas extra by @decko in #142
- fix(gates): handle missing metrics in --require-metric gracefully by @decko in #143
- fix(metrics): store N/A metrics as null in JSON instead of 0.0 by @decko in #144
- fix(metrics): use per-domain matching for knowledge miss/gap rates by @decko in #145
- fix(metrics): wire doc chunks as reference_contexts for precision/recall by @decko in #146
- fix(ragas): truncate synthesized contexts and handle max_tokens errors by @decko in #147
- fix(ragas): comprehensive truncation for Ragas text inputs by @decko in #148
- fix(ragas): set max_tokens=4096 on llm_factory calls by @decko in #149
- chore: bump version to 0.7.1 by @decko in #153
Full Changelog: v0.7.0...v0.7.1