Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,96 @@
## [0.15.0] — 2026-05-28

### Breaking Changes

- ``eval-manifest.yaml`` is no longer auto-discovered by ``raki run``. Only
``raki.yaml`` is recognized. Rename any existing ``eval-manifest.yaml`` files
to ``raki.yaml``.

### Features

- Add incremental evaluation mode (``raki run --incremental``).

``raki run --incremental`` (short: ``-i``) now skips sessions that were
already evaluated in a prior run, based on the ``session_ids`` field written
to ``history.jsonl`` after each run. Exit code 2 is returned when there are
no new sessions to evaluate.

``raki run --rerun-all`` evaluates all sessions regardless of history and
suppresses the new deprecation warning that fires when prior session history
exists and neither flag is provided.

Implementation details:

- ``HistoryEntry`` gains a ``session_ids: list[str]`` field (schema-backwards-compatible).
- ``append_history_entry()`` now populates ``session_ids`` from ``report.sample_results``.
- New ``load_seen_session_ids(path, *, manifest=None)`` helper in ``history.py``.
- New ``raki.report.incremental`` module exposes ``filter_new_samples(dataset, seen_ids)``.

(#293)
- Split ``first_pass_success_rate`` into review-rework vs corrective-patch dimensions.

New ``patch_cycles`` field on :class:`SessionMeta` (default ``0``) tracks the
number of verify/CI-triggered corrective iterations — a subset of ``rework_cycles``.

New ``ReviewReworkRate`` metric (``review_rework_rate``) measures the fraction
of sessions that avoided *review-triggered* rework, ignoring CI/verify
corrective patches. Unlike ``FirstPassSuccessRate`` (which counts any rework),
this metric focuses on the human-review feedback loop.

- :class:`SessionMeta` gains ``patch_cycles: int = 0`` (ticket #295).
- :class:`SessionSchemaAdapter` reads ``patch_cycles`` from ``meta.json``.
- :class:`AlcovePipelineAdapter` counts ``verify.Failed`` / ``await-ci.Failed``
triggered corrective steps as ``patch_cycles`` (review-triggered corrective steps are excluded).
- ``ReviewReworkRate`` is registered in ``ALL_OPERATIONAL``, ``METRIC_METADATA``,
and ``OPERATIONAL_METRICS`` so it appears in HTML and CLI reports.

(#295)
- History entries now use the manifest ``name:`` field as the match key for sparkline trends and incremental filtering, instead of the manifest filename. Projects that set ``name: my-project`` in their manifest YAML get a stable, rename-proof identity across runs. Projects without a ``name:`` field continue to use the filename (backward-compatible). (#320)

### Bug Fixes

- Extract ``phase_dot_class()`` from Jinja2 template into a Python function so
that dot-color logic is unit-tested directly rather than through full-HTML
string matching, which previously matched CSS class definitions in the
``<style>`` block (vacuous assertions). The vacuous ``test_superseded_phase_css_rule_defined``
assertion (``".phase-status-superseded" in content``) is replaced with a
line-level check that confirms the CSS rule body includes ``opacity``. (#271)
- Align CLI and HTML score color thresholds to eliminate the green/yellow inconsistency at the 0.80–0.85 boundary. ``color_for_score()`` in ``cli_summary.py`` now reads from the shared ``ZONE_THRESHOLDS`` constant (green ≥ 0.85) instead of a hard-coded 0.80 cutoff, matching the HTML report's coloring exactly. (#300)
- Fix inverted SparklineData direction semantics for lower-is-better metrics in _make_report_with_sparklines test helper. (#305)
- Replace inline JSON serialization lambdas in the cohort command with a named `_json_default` helper that raises `TypeError` for unexpected types instead of silently passing them through. (#313)
- Fix ``--fail-on-regression`` notice never being shown when ``--group-by`` produces more than 2 cohorts; the dead ``group_count == 2`` guard has been removed so users always see the "only supported with 2 cohorts" warning. (#315)
- Fix ``--until`` being silently ignored when combined with ``--group-by`` in ``raki cohort``. The mutual-exclusivity check is now performed before session loading so that an empty sessions directory correctly produces exit code 2 (usage error) rather than exit code 1 (no sessions found). (#318)
- HTML report now correctly displays timed-out (superseded) phases. Sessions where a phase
was interrupted by a timeout and restarted at a higher generation (timeout-resume pattern)
now show a synthesised ``superseded`` phase entry in the timeline with a distinct status
dot. The phase timeline is also sorted correctly: post-superseded gen-1 phases (verify,
review, submit) appear after the replacement generation rather than before it. (#319)
- Phase status dots in the HTML report now reflect the structured verdict for verify and review phases. A verify phase with verdict ``FAIL`` shows a red dot even when its execution status is ``completed``; a review phase with verdict ``REWORK`` shows a yellow dot; and ``approve``/``pass``/``pass-with-follow-ups`` show a green dot. Hard execution failures (``failed``, ``skipped``, ``superseded``) still take priority over any verdict. (#325)
- Add ``jinja2`` to the ``dev`` extra so ``ty check`` can resolve the deferred
import in ``html_report.py`` without relying on transitive dependencies.
- Pin ``langchain-community>=0.4,<0.4.2`` in the ragas extra to work around
a broken ``ChatVertexAI`` import in ragas 0.4.3
(upstream: `ragas#2745 <https://github.com/explodinggradients/ragas/issues/2745>`_).
- Review phase detail in HTML reports now shows findings inline when
``output_structured.findings`` is stripped. Falls back to the session-level
findings list with severity badges and file locations.
- ``--rerun-all`` now bypasses the duplicate-run detection warning from ``--force``.
Previously it only silenced the incremental deprecation warning.

### Documentation

- Fixed incorrect y-axis description for lower_is_better metrics in the comparing-runs doc; higher values always map to higher dot positions. (#307)
- Removed duplicate 'Filtering the compare cohort with --until' section from comparing-runs.md. (#316)

### Internal Changes

- Add skipped-phase dot coloring coverage to ``TestPhaseTimelineDotColoring``: ``test_skipped_phase_has_skipped_dot`` and ``test_verify_pass_verdict_but_skipped_status_gives_skipped_dot`` verify that skipped phases render with the muted ``phase-status-skipped`` CSS class. (#272)
- Strengthen weak assertion in test_tool_call_count_shown_when_present to verify class and count appear together in the correct HTML element. (#274)
- Remove dead no-op Jinja2 block in ``report.html.j2`` that used a broken ``selectattr`` filter on ``session_id``; the correct ``namespace(value=false)`` loop was already in place. (#275)
- Correct changelog entry for #260: replace '--before DATE' with '--since DATE' in the cohort command description. (#311)
- Replace bare ``list`` annotation with ``list[RegressionResult]`` in ``gate_check`` command; adds ``RegressionResult`` to the ``TYPE_CHECKING`` import block for full generic annotation. (#317)


## [0.14.0] — 2026-05-24

### Breaking Changes
Expand Down
6 changes: 0 additions & 6 deletions changes/271.fix

This file was deleted.

1 change: 0 additions & 1 deletion changes/272.misc

This file was deleted.

1 change: 0 additions & 1 deletion changes/274.misc

This file was deleted.

1 change: 0 additions & 1 deletion changes/275.misc

This file was deleted.

17 changes: 0 additions & 17 deletions changes/293.feature

This file was deleted.

16 changes: 0 additions & 16 deletions changes/295.feature

This file was deleted.

1 change: 0 additions & 1 deletion changes/300.fix

This file was deleted.

1 change: 0 additions & 1 deletion changes/305.fix

This file was deleted.

1 change: 0 additions & 1 deletion changes/307.doc

This file was deleted.

1 change: 0 additions & 1 deletion changes/311.misc

This file was deleted.

1 change: 0 additions & 1 deletion changes/313.fix

This file was deleted.

1 change: 0 additions & 1 deletion changes/315.fix

This file was deleted.

1 change: 0 additions & 1 deletion changes/316.doc

This file was deleted.

1 change: 0 additions & 1 deletion changes/317.misc

This file was deleted.

1 change: 0 additions & 1 deletion changes/318.fix

This file was deleted.

5 changes: 0 additions & 5 deletions changes/319.fix

This file was deleted.

1 change: 0 additions & 1 deletion changes/320.feature

This file was deleted.

1 change: 0 additions & 1 deletion changes/325.fix

This file was deleted.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "raki"
version = "0.14.0"
version = "0.15.0"
description = "Retrieval Assessment for Knowledge Impact — evaluate agentic RAG quality"
requires-python = ">=3.12"
license = "Apache-2.0"
Expand Down
2 changes: 1 addition & 1 deletion src/raki/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.14.0"
__version__ = "0.15.0"
Loading