Skip to content

feat: v0.2.0 — ROUGE-L, EvalMode tracking, optional LlamaIndex, JSON export#1

Merged
cortexark merged 1 commit intomainfrom
feat/v0.2.0-rouge-l-evalmode-export
Apr 8, 2026
Merged

feat: v0.2.0 — ROUGE-L, EvalMode tracking, optional LlamaIndex, JSON export#1
cortexark merged 1 commit intomainfrom
feat/v0.2.0-rouge-l-evalmode-export

Conversation

@cortexark
Copy link
Copy Markdown
Owner

Summary

  • Add pure-Python ROUGE-L metric (LCS-based, zero deps, 14 tests)
  • Add EvalMode enum (llm_judge | heuristic | none) tracked in every GenerationMetrics result
  • Make LlamaIndex an optional dependency (pip install rageval[llm]) — core needs only pydantic/structlog/duckdb
  • Per-metric error isolation in LLM-as-Judge (partial results instead of total failure)
  • JSON export from ResultStore (export_json, list_runs)
  • More query filters: min_faithfulness, min_rouge_l, eval_mode
  • Bump to v0.2.0: 126 tests passing, ruff clean, mypy strict clean

Test plan

  • 126/126 tests passing (1.68s)
  • ruff lint clean
  • mypy strict clean (14 source files)
  • New test files: test_rouge.py (14 tests), extended test_generation_metrics.py, test_models.py, test_storage.py
  • All existing E2E customer scenarios still pass

…, JSON export

- Add pure-Python ROUGE-L implementation (LCS-based, zero deps, 14 tests)
- Add EvalMode enum (llm_judge | heuristic | none) tracked in every result
- Make LlamaIndex optional: core needs only pydantic/structlog/duckdb
- Per-metric error isolation in LLM-as-Judge (partial results > nothing)
- JSON export from ResultStore (export_json, list_runs)
- More query filters (faithfulness, rouge_l, eval_mode)
- Bump to v0.2.0, 126 tests passing, ruff + mypy strict clean
@cortexark cortexark merged commit 4d67b83 into main Apr 8, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant