Setup DeepEval Evaluation Framework by rafmacalaba · Pull Request #56 · avsolatorio/data-ai-chatbot

rafmacalaba · 2026-03-08T15:19:07Z

Overview

This PR introduces a robust, persona-driven evaluation framework for the Data360 chatbot, built on DeepEval and optimized for E2E quality assessment, regression testing, and adversarial boundary probing.

Key Features

Unified CLI: Single entry point via python -m evals with subcommands for simulation (run), overnight batching (batch), suite management (suite), and comparison (compare).
Composable Persona System: Assembles realistic user scenarios from 12k+ combinations of archetypes (Bases), Topics, Countries, and Conversation Patterns.
17+ Scoring Metrics: Comprehensive evaluation across Conversational (Completeness, Hallucination) and Per-turn (Accuracy, Tool Selection, Data Consistency) layers.
Dynamic Pre-filtering: Intelligent turn-by-turn metric selection to avoid false failures and optimize LLM-as-judge costs.
Modular Architecture: Separate directories for configs/ (YAML configs), tests/ (unit tests), and personas/ (composability facets).
Batch & Checkpoint Support: Programmatic suite generation with retry logic and stateful checkpointing for large evaluation sweeps.

CLI Reference

# Run a specific persona
python -m evals run --http --persona student_learning_and_exploration

# Run a random composition
python -m evals run --http --compose-random technical_expert --runs 5

# Generate a sample suite and run an overnight batch
python -m evals suite --sample 20 && python -m evals batch --http

# Compare two evaluation runs
python -m evals compare <TIMESTAMP_A> <TIMESTAMP_B>

Technical Layout

backend/evals/__main__.py -- Unified CLI dispatcher.
backend/evals/run_conversation_eval.py -- E2E simulation & scoring engine.
backend/evals/configs/ -- YAML configurations and environment templates.
backend/evals/tests/ -- Unit test suite (26 tests passing).
backend/evals/personas/ -- Composable persona facets.

Testing

Unit Tests: 26 tests covering the persona composer, per-turn extraction, and metric building.
E2E Preflight: Automated verification of Docker stack, Backend API, and MCP connectivity.
Live Verification: Multiple evaluation runs completed against the dev backend.

- Add conversation eval suite with 10 base metrics + 7 edge-case metrics - 12 persona definitions (student, adversarial, economist, etc.) - ConversationSimulator for dynamic question generation - Conversation markdown output with collapsible tool calls - Claim tag transformation to visible checkmarks for review - Post-hoc LLM conversation review (review_conversation.py) - Conversation dump utility (dump_conversations.py) - Persona generator utility (generate_personas.py) - Single-turn scorecard tests (run_scorecard.py) - Pipeline runner for non-streaming eval execution - MVP feature spec and user stories documentation - Switch MCP default transport from SSE to HTTP

…ck, archive old personas - New generate_personas.py with --from-docs and --describe modes - New generate_goldens.py with --enrich post-processing - 6 new MVP-inferred personas (policy_advisor, journalist, technical_analyst, student, adversarial, policy_researcher) - Archive old hand-crafted personas and conversations - Fix HTTP callback: create chat before streaming, correct endpoint /api/v1/chat/stream, text-delta SSE parsing, session caching - Add eval_config.yaml with chatbot_api_base config - Add truststore for async SSL fix - Add eval results and per-run conversation markdowns

…fixes - Add sequential agent_actions timeline to TurnData, capturing SSE events (routing, thinking, tool calls, tool outputs) in exact stream order - Rewrite SSE parser to build timeline with thinking text flushing on stage changes and before tool calls - Replace separate Routing/Planner/Tool Calls markdown sections with a single sequential Agent Actions block matching the chatbot UI - Add per-turn GEval metrics (context retention, claim consistency) - Truncate tool output JSON to 2000 chars in markdown, wrap in nested collapsible details blocks - Sanitize markdown --- rules to <hr> inside HTML details blocks - Remove stale r1/r2 conversation outputs

…e-filtering - Slim base_metrics to 1 (Conversation Completeness only) - Move 8 metrics to per_turn_metrics (12 total per-turn) - Add 'requires' field for intent-based pre-filtering (tool_data, data_gap, comparison, technical_terms, prior_context) - Add 'aggregation' field (min/mean) per metric - Rewrite _build_condensed_context to return structured dict - Add _serialize_condensed_context and _select_metrics_for_turn - Fix aggregation: strip [GEval] suffix, exclude pre-filtered scores - Remove unused Turn Faithfulness/Relevancy built-in imports - Add METRICS.md reference documentation - ~62% reduction in judge calls for typical conversations

- Add 12 persona conversation results from run_20260305_210102 - Add 12 manual review documents following eval-manual-review skill - Remove old archive conversations (replaced by timestamped run dir) - Remove student_learning_and_exploration conversation - Update eval_config.yaml: remove inline APPLICABILITY pre-filters, adjust rubric score ranges for per-turn metrics - Update adversarial persona config and run_conversation_eval.py - Fix ruff N806: rename _REQUIRES_MAP to requires_map - Add .results/, .cache/, logs/, reviews/ to .gitignore

- Add 'LLM Instruction-Following' root cause for student Source Citation - Annotate Finding §1 sub-cases with per-item root cause labels - Drop actions #7/#8 (Sources, Tables already in prompt) - Replace with: standardize Limitations label + monitor instruction-following - Surface API timeout root cause from journalist_review - Add cross-references between findings and actions - Elevate Priority 4 product decision to blockquote alert - Fill missing curious_citizen turn count

- Extract per_turn_eval.py (657 lines) and metric_builder.py (142 lines) from run_conversation_eval.py (2556 -> 1855 lines) - Rewrite README.md and METRICS.md from scratch - Add SUMMARY_FINDINGS.md with actionable chatbot/infra findings - Add test_tool_use_metrics.py (11 unit tests) - Add compare_eval_runs.py for diffing evaluation runs - Add 4 new adversarial/regression personas - Archive outdated docs, old conversation runs, and run_scorecard.py - Update .gitignore with __pycache__/ and *.pyc

…RICS.md

…often MCP preflight

…n preflight The eval runner no longer probes MCP_SERVER_URL directly. Instead it authenticates as guest and calls GET /api/v1/mcp/tools on the backend to verify the full chain (backend -> MCP -> tools). TODO: replace with a proper /health endpoint that reports MCP status.

- conversations_<ts>_<persona>.json (single) or _all.json (multiple) - conversation_eval_<ts>_<persona>.json - Replay loader is backward-compatible with old filenames

- Move planning docs to evals/docs/ (mvp_features.md, mvp_user_stories.md, SUMMARY_FINDINGS.md) - Archive obsolete dump_conversations.py (superseded by _save_conversation_markdown) - Fix stale imports in test_tool_use_metrics.py (per_turn_eval, not run_conversation_eval)

- README: added high-level overview, two-tier metric explanation, typical workflow, full CLI reference, updated file layout - METRICS: added plain-English descriptions of every metric, pre-filter flags reference table, enriched rubric examples

These are superseded scripts, old docs, and stale test harnesses that are no longer needed.

rafmacalaba · 2026-03-08T15:19:54Z

#45

The --no-eval path was still using the old format. Now both save paths include the persona name in the filename.

- Add DEEPEVAL_IMPLEMENTATION.md: full two-part reference covering the gist (plain-language overview, score meanings, key concepts) and the technical deep-dive (pre-filtering engine, TurnData, ConversationSimulator, SSE parsing, aggregation, replay mechanics, output structure, and how to add metrics) - Add PERSONAS.md: composable persona system documentation - Add persona_composer.py: runtime persona assembly from YAML facets - Add test_persona_composer.py: unit tests for persona composer - Update eval_config.yaml: rubric and pre-filter refinements - Update metric_builder.py, per_turn_eval.py, run_conversation_eval.py: per-turn evaluation engine improvements and aggregation fixes - Update README.md, METRICS.md: team readability improvements - Remove flat persona YAMLs and conversations from git tracking (personas/ and conversations/ are now gitignored for active runs) - Fix F821: undefined name 'combined' in _evaluate_single_run - Fix F841: unused variable 'warning_lines' in markdown renderer

avsolatorio · 2026-03-18T04:09:50Z

Hello @rafmacalaba , I decided to merge the base branch of this PR, try rebasing from dev to update this branch. Thanks!

…d findings - Add regression_suite.yaml with 16-persona fixed suite and known_failures mapping - Add run_regression.py automated runner with retry logic and report generation - Add composed persona YAML system (6 bases x 10 topics x 5 regions x 8 patterns) - Add CONSOLIDATED_FINDING.md covering all 132 eval runs across 5 batches - Include 10 curated notable conversations (4 resilient, 3 concerning, 3 reviews) - Fix pipeline_runner.py and review_conversation.py for report deduplication - Update .gitignore for .deepeval/, _old_evals/, evals/.results/, evals/.cache/

avsolatorio · 2026-04-02T00:48:03Z

Hello @rafmacalaba , is this still a draft? Thanks!

…_viz_batch.sh)

…ght batch - Covers all 6 composed persona bases: advocate_journalist, country_analyst, decision_maker, general_public, student, technical_expert - Each base ran ~17 random topic/region/pattern combinations via run_overnight_batch.sh - Also covers all 10 flat personas (adversarial_*, journalist_*, policy_*, etc.) - Each run: conversations JSON, results JSON, markdown review - Prompt version: feat/setup-evals@d8be3593d - Pre-commit skipped: large file (1.1MB conversation JSON) and hex system_prompt_hash flagged as secret are both expected eval data artifacts

…ults The original findings (132 runs, 20260312-20260316) still stand. A second full overnight batch of 132 runs was run on 20260325/26 with no prompt or system changes. Key results: - 12 of 18 metrics reached 100% pass rate (up from 3) - F3 (empty writer): zero occurrences -- was 2, now appears intermittent - F2 (visualization not generated): 2 failures vs. 7 -- improving - F4 (disambiguation spurious failures): Data Formatting now at 100% - Source Citation: improved from 78.8% to 87.1% but still the weakest metric - All P0-P2 recommendations remain actionable Document closed with Status: CLOSED and a new Validation Batch section.

… rubrics

rafmacalaba · 2026-04-02T11:30:17Z

Hello @rafmacalaba , is this still a draft? Thanks!

finalizing the repo structure then un-wip it.

…mvp/

- Add __main__.py: single entry point via 'python -m evals <command>' Commands: run, batch, generate, suite, compare - Add 'suite' subcommand: programmatic suite.yaml generation from facets Supports --full, --sample N, --base, --adversarial-only, --list-facets - Rename regression_suite.yaml -> suite.yaml - Delete run_overnight_batch.sh, run_viz_batch.sh (replaced by unified CLI) - Update README.md with full pipeline documentation

- Move eval_config.yaml, suite.yaml, .env.example to backend/evals/configs/ - Update run_conversation_eval.py, run_regression.py, generate_goldens.py, and __main__.py with new configuration paths - Update README.md with the new directory structure and configuration links

- Update E2E_GUIDE.md and PERSONAS.md with 'python -m evals' commands - Fix configuration and test paths in all markdown guides - Ensure README.md options tables use the new configs/ directory paths

- Delete legacy .cursor rules - Remove all conversation logs and keep placeholder instructions - Final linting and path fixes for the unified CLI

rafmacalaba · 2026-04-02T16:51:53Z

this is ready for review @avsolatorio Thanks!

rafmacalaba added 21 commits March 2, 2026 22:58

eval: add consolidated manual review summary for run_20260305

10bf55e

docs(evals): add Draco2 fallback and benchmarking context to findings

dd8d5df

docs(evals): deduplicate README and METRICS, move metric guide to MET…

3552473

…RICS.md

docs: add E2E evaluation guide for running DeepEval against production

7c8b1f2

docs: add local/production sections to E2E guide, add .env.example, s…

087ab7f

…often MCP preflight

docs: move in-process mode to annex, focus guide on HTTP E2E

2d3c737

docs: fix SSL cert path to backend/certs/corp-root-ca.crt

38cb1cd

docs: change 'company' to 'WBG'

8c7066b

docs: remove troubleshooting section from E2E guide

b80b54f

feat: include persona name in conversation and eval result filenames

8d4316c

- conversations_<ts>_<persona>.json (single) or _all.json (multiple) - conversation_eval_<ts>_<persona>.json - Replay loader is backward-compatible with old filenames

chore: remove archived eval files

610768c

These are superseded scripts, old docs, and stale test harnesses that are no longer needed.

rafmacalaba added 3 commits March 9, 2026 00:21

chore: remove _archive dirs from personas/ and conversations/

44ee826

fix: add persona suffix to _save_conversations_exhaustive filename

6143d82

The --no-eval path was still using the old format. Now both save paths include the persona name in the filename.

rafmacalaba added 2 commits April 2, 2026 18:24

feat(evals): add overnight batch scripts (run_overnight_batch.sh, run…

a24e081

…_viz_batch.sh)

rafmacalaba requested a review from avsolatorio April 2, 2026 11:29

docs(evals): finalize consolidated findings and tighten visualization…

91d06f4

… rubrics

rafmacalaba added 8 commits April 2, 2026 19:43

chore(evals): cleanup superseded docs, reorganize into findings/ and …

ba02db9

…mvp/

chore(evals): move tests into tests/ directory

1fddf60

docs(evals): update guides to reflect unified CLI and directory refactor

b00e258

- Update E2E_GUIDE.md and PERSONAS.md with 'python -m evals' commands - Fix configuration and test paths in all markdown guides - Ensure README.md options tables use the new configs/ directory paths

chore(evals): finalize reorganization and cleanup

4d4d889

- Delete legacy .cursor rules - Remove all conversation logs and keep placeholder instructions - Final linting and path fixes for the unified CLI

revert: restore cursor rules accidentally deleted in evaluation refactor

bf85bbb

merge(dev): resolve .gitignore conflict and update from dev branch

c04332a

rafmacalaba changed the title ~~[WIP] Setup DeepEval Evaluations~~ Setup DeepEval Evaluation Framework Apr 2, 2026

rafmacalaba marked this pull request as ready for review April 2, 2026 13:10

ci: update github_issue refs [skip ci]

dd23a45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup DeepEval Evaluation Framework#56

Setup DeepEval Evaluation Framework#56
rafmacalaba wants to merge 38 commits intodevfrom
feat/setup-evals

rafmacalaba commented Mar 8, 2026 •

edited

Loading

Uh oh!

rafmacalaba commented Mar 8, 2026

Uh oh!

avsolatorio commented Mar 18, 2026

Uh oh!

avsolatorio commented Apr 2, 2026

Uh oh!

rafmacalaba commented Apr 2, 2026

Uh oh!

rafmacalaba commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rafmacalaba commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features

CLI Reference

Technical Layout

Testing

Uh oh!

rafmacalaba commented Mar 8, 2026

Uh oh!

avsolatorio commented Mar 18, 2026

Uh oh!

avsolatorio commented Apr 2, 2026

Uh oh!

rafmacalaba commented Apr 2, 2026

Uh oh!

rafmacalaba commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rafmacalaba commented Mar 8, 2026 •

edited

Loading