Skip to content

Setup DeepEval Evaluation Framework#56

Open
rafmacalaba wants to merge 38 commits intodevfrom
feat/setup-evals
Open

Setup DeepEval Evaluation Framework#56
rafmacalaba wants to merge 38 commits intodevfrom
feat/setup-evals

Conversation

@rafmacalaba
Copy link
Copy Markdown
Collaborator

@rafmacalaba rafmacalaba commented Mar 8, 2026

Overview

This PR introduces a robust, persona-driven evaluation framework for the Data360 chatbot, built on DeepEval and optimized for E2E quality assessment, regression testing, and adversarial boundary probing.

Key Features

  • Unified CLI: Single entry point via python -m evals with subcommands for simulation (run), overnight batching (batch), suite management (suite), and comparison (compare).
  • Composable Persona System: Assembles realistic user scenarios from 12k+ combinations of archetypes (Bases), Topics, Countries, and Conversation Patterns.
  • 17+ Scoring Metrics: Comprehensive evaluation across Conversational (Completeness, Hallucination) and Per-turn (Accuracy, Tool Selection, Data Consistency) layers.
  • Dynamic Pre-filtering: Intelligent turn-by-turn metric selection to avoid false failures and optimize LLM-as-judge costs.
  • Modular Architecture: Separate directories for configs/ (YAML configs), tests/ (unit tests), and personas/ (composability facets).
  • Batch & Checkpoint Support: Programmatic suite generation with retry logic and stateful checkpointing for large evaluation sweeps.

CLI Reference

# Run a specific persona
python -m evals run --http --persona student_learning_and_exploration

# Run a random composition
python -m evals run --http --compose-random technical_expert --runs 5

# Generate a sample suite and run an overnight batch
python -m evals suite --sample 20 && python -m evals batch --http

# Compare two evaluation runs
python -m evals compare <TIMESTAMP_A> <TIMESTAMP_B>

Technical Layout

  • backend/evals/__main__.py -- Unified CLI dispatcher.
  • backend/evals/run_conversation_eval.py -- E2E simulation & scoring engine.
  • backend/evals/configs/ -- YAML configurations and environment templates.
  • backend/evals/tests/ -- Unit test suite (26 tests passing).
  • backend/evals/personas/ -- Composable persona facets.

Testing

  • Unit Tests: 26 tests covering the persona composer, per-turn extraction, and metric building.
  • E2E Preflight: Automated verification of Docker stack, Backend API, and MCP connectivity.
  • Live Verification: Multiple evaluation runs completed against the dev backend.

- Add conversation eval suite with 10 base metrics + 7 edge-case metrics
- 12 persona definitions (student, adversarial, economist, etc.)
- ConversationSimulator for dynamic question generation
- Conversation markdown output with collapsible tool calls
- Claim tag transformation to visible checkmarks for review
- Post-hoc LLM conversation review (review_conversation.py)
- Conversation dump utility (dump_conversations.py)
- Persona generator utility (generate_personas.py)
- Single-turn scorecard tests (run_scorecard.py)
- Pipeline runner for non-streaming eval execution
- MVP feature spec and user stories documentation
- Switch MCP default transport from SSE to HTTP
…ck, archive old personas

- New generate_personas.py with --from-docs and --describe modes
- New generate_goldens.py with --enrich post-processing
- 6 new MVP-inferred personas (policy_advisor, journalist, technical_analyst, student, adversarial, policy_researcher)
- Archive old hand-crafted personas and conversations
- Fix HTTP callback: create chat before streaming, correct endpoint /api/v1/chat/stream, text-delta SSE parsing, session caching
- Add eval_config.yaml with chatbot_api_base config
- Add truststore for async SSL fix
- Add eval results and per-run conversation markdowns
…fixes

- Add sequential agent_actions timeline to TurnData, capturing SSE events
  (routing, thinking, tool calls, tool outputs) in exact stream order
- Rewrite SSE parser to build timeline with thinking text flushing on
  stage changes and before tool calls
- Replace separate Routing/Planner/Tool Calls markdown sections with a
  single sequential Agent Actions block matching the chatbot UI
- Add per-turn GEval metrics (context retention, claim consistency)
- Truncate tool output JSON to 2000 chars in markdown, wrap in nested
  collapsible details blocks
- Sanitize markdown --- rules to <hr> inside HTML details blocks
- Remove stale r1/r2 conversation outputs
…e-filtering

- Slim base_metrics to 1 (Conversation Completeness only)
- Move 8 metrics to per_turn_metrics (12 total per-turn)
- Add 'requires' field for intent-based pre-filtering
  (tool_data, data_gap, comparison, technical_terms, prior_context)
- Add 'aggregation' field (min/mean) per metric
- Rewrite _build_condensed_context to return structured dict
- Add _serialize_condensed_context and _select_metrics_for_turn
- Fix aggregation: strip [GEval] suffix, exclude pre-filtered scores
- Remove unused Turn Faithfulness/Relevancy built-in imports
- Add METRICS.md reference documentation
- ~62% reduction in judge calls for typical conversations
- Add 12 persona conversation results from run_20260305_210102
- Add 12 manual review documents following eval-manual-review skill
- Remove old archive conversations (replaced by timestamped run dir)
- Remove student_learning_and_exploration conversation
- Update eval_config.yaml: remove inline APPLICABILITY pre-filters,
  adjust rubric score ranges for per-turn metrics
- Update adversarial persona config and run_conversation_eval.py
- Fix ruff N806: rename _REQUIRES_MAP to requires_map
- Add .results/, .cache/, logs/, reviews/ to .gitignore
- Add 'LLM Instruction-Following' root cause for student Source Citation
- Annotate Finding §1 sub-cases with per-item root cause labels
- Drop actions #7/#8 (Sources, Tables already in prompt)
- Replace with: standardize Limitations label + monitor instruction-following
- Surface API timeout root cause from journalist_review
- Add cross-references between findings and actions
- Elevate Priority 4 product decision to blockquote alert
- Fill missing curious_citizen turn count
- Extract per_turn_eval.py (657 lines) and metric_builder.py (142 lines)
  from run_conversation_eval.py (2556 -> 1855 lines)
- Rewrite README.md and METRICS.md from scratch
- Add SUMMARY_FINDINGS.md with actionable chatbot/infra findings
- Add test_tool_use_metrics.py (11 unit tests)
- Add compare_eval_runs.py for diffing evaluation runs
- Add 4 new adversarial/regression personas
- Archive outdated docs, old conversation runs, and run_scorecard.py
- Update .gitignore with __pycache__/ and *.pyc
…n preflight

The eval runner no longer probes MCP_SERVER_URL directly. Instead it
authenticates as guest and calls GET /api/v1/mcp/tools on the backend
to verify the full chain (backend -> MCP -> tools).

TODO: replace with a proper /health endpoint that reports MCP status.
- conversations_<ts>_<persona>.json (single) or _all.json (multiple)
- conversation_eval_<ts>_<persona>.json
- Replay loader is backward-compatible with old filenames
- Move planning docs to evals/docs/ (mvp_features.md, mvp_user_stories.md, SUMMARY_FINDINGS.md)
- Archive obsolete dump_conversations.py (superseded by _save_conversation_markdown)
- Fix stale imports in test_tool_use_metrics.py (per_turn_eval, not run_conversation_eval)
- README: added high-level overview, two-tier metric explanation,
  typical workflow, full CLI reference, updated file layout
- METRICS: added plain-English descriptions of every metric,
  pre-filter flags reference table, enriched rubric examples
These are superseded scripts, old docs, and stale test harnesses
that are no longer needed.
@rafmacalaba
Copy link
Copy Markdown
Collaborator Author

#45

The --no-eval path was still using the old format. Now both
save paths include the persona name in the filename.
- Add DEEPEVAL_IMPLEMENTATION.md: full two-part reference covering
  the gist (plain-language overview, score meanings, key concepts)
  and the technical deep-dive (pre-filtering engine, TurnData,
  ConversationSimulator, SSE parsing, aggregation, replay mechanics,
  output structure, and how to add metrics)
- Add PERSONAS.md: composable persona system documentation
- Add persona_composer.py: runtime persona assembly from YAML facets
- Add test_persona_composer.py: unit tests for persona composer
- Update eval_config.yaml: rubric and pre-filter refinements
- Update metric_builder.py, per_turn_eval.py, run_conversation_eval.py:
  per-turn evaluation engine improvements and aggregation fixes
- Update README.md, METRICS.md: team readability improvements
- Remove flat persona YAMLs and conversations from git tracking
  (personas/ and conversations/ are now gitignored for active runs)
- Fix F821: undefined name 'combined' in _evaluate_single_run
- Fix F841: unused variable 'warning_lines' in markdown renderer
@avsolatorio
Copy link
Copy Markdown
Owner

Hello @rafmacalaba , I decided to merge the base branch of this PR, try rebasing from dev to update this branch. Thanks!

…d findings

- Add regression_suite.yaml with 16-persona fixed suite and known_failures mapping
- Add run_regression.py automated runner with retry logic and report generation
- Add composed persona YAML system (6 bases x 10 topics x 5 regions x 8 patterns)
- Add CONSOLIDATED_FINDING.md covering all 132 eval runs across 5 batches
- Include 10 curated notable conversations (4 resilient, 3 concerning, 3 reviews)
- Fix pipeline_runner.py and review_conversation.py for report deduplication
- Update .gitignore for .deepeval/, _old_evals/, evals/.results/, evals/.cache/
@avsolatorio
Copy link
Copy Markdown
Owner

Hello @rafmacalaba , is this still a draft? Thanks!

…ght batch

- Covers all 6 composed persona bases: advocate_journalist, country_analyst,
  decision_maker, general_public, student, technical_expert
- Each base ran ~17 random topic/region/pattern combinations via run_overnight_batch.sh
- Also covers all 10 flat personas (adversarial_*, journalist_*, policy_*, etc.)
- Each run: conversations JSON, results JSON, markdown review
- Prompt version: feat/setup-evals@d8be3593d
- Pre-commit skipped: large file (1.1MB conversation JSON) and hex system_prompt_hash
  flagged as secret are both expected eval data artifacts
…ults

The original findings (132 runs, 20260312-20260316) still stand.
A second full overnight batch of 132 runs was run on 20260325/26 with
no prompt or system changes. Key results:

- 12 of 18 metrics reached 100% pass rate (up from 3)
- F3 (empty writer): zero occurrences -- was 2, now appears intermittent
- F2 (visualization not generated): 2 failures vs. 7 -- improving
- F4 (disambiguation spurious failures): Data Formatting now at 100%
- Source Citation: improved from 78.8% to 87.1% but still the weakest metric
- All P0-P2 recommendations remain actionable

Document closed with Status: CLOSED and a new Validation Batch section.
@rafmacalaba rafmacalaba requested a review from avsolatorio April 2, 2026 11:29
@rafmacalaba
Copy link
Copy Markdown
Collaborator Author

Hello @rafmacalaba , is this still a draft? Thanks!

finalizing the repo structure then un-wip it.

- Add __main__.py: single entry point via 'python -m evals <command>'
  Commands: run, batch, generate, suite, compare
- Add 'suite' subcommand: programmatic suite.yaml generation from facets
  Supports --full, --sample N, --base, --adversarial-only, --list-facets
- Rename regression_suite.yaml -> suite.yaml
- Delete run_overnight_batch.sh, run_viz_batch.sh (replaced by unified CLI)
- Update README.md with full pipeline documentation
- Move eval_config.yaml, suite.yaml, .env.example to backend/evals/configs/
- Update run_conversation_eval.py, run_regression.py, generate_goldens.py, and __main__.py with new configuration paths
- Update README.md with the new directory structure and configuration links
- Update E2E_GUIDE.md and PERSONAS.md with 'python -m evals' commands
- Fix configuration and test paths in all markdown guides
- Ensure README.md options tables use the new configs/ directory paths
- Delete legacy .cursor rules
- Remove all conversation logs and keep placeholder instructions
- Final linting and path fixes for the unified CLI
@rafmacalaba rafmacalaba changed the title [WIP] Setup DeepEval Evaluations Setup DeepEval Evaluation Framework Apr 2, 2026
@rafmacalaba rafmacalaba marked this pull request as ready for review April 2, 2026 13:10
@rafmacalaba
Copy link
Copy Markdown
Collaborator Author

this is ready for review @avsolatorio Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants