Open
Conversation
- Add conversation eval suite with 10 base metrics + 7 edge-case metrics - 12 persona definitions (student, adversarial, economist, etc.) - ConversationSimulator for dynamic question generation - Conversation markdown output with collapsible tool calls - Claim tag transformation to visible checkmarks for review - Post-hoc LLM conversation review (review_conversation.py) - Conversation dump utility (dump_conversations.py) - Persona generator utility (generate_personas.py) - Single-turn scorecard tests (run_scorecard.py) - Pipeline runner for non-streaming eval execution - MVP feature spec and user stories documentation - Switch MCP default transport from SSE to HTTP
…ck, archive old personas - New generate_personas.py with --from-docs and --describe modes - New generate_goldens.py with --enrich post-processing - 6 new MVP-inferred personas (policy_advisor, journalist, technical_analyst, student, adversarial, policy_researcher) - Archive old hand-crafted personas and conversations - Fix HTTP callback: create chat before streaming, correct endpoint /api/v1/chat/stream, text-delta SSE parsing, session caching - Add eval_config.yaml with chatbot_api_base config - Add truststore for async SSL fix - Add eval results and per-run conversation markdowns
…fixes - Add sequential agent_actions timeline to TurnData, capturing SSE events (routing, thinking, tool calls, tool outputs) in exact stream order - Rewrite SSE parser to build timeline with thinking text flushing on stage changes and before tool calls - Replace separate Routing/Planner/Tool Calls markdown sections with a single sequential Agent Actions block matching the chatbot UI - Add per-turn GEval metrics (context retention, claim consistency) - Truncate tool output JSON to 2000 chars in markdown, wrap in nested collapsible details blocks - Sanitize markdown --- rules to <hr> inside HTML details blocks - Remove stale r1/r2 conversation outputs
…e-filtering - Slim base_metrics to 1 (Conversation Completeness only) - Move 8 metrics to per_turn_metrics (12 total per-turn) - Add 'requires' field for intent-based pre-filtering (tool_data, data_gap, comparison, technical_terms, prior_context) - Add 'aggregation' field (min/mean) per metric - Rewrite _build_condensed_context to return structured dict - Add _serialize_condensed_context and _select_metrics_for_turn - Fix aggregation: strip [GEval] suffix, exclude pre-filtered scores - Remove unused Turn Faithfulness/Relevancy built-in imports - Add METRICS.md reference documentation - ~62% reduction in judge calls for typical conversations
- Add 12 persona conversation results from run_20260305_210102 - Add 12 manual review documents following eval-manual-review skill - Remove old archive conversations (replaced by timestamped run dir) - Remove student_learning_and_exploration conversation - Update eval_config.yaml: remove inline APPLICABILITY pre-filters, adjust rubric score ranges for per-turn metrics - Update adversarial persona config and run_conversation_eval.py - Fix ruff N806: rename _REQUIRES_MAP to requires_map - Add .results/, .cache/, logs/, reviews/ to .gitignore
- Add 'LLM Instruction-Following' root cause for student Source Citation - Annotate Finding §1 sub-cases with per-item root cause labels - Drop actions #7/#8 (Sources, Tables already in prompt) - Replace with: standardize Limitations label + monitor instruction-following - Surface API timeout root cause from journalist_review - Add cross-references between findings and actions - Elevate Priority 4 product decision to blockquote alert - Fill missing curious_citizen turn count
- Extract per_turn_eval.py (657 lines) and metric_builder.py (142 lines) from run_conversation_eval.py (2556 -> 1855 lines) - Rewrite README.md and METRICS.md from scratch - Add SUMMARY_FINDINGS.md with actionable chatbot/infra findings - Add test_tool_use_metrics.py (11 unit tests) - Add compare_eval_runs.py for diffing evaluation runs - Add 4 new adversarial/regression personas - Archive outdated docs, old conversation runs, and run_scorecard.py - Update .gitignore with __pycache__/ and *.pyc
…often MCP preflight
…n preflight The eval runner no longer probes MCP_SERVER_URL directly. Instead it authenticates as guest and calls GET /api/v1/mcp/tools on the backend to verify the full chain (backend -> MCP -> tools). TODO: replace with a proper /health endpoint that reports MCP status.
- conversations_<ts>_<persona>.json (single) or _all.json (multiple) - conversation_eval_<ts>_<persona>.json - Replay loader is backward-compatible with old filenames
- Move planning docs to evals/docs/ (mvp_features.md, mvp_user_stories.md, SUMMARY_FINDINGS.md) - Archive obsolete dump_conversations.py (superseded by _save_conversation_markdown) - Fix stale imports in test_tool_use_metrics.py (per_turn_eval, not run_conversation_eval)
- README: added high-level overview, two-tier metric explanation, typical workflow, full CLI reference, updated file layout - METRICS: added plain-English descriptions of every metric, pre-filter flags reference table, enriched rubric examples
These are superseded scripts, old docs, and stale test harnesses that are no longer needed.
Collaborator
Author
The --no-eval path was still using the old format. Now both save paths include the persona name in the filename.
- Add DEEPEVAL_IMPLEMENTATION.md: full two-part reference covering the gist (plain-language overview, score meanings, key concepts) and the technical deep-dive (pre-filtering engine, TurnData, ConversationSimulator, SSE parsing, aggregation, replay mechanics, output structure, and how to add metrics) - Add PERSONAS.md: composable persona system documentation - Add persona_composer.py: runtime persona assembly from YAML facets - Add test_persona_composer.py: unit tests for persona composer - Update eval_config.yaml: rubric and pre-filter refinements - Update metric_builder.py, per_turn_eval.py, run_conversation_eval.py: per-turn evaluation engine improvements and aggregation fixes - Update README.md, METRICS.md: team readability improvements - Remove flat persona YAMLs and conversations from git tracking (personas/ and conversations/ are now gitignored for active runs) - Fix F821: undefined name 'combined' in _evaluate_single_run - Fix F841: unused variable 'warning_lines' in markdown renderer
Owner
|
Hello @rafmacalaba , I decided to merge the base branch of this PR, try rebasing from dev to update this branch. Thanks! |
…d findings - Add regression_suite.yaml with 16-persona fixed suite and known_failures mapping - Add run_regression.py automated runner with retry logic and report generation - Add composed persona YAML system (6 bases x 10 topics x 5 regions x 8 patterns) - Add CONSOLIDATED_FINDING.md covering all 132 eval runs across 5 batches - Include 10 curated notable conversations (4 resilient, 3 concerning, 3 reviews) - Fix pipeline_runner.py and review_conversation.py for report deduplication - Update .gitignore for .deepeval/, _old_evals/, evals/.results/, evals/.cache/
Owner
|
Hello @rafmacalaba , is this still a draft? Thanks! |
…ght batch - Covers all 6 composed persona bases: advocate_journalist, country_analyst, decision_maker, general_public, student, technical_expert - Each base ran ~17 random topic/region/pattern combinations via run_overnight_batch.sh - Also covers all 10 flat personas (adversarial_*, journalist_*, policy_*, etc.) - Each run: conversations JSON, results JSON, markdown review - Prompt version: feat/setup-evals@d8be3593d - Pre-commit skipped: large file (1.1MB conversation JSON) and hex system_prompt_hash flagged as secret are both expected eval data artifacts
…ults The original findings (132 runs, 20260312-20260316) still stand. A second full overnight batch of 132 runs was run on 20260325/26 with no prompt or system changes. Key results: - 12 of 18 metrics reached 100% pass rate (up from 3) - F3 (empty writer): zero occurrences -- was 2, now appears intermittent - F2 (visualization not generated): 2 failures vs. 7 -- improving - F4 (disambiguation spurious failures): Data Formatting now at 100% - Source Citation: improved from 78.8% to 87.1% but still the weakest metric - All P0-P2 recommendations remain actionable Document closed with Status: CLOSED and a new Validation Batch section.
Collaborator
Author
finalizing the repo structure then un-wip it. |
- Add __main__.py: single entry point via 'python -m evals <command>' Commands: run, batch, generate, suite, compare - Add 'suite' subcommand: programmatic suite.yaml generation from facets Supports --full, --sample N, --base, --adversarial-only, --list-facets - Rename regression_suite.yaml -> suite.yaml - Delete run_overnight_batch.sh, run_viz_batch.sh (replaced by unified CLI) - Update README.md with full pipeline documentation
- Move eval_config.yaml, suite.yaml, .env.example to backend/evals/configs/ - Update run_conversation_eval.py, run_regression.py, generate_goldens.py, and __main__.py with new configuration paths - Update README.md with the new directory structure and configuration links
- Update E2E_GUIDE.md and PERSONAS.md with 'python -m evals' commands - Fix configuration and test paths in all markdown guides - Ensure README.md options tables use the new configs/ directory paths
- Delete legacy .cursor rules - Remove all conversation logs and keep placeholder instructions - Final linting and path fixes for the unified CLI
Collaborator
Author
|
this is ready for review @avsolatorio Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces a robust, persona-driven evaluation framework for the Data360 chatbot, built on DeepEval and optimized for E2E quality assessment, regression testing, and adversarial boundary probing.
Key Features
python -m evalswith subcommands for simulation (run), overnight batching (batch), suite management (suite), and comparison (compare).configs/(YAML configs),tests/(unit tests), andpersonas/(composability facets).CLI Reference
Technical Layout
backend/evals/__main__.py-- Unified CLI dispatcher.backend/evals/run_conversation_eval.py-- E2E simulation & scoring engine.backend/evals/configs/-- YAML configurations and environment templates.backend/evals/tests/-- Unit test suite (26 tests passing).backend/evals/personas/-- Composable persona facets.Testing
devbackend.