You changed a prompt. Swapped a model. Updated a tool. Did anything break? Run EvalView. Know for sure.
pip install evalview && evalview demo # No API key neededπ Like it? Give us a β β it helps more devs discover EvalView.
| Status | What it means | What you do |
|---|---|---|
| β PASSED | Agent behavior matches baseline | Ship with confidence |
| Agent is calling different tools | Review the diff | |
| Same tools, output quality shifted | Review the diff | |
| β REGRESSION | Score dropped significantly | Fix before shipping |
Simple workflow (recommended):
# 1. Your agent works correctly
evalview snapshot # πΈ Save current behavior as baseline
# 2. You change something (prompt, model, tools)
evalview check # π Detect regressions automatically
# 3. EvalView tells you exactly what changed
# β β
All clean! No regressions detected.
# β β οΈ TOOLS_CHANGED: +web_search, -calculator
# β β REGRESSION: score 85 β 71Advanced workflow (more control):
evalview run --save-golden # Save specific result as baseline
evalview run --diff # Compare with custom optionsThat's it. Deterministic proof, no LLM-as-judge required, no API keys needed.
EvalView now tracks your progress and celebrates wins:
evalview check
# π Comparing against your baseline...
# β¨ All clean! No regressions detected.
# π― 5 clean checks in a row! You're on a roll.Features:
- π₯ Streak tracking β Celebrate consecutive clean checks (3, 5, 10, 25+ milestones)
- π Health score β See your project's stability at a glance
- π Smart recaps β "Since last time" summaries to stay in context
- π Progress visualization β Track improvement over time
Some agents produce valid variations. Save up to 5 golden variants per test:
# Save multiple acceptable behaviors
evalview snapshot --variant variant1
evalview snapshot --variant variant2
# EvalView compares against ALL variants, passes if ANY match
evalview check
# β
Matched variant 2/3Perfect for LLM-based agents with creative variation.
-
Install EvalView
pip install evalview
-
Try the demo (zero setup, no API key)
evalview demo
-
Set up a working example in 2 minutes
evalview quickstart
-
Want LLM-as-judge scoring too?
export OPENAI_API_KEY='your-key' evalview run
-
Prefer local/free evaluation?
evalview run --judge-provider ollama --judge-model llama3.2
Full getting started guide β
- π Automatic regression detection β Know instantly when your agent breaks
- πΈ Golden baseline diffing β Save known-good behavior, compare every change
- π Works without API keys β Deterministic scoring, no LLM-as-judge needed
- πΈ Free & open source β No vendor lock-in, no SaaS pricing
- π Works offline β Use Ollama for fully local evaluation
| Observability (LangSmith) | Benchmarks (Braintrust) | EvalView | |
|---|---|---|---|
| Answers | "What did my agent do?" | "How good is my agent?" | "Did my agent change?" |
| Detects regressions | β | β Automatic | |
| Golden baseline diffing | β | β | β |
| Works without API keys | β | β | β |
| Free & open source | β | β | β |
| Works offline (Ollama) | β | β |
Use observability tools to see what happened. Use EvalView to prove it didn't break.
Talk to your tests. Debug failures. Compare runs.
evalview chatYou: run the calculator test
π€ Running calculator test...
β
Passed (score: 92.5)
You: compare to yesterday
π€ Score: 92.5 β 87.2 (-5.3)
Tools: +1 added (validator)
Cost: $0.003 β $0.005 (+67%)
Slash commands: /run, /test, /compare, /traces, /skill, /adapters
Practice agent eval patterns with guided exercises.
evalview gym| Agent | E2E Testing | Trace Capture |
|---|---|---|
| Claude Code | β | β |
| OpenAI Codex | β | β |
| LangGraph | β | β |
| CrewAI | β | β |
| OpenAI Assistants | β | β |
| Custom (any CLI/API) | β | β |
Also works with: AutoGen β’ Dify β’ Ollama β’ HuggingFace β’ Any HTTP API
evalview init --ci # Generates workflow fileOr add manually:
# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hidai25/eval-view@v0.2.5
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
command: check # Use new check command
fail-on: 'REGRESSION' # Block PRs on regressions
json: true # Structured output for CIOr use the CLI directly:
- run: evalview check --fail-on REGRESSION --json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}PRs with regressions get blocked. Add a PR comment showing exactly what changed:
- run: evalview ci comment
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}| Feature | Description | Docs |
|---|---|---|
| πΈ Snapshot/Check Workflow | Simple snapshot β check commands for regression detection |
β |
| π₯ Streak Tracking | Habit-forming celebrations for consecutive clean checks | β |
| π¨ Multi-Reference Goldens | Save up to 5 variants per test for non-deterministic agents | β |
| π¬ Chat Mode | AI assistant: /run, /test, /compare |
β |
| π·οΈ Tool Categories | Match by intent, not exact tool names | β |
| π Statistical Mode | Handle flaky LLMs with --runs N and pass@k |
β |
| π° Cost & Latency | Automatic threshold enforcement | β |
| π HTML Reports | Interactive Plotly charts | β |
| π§ͺ Test Generation | Generate 1000 tests from 1 | β |
| ποΈ Suite Types | Separate capability vs regression tests | β |
| π― Difficulty Levels | Filter by --difficulty hard, benchmark by tier |
β |
| π¬ Behavior Coverage | Track tasks, tools, paths tested | β |
Test that your agent's code actually works β not just that the output looks right. Best for teams maintaining SKILL.md workflows for Claude Code or Codex.
tests:
- name: creates-working-api
input: "Create an express server with /health endpoint"
expected:
files_created: ["index.js", "package.json"]
build_must_pass:
- "npm install"
- "npm run lint"
smoke_tests:
- command: "node index.js"
background: true
health_check: "http://localhost:3000/health"
expected_status: 200
timeout: 10
no_sudo: true
git_clean: trueevalview skill test tests.yaml --agent claude-code
evalview skill test tests.yaml --agent codex
evalview skill test tests.yaml --agent langgraph| Check | What it catches |
|---|---|
build_must_pass |
Code that doesn't compile, missing dependencies |
smoke_tests |
Runtime crashes, wrong ports, failed health checks |
git_clean |
Uncommitted files, dirty working directory |
no_sudo |
Privilege escalation attempts |
max_tokens |
Cost blowouts, verbose outputs |
| Getting Started | CLI Reference |
| Golden Traces | CI/CD Integration |
| Tool Categories | Statistical Mode |
| Chat Mode | Evaluation Metrics |
| Skills Testing | Debugging |
| FAQ |
Guides: Testing LangGraph in CI β’ Detecting Hallucinations
| Framework | Link |
|---|---|
| Claude Code (E2E) | examples/agent-test/ |
| LangGraph | examples/langgraph/ |
| CrewAI | examples/crewai/ |
| Anthropic Claude | examples/anthropic/ |
| Dify | examples/dify/ |
| Ollama (Local) | examples/ollama/ |
Node.js? See @evalview/node
Shipped: Golden traces β’ Snapshot/check workflow β’ Streak tracking & celebrations β’ Multi-reference goldens β’ Tool categories β’ Statistical mode β’ Difficulty levels β’ Partial sequence credit β’ Skills validation β’ E2E agent testing β’ Build & smoke tests β’ Health checks β’ Safety guards (no_sudo, git_clean) β’ Claude Code & Codex adapters β’ Opus 4.6 cost tracking β’ MCP servers β’ HTML reports β’ Interactive chat mode β’ EvalView Gym
Coming: Agent Teams trace analysis β’ Multi-turn conversations β’ Grounded hallucination detection β’ Error compounding metrics β’ Container isolation
- Questions? GitHub Discussions
- Bugs? GitHub Issues
- Want setup help? Email hidai@evalview.com β happy to help configure your first tests
- Contributing? See CONTRIBUTING.md
License: Apache 2.0
π Don't miss out on future updates! Star the repo and be the first to know about new features.
Proof that your agent still works.
Get started β
EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.
