EvalView — Proof that your agent still works.

You changed a prompt. Swapped a model. Updated a tool. Did anything break? Run EvalView. Know for sure.

pip install evalview && evalview demo   # No API key needed

🌟 Like it? Give us a ⭐ — it helps more devs discover EvalView.

🔍 What EvalView Catches

Status	What it means	What you do
✅ PASSED	Agent behavior matches baseline	Ship with confidence
⚠️ TOOLS_CHANGED	Agent is calling different tools	Review the diff
⚠️ OUTPUT_CHANGED	Same tools, output quality shifted	Review the diff
❌ REGRESSION	Score dropped significantly	Fix before shipping

🤔 How It Works

Simple workflow (recommended):

# 1. Your agent works correctly
evalview snapshot                 # 📸 Save current behavior as baseline

# 2. You change something (prompt, model, tools)
evalview check                    # 🔍 Detect regressions automatically

# 3. EvalView tells you exactly what changed
#    → ✅ All clean! No regressions detected.
#    → ⚠️ TOOLS_CHANGED: +web_search, -calculator
#    → ❌ REGRESSION: score 85 → 71

Advanced workflow (more control):

evalview run --save-golden        # Save specific result as baseline
evalview run --diff               # Compare with custom options

That's it. Deterministic proof, no LLM-as-judge required, no API keys needed.

🎯 New: Habit-Forming Regression Detection

EvalView now tracks your progress and celebrates wins:

evalview check
# 🔍 Comparing against your baseline...
# ✨ All clean! No regressions detected.
# 🎯 5 clean checks in a row! You're on a roll.

Features:

🔥 Streak tracking — Celebrate consecutive clean checks (3, 5, 10, 25+ milestones)
📊 Health score — See your project's stability at a glance
🔔 Smart recaps — "Since last time" summaries to stay in context
📈 Progress visualization — Track improvement over time

🎨 Multi-Reference Goldens (for non-deterministic agents)

Some agents produce valid variations. Save up to 5 golden variants per test:

# Save multiple acceptable behaviors
evalview snapshot --variant variant1
evalview snapshot --variant variant2

# EvalView compares against ALL variants, passes if ANY match
evalview check
# ✅ Matched variant 2/3

Perfect for LLM-based agents with creative variation.

🚀 Quick Start

Install EvalView
```
pip install evalview
```
Try the demo (zero setup, no API key)
```
evalview demo
```
Set up a working example in 2 minutes
```
evalview quickstart
```

Want LLM-as-judge scoring too?

export OPENAI_API_KEY='your-key'
evalview run

Prefer local/free evaluation?

evalview run --judge-provider ollama --judge-model llama3.2

Full getting started guide →

💡 Why EvalView?

🔄 Automatic regression detection — Know instantly when your agent breaks
📸 Golden baseline diffing — Save known-good behavior, compare every change
🔑 Works without API keys — Deterministic scoring, no LLM-as-judge needed
💸 Free & open source — No vendor lock-in, no SaaS pricing
🏠 Works offline — Use Ollama for fully local evaluation

	Observability (LangSmith)	Benchmarks (Braintrust)	EvalView
Answers	"What did my agent do?"	"How good is my agent?"	"Did my agent change?"
Detects regressions	❌	⚠️ Manual	✅ Automatic
Golden baseline diffing	❌	❌	✅
Works without API keys	❌	❌	✅
Free & open source	❌	❌	✅
Works offline (Ollama)	❌	⚠️ Some	✅

Use observability tools to see what happened. Use EvalView to prove it didn't break.

🧭 Explore & Learn

💬 Interactive Chat

Talk to your tests. Debug failures. Compare runs.

evalview chat

You: run the calculator test
🤖 Running calculator test...
✅ Passed (score: 92.5)

You: compare to yesterday
🤖 Score: 92.5 → 87.2 (-5.3)
   Tools: +1 added (validator)
   Cost: $0.003 → $0.005 (+67%)

Slash commands: /run, /test, /compare, /traces, /skill, /adapters

Chat mode docs →

🏋️ EvalView Gym

Practice agent eval patterns with guided exercises.

evalview gym

⚡ Supported Agents & Frameworks

Agent	E2E Testing	Trace Capture
Claude Code	✅	✅
OpenAI Codex	✅	✅
LangGraph	✅	✅
CrewAI	✅	✅
OpenAI Assistants	✅	✅
Custom (any CLI/API)	✅	✅

Also works with: AutoGen • Dify • Ollama • HuggingFace • Any HTTP API

Compatibility details →

🔧 Automate It

GitHub Actions

evalview init --ci    # Generates workflow file

Or add manually:

# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.2.5
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          command: check                   # Use new check command
          fail-on: 'REGRESSION'            # Block PRs on regressions
          json: true                       # Structured output for CI

Or use the CLI directly:

      - run: evalview check --fail-on REGRESSION --json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

PRs with regressions get blocked. Add a PR comment showing exactly what changed:

      - run: evalview ci comment
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Full CI/CD setup →

📦 Features

Feature	Description	Docs
📸 Snapshot/Check Workflow	Simple `snapshot` → `check` commands for regression detection	→
🔥 Streak Tracking	Habit-forming celebrations for consecutive clean checks	→
🎨 Multi-Reference Goldens	Save up to 5 variants per test for non-deterministic agents	→
💬 Chat Mode	AI assistant: `/run`, `/test`, `/compare`	→
🏷️ Tool Categories	Match by intent, not exact tool names	→
📊 Statistical Mode	Handle flaky LLMs with `--runs N` and pass@k	→
💰 Cost & Latency	Automatic threshold enforcement	→
📈 HTML Reports	Interactive Plotly charts	→
🧪 Test Generation	Generate 1000 tests from 1	→
🏗️ Suite Types	Separate capability vs regression tests	→
🎯 Difficulty Levels	Filter by `--difficulty hard`, benchmark by tier	→
🔬 Behavior Coverage	Track tasks, tools, paths tested	→

🔬 Advanced: Skills Testing

Test that your agent's code actually works — not just that the output looks right. Best for teams maintaining SKILL.md workflows for Claude Code or Codex.

tests:
  - name: creates-working-api
    input: "Create an express server with /health endpoint"
    expected:
      files_created: ["index.js", "package.json"]
      build_must_pass:
        - "npm install"
        - "npm run lint"
      smoke_tests:
        - command: "node index.js"
          background: true
          health_check: "http://localhost:3000/health"
          expected_status: 200
          timeout: 10
      no_sudo: true
      git_clean: true

evalview skill test tests.yaml --agent claude-code
evalview skill test tests.yaml --agent codex
evalview skill test tests.yaml --agent langgraph

Check	What it catches
`build_must_pass`	Code that doesn't compile, missing dependencies
`smoke_tests`	Runtime crashes, wrong ports, failed health checks
`git_clean`	Uncommitted files, dirty working directory
`no_sudo`	Privilege escalation attempts
`max_tokens`	Cost blowouts, verbose outputs

Skills testing docs →

📚 Documentation


Getting Started	CLI Reference
Golden Traces	CI/CD Integration
Tool Categories	Statistical Mode
Chat Mode	Evaluation Metrics
Skills Testing	Debugging
FAQ

Guides: Testing LangGraph in CI • Detecting Hallucinations

📂 Examples

Framework	Link
Claude Code (E2E)	examples/agent-test/
LangGraph	examples/langgraph/
CrewAI	examples/crewai/
Anthropic Claude	examples/anthropic/
Dify	examples/dify/
Ollama (Local)	examples/ollama/

Node.js? See @evalview/node

🗺️ Roadmap

Shipped: Golden traces • Snapshot/check workflow • Streak tracking & celebrations • Multi-reference goldens • Tool categories • Statistical mode • Difficulty levels • Partial sequence credit • Skills validation • E2E agent testing • Build & smoke tests • Health checks • Safety guards (no_sudo, git_clean) • Claude Code & Codex adapters • Opus 4.6 cost tracking • MCP servers • HTML reports • Interactive chat mode • EvalView Gym

Coming: Agent Teams trace analysis • Multi-turn conversations • Grounded hallucination detection • Error compounding metrics • Container isolation

Vote on features →

🤝 Get Help & Contributing

Questions? GitHub Discussions
Bugs? GitHub Issues
Want setup help? Email hidai@evalview.com — happy to help configure your first tests
Contributing? See CONTRIBUTING.md

License: Apache 2.0

⭐ Thank You for the Support!

🌟 Don't miss out on future updates! Star the repo and be the first to know about new features.

Proof that your agent still works.
Get started →

EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.

Name		Name	Last commit message	Last commit date
Latest commit History 276 Commits
.github		.github
assets		assets
demo-agent		demo-agent
docs		docs
dogfood		dogfood
evalview		evalview
examples		examples
guides		guides
gym		gym
scripts		scripts
sdks/node		sdks/node
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
action.yml		action.yml
diff-report.html		diff-report.html
goosebench-test.txt		goosebench-test.txt
package.json		package.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
report.html		report.html
requirements.txt		requirements.txt
test_verbose.sh		test_verbose.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalView — Proof that your agent still works.

🔍 What EvalView Catches

🤔 How It Works

🎯 New: Habit-Forming Regression Detection

🎨 Multi-Reference Goldens (for non-deterministic agents)

🚀 Quick Start

💡 Why EvalView?

🧭 Explore & Learn

💬 Interactive Chat

🏋️ EvalView Gym

⚡ Supported Agents & Frameworks

🔧 Automate It

GitHub Actions

📦 Features

🔬 Advanced: Skills Testing

📚 Documentation

📂 Examples

🗺️ Roadmap

🤝 Get Help & Contributing

⭐ Thank You for the Support!

About

Uh oh!

Releases 12

Packages

Contributors 3

Uh oh!

Languages

License

hidai25/eval-view

Folders and files

Latest commit

History

Repository files navigation

EvalView — Proof that your agent still works.

🔍 What EvalView Catches

🤔 How It Works

🎯 New: Habit-Forming Regression Detection

🎨 Multi-Reference Goldens (for non-deterministic agents)

🚀 Quick Start

💡 Why EvalView?

🧭 Explore & Learn

💬 Interactive Chat

🏋️ EvalView Gym

⚡ Supported Agents & Frameworks

🔧 Automate It

GitHub Actions

📦 Features

🔬 Advanced: Skills Testing

📚 Documentation

📂 Examples

🗺️ Roadmap

🤝 Get Help & Contributing

⭐ Thank You for the Support!

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Contributors 3

Uh oh!

Languages

Packages