Complete, production-ready examples demonstrating how to integrate Geval with popular evaluation frameworks for AI release enforcement.
This repository shows you how to use Geval in real-world scenarios by providing two complete end-to-end examples:
- Promptfoo Integration (Main directory) - Learn the basics with a simplified workflow
- LangSmith Integration (
langsmith-example/) - Production-ready example with real API calls
Each example demonstrates the complete flow:
AI Agent → Evaluation Framework → Geval Decision → CI/CD Enforcement
Start here: Follow this learning path to understand Geval step-by-step:
-
Read the architecture (ARCHITECTURE.md) - Understand how Geval fits in the AI workflow
-
Run the Promptfoo example (main directory):
npm install npm run workflow
This shows you the basics without needing API keys.
-
Explore the LangSmith example (
langsmith-example/):cd langsmith-example npm install npm run geval:check # Demo mode (no API keys needed)
See how Geval works with real evaluation frameworks.
-
Read the integration guides:
- INTEGRATION_EXPLAINED.md - Promptfoo integration details
- langsmith-example/README.md - LangSmith integration guide
Choose the example that matches your eval framework:
- Using Promptfoo? → Start with the main directory
- Using LangSmith? → Go to
langsmith-example/ - Using other frameworks? → Both examples show the pattern you can adapt
What it shows: Basic Geval integration with custom eval results
Best for: Learning Geval concepts, quick prototyping
┌─────────────┐
│ Agent │ Simple FAQ bot answering customer questions
└──────┬──────┘
│
▼
┌─────────────┐
│ Promptfoo │ Runs evals, produces results
└──────┬──────┘
│
▼
┌─────────────┐
│ Geval │ Decision layer: PASS / REQUIRE_APPROVAL / BLOCK
└──────┬──────┘
│
▼
┌─────────────┐
│ CI/CD │ Enforces decision (deploy or block)
└─────────────┘
Quick Start:
npm install
npm run workflow # See complete flow in actionKey Files:
agent/src/bot.ts- Simple FAQ botevals/generate-results.ts- Custom eval generator (Promptfoo-compatible)contracts/production.yaml- Quality gatesscripts/workflow.ts- Complete orchestration
Note: Uses custom eval generator instead of Promptfoo directly due to Node.js version compatibility. See FIXES.md.
What it shows: Production-ready integration with real LangSmith SDK and OpenAI API
Best for: Teams using LangSmith, production implementations
Architecture:
┌─────────────┐
│ GPT-4o-mini │ Real OpenAI API calls
└──────┬──────┘
│
▼
┌─────────────┐
│ LangSmith │ LLM-as-judge evaluation + platform tracing
└──────┬──────┘
│
▼
┌─────────────┐
│ Geval │ Decision layer with quality gates
└──────┬──────┘
│
▼
PASS / BLOCK / REQUIRES_APPROVAL
Quick Start:
cd langsmith-example
npm install
# Demo mode (no API keys needed)
npm run geval:check
# Real evaluation (requires OpenAI + LangSmith API keys)
cp .env.example .env # Add your keys
npm run workflowKey Features:
- ✅ Real LangSmith SDK integration
- ✅ Actual OpenAI GPT-4o-mini calls
- ✅ LLM-as-judge evaluation pattern
- ✅ Platform tracing in LangSmith UI
- ✅ Production-ready error handling
Key Files:
src/run-eval.ts- LangSmith evaluation runnersrc/workflow.ts- Complete E2E workflowcontract.yaml- Quality gates (70% pass rate, 3s latency)eval-results/example-results.json- Pre-generated demo data
Documentation:
- README.md - Complete guide
- QUICKSTART.md - Fast start
- INTEGRATION_SUMMARY.md - Technical deep dive
- Eval-based contracts - Quality gates on eval metrics
- Policy-based contracts - Signal-driven decisions
- Baseline comparisons - Regression detection against previous runs
- Signal integration - Human reviews, risk flags, approval workflows
- Environment-aware - Different rules for dev/staging/production
- Decision records - Auditable decision artifacts with cryptographic hashes
- CI/CD integration - Exit codes, JSON output for automation
Both examples show how to handle:
- Performance regression - Block releases when latency increases
- Quality degradation - Require approval when accuracy drops
- Safety concerns - Block immediately on toxicity/harmful content
- Human override - Manual approval for edge cases
- Multi-metric evaluation - Combined quality gates (accuracy + latency + safety)
# 1. Install all dependencies
npm install
# 2. Run the complete workflow
npm run workflow
# 3. Or run individual steps
npm run agent:test # Test FAQ bot directly
npm run evals:run # Generate evaluation results
npm run geval:check # Run Geval decision check
npm run geval:explain # Get detailed decision explanation
# 4. Try different environments
npm run workflow -- --env=staging # More lenient rules
npm run workflow -- --env=development # Very permissivecd langsmith-example
# Demo mode (no API keys needed)
npm install
npm run geval:check # Uses pre-generated example results
# Real evaluation (requires API keys)
cp .env.example .env
# Edit .env and add OPENAI_API_KEY and LANGSMITH_API_KEY
npm run workflow
# Different environments
npm run workflow -- --env=production # Strictest (default)
npm run workflow -- --env=staging # Moderate
npm run workflow -- --env=development # LenientEach example has comprehensive documentation:
geval-examples/
├── agent/ # Simple FAQ bot implementation
│ ├── src/
│ │ └── bot.ts # Bot logic
│ └── package.json
├── evals/ # Promptfoo eval configuration
│ ├── promptfoo.yaml # Eval config
│ ├── generate-results.ts # Custom eval results generator
│ └── outputs/ # Eval results (generated)
├── langsmith-example/ # 🆕 Real LangSmith integration
│ ├── src/
│ │ ├── run-eval.ts # LangSmith eval runner
│ │ └── workflow.ts # Complete workflow
│ ├── contract.yaml # Quality gates
│ ├── eval-results/ # LangSmith results
│ └── README.md # Full documentation
├── contracts/ # Geval contracts
│ ├── production.yaml # Production quality gates
│ ├── staging.yaml # Staging (more lenient)
│ └── development.yaml # Development (permissive)
├── signals/ # Signal examples
│ ├── approval.json # Human approval signal
│ └── risk-flag.json # Risk flag signal
├── scripts/
│ └── workflow.ts # Complete workflow script
├── .github/
│ └── workflows/
│ └── eval-check.yml # CI/CD integration example
└── README.md
A simple customer support FAQ bot that:
- Answers common questions about products/services
- Uses a knowledge base (in-memory for simplicity)
- Returns structured responses
Example:
const bot = new FAQBot(knowledgeBase);
const response = bot.answer("What is your return policy?");
// Returns: { answer: "...", confidence: 0.95, latency: 120 }Using Promptfoo to evaluate:
- Accuracy - Correctness of answers
- Latency - Response time
- Toxicity - Safety checks
- Relevance - Answer relevance to question
Run evals:
cd evals
npx promptfoo@latest evalGeval contracts define release gates:
# contracts/production.yaml
version: 1
name: production-quality-gate
environment: production
policy:
environments:
production:
default: require_approval
rules:
- when:
eval:
metric: accuracy
operator: ">="
threshold: 0.90
then:
action: pass
reason: "Accuracy meets production threshold"
- when:
eval:
metric: latency_p95
operator: ">"
threshold: 500
then:
action: block
reason: "Latency exceeds acceptable threshold"- Develop - Make changes to agent
- Eval - Run Promptfoo evals
- Decide - Geval evaluates against contract
- Enforce - CI/CD blocks or allows deployment
# Complete workflow
npm run workflow
# Output:
# ✓ Agent tested
# ✓ Evals completed
# ✓ Geval decision: PASS
# → Deployment allowedSee .github/workflows/eval-check.yml for a complete GitHub Actions example.
Key points:
- Runs evals on every PR
- Geval checks against contract
- Blocks merge if contract violated
- Requires approval for
REQUIRES_APPROVALstatus
Each example has comprehensive documentation:
- ARCHITECTURE.md - System design and component interaction
- INTEGRATION_EXPLAINED.md - How Geval integrates with Promptfoo
- FIXES.md - Technical decisions and workarounds
- contracts/README.md - Contract syntax and examples
- langsmith-example/README.md - Complete guide (350+ lines)
- langsmith-example/QUICKSTART.md - 5-minute start guide
- langsmith-example/INTEGRATION_SUMMARY.md - Technical deep dive
Geval automatically detects and parses results from different evaluation frameworks:
- Promptfoo -
resultsarray withsuccess,score,latencyMs - LangSmith -
resultswithfeedbackscores,execution_time - OpenEvals - Direct LLM-as-judge format
- Generic - Custom JSON with metrics
Contracts define quality gates with two approaches:
required_evals:
- name: my-eval
rules:
- metric: pass_rate
operator: ">="
threshold: 0.80policy:
environments:
production:
default: require_approval
rules:
- when:
eval:
metric: pass_rate
operator: ">="
threshold: 0.90
then:
action: pass- Run Evaluation - Your eval framework produces results (JSON/CSV)
- Geval Parses - Auto-detects format, extracts metrics
- Contract Check - Evaluates rules against metrics
- Decision - Returns:
PASS(0) /BLOCK(1) /REQUIRES_APPROVAL(2) - CI/CD - Uses exit code to allow/block deployment
# contracts/production.yaml
required_evals:
- name: quality-check
rules:
- metric: accuracy
operator: ">="
baseline: fixed
threshold: 0.85
description: "Minimum 85% accuracy required"policy:
environments:
production:
default: require_approval # Safe default
rules:
- when:
eval:
metric: accuracy
operator: ">="
threshold: 0.95
then:
action: pass # Only auto-pass if excellentpolicy:
environments:
production:
default: require_approval
rules: [strict rules]
staging:
default: pass
rules: [moderate rules]
development:
default: pass
rules: [lenient rules]Both examples show the pattern:
- Run your eval framework
- Export results to JSON
- Ensure it has metrics (pass_rate, latency, etc.)
- Point Geval at the results file
See .github/workflows/eval-check.yml for a complete GitHub Actions example. The pattern works for any CI system:
- name: Run Evaluations
run: npm run evals:run
- name: Check Quality Gates
run: npm run geval:check
# Exits with code 1 if blocked, failing the build| Aspect | Main Example (Promptfoo) | LangSmith Example |
|---|---|---|
| Eval Framework | Promptfoo | LangSmith SDK |
| LLM Integration | Simulated FAQ bot | Real OpenAI GPT-4o-mini |
| Evaluation Type | Keyword matching | LLM-as-judge (OpenEvals) |
| API Keys Required | No | Yes (OpenAI + LangSmith) |
| Complexity | Beginner-friendly | Production-ready |
| Setup Time | < 5 minutes | ~10 minutes |
| Best For | Learning Geval basics | Real-world integration |
| Documentation | ARCHITECTURE.md | README.md (350+ lines) |
| CI/CD Ready | Yes (.github/workflows) | Yes (same pattern) |
| Contract Type | Policy-based (multi-env) | Required evals (simple) |
When to use each:
- Main Example: You're new to Geval, want to understand the flow without API keys, need quick validation
- LangSmith Example: You're integrating with LangSmith, need production patterns, want real LLM evaluation
- Geval Core Repository - Main monorepo with CLI and core library
- Geval Documentation - Official docs
- LangSmith Documentation - LangSmith platform docs
- Promptfoo Documentation - Promptfoo eval framework
- OpenEvals - LLM-as-judge evaluators
The main example using custom eval results generator:
npm run workflowSee INTEGRATION_EXPLAINED.md for detailed explanation.
🆕 Real LangSmith integration with actual OpenAI API calls:
cd langsmith-example
npm install
cp .env.example .env # Add your OPENAI_API_KEY
npm run workflowKey differences:
- ✅ Real LLM calls (GPT-3.5-turbo)
- ✅ Actual latency measurements
- ✅ Production-ready evaluation
- ✅ LangSmith-format export
See langsmith-example/README.md for complete documentation.
- Geval Documentation
- Promptfoo Documentation
- LangSmith Documentation
- Example Contracts
- Workflow Script
MIT