Geval Examples

Complete, production-ready examples demonstrating how to integrate Geval with popular evaluation frameworks for AI release enforcement.

🎯 Purpose of This Repository

This repository shows you how to use Geval in real-world scenarios by providing two complete end-to-end examples:

Promptfoo Integration (Main directory) - Learn the basics with a simplified workflow
LangSmith Integration (langsmith-example/) - Production-ready example with real API calls

Each example demonstrates the complete flow:

AI Agent → Evaluation Framework → Geval Decision → CI/CD Enforcement

📚 How to Use This Repository

For First-Time Geval Users

Start here: Follow this learning path to understand Geval step-by-step:

Read the architecture (ARCHITECTURE.md) - Understand how Geval fits in the AI workflow
Run the Promptfoo example (main directory):
```
npm install
npm run workflow
```
This shows you the basics without needing API keys.
Explore the LangSmith example (langsmith-example/):
```
cd langsmith-example
npm install
npm run geval:check  # Demo mode (no API keys needed)
```
See how Geval works with real evaluation frameworks.
Read the integration guides:
- INTEGRATION_EXPLAINED.md - Promptfoo integration details
- langsmith-example/README.md - LangSmith integration guide

For Teams Implementing Geval

Choose the example that matches your eval framework:

Using Promptfoo? → Start with the main directory
Using LangSmith? → Go to langsmith-example/
Using other frameworks? → Both examples show the pattern you can adapt

🏗️ Repository Structure

Example 1: Promptfoo Integration (Main Directory)

What it shows: Basic Geval integration with custom eval results

Best for: Learning Geval concepts, quick prototyping

┌─────────────┐
│   Agent     │  Simple FAQ bot answering customer questions
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Promptfoo  │  Runs evals, produces results
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Geval     │  Decision layer: PASS / REQUIRE_APPROVAL / BLOCK
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   CI/CD     │  Enforces decision (deploy or block)
└─────────────┘

Quick Start:

npm install
npm run workflow  # See complete flow in action

Key Files:

agent/src/bot.ts - Simple FAQ bot
evals/generate-results.ts - Custom eval generator (Promptfoo-compatible)
contracts/production.yaml - Quality gates
scripts/workflow.ts - Complete orchestration

Note: Uses custom eval generator instead of Promptfoo directly due to Node.js version compatibility. See FIXES.md.

Example 2: LangSmith Integration (`langsmith-example/`)

What it shows: Production-ready integration with real LangSmith SDK and OpenAI API

Best for: Teams using LangSmith, production implementations

Architecture:

┌─────────────┐
│ GPT-4o-mini │  Real OpenAI API calls
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ LangSmith   │  LLM-as-judge evaluation + platform tracing
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Geval     │  Decision layer with quality gates
└──────┬──────┘
       │
       ▼
   PASS / BLOCK / REQUIRES_APPROVAL

Quick Start:

cd langsmith-example
npm install

# Demo mode (no API keys needed)
npm run geval:check

# Real evaluation (requires OpenAI + LangSmith API keys)
cp .env.example .env  # Add your keys
npm run workflow

Key Features:

✅ Real LangSmith SDK integration
✅ Actual OpenAI GPT-4o-mini calls
✅ LLM-as-judge evaluation pattern
✅ Platform tracing in LangSmith UI
✅ Production-ready error handling

Key Files:

src/run-eval.ts - LangSmith evaluation runner
src/workflow.ts - Complete E2E workflow
contract.yaml - Quality gates (70% pass rate, 3s latency)
eval-results/example-results.json - Pre-generated demo data

Documentation:

README.md - Complete guide
QUICKSTART.md - Fast start
INTEGRATION_SUMMARY.md - Technical deep dive

🎓 What You'll Learn

Geval Capabilities Covered

Eval-based contracts - Quality gates on eval metrics
Policy-based contracts - Signal-driven decisions
Baseline comparisons - Regression detection against previous runs
Signal integration - Human reviews, risk flags, approval workflows
Environment-aware - Different rules for dev/staging/production
Decision records - Auditable decision artifacts with cryptographic hashes
CI/CD integration - Exit codes, JSON output for automation

Real-World Scenarios Demonstrated

Both examples show how to handle:

Performance regression - Block releases when latency increases
Quality degradation - Require approval when accuracy drops
Safety concerns - Block immediately on toxicity/harmful content
Human override - Manual approval for edge cases
Multi-metric evaluation - Combined quality gates (accuracy + latency + safety)

🚀 Running the Examples

Main Example (Promptfoo Integration)

# 1. Install all dependencies
npm install

# 2. Run the complete workflow
npm run workflow

# 3. Or run individual steps
npm run agent:test          # Test FAQ bot directly
npm run evals:run           # Generate evaluation results
npm run geval:check         # Run Geval decision check
npm run geval:explain       # Get detailed decision explanation

# 4. Try different environments
npm run workflow -- --env=staging      # More lenient rules
npm run workflow -- --env=development  # Very permissive

LangSmith Example

cd langsmith-example

# Demo mode (no API keys needed)
npm install
npm run geval:check  # Uses pre-generated example results

# Real evaluation (requires API keys)
cp .env.example .env
# Edit .env and add OPENAI_API_KEY and LANGSMITH_API_KEY
npm run workflow

# Different environments
npm run workflow -- --env=production  # Strictest (default)
npm run workflow -- --env=staging     # Moderate
npm run workflow -- --env=development # Lenient

📖 Deep Dive Documentation

Each example has comprehensive documentation:

geval-examples/
├── agent/                   # Simple FAQ bot implementation
│   ├── src/
│   │   └── bot.ts           # Bot logic
│   └── package.json
├── evals/                   # Promptfoo eval configuration
│   ├── promptfoo.yaml       # Eval config
│   ├── generate-results.ts  # Custom eval results generator
│   └── outputs/             # Eval results (generated)
├── langsmith-example/       # 🆕 Real LangSmith integration
│   ├── src/
│   │   ├── run-eval.ts      # LangSmith eval runner
│   │   └── workflow.ts      # Complete workflow
│   ├── contract.yaml        # Quality gates
│   ├── eval-results/        # LangSmith results
│   └── README.md            # Full documentation
├── contracts/               # Geval contracts
│   ├── production.yaml      # Production quality gates
│   ├── staging.yaml         # Staging (more lenient)
│   └── development.yaml     # Development (permissive)
├── signals/                 # Signal examples
│   ├── approval.json        # Human approval signal
│   └── risk-flag.json       # Risk flag signal
├── scripts/
│   └── workflow.ts          # Complete workflow script
├── .github/
│   └── workflows/
│       └── eval-check.yml   # CI/CD integration example
└── README.md

The Agent

A simple customer support FAQ bot that:

Answers common questions about products/services
Uses a knowledge base (in-memory for simplicity)
Returns structured responses

Example:

const bot = new FAQBot(knowledgeBase);
const response = bot.answer("What is your return policy?");
// Returns: { answer: "...", confidence: 0.95, latency: 120 }

The Evals

Using Promptfoo to evaluate:

Accuracy - Correctness of answers
Latency - Response time
Toxicity - Safety checks
Relevance - Answer relevance to question

Run evals:

cd evals
npx promptfoo@latest eval

The Contracts

Geval contracts define release gates:

# contracts/production.yaml
version: 1
name: production-quality-gate
environment: production

policy:
  environments:
    production:
      default: require_approval
      rules:
        - when:
            eval:
              metric: accuracy
              operator: ">="
              threshold: 0.90
          then:
            action: pass
            reason: "Accuracy meets production threshold"
        
        - when:
            eval:
              metric: latency_p95
              operator: ">"
              threshold: 500
          then:
            action: block
            reason: "Latency exceeds acceptable threshold"

The Workflow

Develop - Make changes to agent
Eval - Run Promptfoo evals
Decide - Geval evaluates against contract
Enforce - CI/CD blocks or allows deployment

# Complete workflow
npm run workflow

# Output:
# ✓ Agent tested
# ✓ Evals completed
# ✓ Geval decision: PASS
# → Deployment allowed

CI/CD Integration

See .github/workflows/eval-check.yml for a complete GitHub Actions example.

Key points:

Runs evals on every PR
Geval checks against contract
Blocks merge if contract violated
Requires approval for REQUIRES_APPROVAL status

📖 Deep Dive Documentation

Each example has comprehensive documentation:

Main Example Documentation

ARCHITECTURE.md - System design and component interaction
INTEGRATION_EXPLAINED.md - How Geval integrates with Promptfoo
FIXES.md - Technical decisions and workarounds
contracts/README.md - Contract syntax and examples

LangSmith Example Documentation

langsmith-example/README.md - Complete guide (350+ lines)
langsmith-example/QUICKSTART.md - 5-minute start guide
langsmith-example/INTEGRATION_SUMMARY.md - Technical deep dive

🔑 Key Concepts

Evaluation Adapters

Geval automatically detects and parses results from different evaluation frameworks:

Promptfoo - results array with success, score, latencyMs
LangSmith - results with feedback scores, execution_time
OpenEvals - Direct LLM-as-judge format
Generic - Custom JSON with metrics

Contracts

Contracts define quality gates with two approaches:

1. Required Evals (Simple)

required_evals:
  - name: my-eval
    rules:
      - metric: pass_rate
        operator: ">="
        threshold: 0.80

2. Policy-based (Advanced)

policy:
  environments:
    production:
      default: require_approval
      rules:
        - when:
            eval:
              metric: pass_rate
              operator: ">="
              threshold: 0.90
          then:
            action: pass

Decision Flow

Run Evaluation - Your eval framework produces results (JSON/CSV)
Geval Parses - Auto-detects format, extracts metrics
Contract Check - Evaluates rules against metrics
Decision - Returns: PASS (0) / BLOCK (1) / REQUIRES_APPROVAL (2)
CI/CD - Uses exit code to allow/block deployment

💡 Common Use Cases

Use Case 1: Block on Quality Regression

# contracts/production.yaml
required_evals:
  - name: quality-check
    rules:
      - metric: accuracy
        operator: ">="
        baseline: fixed
        threshold: 0.85
        description: "Minimum 85% accuracy required"

Use Case 2: Require Human Approval for Edge Cases

policy:
  environments:
    production:
      default: require_approval  # Safe default
      rules:
        - when:
            eval:
              metric: accuracy
              operator: ">="
              threshold: 0.95
          then:
            action: pass  # Only auto-pass if excellent

Use Case 3: Different Rules per Environment

policy:
  environments:
    production:
      default: require_approval
      rules: [strict rules]
    
    staging:
      default: pass
      rules: [moderate rules]
    
    development:
      default: pass
      rules: [lenient rules]

🛠️ Adapting to Your Setup

Using a Different Eval Framework?

Both examples show the pattern:

Run your eval framework
Export results to JSON
Ensure it has metrics (pass_rate, latency, etc.)
Point Geval at the results file

Integrating with Your CI/CD?

See .github/workflows/eval-check.yml for a complete GitHub Actions example. The pattern works for any CI system:

- name: Run Evaluations
  run: npm run evals:run

- name: Check Quality Gates
  run: npm run geval:check
  # Exits with code 1 if blocked, failing the build

📊 Comparison: Main vs LangSmith Example

Aspect	Main Example (Promptfoo)	LangSmith Example
Eval Framework	Promptfoo	LangSmith SDK
LLM Integration	Simulated FAQ bot	Real OpenAI GPT-4o-mini
Evaluation Type	Keyword matching	LLM-as-judge (OpenEvals)
API Keys Required	No	Yes (OpenAI + LangSmith)
Complexity	Beginner-friendly	Production-ready
Setup Time	< 5 minutes	~10 minutes
Best For	Learning Geval basics	Real-world integration
Documentation	ARCHITECTURE.md	README.md (350+ lines)
CI/CD Ready	Yes (.github/workflows)	Yes (same pattern)
Contract Type	Policy-based (multi-env)	Required evals (simple)

When to use each:

Main Example: You're new to Geval, want to understand the flow without API keys, need quick validation
LangSmith Example: You're integrating with LangSmith, need production patterns, want real LLM evaluation

🌟 Learn More

Geval Core Repository - Main monorepo with CLI and core library
Geval Documentation - Official docs
LangSmith Documentation - LangSmith platform docs
Promptfoo Documentation - Promptfoo eval framework
OpenEvals - LLM-as-judge evaluators

Examples

1. Promptfoo Integration (Current Directory)

The main example using custom eval results generator:

npm run workflow

See INTEGRATION_EXPLAINED.md for detailed explanation.

2. LangSmith Integration (`langsmith-example/`)

🆕 Real LangSmith integration with actual OpenAI API calls:

cd langsmith-example
npm install
cp .env.example .env  # Add your OPENAI_API_KEY
npm run workflow

Key differences:

✅ Real LLM calls (GPT-3.5-turbo)
✅ Actual latency measurements
✅ Production-ready evaluation
✅ LangSmith-format export

See langsmith-example/README.md for complete documentation.

Learn More

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
agent		agent
contracts		contracts
evals		evals
langsmith-example		langsmith-example
scripts		scripts
signals		signals
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
FIXES.md		FIXES.md
INTEGRATION_EXPLAINED.md		INTEGRATION_EXPLAINED.md
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Geval Examples

🎯 Purpose of This Repository

📚 How to Use This Repository

For First-Time Geval Users

For Teams Implementing Geval

🏗️ Repository Structure

Example 1: Promptfoo Integration (Main Directory)

Example 2: LangSmith Integration (langsmith-example/)

🎓 What You'll Learn

Geval Capabilities Covered

Real-World Scenarios Demonstrated

🚀 Running the Examples

Main Example (Promptfoo Integration)

LangSmith Example

📖 Deep Dive Documentation

The Agent

The Evals

The Contracts

The Workflow

CI/CD Integration

📖 Deep Dive Documentation

Main Example Documentation

LangSmith Example Documentation

🔑 Key Concepts

Evaluation Adapters

Contracts

1. Required Evals (Simple)

2. Policy-based (Advanced)

Decision Flow

💡 Common Use Cases

Use Case 1: Block on Quality Regression

Use Case 2: Require Human Approval for Edge Cases

Use Case 3: Different Rules per Environment

🛠️ Adapting to Your Setup

Using a Different Eval Framework?

Integrating with Your CI/CD?

📊 Comparison: Main vs LangSmith Example

🌟 Learn More

Examples

1. Promptfoo Integration (Current Directory)

2. LangSmith Integration (langsmith-example/)

Learn More

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Example 2: LangSmith Integration (`langsmith-example/`)

2. LangSmith Integration (`langsmith-example/`)

Packages