Skip to content

geval-labs/geval-examples

Repository files navigation

Geval Examples

Complete, production-ready examples demonstrating how to integrate Geval with popular evaluation frameworks for AI release enforcement.

🎯 Purpose of This Repository

This repository shows you how to use Geval in real-world scenarios by providing two complete end-to-end examples:

  1. Promptfoo Integration (Main directory) - Learn the basics with a simplified workflow
  2. LangSmith Integration (langsmith-example/) - Production-ready example with real API calls

Each example demonstrates the complete flow:

AI Agent → Evaluation Framework → Geval Decision → CI/CD Enforcement

📚 How to Use This Repository

For First-Time Geval Users

Start here: Follow this learning path to understand Geval step-by-step:

  1. Read the architecture (ARCHITECTURE.md) - Understand how Geval fits in the AI workflow

  2. Run the Promptfoo example (main directory):

    npm install
    npm run workflow

    This shows you the basics without needing API keys.

  3. Explore the LangSmith example (langsmith-example/):

    cd langsmith-example
    npm install
    npm run geval:check  # Demo mode (no API keys needed)

    See how Geval works with real evaluation frameworks.

  4. Read the integration guides:

For Teams Implementing Geval

Choose the example that matches your eval framework:

  • Using Promptfoo? → Start with the main directory
  • Using LangSmith? → Go to langsmith-example/
  • Using other frameworks? → Both examples show the pattern you can adapt

🏗️ Repository Structure

Example 1: Promptfoo Integration (Main Directory)

What it shows: Basic Geval integration with custom eval results

Best for: Learning Geval concepts, quick prototyping

┌─────────────┐
│   Agent     │  Simple FAQ bot answering customer questions
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Promptfoo  │  Runs evals, produces results
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Geval     │  Decision layer: PASS / REQUIRE_APPROVAL / BLOCK
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   CI/CD     │  Enforces decision (deploy or block)
└─────────────┘

Quick Start:

npm install
npm run workflow  # See complete flow in action

Key Files:

  • agent/src/bot.ts - Simple FAQ bot
  • evals/generate-results.ts - Custom eval generator (Promptfoo-compatible)
  • contracts/production.yaml - Quality gates
  • scripts/workflow.ts - Complete orchestration

Note: Uses custom eval generator instead of Promptfoo directly due to Node.js version compatibility. See FIXES.md.

Example 2: LangSmith Integration (langsmith-example/)

What it shows: Production-ready integration with real LangSmith SDK and OpenAI API

Best for: Teams using LangSmith, production implementations

Architecture:

┌─────────────┐
│ GPT-4o-mini │  Real OpenAI API calls
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ LangSmith   │  LLM-as-judge evaluation + platform tracing
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Geval     │  Decision layer with quality gates
└──────┬──────┘
       │
       ▼
   PASS / BLOCK / REQUIRES_APPROVAL

Quick Start:

cd langsmith-example
npm install

# Demo mode (no API keys needed)
npm run geval:check

# Real evaluation (requires OpenAI + LangSmith API keys)
cp .env.example .env  # Add your keys
npm run workflow

Key Features:

  • ✅ Real LangSmith SDK integration
  • ✅ Actual OpenAI GPT-4o-mini calls
  • ✅ LLM-as-judge evaluation pattern
  • ✅ Platform tracing in LangSmith UI
  • ✅ Production-ready error handling

Key Files:

  • src/run-eval.ts - LangSmith evaluation runner
  • src/workflow.ts - Complete E2E workflow
  • contract.yaml - Quality gates (70% pass rate, 3s latency)
  • eval-results/example-results.json - Pre-generated demo data

Documentation:

🎓 What You'll Learn

Geval Capabilities Covered

  • Eval-based contracts - Quality gates on eval metrics
  • Policy-based contracts - Signal-driven decisions
  • Baseline comparisons - Regression detection against previous runs
  • Signal integration - Human reviews, risk flags, approval workflows
  • Environment-aware - Different rules for dev/staging/production
  • Decision records - Auditable decision artifacts with cryptographic hashes
  • CI/CD integration - Exit codes, JSON output for automation

Real-World Scenarios Demonstrated

Both examples show how to handle:

  • Performance regression - Block releases when latency increases
  • Quality degradation - Require approval when accuracy drops
  • Safety concerns - Block immediately on toxicity/harmful content
  • Human override - Manual approval for edge cases
  • Multi-metric evaluation - Combined quality gates (accuracy + latency + safety)

🚀 Running the Examples

Main Example (Promptfoo Integration)

# 1. Install all dependencies
npm install

# 2. Run the complete workflow
npm run workflow

# 3. Or run individual steps
npm run agent:test          # Test FAQ bot directly
npm run evals:run           # Generate evaluation results
npm run geval:check         # Run Geval decision check
npm run geval:explain       # Get detailed decision explanation

# 4. Try different environments
npm run workflow -- --env=staging      # More lenient rules
npm run workflow -- --env=development  # Very permissive

LangSmith Example

cd langsmith-example

# Demo mode (no API keys needed)
npm install
npm run geval:check  # Uses pre-generated example results

# Real evaluation (requires API keys)
cp .env.example .env
# Edit .env and add OPENAI_API_KEY and LANGSMITH_API_KEY
npm run workflow

# Different environments
npm run workflow -- --env=production  # Strictest (default)
npm run workflow -- --env=staging     # Moderate
npm run workflow -- --env=development # Lenient

📖 Deep Dive Documentation

Each example has comprehensive documentation:

geval-examples/
├── agent/                   # Simple FAQ bot implementation
│   ├── src/
│   │   └── bot.ts           # Bot logic
│   └── package.json
├── evals/                   # Promptfoo eval configuration
│   ├── promptfoo.yaml       # Eval config
│   ├── generate-results.ts  # Custom eval results generator
│   └── outputs/             # Eval results (generated)
├── langsmith-example/       # 🆕 Real LangSmith integration
│   ├── src/
│   │   ├── run-eval.ts      # LangSmith eval runner
│   │   └── workflow.ts      # Complete workflow
│   ├── contract.yaml        # Quality gates
│   ├── eval-results/        # LangSmith results
│   └── README.md            # Full documentation
├── contracts/               # Geval contracts
│   ├── production.yaml      # Production quality gates
│   ├── staging.yaml         # Staging (more lenient)
│   └── development.yaml     # Development (permissive)
├── signals/                 # Signal examples
│   ├── approval.json        # Human approval signal
│   └── risk-flag.json       # Risk flag signal
├── scripts/
│   └── workflow.ts          # Complete workflow script
├── .github/
│   └── workflows/
│       └── eval-check.yml   # CI/CD integration example
└── README.md

The Agent

A simple customer support FAQ bot that:

  • Answers common questions about products/services
  • Uses a knowledge base (in-memory for simplicity)
  • Returns structured responses

Example:

const bot = new FAQBot(knowledgeBase);
const response = bot.answer("What is your return policy?");
// Returns: { answer: "...", confidence: 0.95, latency: 120 }

The Evals

Using Promptfoo to evaluate:

  • Accuracy - Correctness of answers
  • Latency - Response time
  • Toxicity - Safety checks
  • Relevance - Answer relevance to question

Run evals:

cd evals
npx promptfoo@latest eval

The Contracts

Geval contracts define release gates:

# contracts/production.yaml
version: 1
name: production-quality-gate
environment: production

policy:
  environments:
    production:
      default: require_approval
      rules:
        - when:
            eval:
              metric: accuracy
              operator: ">="
              threshold: 0.90
          then:
            action: pass
            reason: "Accuracy meets production threshold"
        
        - when:
            eval:
              metric: latency_p95
              operator: ">"
              threshold: 500
          then:
            action: block
            reason: "Latency exceeds acceptable threshold"

The Workflow

  1. Develop - Make changes to agent
  2. Eval - Run Promptfoo evals
  3. Decide - Geval evaluates against contract
  4. Enforce - CI/CD blocks or allows deployment
# Complete workflow
npm run workflow

# Output:
# ✓ Agent tested
# ✓ Evals completed
# ✓ Geval decision: PASS
# → Deployment allowed

CI/CD Integration

See .github/workflows/eval-check.yml for a complete GitHub Actions example.

Key points:

  • Runs evals on every PR
  • Geval checks against contract
  • Blocks merge if contract violated
  • Requires approval for REQUIRES_APPROVAL status

📖 Deep Dive Documentation

Each example has comprehensive documentation:

Main Example Documentation

LangSmith Example Documentation

🔑 Key Concepts

Evaluation Adapters

Geval automatically detects and parses results from different evaluation frameworks:

  • Promptfoo - results array with success, score, latencyMs
  • LangSmith - results with feedback scores, execution_time
  • OpenEvals - Direct LLM-as-judge format
  • Generic - Custom JSON with metrics

Contracts

Contracts define quality gates with two approaches:

1. Required Evals (Simple)

required_evals:
  - name: my-eval
    rules:
      - metric: pass_rate
        operator: ">="
        threshold: 0.80

2. Policy-based (Advanced)

policy:
  environments:
    production:
      default: require_approval
      rules:
        - when:
            eval:
              metric: pass_rate
              operator: ">="
              threshold: 0.90
          then:
            action: pass

Decision Flow

  1. Run Evaluation - Your eval framework produces results (JSON/CSV)
  2. Geval Parses - Auto-detects format, extracts metrics
  3. Contract Check - Evaluates rules against metrics
  4. Decision - Returns: PASS (0) / BLOCK (1) / REQUIRES_APPROVAL (2)
  5. CI/CD - Uses exit code to allow/block deployment

💡 Common Use Cases

Use Case 1: Block on Quality Regression

# contracts/production.yaml
required_evals:
  - name: quality-check
    rules:
      - metric: accuracy
        operator: ">="
        baseline: fixed
        threshold: 0.85
        description: "Minimum 85% accuracy required"

Use Case 2: Require Human Approval for Edge Cases

policy:
  environments:
    production:
      default: require_approval  # Safe default
      rules:
        - when:
            eval:
              metric: accuracy
              operator: ">="
              threshold: 0.95
          then:
            action: pass  # Only auto-pass if excellent

Use Case 3: Different Rules per Environment

policy:
  environments:
    production:
      default: require_approval
      rules: [strict rules]
    
    staging:
      default: pass
      rules: [moderate rules]
    
    development:
      default: pass
      rules: [lenient rules]

🛠️ Adapting to Your Setup

Using a Different Eval Framework?

Both examples show the pattern:

  1. Run your eval framework
  2. Export results to JSON
  3. Ensure it has metrics (pass_rate, latency, etc.)
  4. Point Geval at the results file

Integrating with Your CI/CD?

See .github/workflows/eval-check.yml for a complete GitHub Actions example. The pattern works for any CI system:

- name: Run Evaluations
  run: npm run evals:run

- name: Check Quality Gates
  run: npm run geval:check
  # Exits with code 1 if blocked, failing the build

📊 Comparison: Main vs LangSmith Example

Aspect Main Example (Promptfoo) LangSmith Example
Eval Framework Promptfoo LangSmith SDK
LLM Integration Simulated FAQ bot Real OpenAI GPT-4o-mini
Evaluation Type Keyword matching LLM-as-judge (OpenEvals)
API Keys Required No Yes (OpenAI + LangSmith)
Complexity Beginner-friendly Production-ready
Setup Time < 5 minutes ~10 minutes
Best For Learning Geval basics Real-world integration
Documentation ARCHITECTURE.md README.md (350+ lines)
CI/CD Ready Yes (.github/workflows) Yes (same pattern)
Contract Type Policy-based (multi-env) Required evals (simple)

When to use each:

  • Main Example: You're new to Geval, want to understand the flow without API keys, need quick validation
  • LangSmith Example: You're integrating with LangSmith, need production patterns, want real LLM evaluation

🌟 Learn More

Examples

1. Promptfoo Integration (Current Directory)

The main example using custom eval results generator:

npm run workflow

See INTEGRATION_EXPLAINED.md for detailed explanation.

2. LangSmith Integration (langsmith-example/)

🆕 Real LangSmith integration with actual OpenAI API calls:

cd langsmith-example
npm install
cp .env.example .env  # Add your OPENAI_API_KEY
npm run workflow

Key differences:

  • ✅ Real LLM calls (GPT-3.5-turbo)
  • ✅ Actual latency measurements
  • ✅ Production-ready evaluation
  • ✅ LangSmith-format export

See langsmith-example/README.md for complete documentation.

Learn More

License

MIT

About

Real-world end-to-end examples demonstrating Geval integration patterns

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors