Skip to content

Thomas-TyTech/prompt-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Resident Assistant Evaluation Pipeline

A comprehensive evaluation system for testing and comparing AI chatbot prompts, specifically designed for the South Carolina Resident Assistant chatbot. Features command-line tools for automated evaluation with manual prompt switching via terminal pause points.

Features

Core Evaluation Pipeline

  • Multi-Prompt Testing: Compare multiple prompt versions side-by-side
  • Intelligent Question Generation: AI-powered test question creation for South Carolina state services
  • Enhanced Link Validation: Comprehensive URL testing with retry logic and government site optimization
  • LLM-Based Grading: Claude-powered response evaluation with detailed rubrics
  • Interactive Dashboards: Both HTML and native UI result visualization

Terminal Interface

  • Real-Time Progress Tracking: Live console updates during evaluation runs
  • Terminal Pause Points: Manual prompt switching with clear instructions
  • Structured Output: Organized results with detailed metrics and summaries

Advanced Analytics

  • 5-Dimensional Scoring: Accuracy, Completeness, Relevance, Clarity, Link Quality
  • Statistical Analysis: Performance trends and comparative metrics
  • Export Options: HTML dashboards, JSON data, and structured reports

Quick Start

Prerequisites

  • Python 3.8+
  • Anthropic API key for LLM grading and question generation
  • Access to your AI agent's API endpoint

Installation

Option A: Using uv (Recommended - Fast & Modern)

# Clone the repository
git clone <repository-url>
cd prompt-evaluation

# Install with uv (creates virtual environment automatically)
uv sync

# Activate the environment
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

# Optional: Install with development dependencies
uv sync --extra dev

Option B: Using pip (Traditional)

# Clone the repository
git clone <repository-url>
cd prompt-evaluation

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

Configuration Setup

  1. Configure your credentials Copy and edit the configuration file:
cp evaluation_config.example.json src/evaluation_config.json

Edit src/evaluation_config.json:

{
  "api_endpoint": "https://your-agent-endpoint.com/sync_query",
  "auth_header": "Basic your-auth-token-here", 
  "default_questions_count": 5,
  "auto_grade": true,
  "generate_html_backup": true
}
  1. Set Anthropic API key
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"

Usage

Complete Evaluation Pipeline (Recommended)

# Run the complete pipeline with terminal pause points
python3 src/run_full_evaluation.py \
  --endpoint "https://your-agent-endpoint.com/sync_query" \
  --auth "Basic your-auth-token" \
  --evaluation-name "Production vs Enhanced Prompts" \
  --num-questions 25

Step-by-Step Command Line Interface

# Generate test questions
python src/test_question_generator.py --count 10 --output questions.json

# Run multi-prompt evaluation with terminal pause points
python3 src/multi_prompt_evaluator.py \
  --endpoint "https://your-agent.com/sync_query" \
  --auth "Basic your-token" \
  --questions questions.json \
  --prompt1-name "Production" \
  --prompt2-name "Experimental"

# Grade responses with LLM
python src/llm_grader.py --evaluation-file results.json --output graded.json

# Generate interactive dashboard
python src/dashboard_generator.py --graded-results graded.json --output dashboard.html

Project Structure

resident-assistant-evaluator/
├── src/                          # Core application code
│   ├── api_test_harness.py      # Direct API testing and database management
│   ├── enhanced_link_validation.py  # Advanced URL validation
│   ├── test_question_generator.py   # AI-powered question generation
│   ├── multi_prompt_evaluator.py   # Multi-prompt comparison pipeline
│   ├── llm_grader.py            # Claude-based response grading
│   ├── dashboard_generator.py    # HTML dashboard creation
│   ├── run_full_evaluation.py   # Complete pipeline orchestrator
│   ├── flet_evaluation_ui.py    # Modern GUI application
│   ├── launch_ui.py             # GUI launcher with deployment options
│   ├── setup_ui.py              # Setup and validation script
│   ├── test_ui.py               # Component testing
│   └── evaluation_config.json   # Configuration file
├── docs/                        # Documentation
│   ├── FLET_UI_README.md       # GUI interface guide
│   └── README_prompt_evaluation.md  # CLI evaluation guide
├── examples/                    # Sample question sets
│   ├── comprehensive_questions.json
│   └── basic_services_questions.json
├── data/                        # Generated data and results (gitignored)
└── requirements.txt             # Python dependencies

Evaluation Workflow

Terminal Workflow

  1. Generate Questions: Create test question sets
  2. Run Evaluations: Test each prompt version with terminal pause points for manual prompt switching
  3. Grade Responses: AI-powered evaluation with detailed rubrics
  4. Generate Dashboards: Create interactive HTML reports
  5. Analyze Results: Compare prompt performance metrics

Evaluation Metrics

Scoring Rubrics (1-5 scale)

  • Accuracy: Factual correctness and precision
  • Completeness: Coverage of all aspects of the question
  • Relevance: Direct applicability to the user's query
  • Clarity: Clear, understandable language and structure
  • Link Quality: Validity and usefulness of provided URLs

Question Categories

  • Government Services: DMV, vital records, licenses
  • Business Registration: Tax IDs, permits, regulations
  • Healthcare: Mental health services, providers, programs
  • Employment: Workers' compensation, benefits, rights
  • Education: Higher education, financial aid, resources

Configuration

API Endpoints

Configure your agent's API endpoint in src/evaluation_config.json:

{
  "api_endpoint": "https://stateofsc-staging.chatbot.socrata.com/sync_query",
  "auth_header": "Basic VHlsZXJUZWNoOlR5bGVyVGVjaA==",
  "default_questions_count": 5,
  "auto_grade": true,
  "generate_html_backup": true
}

Environment Variables

# Required for AI grading and question generation
export ANTHROPIC_API_KEY="sk-ant-your-key-here"

# Optional: Database configuration
export SUPABASE_URL="your-supabase-url"
export SUPABASE_KEY="your-supabase-key"

Terminal Workflow Features

Real-Time Progress Tracking

  • Live console updates during evaluation runs
  • Detailed question-by-question progress with timing
  • Link validation status and summary statistics
  • Clear success/failure indicators

Manual Prompt Switching

  • Clear terminal prompts with step-by-step instructions
  • Pause points between prompt versions for manual configuration
  • Option to skip prompt versions or quit evaluation
  • Confirmation messages for successful prompt changes

Results Analysis

  • Comprehensive terminal summaries comparing all prompt versions
  • Performance metrics including response times and success rates
  • Link validation statistics with detailed breakdowns
  • Structured JSON output for further analysis

Troubleshooting

Common Issues

Import Errors

pip install -r requirements.txt

API Connection Issues

  • Verify endpoint URL and authentication tokens
  • Check network connectivity and firewall settings
  • Test with direct API call tools (curl, Postman)

Pipeline Issues

# Test individual components
python3 src/test_question_generator.py --help
python3 src/multi_prompt_evaluator.py --help
python3 src/llm_grader.py --help
python3 src/dashboard_generator.py --help

Performance Issues

  • Reduce question count for testing
  • Increase delays between questions and prompts
  • Check console output for specific bottlenecks

Advanced Usage

Custom Question Generation

Extend the question generator for different domains:

from src.test_question_generator import ResidentAssistantQuestionGenerator

generator = ResidentAssistantQuestionGenerator()
questions = generator.generate_questions(
    num_questions=20,
    categories=["government", "business"],
    complexity_levels=["basic", "intermediate"]
)

API Integration

Direct integration with your evaluation pipeline:

from src.api_test_harness import APITester, DatabaseManager
from src.llm_grader import ResidentAssistantGrader

# Setup
db = DatabaseManager()
tester = APITester("https://your-api.com", "auth-token", db)
grader = ResidentAssistantGrader()

# Run evaluation
results = tester.run_test_suite(questions, "Test Run")
graded_results = grader.grade_evaluation_results(results)

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes with tests
  4. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built for evaluating the South Carolina Resident Assistant chatbot
  • Powered by Anthropic's Claude for intelligent grading
  • UI framework provided by Flet for cross-platform compatibility
  • Designed for government service chatbot evaluation and optimization

** Ready to start evaluating your AI assistant?**

python3 src/run_full_evaluation.py \
    --endpoint "https://stateofsc-staging.chatbot.socrata.com/sync_query" \
    --auth "Basic VHlsZXJUZWNoOlR5bGVyVGVjaA==" \
    --evaluation-name "Production vs Enhanced Prompt" \
    --num-questions 25

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages