A comprehensive evaluation system for testing and comparing AI chatbot prompts, specifically designed for the South Carolina Resident Assistant chatbot. Features command-line tools for automated evaluation with manual prompt switching via terminal pause points.
- Multi-Prompt Testing: Compare multiple prompt versions side-by-side
- Intelligent Question Generation: AI-powered test question creation for South Carolina state services
- Enhanced Link Validation: Comprehensive URL testing with retry logic and government site optimization
- LLM-Based Grading: Claude-powered response evaluation with detailed rubrics
- Interactive Dashboards: Both HTML and native UI result visualization
- Real-Time Progress Tracking: Live console updates during evaluation runs
- Terminal Pause Points: Manual prompt switching with clear instructions
- Structured Output: Organized results with detailed metrics and summaries
- 5-Dimensional Scoring: Accuracy, Completeness, Relevance, Clarity, Link Quality
- Statistical Analysis: Performance trends and comparative metrics
- Export Options: HTML dashboards, JSON data, and structured reports
- Python 3.8+
- Anthropic API key for LLM grading and question generation
- Access to your AI agent's API endpoint
# Clone the repository
git clone <repository-url>
cd prompt-evaluation
# Install with uv (creates virtual environment automatically)
uv sync
# Activate the environment
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows
# Optional: Install with development dependencies
uv sync --extra dev# Clone the repository
git clone <repository-url>
cd prompt-evaluation
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt- Configure your credentials Copy and edit the configuration file:
cp evaluation_config.example.json src/evaluation_config.jsonEdit src/evaluation_config.json:
{
"api_endpoint": "https://your-agent-endpoint.com/sync_query",
"auth_header": "Basic your-auth-token-here",
"default_questions_count": 5,
"auto_grade": true,
"generate_html_backup": true
}- Set Anthropic API key
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"# Run the complete pipeline with terminal pause points
python3 src/run_full_evaluation.py \
--endpoint "https://your-agent-endpoint.com/sync_query" \
--auth "Basic your-auth-token" \
--evaluation-name "Production vs Enhanced Prompts" \
--num-questions 25# Generate test questions
python src/test_question_generator.py --count 10 --output questions.json
# Run multi-prompt evaluation with terminal pause points
python3 src/multi_prompt_evaluator.py \
--endpoint "https://your-agent.com/sync_query" \
--auth "Basic your-token" \
--questions questions.json \
--prompt1-name "Production" \
--prompt2-name "Experimental"
# Grade responses with LLM
python src/llm_grader.py --evaluation-file results.json --output graded.json
# Generate interactive dashboard
python src/dashboard_generator.py --graded-results graded.json --output dashboard.htmlresident-assistant-evaluator/
├── src/ # Core application code
│ ├── api_test_harness.py # Direct API testing and database management
│ ├── enhanced_link_validation.py # Advanced URL validation
│ ├── test_question_generator.py # AI-powered question generation
│ ├── multi_prompt_evaluator.py # Multi-prompt comparison pipeline
│ ├── llm_grader.py # Claude-based response grading
│ ├── dashboard_generator.py # HTML dashboard creation
│ ├── run_full_evaluation.py # Complete pipeline orchestrator
│ ├── flet_evaluation_ui.py # Modern GUI application
│ ├── launch_ui.py # GUI launcher with deployment options
│ ├── setup_ui.py # Setup and validation script
│ ├── test_ui.py # Component testing
│ └── evaluation_config.json # Configuration file
├── docs/ # Documentation
│ ├── FLET_UI_README.md # GUI interface guide
│ └── README_prompt_evaluation.md # CLI evaluation guide
├── examples/ # Sample question sets
│ ├── comprehensive_questions.json
│ └── basic_services_questions.json
├── data/ # Generated data and results (gitignored)
└── requirements.txt # Python dependencies
- Generate Questions: Create test question sets
- Run Evaluations: Test each prompt version with terminal pause points for manual prompt switching
- Grade Responses: AI-powered evaluation with detailed rubrics
- Generate Dashboards: Create interactive HTML reports
- Analyze Results: Compare prompt performance metrics
- Accuracy: Factual correctness and precision
- Completeness: Coverage of all aspects of the question
- Relevance: Direct applicability to the user's query
- Clarity: Clear, understandable language and structure
- Link Quality: Validity and usefulness of provided URLs
- Government Services: DMV, vital records, licenses
- Business Registration: Tax IDs, permits, regulations
- Healthcare: Mental health services, providers, programs
- Employment: Workers' compensation, benefits, rights
- Education: Higher education, financial aid, resources
Configure your agent's API endpoint in src/evaluation_config.json:
{
"api_endpoint": "https://stateofsc-staging.chatbot.socrata.com/sync_query",
"auth_header": "Basic VHlsZXJUZWNoOlR5bGVyVGVjaA==",
"default_questions_count": 5,
"auto_grade": true,
"generate_html_backup": true
}# Required for AI grading and question generation
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
# Optional: Database configuration
export SUPABASE_URL="your-supabase-url"
export SUPABASE_KEY="your-supabase-key"- Live console updates during evaluation runs
- Detailed question-by-question progress with timing
- Link validation status and summary statistics
- Clear success/failure indicators
- Clear terminal prompts with step-by-step instructions
- Pause points between prompt versions for manual configuration
- Option to skip prompt versions or quit evaluation
- Confirmation messages for successful prompt changes
- Comprehensive terminal summaries comparing all prompt versions
- Performance metrics including response times and success rates
- Link validation statistics with detailed breakdowns
- Structured JSON output for further analysis
Import Errors
pip install -r requirements.txtAPI Connection Issues
- Verify endpoint URL and authentication tokens
- Check network connectivity and firewall settings
- Test with direct API call tools (curl, Postman)
Pipeline Issues
# Test individual components
python3 src/test_question_generator.py --help
python3 src/multi_prompt_evaluator.py --help
python3 src/llm_grader.py --help
python3 src/dashboard_generator.py --helpPerformance Issues
- Reduce question count for testing
- Increase delays between questions and prompts
- Check console output for specific bottlenecks
Extend the question generator for different domains:
from src.test_question_generator import ResidentAssistantQuestionGenerator
generator = ResidentAssistantQuestionGenerator()
questions = generator.generate_questions(
num_questions=20,
categories=["government", "business"],
complexity_levels=["basic", "intermediate"]
)Direct integration with your evaluation pipeline:
from src.api_test_harness import APITester, DatabaseManager
from src.llm_grader import ResidentAssistantGrader
# Setup
db = DatabaseManager()
tester = APITester("https://your-api.com", "auth-token", db)
grader = ResidentAssistantGrader()
# Run evaluation
results = tester.run_test_suite(questions, "Test Run")
graded_results = grader.grade_evaluation_results(results)- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes with tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built for evaluating the South Carolina Resident Assistant chatbot
- Powered by Anthropic's Claude for intelligent grading
- UI framework provided by Flet for cross-platform compatibility
- Designed for government service chatbot evaluation and optimization
** Ready to start evaluating your AI assistant?**
python3 src/run_full_evaluation.py \
--endpoint "https://stateofsc-staging.chatbot.socrata.com/sync_query" \
--auth "Basic VHlsZXJUZWNoOlR5bGVyVGVjaA==" \
--evaluation-name "Production vs Enhanced Prompt" \
--num-questions 25