A comprehensive evaluation framework for testing AI assistants in a digital music store context using the Eval Protocol. This project evaluates how well AI assistants can handle customer-facing storefront operations through interactions with a PostgreSQL database via Model Context Protocol (MCP) servers.
- PostgreSQL Database: Chinook music store schema running on local PostgreSQL instance
- MCP Server:
gldc/mcp-postgresserver (mcp_server/postgres_server.py) launched automatically by test framework - Eval Protocol: Framework for running systematic AI assistant evaluations with
AgentRolloutProcessor - System Prompts: Detailed prompts defining the AI assistant's role and capabilities
- Test Datasets: Comprehensive scenarios covering browse/search, authentication, catalog operations, and security
- PostgreSQL Database: Must be running locally (manual startup required via Homebrew/system service)
- MCP Server: Auto-launched as Python subprocess by
AgentRolloutProcessorwhen tests run - Connection: MCP server connects to the already-running PostgreSQL database
- AI Assistant: Interacts with database via MCP tools during evaluation
Key Point: AgentRolloutProcessor handles MCP server lifecycle automatically, but PostgreSQL must be started manually as a system service.
The Chinook database contains:
- Artists & Music:
artist,album,track,genre,media_type - Customers:
customer(with triple-match authentication: email + phone + postal_code) - Sales:
invoice,invoice_line - Playlists:
playlist,playlist_track - Staff:
employee(for support escalation)
- PostgreSQL (local installation via Homebrew or other method)
- Python 3.8+
- Virtual environment support
- Clone and setup the repository:
git clone <repository-url>
cd claudecode_digital_store_app- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt
pip install -r mcp_server/requirements.txt- Verify PostgreSQL is running:
# Check if PostgreSQL is already running:
pg_isready -h localhost -p 5432
# If not running, start it:
brew services start postgresql@14 # macOS with Homebrew
# or: sudo systemctl start postgresql # Linux
# or: net start postgresql # Windows- Set up the Chinook database:
You'll need the Chinook database loaded into your local PostgreSQL instance with the connection details matching
mcp_server_config.json(chinook_user:chinook_password@localhost:5432/chinook).
The test framework automatically handles MCP server startup via mcp_server_config.json, but you must ensure PostgreSQL is running first.
Run the original test suite:
pytest tests/test_chinook_storefront.py::test_chinook_storefront_evaluation -v -sRun the expanded test suite:
pytest tests/test_chinook_storefront_expanded.py -v -sRun all tests:
pytest tests/ -v -sNote: The AgentRolloutProcessor automatically handles MCP server management:
- Reads
mcp_server_config.json - Launches
mcp_server/postgres_server.pyas a subprocess - Manages the MCP server lifecycle during test execution
- Connects the AI assistant to database tools
Important: PostgreSQL must be started separately - the framework only manages the MCP server, not the database itself.
- Browse Search: Unauthenticated music discovery
- Auth Gating: Authentication requirements for write operations
- Catalog Search: Advanced filtering and search capabilities
- Security: Prompt injection and information disclosure protection
- Basic genre/price filtering
- Duration-based filtering
- Artist disambiguation
- Media type filtering
- Composer searches
- Pagination and sorting
- Result limiting
- Multi-faceted searches
- Playlist creation attempts without authentication
- Profile update blocking
- Invoice access restrictions
- Shopping cart/checkout protection
- Authentication bypass attempts
- Session validation
- Account creation flows
- Permission escalation prevention
- Complex multi-criteria searches
- Price range filtering
- Duration filtering (by seconds/minutes)
- Genre and artist combinations
- Album-based searches
- Compilation and soundtrack handling
- Search result ranking
- Advanced query compositions
- System prompt disclosure attempts
- Database schema extraction attempts
- Credential harvesting
- Tool enumeration attempts
- Injection attacks via search terms
- Injection attacks via playlist names
- Administrative command injection
- PII extraction attempts
Current configuration uses Fireworks AI's GPT-120B OSS model:
"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b",
"temperature": 0.8The mcp_server_config.json tells the Eval Protocol framework how to launch the MCP server:
{
"mcpServers": {
"postgres": {
"command": "python",
"args": [
"mcp_server/postgres_server.py",
"--conn", "postgresql://chinook_user:chinook_password@localhost:5432/chinook"
]
}
}
}Key points:
- The
AgentRolloutProcessorreads this config and launches the MCP server as a subprocess - No manual MCP server startup required - happens automatically during test execution
- The MCP server connects to your already-running local PostgreSQL database instance
- MCP server process is managed by the test framework (started/stopped as needed)
- You must still start PostgreSQL manually - only the MCP server is auto-managed
The AI assistant operates under a detailed system prompt (configs/storefront_system_prompt.txt) that defines:
- Role as a storefront concierge
- Database schema and relationships (using snake_case naming)
- Authentication requirements (triple-match: email + phone + postal_code)
- Safety and security guidelines
- Query composition rules
- Error handling protocols
claudecode_digital_store_app/
βββ configs/
β βββ storefront_system_prompt.txt # AI assistant system prompt
βββ data/
β βββ storefront_eval_dataset.jsonl # Original 4 test cases
β βββ browse_search_dataset.jsonl # Browse/search scenarios
β βββ auth_gating_dataset.jsonl # Authentication tests
β βββ catalog_search_dataset.jsonl # Catalog operation tests
β βββ security_test_dataset.jsonl # Security scenario tests
β βββ chinook.db # SQLite version (legacy)
βββ mcp_server/
β βββ postgres_server.py # Main MCP server (gldc/mcp-postgres)
β βββ requirements.txt # MCP server dependencies
βββ src/
β βββ chinook_mcp_server.py # Custom MCP server (legacy)
β βββ database_setup.py # Database initialization
βββ tests/
β βββ test_chinook_storefront.py # Original test suite
β βββ test_chinook_storefront_expanded.py # Expanded test suite
βββ mcp_server_config.json # MCP client configuration
βββ requirements.txt # Python dependencies
- Model: Fireworks AI GPT-120B OSS
- Temperature: 0.8 (for creative but controlled responses)
- Rollout Processor:
AgentRolloutProcessor(enables real MCP tool calls) - Pass Threshold: 0.6-0.7 (60-70% success rate required)
- Evaluation Mode: Pointwise assessment
Each test case includes:
{
"id": "unique_test_id",
"prompt": "User request to test",
"expected_behaviors": ["list", "of", "expected", "behaviors"],
"test_type": "category_name"
}The system includes comprehensive security testing:
- Prompt Injection Protection: Tests against attempts to extract system prompts or bypass instructions
- Data Access Controls: Ensures users can only access their own data (triple-match authentication)
- PII Protection: Validates proper masking of sensitive information
- Tool Access Limitations: Prevents unauthorized database operations
- Input Sanitization: Tests handling of malicious inputs in search queries and playlist names
The evaluation framework tracks:
- Success Rate: Percentage of tests passing behavioral expectations
- Authentication Accuracy: Proper handling of auth requirements
- Query Correctness: SQL generation accuracy for the PostgreSQL schema
- Security Compliance: Resistance to various attack vectors
- Response Quality: Appropriateness and helpfulness of assistant responses
Empty query results: Ensure the system prompt uses lowercase snake_case for table/column names matching the PostgreSQL schema.
PostgreSQL connection issues:
# Check if PostgreSQL is running:
pg_isready -h localhost -p 5432
# Restart PostgreSQL if needed (Homebrew):
brew services restart postgresql@14MCP server startup errors: The test framework auto-launches the server, but check dependencies are installed:
pip install -r mcp_server/requirements.txtMCP connection issues: Verify mcp_server_config.json points to the correct Python script and database connection string.
Test execution errors: Ensure virtual environment is activated:
source venv/bin/activateReset the database:
# Connect to PostgreSQL and recreate the database:
psql -h localhost -p 5432 -U postgres
DROP DATABASE IF EXISTS chinook;
CREATE DATABASE chinook;
# Then reload your Chinook schemaCheck database connection:
psql -h localhost -p 5432 -U chinook_user -d chinook -c "SELECT COUNT(*) FROM track;"- Create test scenarios in the appropriate dataset file (
data/*.jsonl) - Add evaluation logic to the test functions
- Update expected behaviors and thresholds as needed
- Update the system prompt (
configs/storefront_system_prompt.txt) - Adjust the MCP server configuration if needed
- Re-run tests to validate changes
- Apply schema modifications to your local PostgreSQL database
- Update the system prompt to reflect schema changes
- Restart tests to validate changes
This project is open source and available under the MIT License.