Build a fully generative chatbot using OpenRouter API that handles order cancellation and tracking requests while enforcing company policies (e.g., orders placed less than 10 days ago are eligible for cancellation).
- LLM Integration: OpenRouter API
- Programming Language: Python
- Testing: pytest
- Evaluation Framework: AISI's inspect framework
- API Framework: FastAPI (for mock endpoints)
customer-service-bot-mockup/
├── api/
│ ├── __init__.py
│ ├── models.py # Data models for orders
│ ├── order_api.py # FastAPI endpoints
├── bot/
│ ├── __init__.py
│ ├── bot.py # Main bot logic
│ ├── tools.py # Tool definitions for LLM
│ ├── policies.py # Policy enforcement logic
│ └── llm_client.py # OpenRouter API wrapper
├── tests/
│ ├── __init__.py
│ ├── test_api.py # API endpoint tests
│ ├── test_bot.py # Bot behavior tests
│ ├── test_policies.py # Policy enforcement tests
│ └── mock_data.py # Seed data for testing
├── eval/
│ ├── __init__.py
│ ├── scenarios.py # Test scenarios/cases
│ ├── evaluator.py # Evaluation logic using inspect
│ └── metrics.py # Metric calculations
├── scripts/
│ └── generate_report.py # Report generation script
└── requirements.txt
Purpose: Define data structures for orders and operations
- Order Model (class)
- Attributes:
- order_id: str
- customer_id: str
- order_date: datetime
- status: str (e.g., "pending", "shipped", "delivered", "cancelled")
- items: List[dict]
- total_amount: float
- shipping_address: str
- Attributes:
- OrderStatus Model (class)
- Attributes:
- order_id: str
- status: str
- tracking_number: str (optional)
- estimated_delivery: datetime (optional)
- last_update: datetime
- Attributes:
- CancellationRequest Model (class)
- Attributes:
- order_id: str
- customer_id: str
- reason: str
- timestamp: datetime
- Methods:
- validate() -> bool
- Attributes:
- CancellationResponse Model (class)
- Attributes:
- success: bool
- order_id: str
- message: str
- refund_amount: float (optional)
- Attributes:
Purpose: Seed in-memory database with sample orders
- get_orders() -> List[Order]
- get_order_by_id(order_id: str) -> Optional[Order]
- search_orders(customer_id: str) -> List[Order]
Purpose: FastAPI endpoints for order operations
- ENDPOINTS:
-
GET /health -
GET /orders/{order_id} -
GET /orders/{order_id}/tracking -
POST /orders/{order_id}/cancel -
GET /customers/{customer_id}
-
Test Cases:
test_health_endpoint()test_get_order_success()test_get_order_not_found()test_get_tracking_success()test_cancel_order_within_policy()test_cancel_order_outside_policy()test_cancel_nonexistent_order()test_cancel_already_cancelled_order()
Purpose: Wrapper for OpenRouter API
- LLMClient (class)
- Attributes:
- api_key: str
- model: str (e.g., "anthropic/claude-4-5-haiku") (we'll use claude 4.5 haiku as test model in this repo)
- base_url: str = "https://openrouter.ai/api/v1/chat/completions"
- Methods:
__init__(api_key: str, model: str)chat(messages: List[dict], tools: List[dict], temperature: float = 0.7) -> dictformat_function_calls(tool_calls: List[dict]) -> str
- Attributes:
Purpose: Centralized policy logic for decision-making
- PolicyChecker (class)
- Methods:
can_cancel_order(order: Order) -> tuple[bool, str]get_order_status_message(order: Order) -> strvalidate_customer_ownership(order: Order, customer_id: str) -> bool
- Methods:
Purpose: Define tools/functions for LLM to call
-
Tool Definitions (dictionaries):
-
track_order_tool -
cancel_order_tool -
lookup_customer_orders_tool
-
-
tool_executor() (function)
Purpose: Orchestrate bot interactions with LLM
- CustomerServiceBot (class)
- Attributes:
- llm_client: LLMClient
- api_base_url: str
- policy_checker: PolicyChecker
- conversation_history: List[dict]
- Methods:
__init__(api_key: str, model: str, api_base_url: str)process_message(user_message: str, customer_id: str = None) -> str# - Tool calls (execute them) # - Text response (return to user)# - Execute each tool call # - Append tool results to conversation # - Call LLM again with results for final responseget_conversation_history() -> List[dict]reset_conversation()
- Helper Methods:
execute_tool_calls(tool_calls: List[dict], customer_id: str) -> List[dict]
Run tool_executor for each call. Tool use does not need to be parsed, it's included in the API itself. Tool calling example is at the end of this doc.
format_context_for_llm(tool_results: List[dict]) -> str
- Attributes:
Note: user questions and bot responses from API are mocked in these tests Test Cases:
test_bot_tracks_order_successfully()test_bot_cancels_recent_order()test_bot_rejects_old_order_cancellation()test_bot_handles_invalid_order()test_bot_handles_conversation_history()
Purpose: Define test cases for evaluation
- Scenario (dataclass)
- Attributes:
- scenario_id: str
- description: str
- customer_id: str
- messages: List[str] # Conversation turns
- expected_behavior: str # What bot should do
- expected_policy_enforcement: bool # Should policy be enforced?
- category: str # "cancellation", "tracking", "policy_violation"
- Attributes:
- get_test_scenarios() -> List[Scenario]
Purpose: Calculate performance metrics
- Metrics (class)
- Attributes:
- total_tokens: int
- total_conversations: int
- tool_calls_count: int
- policy_enforcement_rate: float
- correct_action_rate: float
- average_tokens_per_conversation: float
- tokens_per_user_message: float
- Methods:
add_conversation(tokens: int, tools_used: int, policy_correct: bool)calculate_averages() -> dictget_report() -> str
- Attributes:
Purpose: Run evaluation using AISI inspect framework
- Evaluator (class)
- Attributes:
- bot: CustomerServiceBot
- scenarios: List[Scenario]
- metrics: Metrics
- Methods:
__init__(bot: CustomerServiceBot)evaluate_scenario(scenario: Scenario) -> dict# - Call bot.process_message() # - Track: tokens used, tools called, response qualityevaluate_all_scenarios() -> List[dict]generate_summary() -> dict
- Attributes:
Integration Strategy:
- Use AISI inspect framework to:
- Track decision paths: Record when bot decides to call tools vs. respond directly
- Monitor policy adherence: Log policy checks and outcomes
- Measure token efficiency: Record tokens per interaction
- Analyze tool usage: Count and categorize tool calls
- Custom Inspectors:
PolicyInspector: Check if bot correctly applies 10-day ruleToolUsageInspector: Track which tools are used and whenTokenInspector: Monitor token consumption patternsErrorInspector: Detect and classify errors
Purpose: Generate comprehensive evaluation report by querying Inspect logs
- generate_report(evaluation_results: List[dict], output_path: str) -> None
# - Total conversations: N # - Tokens per conversation: average # - Policy enforcement rate: percentage correct # - Tool usage: breakdown by tool type # - Response quality: qualitative assessment# - Executive Summary # - Methodology # - Quantitative Results (tables, charts) # - Qualitative Analysis (scenario-by-scenario) # - Key Findings # - Recommendations - Visualization (optional):
- Create project structure
- Implement
api/models.pywith data structures - Implement
api/mock_data.pywith seed data - Implement
api/order_api.pywith FastAPI endpoints - Write
tests/test_api.pywith comprehensive tests - Run tests to verify API works
- Implement
bot/llm_client.pyfor OpenRouter integration - Implement
bot/policies.pyfor policy enforcement - Implement
bot/tools.pyfor tool definitions - Implement
bot/bot.pyfor main bot logic - Write
tests/test_bot.pyfor bot behavior - Run tests to verify bot works
- Implement
eval/scenarios.pywith test cases - Implement
eval/metrics.pyfor metric calculations - Implement
eval/evaluator.pywith AISI inspect integration - Run evaluation on all scenarios
- Verify metrics are being collected correctly
- Implement
scripts/generate_report.py - Generate report with metrics and insights
- Review and refine report
- Add any visualizations if needed
- API: All endpoints work, tests pass (100% coverage on critical paths)
- Bot: Bot correctly interprets queries, calls appropriate tools, enforces policies
- Evaluation: Metrics show bot follows policies correctly (target: 90%+ accuracy)
- Report: Clear documentation of approach, results, and insights
- Documentation: Code is well-documented, README explains setup and usage
- Policy Enforcement: Centralized in
PolicyCheckerclass to ensure consistency - Tool-Based Architecture: Use function calling to give LLM structured API access
- Conversation History: Maintain full context for multi-turn interactions
- Evaluation: Use AISI inspect to get detailed insights into bot behavior
- Metrics: Focus on policy adherence and token efficiency
- Mock Data: Generate realistic scenarios with varying ages and statuses
After this plan is approved:
- Set up project structure
- Install dependencies (FastAPI, openrouter, pytest, inspect)
- Get OpenRouter API key
- Begin Phase 1 implementation
import os
from openai import OpenAI
# Initialize the OpenRouter client
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.getenv("OPENROUTER_API_KEY"),
default_headers={
"HTTP-Referer": "http://localhost:5000",
"X-Title": "Tool Calling Example"
}
)
# Define the tool/function
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and country, e.g. London, UK"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Initial API call with tools
response = client.chat.completions.create(
model="openai/gpt-4o", # Use a model that supports tool calling
messages=[
{"role": "user", "content": "What's the weather like in Paris?"}
],
tools=tools
)
# Check if the model wants to call a tool
message = response.choices[0].message
if message.tool_calls:
# The model requested a tool call
tool_call = message.tool_calls[0]
print(f"Model wants to call: {tool_call.function.name}")
print(f"With arguments: {tool_call.function.arguments}")
# Simulate the tool execution (in reality, you'd call your actual function)
import json
args = json.loads(tool_call.function.arguments)
# Mock weather data
weather_result = {
"location": args["location"],
"temperature": 18,
"unit": args.get("unit", "celsius"),
"conditions": "Partly cloudy"
}
# Send the tool response back to the model
final_response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[
{"role": "user", "content": "What's the weather like in Paris?"},
message, # Include the assistant's tool call message
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(weather_result)
}
],
tools=tools
)
print(f"Final response: {final_response.choices[0].message.content}")