Implementation Plan: Generative Customer Service Bot with Policy Enforcement

Overview

Build a fully generative chatbot using OpenRouter API that handles order cancellation and tracking requests while enforcing company policies (e.g., orders placed less than 10 days ago are eligible for cancellation).

Technology Stack

LLM Integration: OpenRouter API
Programming Language: Python
Testing: pytest
Evaluation Framework: AISI's inspect framework
API Framework: FastAPI (for mock endpoints)

Phase 1: Mock API Implementation

1.1 Project Structure

customer-service-bot-mockup/
├── api/
│   ├── __init__.py
│   ├── models.py          # Data models for orders
│   ├── order_api.py       # FastAPI endpoints
├── bot/
│   ├── __init__.py
│   ├── bot.py             # Main bot logic
│   ├── tools.py           # Tool definitions for LLM
│   ├── policies.py        # Policy enforcement logic
│   └── llm_client.py      # OpenRouter API wrapper
├── tests/
│   ├── __init__.py
│   ├── test_api.py        # API endpoint tests
│   ├── test_bot.py        # Bot behavior tests
│   ├── test_policies.py   # Policy enforcement tests
│   └── mock_data.py       # Seed data for testing
├── eval/
│   ├── __init__.py
│   ├── scenarios.py       # Test scenarios/cases
│   ├── evaluator.py       # Evaluation logic using inspect
│   └── metrics.py         # Metric calculations
├── scripts/
│   └── generate_report.py # Report generation script
└── requirements.txt

1.2 Mock API Implementation Details

api/models.py

Purpose: Define data structures for orders and operations

Order Model (class)
- Attributes:
  - order_id: str
  - customer_id: str
  - order_date: datetime
  - status: str (e.g., "pending", "shipped", "delivered", "cancelled")
  - items: List[dict]
  - total_amount: float
  - shipping_address: str
OrderStatus Model (class)
- Attributes:
  - order_id: str
  - status: str
  - tracking_number: str (optional)
  - estimated_delivery: datetime (optional)
  - last_update: datetime
CancellationRequest Model (class)
- Attributes:
  - order_id: str
  - customer_id: str
  - reason: str
  - timestamp: datetime
- Methods:
  - validate() -> bool
    Check if request is valid (has required fields)
CancellationResponse Model (class)
- Attributes:
  - success: bool
  - order_id: str
  - message: str
  - refund_amount: float (optional)

tests/mock_data.py

Purpose: Seed in-memory database with sample orders

get_orders() -> List[Order]
10-20 sample orders with varying dates and statuses (pre-generated)

Mix of orders: some < 10 days old, some > 10 days old

Include different statuses: pending, shipped, delivered, cancelled
get_order_by_id(order_id: str) -> Optional[Order]
Query the mock data for specific order
search_orders(customer_id: str) -> List[Order]
Find all orders for a customer

api/order_api.py

Purpose: FastAPI endpoints for order operations

ENDPOINTS:
1. GET /health
  
  Return {"status": "ok"} to verify API is running
2. GET /orders/{order_id}
  
  Retrieve order details by ID
  
  Return: Order object with details
  
  Error handling: 404 if not found
3. GET /orders/{order_id}/tracking
  
  Get tracking information for an order
  
  Return: OrderStatus with tracking number and delivery estimate
  
  Error handling: 404 if order doesn't exist
4. POST /orders/{order_id}/cancel
  
  Cancel an order
  
  Request body: CancellationRequest (reason, customer_id)
  
  Logic:
  
  1. Find order by ID
  
  2. Check policy: is order < 10 days old?
  
  3. Check status: can it be cancelled?
  
  4. Update order status to "cancelled"
  
  5. Return CancellationResponse
  
  Error handling:
  
  400: Policy violation (order too old)
  
  404: Order not found
  
  409: Order already cancelled
5. 1. GET /customers/{customer_id}
Retrieve all customer orders by ID

Return: List of order ids

Error handling: 404 if not found

1.3 API Tests

tests/test_api.py

Test Cases:

test_health_endpoint()
Verify /health returns 200 and {"status": "ok"}
test_get_order_success()
Fetch valid order, verify response structure
test_get_order_not_found()
Fetch non-existent order, verify 404
test_get_tracking_success()
Fetch tracking for valid order
test_cancel_order_within_policy()
Cancel order < 10 days old, verify success
test_cancel_order_outside_policy()
Attempt to cancel order > 10 days old, verify 400 error
test_cancel_nonexistent_order()
Attempt to cancel missing order, verify 404
test_cancel_already_cancelled_order()
Attempt to cancel already-cancelled order, verify 409

Phase 2: Bot Implementation

2.1 LLM Client

bot/llm_client.py

Purpose: Wrapper for OpenRouter API

LLMClient (class)
- Attributes:
  - api_key: str
  - model: str (e.g., "anthropic/claude-4-5-haiku") (we'll use claude 4.5 haiku as test model in this repo)
  - base_url: str = "https://openrouter.ai/api/v1/chat/completions"
- Methods:
  - __init__(api_key: str, model: str)
    Initialize with OpenRouter credentials
  - chat(messages: List[dict], tools: List[dict], temperature: float = 0.7) -> dict
    Send request to OpenRouter with messages and tool definitions
    
    Return API response with tokens used and content
  - format_function_calls(tool_calls: List[dict]) -> str
    Convert function call structure to readable format for logging

2.2 Policy Enforcement

bot/policies.py

Purpose: Centralized policy logic for decision-making

PolicyChecker (class)
- Methods:
  - can_cancel_order(order: Order) -> tuple[bool, str]
    Check if order can be cancelled
    
    Logic:
    
    1. Calculate days since order_date
    
    2. If days < 10: return (True, "Order eligible for cancellation")
    
    3. If days >= 10: return (False, "Order older than 10 days, not eligible")
    
    Returns: (success_boolean, reason_message)
  - get_order_status_message(order: Order) -> str
    Generate human-readable status message
    
    Format: "Your order {id} is {status} as of {date}"
  - validate_customer_ownership(order: Order, customer_id: str) -> bool
    Verify customer owns the order
    
    Returns: True if customer_id matches, False otherwise

2.3 Tool Definitions

bot/tools.py

Purpose: Define tools/functions for LLM to call

Tool Definitions (dictionaries):
1. track_order_tool
  
  Tool to track an order
  
  Parameters: order_id (required)
  
  Returns: tracking information or error message
2. cancel_order_tool
  
  Tool to cancel an order
  
  Parameters: order_id, customer_id, reason (optional)
  
  Returns: cancellation result or error message
3. lookup_customer_orders_tool
  
  Tool to find all orders for a customer
  
  Parameters: customer_id
  
  Returns: list of order summaries
tool_executor() (function)

Execute tool calls from LLM

Parameters: tool_name, arguments

Logic:

Switch on tool_name

Case "track_order": call API /orders/{id}/tracking

Case "cancel_order": call API /orders/{id}/cancel with validation

Case "lookup_customer_orders": call API /orders?customer_id=X

Return: dict with result or error message

2.4 Main Bot Logic

bot/bot.py

Purpose: Orchestrate bot interactions with LLM

CustomerServiceBot (class)
- Attributes:
  - llm_client: LLMClient
  - api_base_url: str
  - policy_checker: PolicyChecker
  - conversation_history: List[dict]
- Methods:
  - __init__(api_key: str, model: str, api_base_url: str)
    - Initialize bot with LLM client and policies
    
    - Create system prompt explaining:
    
    - Bot's role: customer service
    
    - Available tools: track_order, cancel_order, lookup_orders
    
    - Policies: 10-day cancellation window
    
    - Instructions: be polite, check policies before canceling
  - process_message(user_message: str, customer_id: str = None) -> str
    Main method to handle user queries
    
    Logic:
    
    1. Append user message to conversation history
    
    3. Call LLM with messages and tools
    
    4. Get LLM response composed by:
```
# - Tool calls (execute them)
# - Text response (return to user)
```
    5. If tool calls exist:
```
# - Execute each tool call
# - Append tool results to conversation
# - Call LLM again with results for final response
```
    6. Return final message to user
    
    7. Return to #1 unless user closed conversation.
    
    7. Track metrics: tokens used, tools called
  - get_conversation_history() -> List[dict]
    Return conversation context for evaluation
  - reset_conversation()
    Clear history for new session
- Helper Methods:
  - execute_tool_calls(tool_calls: List[dict], customer_id: str) -> List[dict]
  Run tool_executor for each call. Tool use does not need to be parsed, it's included in the API itself. Tool calling example is at the end of this doc.
  - format_context_for_llm(tool_results: List[dict]) -> str
    Convert tool results to natural language for LLM

2.5 Bot Tests

tests/test_bot.py

Note: user questions and bot responses from API are mocked in these tests Test Cases:

test_bot_tracks_order_successfully()
Send "Where is my order ABC123?" message

Verify: bot calls track_order tool, returns tracking info
test_bot_cancels_recent_order()
Send "Cancel my order ABC123" for order < 10 days old

Verify: bot enforces policy, executes cancellation
test_bot_rejects_old_order_cancellation()
Send "Cancel my order XYZ789" for order > 10 days old

Verify: bot applies policy, refuses with explanation
test_bot_handles_invalid_order()
Send cancellation request for non-existent order

Verify: bot catches error and returns helpful message
test_bot_handles_conversation_history()
Send multiple messages in sequence

Verify: bot maintains context across turns

Phase 3: Evaluation Framework

3.1 Test Scenarios

eval/scenarios.py

Purpose: Define test cases for evaluation

Scenario (dataclass)
- Attributes:
  - scenario_id: str
  - description: str
  - customer_id: str
  - messages: List[str] # Conversation turns
  - expected_behavior: str # What bot should do
  - expected_policy_enforcement: bool # Should policy be enforced?
  - category: str # "cancellation", "tracking", "policy_violation"
get_test_scenarios() -> List[Scenario]
Define 10-15 scenarios covering:

- Successful order tracking (various statuses)

- Successful cancellations (within policy)

- Failed cancellations (outside policy)

- Invalid order IDs

- Multiple conversation turns

- Edge cases (already cancelled, no orders, etc.)

3.2 Metrics

eval/metrics.py

Purpose: Calculate performance metrics

Metrics (class)
- Attributes:
  - total_tokens: int
  - total_conversations: int
  - tool_calls_count: int
  - policy_enforcement_rate: float
  - correct_action_rate: float
  - average_tokens_per_conversation: float
  - tokens_per_user_message: float
- Methods:
  - add_conversation(tokens: int, tools_used: int, policy_correct: bool)
    Record metrics for one conversation
  - calculate_averages() -> dict
    Compute all metrics
  - get_report() -> str
    Format metrics as readable report

3.3 Evaluator

eval/evaluator.py

Purpose: Run evaluation using AISI inspect framework

Evaluator (class)
- Attributes:
  - bot: CustomerServiceBot
  - scenarios: List[Scenario]
  - metrics: Metrics
- Methods:
  - __init__(bot: CustomerServiceBot)
    Initialize with bot instance
  - evaluate_scenario(scenario: Scenario) -> dict
    Run one scenario through the bot
    
    Logic:
    
    1. Reset bot conversation
    
    2. For each message in scenario:
```
# - Call bot.process_message()
# - Track: tokens used, tools called, response quality
```
    3. Record metrics
    
    4. Check if policy was enforced correctly
    
    5. Return evaluation result
  - evaluate_all_scenarios() -> List[dict]
    Run all scenarios
    
    Return: List of evaluation results per scenario
  - generate_summary() -> dict
    Aggregate results across all scenarios
    
    Calculate:
    
    - Total tokens used
    
    - Average tokens per conversation
    
    - Policy enforcement accuracy
    
    - Tool usage statistics
    
    - Error rate

3.4 Evaluation with AISI Inspect

Integration Strategy:

Use AISI inspect framework to:
1. Track decision paths: Record when bot decides to call tools vs. respond directly
2. Monitor policy adherence: Log policy checks and outcomes
3. Measure token efficiency: Record tokens per interaction
4. Analyze tool usage: Count and categorize tool calls
Custom Inspectors:
- PolicyInspector: Check if bot correctly applies 10-day rule
- ToolUsageInspector: Track which tools are used and when
- TokenInspector: Monitor token consumption patterns
- ErrorInspector: Detect and classify errors

Phase 4: Report Generation

4.1 Report Script

scripts/generate_report.py

Purpose: Generate comprehensive evaluation report by querying Inspect logs

generate_report(evaluation_results: List[dict], output_path: str) -> None

Logic:

1. Load evaluation results from evaluator

2. Calculate aggregate metrics:

# - Total conversations: N
# - Tokens per conversation: average
# - Policy enforcement rate: percentage correct
# - Tool usage: breakdown by tool type
# - Response quality: qualitative assessment

3. Create sections:

# - Executive Summary
# - Methodology
# - Quantitative Results (tables, charts)
# - Qualitative Analysis (scenario-by-scenario)
# - Key Findings
# - Recommendations

4. Format as Markdown or HTML

5. Save to output_path

Visualization (optional):
Generate charts using matplotlib/plotly:

- Token usage over conversations

- Tool call distribution

- Policy enforcement success rate

- Error types breakdown

Implementation Order Summary

Phase 1: Mock API (Days 1-2)

Create project structure
Implement api/models.py with data structures
Implement api/mock_data.py with seed data
Implement api/order_api.py with FastAPI endpoints
Write tests/test_api.py with comprehensive tests
Run tests to verify API works

Phase 2: Bot (Days 3-4)

Implement bot/llm_client.py for OpenRouter integration
Implement bot/policies.py for policy enforcement
Implement bot/tools.py for tool definitions
Implement bot/bot.py for main bot logic
Write tests/test_bot.py for bot behavior
Run tests to verify bot works

Phase 3: Evaluation (Day 5)

Implement eval/scenarios.py with test cases
Implement eval/metrics.py for metric calculations
Implement eval/evaluator.py with AISI inspect integration
Run evaluation on all scenarios
Verify metrics are being collected correctly

Phase 4: Report (Day 6)

Implement scripts/generate_report.py
Generate report with metrics and insights
Review and refine report
Add any visualizations if needed

Success Criteria

API: All endpoints work, tests pass (100% coverage on critical paths)
Bot: Bot correctly interprets queries, calls appropriate tools, enforces policies
Evaluation: Metrics show bot follows policies correctly (target: 90%+ accuracy)
Report: Clear documentation of approach, results, and insights
Documentation: Code is well-documented, README explains setup and usage

Key Design Decisions

Policy Enforcement: Centralized in PolicyChecker class to ensure consistency
Tool-Based Architecture: Use function calling to give LLM structured API access
Conversation History: Maintain full context for multi-turn interactions
Evaluation: Use AISI inspect to get detailed insights into bot behavior
Metrics: Focus on policy adherence and token efficiency
Mock Data: Generate realistic scenarios with varying ages and statuses

Next Steps

After this plan is approved:

Set up project structure
Install dependencies (FastAPI, openrouter, pytest, inspect)
Get OpenRouter API key
Begin Phase 1 implementation

Tool calling example

import os
from openai import OpenAI

# Initialize the OpenRouter client
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY"),
    default_headers={
        "HTTP-Referer": "http://localhost:5000",
        "X-Title": "Tool Calling Example"
    }
)

# Define the tool/function
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and country, e.g. London, UK"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Initial API call with tools
response = client.chat.completions.create(
    model="openai/gpt-4o",  # Use a model that supports tool calling
    messages=[
        {"role": "user", "content": "What's the weather like in Paris?"}
    ],
    tools=tools
)

# Check if the model wants to call a tool
message = response.choices[0].message
if message.tool_calls:
    # The model requested a tool call
    tool_call = message.tool_calls[0]
    print(f"Model wants to call: {tool_call.function.name}")
    print(f"With arguments: {tool_call.function.arguments}")
    
    # Simulate the tool execution (in reality, you'd call your actual function)
    import json
    args = json.loads(tool_call.function.arguments)
    
    # Mock weather data
    weather_result = {
        "location": args["location"],
        "temperature": 18,
        "unit": args.get("unit", "celsius"),
        "conditions": "Partly cloudy"
    }
    
    # Send the tool response back to the model
    final_response = client.chat.completions.create(
        model="openai/gpt-4o",
        messages=[
            {"role": "user", "content": "What's the weather like in Paris?"},
            message,  # Include the assistant's tool call message
            {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(weather_result)
            }
        ],
        tools=tools
    )
    
    print(f"Final response: {final_response.choices[0].message.content}")

FilesExpand file tree

plan.md

Latest commit

History

plan.md

File metadata and controls

Implementation Plan: Generative Customer Service Bot with Policy Enforcement

Overview

Technology Stack

Phase 1: Mock API Implementation

1.1 Project Structure

1.2 Mock API Implementation Details

api/models.py

Check if request is valid (has required fields)

tests/mock_data.py

10-20 sample orders with varying dates and statuses (pre-generated)

Mix of orders: some < 10 days old, some > 10 days old

Include different statuses: pending, shipped, delivered, cancelled

Query the mock data for specific order

Find all orders for a customer

api/order_api.py

Return {"status": "ok"} to verify API is running

Retrieve order details by ID

Return: Order object with details

Error handling: 404 if not found

Get tracking information for an order

Return: OrderStatus with tracking number and delivery estimate

Error handling: 404 if order doesn't exist

Cancel an order

Request body: CancellationRequest (reason, customer_id)

Logic:

1. Find order by ID

2. Check policy: is order < 10 days old?

3. Check status: can it be cancelled?

4. Update order status to "cancelled"

5. Return CancellationResponse

Error handling:

400: Policy violation (order too old)

404: Order not found

409: Order already cancelled

Retrieve all customer orders by ID

Return: List of order ids

Error handling: 404 if not found

1.3 API Tests

tests/test_api.py

Verify /health returns 200 and {"status": "ok"}

Fetch valid order, verify response structure

Fetch non-existent order, verify 404

Fetch tracking for valid order

Cancel order < 10 days old, verify success

Attempt to cancel order > 10 days old, verify 400 error

Attempt to cancel missing order, verify 404

Attempt to cancel already-cancelled order, verify 409

Phase 2: Bot Implementation

2.1 LLM Client

bot/llm_client.py

Initialize with OpenRouter credentials

Send request to OpenRouter with messages and tool definitions

Return API response with tokens used and content

Convert function call structure to readable format for logging

2.2 Policy Enforcement

bot/policies.py

Check if order can be cancelled

Logic:

1. Calculate days since order_date

2. If days < 10: return (True, "Order eligible for cancellation")

3. If days >= 10: return (False, "Order older than 10 days, not eligible")

Returns: (success_boolean, reason_message)

Generate human-readable status message

Format: "Your order {id} is {status} as of {date}"

Verify customer owns the order

Returns: True if customer_id matches, False otherwise

2.3 Tool Definitions

bot/tools.py

Tool to track an order

Parameters: order_id (required)

Returns: tracking information or error message

Tool to cancel an order

Parameters: order_id, customer_id, reason (optional)

Returns: cancellation result or error message