Skip to content

Latest commit

 

History

History
555 lines (496 loc) · 19.2 KB

File metadata and controls

555 lines (496 loc) · 19.2 KB

Implementation Plan: Generative Customer Service Bot with Policy Enforcement

Overview

Build a fully generative chatbot using OpenRouter API that handles order cancellation and tracking requests while enforcing company policies (e.g., orders placed less than 10 days ago are eligible for cancellation).

Technology Stack

  • LLM Integration: OpenRouter API
  • Programming Language: Python
  • Testing: pytest
  • Evaluation Framework: AISI's inspect framework
  • API Framework: FastAPI (for mock endpoints)

Phase 1: Mock API Implementation

1.1 Project Structure

customer-service-bot-mockup/
├── api/
│   ├── __init__.py
│   ├── models.py          # Data models for orders
│   ├── order_api.py       # FastAPI endpoints
├── bot/
│   ├── __init__.py
│   ├── bot.py             # Main bot logic
│   ├── tools.py           # Tool definitions for LLM
│   ├── policies.py        # Policy enforcement logic
│   └── llm_client.py      # OpenRouter API wrapper
├── tests/
│   ├── __init__.py
│   ├── test_api.py        # API endpoint tests
│   ├── test_bot.py        # Bot behavior tests
│   ├── test_policies.py   # Policy enforcement tests
│   └── mock_data.py       # Seed data for testing
├── eval/
│   ├── __init__.py
│   ├── scenarios.py       # Test scenarios/cases
│   ├── evaluator.py       # Evaluation logic using inspect
│   └── metrics.py         # Metric calculations
├── scripts/
│   └── generate_report.py # Report generation script
└── requirements.txt

1.2 Mock API Implementation Details

api/models.py

Purpose: Define data structures for orders and operations

  • Order Model (class)
    • Attributes:
      • order_id: str
      • customer_id: str
      • order_date: datetime
      • status: str (e.g., "pending", "shipped", "delivered", "cancelled")
      • items: List[dict]
      • total_amount: float
      • shipping_address: str
  • OrderStatus Model (class)
    • Attributes:
      • order_id: str
      • status: str
      • tracking_number: str (optional)
      • estimated_delivery: datetime (optional)
      • last_update: datetime
  • CancellationRequest Model (class)
    • Attributes:
      • order_id: str
      • customer_id: str
      • reason: str
      • timestamp: datetime
    • Methods:
      • validate() -> bool

        Check if request is valid (has required fields)

  • CancellationResponse Model (class)
    • Attributes:
      • success: bool
      • order_id: str
      • message: str
      • refund_amount: float (optional)

tests/mock_data.py

Purpose: Seed in-memory database with sample orders

  • get_orders() -> List[Order]

    10-20 sample orders with varying dates and statuses (pre-generated)

    Mix of orders: some < 10 days old, some > 10 days old

    Include different statuses: pending, shipped, delivered, cancelled

  • get_order_by_id(order_id: str) -> Optional[Order]

    Query the mock data for specific order

  • search_orders(customer_id: str) -> List[Order]

    Find all orders for a customer

api/order_api.py

Purpose: FastAPI endpoints for order operations

  • ENDPOINTS:
    1. GET /health

      Return {"status": "ok"} to verify API is running

    2. GET /orders/{order_id}

      Retrieve order details by ID

      Return: Order object with details

      Error handling: 404 if not found

    3. GET /orders/{order_id}/tracking

      Get tracking information for an order

      Return: OrderStatus with tracking number and delivery estimate

      Error handling: 404 if order doesn't exist

    4. POST /orders/{order_id}/cancel

      Cancel an order

      Request body: CancellationRequest (reason, customer_id)

      Logic:

      1. Find order by ID

      2. Check policy: is order < 10 days old?

      3. Check status: can it be cancelled?

      4. Update order status to "cancelled"

      5. Return CancellationResponse

      Error handling:

      400: Policy violation (order too old)

      404: Order not found

      409: Order already cancelled

      1. GET /customers/{customer_id}

    Retrieve all customer orders by ID

    Return: List of order ids

    Error handling: 404 if not found

1.3 API Tests

tests/test_api.py

Test Cases:

  • test_health_endpoint()

    Verify /health returns 200 and {"status": "ok"}

  • test_get_order_success()

    Fetch valid order, verify response structure

  • test_get_order_not_found()

    Fetch non-existent order, verify 404

  • test_get_tracking_success()

    Fetch tracking for valid order

  • test_cancel_order_within_policy()

    Cancel order < 10 days old, verify success

  • test_cancel_order_outside_policy()

    Attempt to cancel order > 10 days old, verify 400 error

  • test_cancel_nonexistent_order()

    Attempt to cancel missing order, verify 404

  • test_cancel_already_cancelled_order()

    Attempt to cancel already-cancelled order, verify 409


Phase 2: Bot Implementation

2.1 LLM Client

bot/llm_client.py

Purpose: Wrapper for OpenRouter API

  • LLMClient (class)
    • Attributes:
    • Methods:
      • __init__(api_key: str, model: str)

        Initialize with OpenRouter credentials

      • chat(messages: List[dict], tools: List[dict], temperature: float = 0.7) -> dict

        Send request to OpenRouter with messages and tool definitions

        Return API response with tokens used and content

      • format_function_calls(tool_calls: List[dict]) -> str

        Convert function call structure to readable format for logging

2.2 Policy Enforcement

bot/policies.py

Purpose: Centralized policy logic for decision-making

  • PolicyChecker (class)
    • Methods:
      • can_cancel_order(order: Order) -> tuple[bool, str]

        Check if order can be cancelled

        Logic:

        1. Calculate days since order_date

        2. If days < 10: return (True, "Order eligible for cancellation")

        3. If days >= 10: return (False, "Order older than 10 days, not eligible")

        Returns: (success_boolean, reason_message)

      • get_order_status_message(order: Order) -> str

        Generate human-readable status message

        Format: "Your order {id} is {status} as of {date}"

      • validate_customer_ownership(order: Order, customer_id: str) -> bool

        Verify customer owns the order

        Returns: True if customer_id matches, False otherwise

2.3 Tool Definitions

bot/tools.py

Purpose: Define tools/functions for LLM to call

  • Tool Definitions (dictionaries):

    1. track_order_tool

      Tool to track an order

      Parameters: order_id (required)

      Returns: tracking information or error message

    2. cancel_order_tool

      Tool to cancel an order

      Parameters: order_id, customer_id, reason (optional)

      Returns: cancellation result or error message

    3. lookup_customer_orders_tool

      Tool to find all orders for a customer

      Parameters: customer_id

      Returns: list of order summaries

  • tool_executor() (function)

    Execute tool calls from LLM

    Parameters: tool_name, arguments

    Logic:

    Switch on tool_name

    Case "track_order": call API /orders/{id}/tracking

    Case "cancel_order": call API /orders/{id}/cancel with validation

    Case "lookup_customer_orders": call API /orders?customer_id=X

    Return: dict with result or error message

2.4 Main Bot Logic

bot/bot.py

Purpose: Orchestrate bot interactions with LLM

  • CustomerServiceBot (class)
    • Attributes:
      • llm_client: LLMClient
      • api_base_url: str
      • policy_checker: PolicyChecker
      • conversation_history: List[dict]
    • Methods:
      • __init__(api_key: str, model: str, api_base_url: str)

        - Initialize bot with LLM client and policies

        - Create system prompt explaining:

        - Bot's role: customer service

        - Available tools: track_order, cancel_order, lookup_orders

        - Policies: 10-day cancellation window

        - Instructions: be polite, check policies before canceling

      • process_message(user_message: str, customer_id: str = None) -> str

        Main method to handle user queries

        Logic:

        1. Append user message to conversation history

        3. Call LLM with messages and tools

        4. Get LLM response composed by:

        # - Tool calls (execute them)
        # - Text response (return to user)
        

        5. If tool calls exist:

        # - Execute each tool call
        # - Append tool results to conversation
        # - Call LLM again with results for final response
        

        6. Return final message to user

        7. Return to #1 unless user closed conversation.

        7. Track metrics: tokens used, tools called

      • get_conversation_history() -> List[dict]

        Return conversation context for evaluation

      • reset_conversation()

        Clear history for new session

    • Helper Methods:
      • execute_tool_calls(tool_calls: List[dict], customer_id: str) -> List[dict]

      Run tool_executor for each call. Tool use does not need to be parsed, it's included in the API itself. Tool calling example is at the end of this doc.

      • format_context_for_llm(tool_results: List[dict]) -> str

        Convert tool results to natural language for LLM

2.5 Bot Tests

tests/test_bot.py

Note: user questions and bot responses from API are mocked in these tests Test Cases:

  • test_bot_tracks_order_successfully()

    Send "Where is my order ABC123?" message

    Verify: bot calls track_order tool, returns tracking info

  • test_bot_cancels_recent_order()

    Send "Cancel my order ABC123" for order < 10 days old

    Verify: bot enforces policy, executes cancellation

  • test_bot_rejects_old_order_cancellation()

    Send "Cancel my order XYZ789" for order > 10 days old

    Verify: bot applies policy, refuses with explanation

  • test_bot_handles_invalid_order()

    Send cancellation request for non-existent order

    Verify: bot catches error and returns helpful message

  • test_bot_handles_conversation_history()

    Send multiple messages in sequence

    Verify: bot maintains context across turns


Phase 3: Evaluation Framework

3.1 Test Scenarios

eval/scenarios.py

Purpose: Define test cases for evaluation

  • Scenario (dataclass)
    • Attributes:
      • scenario_id: str
      • description: str
      • customer_id: str
      • messages: List[str] # Conversation turns
      • expected_behavior: str # What bot should do
      • expected_policy_enforcement: bool # Should policy be enforced?
      • category: str # "cancellation", "tracking", "policy_violation"
  • get_test_scenarios() -> List[Scenario]

    Define 10-15 scenarios covering:

    - Successful order tracking (various statuses)

    - Successful cancellations (within policy)

    - Failed cancellations (outside policy)

    - Invalid order IDs

    - Multiple conversation turns

    - Edge cases (already cancelled, no orders, etc.)

3.2 Metrics

eval/metrics.py

Purpose: Calculate performance metrics

  • Metrics (class)
    • Attributes:
      • total_tokens: int
      • total_conversations: int
      • tool_calls_count: int
      • policy_enforcement_rate: float
      • correct_action_rate: float
      • average_tokens_per_conversation: float
      • tokens_per_user_message: float
    • Methods:
      • add_conversation(tokens: int, tools_used: int, policy_correct: bool)

        Record metrics for one conversation

      • calculate_averages() -> dict

        Compute all metrics

      • get_report() -> str

        Format metrics as readable report

3.3 Evaluator

eval/evaluator.py

Purpose: Run evaluation using AISI inspect framework

  • Evaluator (class)
    • Attributes:
      • bot: CustomerServiceBot
      • scenarios: List[Scenario]
      • metrics: Metrics
    • Methods:
      • __init__(bot: CustomerServiceBot)

        Initialize with bot instance

      • evaluate_scenario(scenario: Scenario) -> dict

        Run one scenario through the bot

        Logic:

        1. Reset bot conversation

        2. For each message in scenario:

        # - Call bot.process_message()
        # - Track: tokens used, tools called, response quality
        

        3. Record metrics

        4. Check if policy was enforced correctly

        5. Return evaluation result

      • evaluate_all_scenarios() -> List[dict]

        Run all scenarios

        Return: List of evaluation results per scenario

      • generate_summary() -> dict

        Aggregate results across all scenarios

        Calculate:

        - Total tokens used

        - Average tokens per conversation

        - Policy enforcement accuracy

        - Tool usage statistics

        - Error rate

3.4 Evaluation with AISI Inspect

Integration Strategy:

  • Use AISI inspect framework to:
    1. Track decision paths: Record when bot decides to call tools vs. respond directly
    2. Monitor policy adherence: Log policy checks and outcomes
    3. Measure token efficiency: Record tokens per interaction
    4. Analyze tool usage: Count and categorize tool calls
  • Custom Inspectors:
    • PolicyInspector: Check if bot correctly applies 10-day rule
    • ToolUsageInspector: Track which tools are used and when
    • TokenInspector: Monitor token consumption patterns
    • ErrorInspector: Detect and classify errors

Phase 4: Report Generation

4.1 Report Script

scripts/generate_report.py

Purpose: Generate comprehensive evaluation report by querying Inspect logs

  • generate_report(evaluation_results: List[dict], output_path: str) -> None

    Logic:

    1. Load evaluation results from evaluator

    2. Calculate aggregate metrics:

    # - Total conversations: N
    # - Tokens per conversation: average
    # - Policy enforcement rate: percentage correct
    # - Tool usage: breakdown by tool type
    # - Response quality: qualitative assessment
    

    3. Create sections:

    # - Executive Summary
    # - Methodology
    # - Quantitative Results (tables, charts)
    # - Qualitative Analysis (scenario-by-scenario)
    # - Key Findings
    # - Recommendations
    

    4. Format as Markdown or HTML

    5. Save to output_path

  • Visualization (optional):

    Generate charts using matplotlib/plotly:

    - Token usage over conversations

    - Tool call distribution

    - Policy enforcement success rate

    - Error types breakdown


Implementation Order Summary

Phase 1: Mock API (Days 1-2)

  1. Create project structure
  2. Implement api/models.py with data structures
  3. Implement api/mock_data.py with seed data
  4. Implement api/order_api.py with FastAPI endpoints
  5. Write tests/test_api.py with comprehensive tests
  6. Run tests to verify API works

Phase 2: Bot (Days 3-4)

  1. Implement bot/llm_client.py for OpenRouter integration
  2. Implement bot/policies.py for policy enforcement
  3. Implement bot/tools.py for tool definitions
  4. Implement bot/bot.py for main bot logic
  5. Write tests/test_bot.py for bot behavior
  6. Run tests to verify bot works

Phase 3: Evaluation (Day 5)

  1. Implement eval/scenarios.py with test cases
  2. Implement eval/metrics.py for metric calculations
  3. Implement eval/evaluator.py with AISI inspect integration
  4. Run evaluation on all scenarios
  5. Verify metrics are being collected correctly

Phase 4: Report (Day 6)

  1. Implement scripts/generate_report.py
  2. Generate report with metrics and insights
  3. Review and refine report
  4. Add any visualizations if needed

Success Criteria

  1. API: All endpoints work, tests pass (100% coverage on critical paths)
  2. Bot: Bot correctly interprets queries, calls appropriate tools, enforces policies
  3. Evaluation: Metrics show bot follows policies correctly (target: 90%+ accuracy)
  4. Report: Clear documentation of approach, results, and insights
  5. Documentation: Code is well-documented, README explains setup and usage

Key Design Decisions

  1. Policy Enforcement: Centralized in PolicyChecker class to ensure consistency
  2. Tool-Based Architecture: Use function calling to give LLM structured API access
  3. Conversation History: Maintain full context for multi-turn interactions
  4. Evaluation: Use AISI inspect to get detailed insights into bot behavior
  5. Metrics: Focus on policy adherence and token efficiency
  6. Mock Data: Generate realistic scenarios with varying ages and statuses

Next Steps

After this plan is approved:

  1. Set up project structure
  2. Install dependencies (FastAPI, openrouter, pytest, inspect)
  3. Get OpenRouter API key
  4. Begin Phase 1 implementation

Tool calling example

import os
from openai import OpenAI

# Initialize the OpenRouter client
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY"),
    default_headers={
        "HTTP-Referer": "http://localhost:5000",
        "X-Title": "Tool Calling Example"
    }
)

# Define the tool/function
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and country, e.g. London, UK"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Initial API call with tools
response = client.chat.completions.create(
    model="openai/gpt-4o",  # Use a model that supports tool calling
    messages=[
        {"role": "user", "content": "What's the weather like in Paris?"}
    ],
    tools=tools
)

# Check if the model wants to call a tool
message = response.choices[0].message
if message.tool_calls:
    # The model requested a tool call
    tool_call = message.tool_calls[0]
    print(f"Model wants to call: {tool_call.function.name}")
    print(f"With arguments: {tool_call.function.arguments}")
    
    # Simulate the tool execution (in reality, you'd call your actual function)
    import json
    args = json.loads(tool_call.function.arguments)
    
    # Mock weather data
    weather_result = {
        "location": args["location"],
        "temperature": 18,
        "unit": args.get("unit", "celsius"),
        "conditions": "Partly cloudy"
    }
    
    # Send the tool response back to the model
    final_response = client.chat.completions.create(
        model="openai/gpt-4o",
        messages=[
            {"role": "user", "content": "What's the weather like in Paris?"},
            message,  # Include the assistant's tool call message
            {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(weather_result)
            }
        ],
        tools=tools
    )
    
    print(f"Final response: {final_response.choices[0].message.content}")