Skip to content

Agent reliability layer: circuit breakers, loop detection, and automatic model escalation #29

@weklund-agent

Description

@weklund-agent

Summary

Implement an agent reliability layer directly in the serving infrastructure that detects common failure modes (infinite loops, malformed tool calls, runaway token usage) and automatically recovers — either by retrying with corrective prompts, escalating to a more capable model tier, or falling back to cloud. This makes any local model dramatically more reliable for autonomous agent workloads without requiring changes to the agent framework.

Problem

The #1 reason people give up on local models for autonomous agents is unreliable tool calling and instruction following. This is well-documented across communities:

  • CrewAI forums: "agents frequently attempted tool calls using placeholder data," "agents produce garbage output or incorrect JSON formatting"
  • Local model users consistently report: agent generates malformed JSON → retries → malformed again → infinite loop burning tokens and compute all night
  • Research shows 73% of enterprise agent deployments hit reliability failures in their first year. On local hardware, failure modes are worse because smaller models have weaker instruction following

Nobody is solving this at the serving layer. Every agent framework expects the model API to "just work" and handles errors poorly. The agent framework generates a prompt, sends it to the API, and trusts the response. When the response is garbage, the framework either crashes, retries blindly, or loops forever.

Failure Modes in the Wild

  1. Infinite tool-call loops: Model generates {"function": "search", "arguments": {"query": "..."}} → gets result → generates the exact same tool call → gets the same result → loops indefinitely. The agent framework doesn't detect this because each response is technically valid.

  2. Malformed function calls: Model generates almost-valid JSON but with wrong field names, missing required fields, or extra trailing text. The agent framework's parser fails, it retries, the model generates the same malformed output.

  3. Hallucinated tool names: Model calls execute_code when the available tools are run_terminal_command and write_file. Agent framework returns "unknown tool" error, model tries again with the same hallucinated name.

  4. Runaway token consumption: Agent enters a reasoning loop, generating thousands of tokens of "thinking" that goes nowhere. On a cloud API this costs money; on local hardware it monopolizes the model server for hours.

  5. Quality degradation under load: When memory pressure causes swap, model quality drops silently. The model starts generating incoherent responses, but the health check still passes because the server is technically running.

Proposed Solution

Architecture

The reliability layer sits between LiteLLM and the client (agent framework), implemented as a LiteLLM middleware or a thin proxy layer:

Agent Framework
       ↓
┌─────────────────────────┐
│   Reliability Layer     │
│  ┌───────────────────┐  │
│  │  Loop Detector     │  │
│  │  Tool Validator    │  │
│  │  Token Budget      │  │
│  │  Escalation Engine │  │
│  └───────────────────┘  │
└─────────────────────────┘
       ↓
   LiteLLM Proxy (:4000)
       ↓
   Model Tiers

Component 1: Loop Detection

Mechanism: Maintain a rolling window of the last N responses per session (identified by a session header or API key). Compute a similarity hash of each response (normalized, whitespace-stripped). If 3+ responses in the window produce the same or near-identical output, trigger circuit breaker.

Implementation:

class LoopDetector:
    def __init__(self, window_size=10, threshold=3, similarity=0.95):
        self.sessions = {}  # session_id -> deque of response hashes

    def check(self, session_id: str, response: str) -> LoopStatus:
        # Hash the normalized response
        # Check against rolling window
        # Return NORMAL, WARNING, or CIRCUIT_BREAK

On circuit break:

  1. Return a structured error response that signals the agent framework to try a different approach:
    {
      "error": {
        "type": "loop_detected",
        "message": "Repeated identical output detected (3x). Consider rephrasing the request or trying a different approach.",
        "retry_after": 5
      }
    }
  2. Optionally: auto-escalate to the next tier (see Component 4) instead of returning an error.

Configuration:

reliability:
  loop_detection:
    enabled: true
    window_size: 10
    duplicate_threshold: 3
    similarity_threshold: 0.95   # fuzzy matching for near-identical responses

Component 2: Tool Call Validation

Mechanism: When the request includes tools or functions definitions (OpenAI function calling format), validate the model's response against the provided schemas before returning it to the client.

Validation checks:

  1. JSON validity: Is the tool call valid JSON?
  2. Schema compliance: Do the arguments match the declared parameter types and required fields?
  3. Tool existence: Does the called function name exist in the provided tool list?
  4. Argument plausibility: Are required arguments present and non-empty?

On validation failure:

  1. Inject a corrective system message and retry on the same tier (up to N times):
    Your previous response contained an invalid tool call. The error was: 
    missing required field "query" in function "search". 
    You must respond with valid JSON matching the tool schema exactly.
    Available tools: search, write_file, run_terminal_command.
    
  2. After N retries, escalate to the next tier.
  3. Log the failure for observability.

Configuration:

reliability:
  tool_validation:
    enabled: true
    max_retries: 3
    retry_with_correction: true   # inject corrective prompt on retry
    validate_json: true
    validate_schema: true
    validate_tool_names: true

Component 3: Token Budget Enforcement

Mechanism: Track token usage per session and per time window. Enforce configurable limits to prevent runaway agents from monopolizing the model server.

Limits:

  • Per-session: Maximum tokens consumed in a single session (input + output)
  • Per-hour: Maximum tokens across all sessions in a rolling hour
  • Per-request output: Maximum output tokens for a single generation (prevents infinite reasoning loops)

On budget exceeded:

  • warn_and_continue: Add a header X-MLX-Stack-Budget-Warning: 80% to responses
  • escalate: Switch to a cheaper/faster tier for the remainder of the session
  • hard_stop: Return an error and refuse further requests until the budget resets

Configuration:

reliability:
  token_budget:
    enabled: true
    per_session: 500000           # 500K tokens per session
    per_hour: 2000000             # 2M tokens per hour across all sessions
    max_output_tokens: 16384      # per-request output cap
    policy: warn_and_continue     # or: escalate, hard_stop

Component 4: Automatic Model Escalation

Mechanism: When a tier fails to produce a valid response (after retries), transparently re-route the request to the next tier in the escalation chain. The agent framework receives a working response and never knows the escalation happened.

Escalation chain:

fast → standard → longctx → premium (cloud via OpenRouter)

Implementation:

  • On tool validation failure after max retries: escalate
  • On loop detection circuit break: escalate
  • On timeout (model too slow, possibly due to memory pressure): escalate
  • On repeated 5xx errors from a tier: escalate and mark tier as degraded

Response headers for observability:

X-MLX-Stack-Tier: standard          # which tier actually served the request
X-MLX-Stack-Escalated-From: fast    # original tier before escalation
X-MLX-Stack-Escalation-Reason: tool_validation_failure
X-MLX-Stack-Retries: 3

Configuration:

reliability:
  escalation:
    enabled: true
    chain: [fast, standard, longctx, premium]
    max_retries_before_escalate: 3
    escalate_on:
      - tool_validation_failure
      - loop_detected
      - timeout
      - server_error
    cloud_fallback:
      provider: openrouter
      model: anthropic/claude-sonnet-4-6
      api_key_env: OPENROUTER_API_KEY

Component 5: Memory Pressure Quality Monitor (Bonus)

Mechanism: Monitor macOS memory pressure via sysctl vm.memory_pressure_level or the libdispatch memory pressure source. When the system enters memory pressure (indicating swap is active or imminent), model quality degrades silently — the model is technically serving responses but they're worse.

On memory pressure:

  1. Add X-MLX-Stack-Memory-Pressure: warning header to all responses
  2. If sustained for >60s, preemptively escalate complex requests to cloud
  3. Log the event for postmortem analysis

Why This Belongs in mlx-stack (Not the Agent Framework)

  1. Framework-agnostic: Every agent framework benefits. Hermes, CrewAI, OpenHands, AutoGPT — none of them need to implement their own reliability logic.
  2. Model-aware: mlx-stack knows which tiers are available, their capabilities, and their health. The agent framework doesn't.
  3. Transparent: The agent framework's code doesn't change. It sends requests and gets responses. The reliability layer is invisible when everything works.
  4. Operational: Loop detection, token budgets, and memory pressure monitoring are infrastructure concerns, not application concerns.

Why This is a "WOW Factor" Feature

Today, the best advice on the CrewAI forum for dealing with unreliable local models is "keep trying different models until you find one that works." mlx-stack's reliability layer changes the value proposition from "hope your model is good enough" to "mlx-stack guarantees a working response through intelligent escalation."

The pitch becomes: 95% local (cheap, private, fast), 5% cloud (reliable safety net). Users get the economics and privacy of local inference with the reliability of cloud APIs.

Implementation Approach

Option A: LiteLLM custom callbacks (recommended)

LiteLLM supports custom callbacks that can intercept requests and responses. The reliability layer would be implemented as a callback class:

class ReliabilityCallback(litellm.Callbacks):
    async def async_post_call_success_hook(self, response, ...):
        # Loop detection
        # Tool call validation
        # Token budget tracking
        # Escalation decision

This avoids building a separate proxy and integrates directly into the existing LiteLLM process.

Option B: Standalone proxy

A thin FastAPI app that sits in front of LiteLLM, adding the reliability logic. This is more isolated but adds another process to manage.

Recommendation: Start with Option A (LiteLLM callbacks) for simplicity. Move to Option B only if the callback interface proves too limiting.

Priority

v0.2 — This solves the #1 reason people abandon local models for agent workloads. Combined with the existing multi-tier architecture and cloud fallback, it makes mlx-stack uniquely valuable in the local serving space.

Acceptance Criteria

  • Loop detection identifies repeated identical/near-identical responses and triggers circuit breaker
  • Tool call validation checks responses against provided function schemas
  • Corrective retry injects helpful context before re-prompting the model
  • Automatic escalation transparently re-routes failed requests to the next tier
  • Cloud fallback works as the last resort in the escalation chain
  • Token budget enforcement with configurable per-session and per-hour limits
  • Response headers expose escalation metadata for observability
  • All components individually configurable (enable/disable, thresholds)
  • Memory pressure monitoring with proactive escalation
  • Comprehensive logging of all reliability events
  • Documentation explaining the reliability model and configuration
  • Integration tests simulating common failure modes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions