Agent reliability layer: circuit breakers, loop detection, and automatic model escalation

## Summary

Implement an agent reliability layer directly in the serving infrastructure that detects common failure modes (infinite loops, malformed tool calls, runaway token usage) and automatically recovers — either by retrying with corrective prompts, escalating to a more capable model tier, or falling back to cloud. This makes any local model dramatically more reliable for autonomous agent workloads without requiring changes to the agent framework.

## Problem

The #1 reason people give up on local models for autonomous agents is **unreliable tool calling and instruction following.** This is well-documented across communities:

- CrewAI forums: "agents frequently attempted tool calls using placeholder data," "agents produce garbage output or incorrect JSON formatting"
- Local model users consistently report: agent generates malformed JSON → retries → malformed again → infinite loop burning tokens and compute all night
- Research shows 73% of enterprise agent deployments hit reliability failures in their first year. On local hardware, failure modes are worse because smaller models have weaker instruction following

**Nobody is solving this at the serving layer.** Every agent framework expects the model API to "just work" and handles errors poorly. The agent framework generates a prompt, sends it to the API, and trusts the response. When the response is garbage, the framework either crashes, retries blindly, or loops forever.

### Failure Modes in the Wild

1. **Infinite tool-call loops:** Model generates `{"function": "search", "arguments": {"query": "..."}}` → gets result → generates the exact same tool call → gets the same result → loops indefinitely. The agent framework doesn't detect this because each response is technically valid.

2. **Malformed function calls:** Model generates almost-valid JSON but with wrong field names, missing required fields, or extra trailing text. The agent framework's parser fails, it retries, the model generates the same malformed output.

3. **Hallucinated tool names:** Model calls `execute_code` when the available tools are `run_terminal_command` and `write_file`. Agent framework returns "unknown tool" error, model tries again with the same hallucinated name.

4. **Runaway token consumption:** Agent enters a reasoning loop, generating thousands of tokens of "thinking" that goes nowhere. On a cloud API this costs money; on local hardware it monopolizes the model server for hours.

5. **Quality degradation under load:** When memory pressure causes swap, model quality drops silently. The model starts generating incoherent responses, but the health check still passes because the server is technically running.

## Proposed Solution

### Architecture

The reliability layer sits between LiteLLM and the client (agent framework), implemented as a LiteLLM middleware or a thin proxy layer:

```
Agent Framework
       ↓
┌─────────────────────────┐
│   Reliability Layer     │
│  ┌───────────────────┐  │
│  │  Loop Detector     │  │
│  │  Tool Validator    │  │
│  │  Token Budget      │  │
│  │  Escalation Engine │  │
│  └───────────────────┘  │
└─────────────────────────┘
       ↓
   LiteLLM Proxy (:4000)
       ↓
   Model Tiers
```

### Component 1: Loop Detection

**Mechanism:** Maintain a rolling window of the last N responses per session (identified by a session header or API key). Compute a similarity hash of each response (normalized, whitespace-stripped). If 3+ responses in the window produce the same or near-identical output, trigger circuit breaker.

**Implementation:**
```python
class LoopDetector:
    def __init__(self, window_size=10, threshold=3, similarity=0.95):
        self.sessions = {}  # session_id -> deque of response hashes

    def check(self, session_id: str, response: str) -> LoopStatus:
        # Hash the normalized response
        # Check against rolling window
        # Return NORMAL, WARNING, or CIRCUIT_BREAK
```

**On circuit break:**
1. Return a structured error response that signals the agent framework to try a different approach:
   ```json
   {
     "error": {
       "type": "loop_detected",
       "message": "Repeated identical output detected (3x). Consider rephrasing the request or trying a different approach.",
       "retry_after": 5
     }
   }
   ```
2. Optionally: auto-escalate to the next tier (see Component 4) instead of returning an error.

**Configuration:**
```yaml
reliability:
  loop_detection:
    enabled: true
    window_size: 10
    duplicate_threshold: 3
    similarity_threshold: 0.95   # fuzzy matching for near-identical responses
```

### Component 2: Tool Call Validation

**Mechanism:** When the request includes `tools` or `functions` definitions (OpenAI function calling format), validate the model's response against the provided schemas before returning it to the client.

**Validation checks:**
1. **JSON validity:** Is the tool call valid JSON?
2. **Schema compliance:** Do the arguments match the declared parameter types and required fields?
3. **Tool existence:** Does the called function name exist in the provided tool list?
4. **Argument plausibility:** Are required arguments present and non-empty?

**On validation failure:**
1. Inject a corrective system message and retry on the same tier (up to N times):
   ```
   Your previous response contained an invalid tool call. The error was: 
   missing required field "query" in function "search". 
   You must respond with valid JSON matching the tool schema exactly.
   Available tools: search, write_file, run_terminal_command.
   ```
2. After N retries, escalate to the next tier.
3. Log the failure for observability.

**Configuration:**
```yaml
reliability:
  tool_validation:
    enabled: true
    max_retries: 3
    retry_with_correction: true   # inject corrective prompt on retry
    validate_json: true
    validate_schema: true
    validate_tool_names: true
```

### Component 3: Token Budget Enforcement

**Mechanism:** Track token usage per session and per time window. Enforce configurable limits to prevent runaway agents from monopolizing the model server.

**Limits:**
- **Per-session:** Maximum tokens consumed in a single session (input + output)
- **Per-hour:** Maximum tokens across all sessions in a rolling hour
- **Per-request output:** Maximum output tokens for a single generation (prevents infinite reasoning loops)

**On budget exceeded:**
- `warn_and_continue`: Add a header `X-MLX-Stack-Budget-Warning: 80%` to responses
- `escalate`: Switch to a cheaper/faster tier for the remainder of the session
- `hard_stop`: Return an error and refuse further requests until the budget resets

**Configuration:**
```yaml
reliability:
  token_budget:
    enabled: true
    per_session: 500000           # 500K tokens per session
    per_hour: 2000000             # 2M tokens per hour across all sessions
    max_output_tokens: 16384      # per-request output cap
    policy: warn_and_continue     # or: escalate, hard_stop
```

### Component 4: Automatic Model Escalation

**Mechanism:** When a tier fails to produce a valid response (after retries), transparently re-route the request to the next tier in the escalation chain. The agent framework receives a working response and never knows the escalation happened.

**Escalation chain:**
```
fast → standard → longctx → premium (cloud via OpenRouter)
```

**Implementation:**
- On tool validation failure after max retries: escalate
- On loop detection circuit break: escalate
- On timeout (model too slow, possibly due to memory pressure): escalate
- On repeated 5xx errors from a tier: escalate and mark tier as degraded

**Response headers for observability:**
```
X-MLX-Stack-Tier: standard          # which tier actually served the request
X-MLX-Stack-Escalated-From: fast    # original tier before escalation
X-MLX-Stack-Escalation-Reason: tool_validation_failure
X-MLX-Stack-Retries: 3
```

**Configuration:**
```yaml
reliability:
  escalation:
    enabled: true
    chain: [fast, standard, longctx, premium]
    max_retries_before_escalate: 3
    escalate_on:
      - tool_validation_failure
      - loop_detected
      - timeout
      - server_error
    cloud_fallback:
      provider: openrouter
      model: anthropic/claude-sonnet-4-6
      api_key_env: OPENROUTER_API_KEY
```

### Component 5: Memory Pressure Quality Monitor (Bonus)

**Mechanism:** Monitor macOS memory pressure via `sysctl vm.memory_pressure_level` or the `libdispatch` memory pressure source. When the system enters memory pressure (indicating swap is active or imminent), model quality degrades silently — the model is technically serving responses but they're worse.

**On memory pressure:**
1. Add `X-MLX-Stack-Memory-Pressure: warning` header to all responses
2. If sustained for >60s, preemptively escalate complex requests to cloud
3. Log the event for postmortem analysis

## Why This Belongs in mlx-stack (Not the Agent Framework)

1. **Framework-agnostic:** Every agent framework benefits. Hermes, CrewAI, OpenHands, AutoGPT — none of them need to implement their own reliability logic.
2. **Model-aware:** mlx-stack knows which tiers are available, their capabilities, and their health. The agent framework doesn't.
3. **Transparent:** The agent framework's code doesn't change. It sends requests and gets responses. The reliability layer is invisible when everything works.
4. **Operational:** Loop detection, token budgets, and memory pressure monitoring are infrastructure concerns, not application concerns.

## Why This is a "WOW Factor" Feature

Today, the best advice on the CrewAI forum for dealing with unreliable local models is "keep trying different models until you find one that works." mlx-stack's reliability layer changes the value proposition from "hope your model is good enough" to **"mlx-stack guarantees a working response through intelligent escalation."**

The pitch becomes: **95% local (cheap, private, fast), 5% cloud (reliable safety net).** Users get the economics and privacy of local inference with the reliability of cloud APIs.

## Implementation Approach

**Option A: LiteLLM custom callbacks (recommended)**

LiteLLM supports [custom callbacks](https://docs.litellm.ai/docs/observability/custom_callback) that can intercept requests and responses. The reliability layer would be implemented as a callback class:

```python
class ReliabilityCallback(litellm.Callbacks):
    async def async_post_call_success_hook(self, response, ...):
        # Loop detection
        # Tool call validation
        # Token budget tracking
        # Escalation decision
```

This avoids building a separate proxy and integrates directly into the existing LiteLLM process.

**Option B: Standalone proxy**

A thin FastAPI app that sits in front of LiteLLM, adding the reliability logic. This is more isolated but adds another process to manage.

**Recommendation:** Start with Option A (LiteLLM callbacks) for simplicity. Move to Option B only if the callback interface proves too limiting.

## Priority

v0.2 — This solves the #1 reason people abandon local models for agent workloads. Combined with the existing multi-tier architecture and cloud fallback, it makes mlx-stack uniquely valuable in the local serving space.

## Acceptance Criteria

- [ ] Loop detection identifies repeated identical/near-identical responses and triggers circuit breaker
- [ ] Tool call validation checks responses against provided function schemas
- [ ] Corrective retry injects helpful context before re-prompting the model
- [ ] Automatic escalation transparently re-routes failed requests to the next tier
- [ ] Cloud fallback works as the last resort in the escalation chain
- [ ] Token budget enforcement with configurable per-session and per-hour limits
- [ ] Response headers expose escalation metadata for observability
- [ ] All components individually configurable (enable/disable, thresholds)
- [ ] Memory pressure monitoring with proactive escalation
- [ ] Comprehensive logging of all reliability events
- [ ] Documentation explaining the reliability model and configuration
- [ ] Integration tests simulating common failure modes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent reliability layer: circuit breakers, loop detection, and automatic model escalation #29

Summary

Problem

Failure Modes in the Wild

Proposed Solution

Architecture

Component 1: Loop Detection

Component 2: Tool Call Validation

Component 3: Token Budget Enforcement

Component 4: Automatic Model Escalation

Component 5: Memory Pressure Quality Monitor (Bonus)

Why This Belongs in mlx-stack (Not the Agent Framework)

Why This is a "WOW Factor" Feature

Implementation Approach

Priority

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Agent reliability layer: circuit breakers, loop detection, and automatic model escalation #29

Description

Summary

Problem

Failure Modes in the Wild

Proposed Solution

Architecture

Component 1: Loop Detection

Component 2: Tool Call Validation

Component 3: Token Budget Enforcement

Component 4: Automatic Model Escalation

Component 5: Memory Pressure Quality Monitor (Bonus)

Why This Belongs in mlx-stack (Not the Agent Framework)

Why This is a "WOW Factor" Feature

Implementation Approach

Priority

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions