Add correctness benchmark: data corruption in naive LLM chaining vs compiled flows

## Context / Problem

Issue #29 benchmarks **latency and cost** — compiled flows vs naive chaining with simulated LLM delays. But latency savings are a commodity argument: as LLMs get faster and cheaper, the absolute savings shrink.

The **correctness argument** is more durable and harder to replicate: when an LLM mediates data passing between tools, it introduces data corruption that compiled flows provably eliminate. This corruption is not hypothetical — it's structural:

| Corruption type | What happens | How often (estimated) |
|----------------|-------------|----------------------|
| **Field hallucination** | Tool A outputs `{"customer_id": 42}`. LLM passes `{"customer_id": 42, "account_type": "premium"}` — the `account_type` was fabricated | Common with GPT-4 class models on ambiguous schemas |
| **Data loss** | Tool A outputs 15 fields. LLM "summarizes" and passes only 3 to Tool B | Very common — LLMs naturally compress |
| **Type corruption** | Tool A returns `{"amount": 99.99}`. LLM passes `{"amount": "ninety-nine dollars"}` | Common when schema types aren't enforced |
| **Schema drift** | Tool A returns `{"user_id": 5}`. LLM passes `{"userId": 5}` (camelCase vs snake_case) | Common across different schema conventions |
| **Inconsistent routing** | Same input → same Tool A output → LLM picks different Tool B on different runs | Inherent — LLM sampling is non-deterministic |

ChainWeaver's compiled execution eliminates **all** of these because Pydantic-validated schemas pass data directly with no LLM touching intermediate data. A benchmark proving this gives ChainWeaver its strongest positioning argument.

## Proposal

### 1. Correctness benchmark suite

Create `benchmarks/bench_correctness.py`:

```python
def benchmark_naive_correctness(
    tools: list[Tool],
    steps: list[str],
    initial_input: dict,
    llm_fn: Callable[[str], str],     # simulates LLM data routing
    runs: int = 100,
) -> CorrectnessReport:
    """
    Simulate naive LLM chaining where the LLM interprets each tool's
    output and constructs the next tool's input.
    
    The llm_fn simulates realistic LLM behavior: occasional field
    hallucination, data loss, type changes, and non-deterministic routing.
    """

def benchmark_compiled_correctness(
    executor: FlowExecutor,
    flow_name: str,
    initial_input: dict,
    runs: int = 100,
) -> CorrectnessReport:
    """
    Execute via FlowExecutor (schema-validated, deterministic).
    """
```

### 2. `CorrectnessReport`

```python
@dataclass
class CorrectnessReport:
    total_runs: int
    successful_runs: int
    
    # Corruption metrics
    field_hallucinations: int          # Fields not in source schema appeared in output
    data_loss_events: int              # Fields from source schema missing in output
    type_corruptions: int              # Field present but wrong type
    schema_drift_events: int           # Field name changed (e.g., snake_case → camelCase)
    routing_inconsistencies: int       # Different tool sequence on different runs
    
    # Aggregate
    corruption_rate: float             # corrupted_runs / total_runs
    determinism_rate: float            # identical_results / total_runs
    data_integrity_score: float        # 0.0–1.0 composite score
```

### 3. Simulated LLM corruption model

The benchmark needs a realistic LLM simulation that introduces corruption at configurable rates:

```python
@dataclass
class LLMCorruptionProfile:
    """
    Configurable corruption rates for simulating LLM data routing.
    Rates are based on empirical observations from LLM tool-calling benchmarks.
    """
    hallucination_rate: float = 0.05      # 5% chance of adding a fabricated field
    data_loss_rate: float = 0.10          # 10% chance of dropping a field
    type_corruption_rate: float = 0.03    # 3% chance of type change
    schema_drift_rate: float = 0.02       # 2% chance of field name change
    routing_variance_rate: float = 0.08   # 8% chance of choosing a different next tool
```

These rates should be informed by published LLM tool-calling benchmarks where available, with documented sources.

### 4. Output format

```
ChainWeaver Correctness Benchmark
==================================
Chain: fetch → validate → transform → format (4 steps)
Runs: 100

                        Naive (LLM)    Compiled (ChainWeaver)
─────────────────────────────────────────────────────────────
Successful runs:        87/100         100/100
Field hallucinations:   12             0
Data loss events:       23             0
Type corruptions:       5              0
Schema drift:           3              0
Routing inconsistent:   8              0 (N/A — deterministic)
─────────────────────────────────────────────────────────────
Corruption rate:        13.0%          0.0%
Determinism rate:       84.0%          100.0%
Data integrity score:   0.87           1.00

Verdict: Compiled flows eliminated 100% of intermediate data corruption.
```

### 5. Parameterized test scenarios

| Scenario | Chain | Why it matters |
|----------|-------|---------------|
| Numeric pipeline | double → add → format | Type corruption (int vs float vs string) |
| Data enrichment | fetch → enrich → validate → store | Field hallucination, data loss |
| Multi-schema | search → extract → translate → summarize | Schema drift between different domain schemas |
| Long chain | 10-step pipeline | Corruption compounds — 5% per step ≈ 40% at step 10 |
| Branching | fetch → (validate OR retry) → store | Routing inconsistency |

### Relevant Code Locations
- Latency benchmark → `benchmarks/bench_naive_vs_compiled.py` (#29)
- `FlowExecutor` → `chainweaver/executor.py`
- `Tool` schemas → `chainweaver/tools.py`

## Acceptance Criteria

- [ ] `benchmarks/bench_correctness.py` exists and runs standalone
- [ ] `CorrectnessReport` captures all 5 corruption types + aggregate scores
- [ ] `LLMCorruptionProfile` allows configurable corruption rates
- [ ] Naive simulation introduces realistic corruption (hallucination, loss, type, drift, routing)
- [ ] Compiled flow benchmark shows 0 corruption across all runs (schema validation enforces this)
- [ ] At least 3 parameterized scenarios (numeric, data enrichment, long chain)
- [ ] Human-readable table output and machine-readable JSON
- [ ] Results include "corruption compounds" analysis (showing N-step corruption accumulation)
- [ ] `benchmarks/README.md` updated with correctness benchmark instructions
- [ ] Default corruption rates are documented with rationale

## Out of Scope

- Real LLM API calls (simulation only — consistent, reproducible)
- Statistical significance testing (keep it simple: 100 runs is sufficient for demonstration)
- Comparison against specific LLM providers (framework-agnostic)
- Automated regression tracking in CI

## Dependencies

- #29 (latency benchmark) — shares `benchmarks/` directory and output format conventions
- Standalone otherwise — uses existing `FlowExecutor` and `Tool` classes

## Notes

- The key insight: **corruption compounds across steps**. A 5% hallucination rate per step means ~23% chance of at least one hallucination in a 5-step chain. This makes the correctness argument *stronger* for longer chains — exactly where ChainWeaver's value is highest.
- Corruption rate × chain length = the "compound corruption curve" — this should be a chart in the README.
- The compiled flow correctness is provably 0% corruption because Pydantic schema validation rejects malformed data at every boundary. This is a mathematical guarantee, not an empirical observation.
- Consider referencing published LLM tool-calling benchmarks (Berkeley Function Calling Leaderboard, etc.) to ground the default corruption rates in real data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add correctness benchmark: data corruption in naive LLM chaining vs compiled flows #103

Context / Problem

Proposal

1. Correctness benchmark suite

2. `CorrectnessReport`

3. Simulated LLM corruption model

4. Output format

5. Parameterized test scenarios

Relevant Code Locations

Acceptance Criteria

Out of Scope

Dependencies

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Corruption type	What happens	How often (estimated)
Field hallucination	Tool A outputs `{"customer_id": 42}`. LLM passes `{"customer_id": 42, "account_type": "premium"}` — the `account_type` was fabricated	Common with GPT-4 class models on ambiguous schemas
Data loss	Tool A outputs 15 fields. LLM "summarizes" and passes only 3 to Tool B	Very common — LLMs naturally compress
Type corruption	Tool A returns `{"amount": 99.99}`. LLM passes `{"amount": "ninety-nine dollars"}`	Common when schema types aren't enforced
Schema drift	Tool A returns `{"user_id": 5}`. LLM passes `{"userId": 5}` (camelCase vs snake_case)	Common across different schema conventions
Inconsistent routing	Same input → same Tool A output → LLM picks different Tool B on different runs	Inherent — LLM sampling is non-deterministic

Scenario	Chain	Why it matters
Numeric pipeline	double → add → format	Type corruption (int vs float vs string)
Data enrichment	fetch → enrich → validate → store	Field hallucination, data loss
Multi-schema	search → extract → translate → summarize	Schema drift between different domain schemas
Long chain	10-step pipeline	Corruption compounds — 5% per step ≈ 40% at step 10
Branching	fetch → (validate OR retry) → store	Routing inconsistency

Add correctness benchmark: data corruption in naive LLM chaining vs compiled flows #103

Description

Context / Problem

Proposal

1. Correctness benchmark suite

2. CorrectnessReport

3. Simulated LLM corruption model

4. Output format

5. Parameterized test scenarios

Relevant Code Locations

Acceptance Criteria

Out of Scope

Dependencies

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. `CorrectnessReport`