Skip to content

Add correctness benchmark: data corruption in naive LLM chaining vs compiled flows #103

@dgenio

Description

@dgenio

Context / Problem

Issue #29 benchmarks latency and cost — compiled flows vs naive chaining with simulated LLM delays. But latency savings are a commodity argument: as LLMs get faster and cheaper, the absolute savings shrink.

The correctness argument is more durable and harder to replicate: when an LLM mediates data passing between tools, it introduces data corruption that compiled flows provably eliminate. This corruption is not hypothetical — it's structural:

Corruption type What happens How often (estimated)
Field hallucination Tool A outputs {"customer_id": 42}. LLM passes {"customer_id": 42, "account_type": "premium"} — the account_type was fabricated Common with GPT-4 class models on ambiguous schemas
Data loss Tool A outputs 15 fields. LLM "summarizes" and passes only 3 to Tool B Very common — LLMs naturally compress
Type corruption Tool A returns {"amount": 99.99}. LLM passes {"amount": "ninety-nine dollars"} Common when schema types aren't enforced
Schema drift Tool A returns {"user_id": 5}. LLM passes {"userId": 5} (camelCase vs snake_case) Common across different schema conventions
Inconsistent routing Same input → same Tool A output → LLM picks different Tool B on different runs Inherent — LLM sampling is non-deterministic

ChainWeaver's compiled execution eliminates all of these because Pydantic-validated schemas pass data directly with no LLM touching intermediate data. A benchmark proving this gives ChainWeaver its strongest positioning argument.

Proposal

1. Correctness benchmark suite

Create benchmarks/bench_correctness.py:

def benchmark_naive_correctness(
    tools: list[Tool],
    steps: list[str],
    initial_input: dict,
    llm_fn: Callable[[str], str],     # simulates LLM data routing
    runs: int = 100,
) -> CorrectnessReport:
    """
    Simulate naive LLM chaining where the LLM interprets each tool's
    output and constructs the next tool's input.
    
    The llm_fn simulates realistic LLM behavior: occasional field
    hallucination, data loss, type changes, and non-deterministic routing.
    """

def benchmark_compiled_correctness(
    executor: FlowExecutor,
    flow_name: str,
    initial_input: dict,
    runs: int = 100,
) -> CorrectnessReport:
    """
    Execute via FlowExecutor (schema-validated, deterministic).
    """

2. CorrectnessReport

@dataclass
class CorrectnessReport:
    total_runs: int
    successful_runs: int
    
    # Corruption metrics
    field_hallucinations: int          # Fields not in source schema appeared in output
    data_loss_events: int              # Fields from source schema missing in output
    type_corruptions: int              # Field present but wrong type
    schema_drift_events: int           # Field name changed (e.g., snake_case → camelCase)
    routing_inconsistencies: int       # Different tool sequence on different runs
    
    # Aggregate
    corruption_rate: float             # corrupted_runs / total_runs
    determinism_rate: float            # identical_results / total_runs
    data_integrity_score: float        # 0.0–1.0 composite score

3. Simulated LLM corruption model

The benchmark needs a realistic LLM simulation that introduces corruption at configurable rates:

@dataclass
class LLMCorruptionProfile:
    """
    Configurable corruption rates for simulating LLM data routing.
    Rates are based on empirical observations from LLM tool-calling benchmarks.
    """
    hallucination_rate: float = 0.05      # 5% chance of adding a fabricated field
    data_loss_rate: float = 0.10          # 10% chance of dropping a field
    type_corruption_rate: float = 0.03    # 3% chance of type change
    schema_drift_rate: float = 0.02       # 2% chance of field name change
    routing_variance_rate: float = 0.08   # 8% chance of choosing a different next tool

These rates should be informed by published LLM tool-calling benchmarks where available, with documented sources.

4. Output format

ChainWeaver Correctness Benchmark
==================================
Chain: fetch → validate → transform → format (4 steps)
Runs: 100

                        Naive (LLM)    Compiled (ChainWeaver)
─────────────────────────────────────────────────────────────
Successful runs:        87/100         100/100
Field hallucinations:   12             0
Data loss events:       23             0
Type corruptions:       5              0
Schema drift:           3              0
Routing inconsistent:   8              0 (N/A — deterministic)
─────────────────────────────────────────────────────────────
Corruption rate:        13.0%          0.0%
Determinism rate:       84.0%          100.0%
Data integrity score:   0.87           1.00

Verdict: Compiled flows eliminated 100% of intermediate data corruption.

5. Parameterized test scenarios

Scenario Chain Why it matters
Numeric pipeline double → add → format Type corruption (int vs float vs string)
Data enrichment fetch → enrich → validate → store Field hallucination, data loss
Multi-schema search → extract → translate → summarize Schema drift between different domain schemas
Long chain 10-step pipeline Corruption compounds — 5% per step ≈ 40% at step 10
Branching fetch → (validate OR retry) → store Routing inconsistency

Relevant Code Locations

Acceptance Criteria

  • benchmarks/bench_correctness.py exists and runs standalone
  • CorrectnessReport captures all 5 corruption types + aggregate scores
  • LLMCorruptionProfile allows configurable corruption rates
  • Naive simulation introduces realistic corruption (hallucination, loss, type, drift, routing)
  • Compiled flow benchmark shows 0 corruption across all runs (schema validation enforces this)
  • At least 3 parameterized scenarios (numeric, data enrichment, long chain)
  • Human-readable table output and machine-readable JSON
  • Results include "corruption compounds" analysis (showing N-step corruption accumulation)
  • benchmarks/README.md updated with correctness benchmark instructions
  • Default corruption rates are documented with rationale

Out of Scope

  • Real LLM API calls (simulation only — consistent, reproducible)
  • Statistical significance testing (keep it simple: 100 runs is sufficient for demonstration)
  • Comparison against specific LLM providers (framework-agnostic)
  • Automated regression tracking in CI

Dependencies

Notes

  • The key insight: corruption compounds across steps. A 5% hallucination rate per step means ~23% chance of at least one hallucination in a 5-step chain. This makes the correctness argument stronger for longer chains — exactly where ChainWeaver's value is highest.
  • Corruption rate × chain length = the "compound corruption curve" — this should be a chart in the README.
  • The compiled flow correctness is provably 0% corruption because Pydantic schema validation rejects malformed data at every boundary. This is a mathematical guarantee, not an empirical observation.
  • Consider referencing published LLM tool-calling benchmarks (Berkeley Function Calling Leaderboard, etc.) to ground the default corruption rates in real data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ai-friendlyDesigned for AI-assisted implementationarea:benchmarksPerformance benchmarkscomplexity:averageModerate effort, some design neededpriority:highMust address first within the milestonesize:MMedium effort (1-3 days)type:featureNew feature or capability

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions