Context / Problem
Issue #29 benchmarks latency and cost — compiled flows vs naive chaining with simulated LLM delays. But latency savings are a commodity argument: as LLMs get faster and cheaper, the absolute savings shrink.
The correctness argument is more durable and harder to replicate: when an LLM mediates data passing between tools, it introduces data corruption that compiled flows provably eliminate. This corruption is not hypothetical — it's structural:
| Corruption type |
What happens |
How often (estimated) |
| Field hallucination |
Tool A outputs {"customer_id": 42}. LLM passes {"customer_id": 42, "account_type": "premium"} — the account_type was fabricated |
Common with GPT-4 class models on ambiguous schemas |
| Data loss |
Tool A outputs 15 fields. LLM "summarizes" and passes only 3 to Tool B |
Very common — LLMs naturally compress |
| Type corruption |
Tool A returns {"amount": 99.99}. LLM passes {"amount": "ninety-nine dollars"} |
Common when schema types aren't enforced |
| Schema drift |
Tool A returns {"user_id": 5}. LLM passes {"userId": 5} (camelCase vs snake_case) |
Common across different schema conventions |
| Inconsistent routing |
Same input → same Tool A output → LLM picks different Tool B on different runs |
Inherent — LLM sampling is non-deterministic |
ChainWeaver's compiled execution eliminates all of these because Pydantic-validated schemas pass data directly with no LLM touching intermediate data. A benchmark proving this gives ChainWeaver its strongest positioning argument.
Proposal
1. Correctness benchmark suite
Create benchmarks/bench_correctness.py:
def benchmark_naive_correctness(
tools: list[Tool],
steps: list[str],
initial_input: dict,
llm_fn: Callable[[str], str], # simulates LLM data routing
runs: int = 100,
) -> CorrectnessReport:
"""
Simulate naive LLM chaining where the LLM interprets each tool's
output and constructs the next tool's input.
The llm_fn simulates realistic LLM behavior: occasional field
hallucination, data loss, type changes, and non-deterministic routing.
"""
def benchmark_compiled_correctness(
executor: FlowExecutor,
flow_name: str,
initial_input: dict,
runs: int = 100,
) -> CorrectnessReport:
"""
Execute via FlowExecutor (schema-validated, deterministic).
"""
2. CorrectnessReport
@dataclass
class CorrectnessReport:
total_runs: int
successful_runs: int
# Corruption metrics
field_hallucinations: int # Fields not in source schema appeared in output
data_loss_events: int # Fields from source schema missing in output
type_corruptions: int # Field present but wrong type
schema_drift_events: int # Field name changed (e.g., snake_case → camelCase)
routing_inconsistencies: int # Different tool sequence on different runs
# Aggregate
corruption_rate: float # corrupted_runs / total_runs
determinism_rate: float # identical_results / total_runs
data_integrity_score: float # 0.0–1.0 composite score
3. Simulated LLM corruption model
The benchmark needs a realistic LLM simulation that introduces corruption at configurable rates:
@dataclass
class LLMCorruptionProfile:
"""
Configurable corruption rates for simulating LLM data routing.
Rates are based on empirical observations from LLM tool-calling benchmarks.
"""
hallucination_rate: float = 0.05 # 5% chance of adding a fabricated field
data_loss_rate: float = 0.10 # 10% chance of dropping a field
type_corruption_rate: float = 0.03 # 3% chance of type change
schema_drift_rate: float = 0.02 # 2% chance of field name change
routing_variance_rate: float = 0.08 # 8% chance of choosing a different next tool
These rates should be informed by published LLM tool-calling benchmarks where available, with documented sources.
4. Output format
ChainWeaver Correctness Benchmark
==================================
Chain: fetch → validate → transform → format (4 steps)
Runs: 100
Naive (LLM) Compiled (ChainWeaver)
─────────────────────────────────────────────────────────────
Successful runs: 87/100 100/100
Field hallucinations: 12 0
Data loss events: 23 0
Type corruptions: 5 0
Schema drift: 3 0
Routing inconsistent: 8 0 (N/A — deterministic)
─────────────────────────────────────────────────────────────
Corruption rate: 13.0% 0.0%
Determinism rate: 84.0% 100.0%
Data integrity score: 0.87 1.00
Verdict: Compiled flows eliminated 100% of intermediate data corruption.
5. Parameterized test scenarios
| Scenario |
Chain |
Why it matters |
| Numeric pipeline |
double → add → format |
Type corruption (int vs float vs string) |
| Data enrichment |
fetch → enrich → validate → store |
Field hallucination, data loss |
| Multi-schema |
search → extract → translate → summarize |
Schema drift between different domain schemas |
| Long chain |
10-step pipeline |
Corruption compounds — 5% per step ≈ 40% at step 10 |
| Branching |
fetch → (validate OR retry) → store |
Routing inconsistency |
Relevant Code Locations
Acceptance Criteria
Out of Scope
- Real LLM API calls (simulation only — consistent, reproducible)
- Statistical significance testing (keep it simple: 100 runs is sufficient for demonstration)
- Comparison against specific LLM providers (framework-agnostic)
- Automated regression tracking in CI
Dependencies
Notes
- The key insight: corruption compounds across steps. A 5% hallucination rate per step means ~23% chance of at least one hallucination in a 5-step chain. This makes the correctness argument stronger for longer chains — exactly where ChainWeaver's value is highest.
- Corruption rate × chain length = the "compound corruption curve" — this should be a chart in the README.
- The compiled flow correctness is provably 0% corruption because Pydantic schema validation rejects malformed data at every boundary. This is a mathematical guarantee, not an empirical observation.
- Consider referencing published LLM tool-calling benchmarks (Berkeley Function Calling Leaderboard, etc.) to ground the default corruption rates in real data.
Context / Problem
Issue #29 benchmarks latency and cost — compiled flows vs naive chaining with simulated LLM delays. But latency savings are a commodity argument: as LLMs get faster and cheaper, the absolute savings shrink.
The correctness argument is more durable and harder to replicate: when an LLM mediates data passing between tools, it introduces data corruption that compiled flows provably eliminate. This corruption is not hypothetical — it's structural:
{"customer_id": 42}. LLM passes{"customer_id": 42, "account_type": "premium"}— theaccount_typewas fabricated{"amount": 99.99}. LLM passes{"amount": "ninety-nine dollars"}{"user_id": 5}. LLM passes{"userId": 5}(camelCase vs snake_case)ChainWeaver's compiled execution eliminates all of these because Pydantic-validated schemas pass data directly with no LLM touching intermediate data. A benchmark proving this gives ChainWeaver its strongest positioning argument.
Proposal
1. Correctness benchmark suite
Create
benchmarks/bench_correctness.py:2.
CorrectnessReport3. Simulated LLM corruption model
The benchmark needs a realistic LLM simulation that introduces corruption at configurable rates:
These rates should be informed by published LLM tool-calling benchmarks where available, with documented sources.
4. Output format
5. Parameterized test scenarios
Relevant Code Locations
benchmarks/bench_naive_vs_compiled.py(Create benchmark suite: naive chaining vs compiled flow execution #29)FlowExecutor→chainweaver/executor.pyToolschemas →chainweaver/tools.pyAcceptance Criteria
benchmarks/bench_correctness.pyexists and runs standaloneCorrectnessReportcaptures all 5 corruption types + aggregate scoresLLMCorruptionProfileallows configurable corruption ratesbenchmarks/README.mdupdated with correctness benchmark instructionsOut of Scope
Dependencies
benchmarks/directory and output format conventionsFlowExecutorandToolclassesNotes