EPOCH-Bench

Evaluation of Planning Over Causal Horizons

A benchmark for evaluating LLM agentic planning capabilities on PDDL time-travel puzzles with causal propagation, inspired by Day of the Tentacle.

git clone https://github.com/hey-intent/epoch-bench.git && cd epoch-bench
pip install -e .
cp .env-example .env  # add your OpenRouter key
python cli/test_llm.py --model anthropic/claude-opus-4.6 --runs 1

EPOCH-Bench measures whether LLMs can act as goal-directed agents in a formal world with strict constraints, negative feedback, and temporal causality.

What it tests

Six progressively harder levels, from simple spatial navigation (L01: send an object through time in 4 steps) to multi-epoch synchronization with temporal decay (L06: coordinate 3 characters across past/present/future with timing constraints in 25+ steps).

Each action is validated by a deterministic PDDL engine. The agent receives structured feedback (actionable hints on failed preconditions, causal propagation events, state changes) and must adapt its plan in real time. The benchmark discriminates explicitly between format compliance, state tracking, causal understanding, and recovery after errors.

The puzzles are original creations inspired by the game, not reproductions of existing content.

Quick Start

Installation

git clone https://github.com/hey-intent/epoch-bench.git
cd epoch-bench
pip install -e .

Create an OpenRouter API key at https://openrouter.ai/settings/keys

cp .env-example .env   # then edit with your OpenRouter key

Run benchmark

# Run all 6 levels, 1 run each (using .env OPEN_ROUTER_MODEL)
python cli/test_llm.py

# Run specific levels with 10 runs each
python cli/test_llm.py --levels 1-3 --runs 10

# Use a specific model
python cli/test_llm.py --model google/gemini-3-flash-preview --runs 10

Batch execution

Edit the model list in cli/benchmark.py, or pass models via CLI:

# Run specific models and levels
python cli/benchmark.py --models anthropic/claude-opus-4.6 openai/gpt-5.2 --levels 1-3 --runs 5

# Run all configured models, all 6 levels, 5 runs each
python cli/benchmark.py --runs 5

Cost warning: A full benchmark run (13 models x 6 levels x 5 runs = 390 runs) costs approximately $30-40 on OpenRouter. L06 alone can consume 100k+ input tokens per run on some models. Start with --levels 1-2 --runs 1 to validate your setup before committing to a full run.

Verify world model (no API key needed)

python cli/test_world_model.py      # Unit tests
python cli/manual_solutions.py      # Verify optimal solutions

Code quality check

ruff check .

Design Decisions

Why PDDL?

PDDL provides a formal, deterministic, and reproducible framework. State transitions are mathematically verifiable. Every action either satisfies its preconditions or it doesn't.

Why Day of the Tentacle?

The game naturally provides a multi-epoch structure (past, present, future) with temporal causality (plant a tree in the past -> tree exists in the future). This allows testing causal understanding and multi-agent coordination without requiring real-world knowledge. The puzzles are original creations inspired by the game, not reproductions of existing content.

Why OpenRouter?

A benchmark comparing 10+ models across 6 providers needs a single API surface. OpenRouter provides exactly that: one endpoint, one auth token, unified tool calling format. Without it, the benchmark would require 6 separate SDKs, 6 auth flows, and 6 different tool calling implementations, all for infrastructure plumbing that adds zero scientific value.

The trade-off is real: OpenRouter adds a proxy hop (latency), and provider-specific features (Anthropic's extended thinking, Gemini's grounding) are not accessible. But for a benchmark measuring planning capability through standardized tool calls, consistency across models matters more than provider-specific optimizations. Every model gets the same prompt, the same tools schema, the same feedback loop.

Adding a new model to the benchmark is a one-line change in the config: no new SDK, no new auth, no new parsing logic.

Wrapper Pattern & Fixed-Point Propagation

TemporalWorldModel wraps WorldModel to add causal propagation without modifying the base class. The SyncEngine applies propagation rules via fixed-point iteration, automatically handling chain reactions (e.g., tree-planted -> tree-exists future -> unlocked park-future -> safe-key accessible).

Conversation Pruning

Sliding window of N turns sent to the LLM API (configurable, default 10), while preserving the full history for export. Trade-off between available context and token consumption.

Tool Calling over Text Parsing

PDDL actions are exposed as OpenAI-compatible tools. This eliminates parsing ambiguity and directly tests the LLM's ability to use tools correctly, a fundamental agentic competency.

Three Levels of Knowledge Tested

The benchmark tests three distinct levels of agentic competency:

Level 1: Macro-Causality (EXPLICIT)

The model receives in the prompt:

Action signatures with typed parameters
Complete current state (all predicates)
Goal predicates
Explicit causal effects ("plant-tree -> tree-exists future")

Test: can the model follow explicit rules?

Level 2: Micro-Causality / Gating (FEEDBACK-DRIVEN)

The model discovers through feedback:

PDDL preconditions: PRECONDITION_FAILED: has(laverne, safe-key) is FALSE
Ordering constraints: form-beta-destroyed is FALSE -> use destroy action
Gating conditions: bureau-open is FALSE -> use open-bureau with correct key

Test: does the model integrate negative feedback to reorder its plan?

Level 3: Resource Management (IMPLICIT, NO FEEDBACK)

No preventive feedback. Suboptimal actions are not blocked:

Dropping an item in a location where no other character can reach it
Sending an item to the wrong character, requiring extra transfer steps
Moving to dead-end locations that waste steps toward the budget limit

The engine does not warn against wasteful actions: they are technically valid.

Test: does the model plan ahead to avoid wasting resources (steps, items, positioning)?

The 6 Agentic Metrics

All metrics are extracted automatically from logs, with no human annotation.

The benchmark separates two distinct failure modes that earlier versions conflated:

LLM turn  ──>  Tool call produced?  ──yes──>  Preconditions met?  ──yes──>  World valid step
                   │ no                              │ no
                   v                                   v
             FORMAT_FAILURE                    PRECONDITION_FAILED
           (tool_call_validity_rate)         (world_action_accuracy)

Tool Call Validity Rate

tool_calls_total = total_steps - control_signals - api_errors
tool_calls_ok    = tool_calls_total - format_errors

Tool Call Validity Rate = tool_calls_ok / tool_calls_total

Does the model understand it is a tool-using agent, not a chatbot? A format error means the model produced plain text, called an unknown tool, or sent malformed arguments. The turn never reached the PDDL engine.

~1.0 → agent-compatible
< 0.5 → unusable in an action loop

World Action Accuracy

World Action Accuracy = world_valid_steps / tool_calls_ok

Among syntactically valid tool calls, how many passed PDDL precondition checks? This measures whether the model maintains a correct mental model of the world state.

The two metrics are complementary: a model can have perfect tool call validity (always calls a tool) but low world accuracy (calls the wrong actions). Conversely, a model that barely tool-calls has low validity but its rare tool calls may be world-accurate.

Causal Progress

Causal Progress = milestones_reached / total_milestones

A causal milestone is an irreversible structuring state (tree-planted, permit-created, amendment-in-constitution). Do the actions causally advance the world?

Causal Efficiency

Causal Efficiency = milestones_reached / world_valid_steps

Does the model advance the world, or explore without converging?

Complementarity with Causal Progress:

Progress	Efficiency	Observable pattern
High	High	Few valid actions, most trigger milestones
High	Low	Many valid actions needed to reach milestones
Low	High	Few valid actions, none wasted, but run stopped early
Low	Low	Many valid actions, none trigger milestones

Recovery Rate (Streak-Based)

Recovery Rate = recovered_streaks / total_invalid_streaks

Where recovered_streak = block of consecutive errors followed by a DIFFERENT valid action.

Can the model unblock itself after a series of errors?

Effort Metrics

Three complementary metrics:

Steps-to-Solve (total)  = total LLM turns until goal
Plan Length (valid-only) = world_valid_steps until goal
Error Overhead           = total - plan_length
Overhead Ratio           = total / plan_length

Overhead Ratio = 1.0 means zero errors. Higher values indicate proportionally more total steps relative to useful steps.

Control Signals vs Errors

Control signals (STUCK, DONE) are NOT errors. They are excluded from error counts, from tool call validity, and from world action accuracy. API errors (infrastructure failures) are similarly excluded: they reflect provider issues, not model capability.

Results

13 models, 6 levels, 5 runs each = 390 runs. All models accessed via OpenRouter with standardized tool calling.

Solve Rate (mean over 5 runs)

Model	L01	L02	L03	L04	L05	L06	ALL
claude-opus-4.6	1.00	1.00	1.00	1.00	1.00	0.80	0.97
grok-4.1-fast	1.00	1.00	0.80	1.00	1.00	0.60	0.90
gemini-3-flash-preview	1.00	1.00	1.00	1.00	0.80	0.40	0.87
kimi-k2.5	1.00	1.00	1.00	1.00	1.00	0.20	0.87
gpt-5.2	1.00	1.00	1.00	1.00	1.00	0.00	0.83
deepseek-v3.2	1.00	1.00	0.80	1.00	1.00	0.00	0.80
gemini-2.5-pro	0.80	1.00	1.00	1.00	1.00	0.00	0.80
claude-haiku-4.5	1.00	1.00	0.60	1.00	1.00	0.00	0.77
devstral-2512	1.00	0.60	1.00	1.00	0.80	0.00	0.73
mistral-large-2512	1.00	0.40	0.80	1.00	0.60	0.00	0.63
qwen3-coder-next	1.00	1.00	0.00	1.00	0.20	0.00	0.53
qwen3.5-plus-02-15	1.00	1.00	0.00	0.00	0.00	0.00	0.33
llama-4-scout	0.80	0.60	0.00	0.20	0.00	0.00	0.27

L06 (The Three Body Problem) is the main discriminant: only 4 models ever solve it, and only claude-opus-4.6 reaches 80%.

Per-Model Summary

Model	Runs	SR	Time(avg)	In(avg)	Out(avg)
claude-opus-4.6	30	97%	43.1s	64k	2k
grok-4.1-fast	30	90%	102.9s	50k	12k
gemini-3-flash-preview	30	87%	27.5s	52k	1k
kimi-k2.5	30	87%	116.0s	46k	2k
gpt-5.2	30	83%	52.8s	39k	2k
deepseek-v3.2	30	80%	138.2s	78k	2k
gemini-2.5-pro	30	80%	39.4s	32k	2k
claude-haiku-4.5	30	77%	29.2s	74k	2k
devstral-2512	30	73%	12.1s	69k	573
mistral-large-2512	30	63%	27.5s	81k	1k
qwen3-coder-next	30	53%	35.9s	108k	2k
qwen3.5-plus-02-15	30	33%	17.9s	26k	769
llama-4-scout	30	27%	11.2s	77k	1k

Figures

Run outcome distribution per model, sorted by SR. Dashed line = average total tokens per run. Green = solved, blue = temporal decay, yellow = max steps, red = max invalid streak.

Full traces (JSON + MD per run): results/raw/

Architecture

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐  │
│   │  State  │───>│ Planner │───>│ Execute │───>│Propagate│  │
│   │  (PDDL) │    │  (LLM)  │    │ Action  │    │ (Sync)  │  │
│   └─────────┘    └─────────┘    └─────────┘    └─────────┘  │
│        ^                                            │        │
│        └────────────────────────────────────────────┘        │
│                    Repeat until goal                         │
└──────────────────────────────────────────────────────────────┘

The LLM receives the current state (grouped predicates), goal predicates, action history with results, and recent causal trigger logs.
The LLM selects an action via tool calling (each PDDL action is a tool with typed parameters).
The WorldModel validates preconditions and applies PDDL effects.
The SyncEngine applies causal propagation (fixed-point iteration).
Feedback (valid/invalid, triggered rules, state changes) is added to the history.

Component	File	Description
WorldModel	`src/core/world_model.py`	Parses PDDL, tracks state, validates preconditions
SyncEngine	`src/core/sync_engine.py`	Propagates causal effects across epochs (fixed-point)
Tracing	`src/core/tracing.py`	Run logging, step logs, stop reason tracking
Scoring	`src/core/scoring.py`	Multi-level metrics: tool call validity, world action accuracy, causal efficiency, recovery rate, effort
LLMPlanner	`src/llm/planner.py`	Queries LLM via OpenRouter (tool calling)
LLMRunner	`src/llm/runner.py`	Orchestrates planning loop with 9 stop conditions
Prompts	`src/llm/prompts.py`	Converts PDDL state to natural language prompts
Export	`src/llm/export.py`	Exports run results to Markdown + JSON
Benchmark	`src/llm/benchmark.py`	Multi-model benchmark runner

Levels

Level	Name	Optimal	Competencies Tested
L01	Hello ChronoJohn	4	Spatial navigation, cross-epoch transfer
L02	The Tree	3	Causal propagation (past -> future)
L03	The Carved Code	8	Multi-agent coordination, character switching
L04	The Vault	10	Locked-resource dependencies, key management
L05	The Constitutional Vacuum	18	Full integration of all competencies
L06	The Three Body Problem	25	Temporal decay (5 valid-action TTL), deductive reasoning, 3-epoch sync

Causal Propagation Rules

Trigger	Effect	Example
`tree-planted`	`tree-exists present/future`	Tree grows over time
`tree-exists future`	`unlocked park-future`	Gate opens when tree exists
`code-carved`	`stone-readable present/future`	Stone persists across epochs
`founder-convinced`	`amendment-in-constitution`	Constitutional change

Stop Conditions

Condition	Description	Agent fault?
`SOLVED`	Goal reached	--
`MAX_STEPS`	Step budget exhausted	Yes (too slow)
`MAX_INVALID_STREAK`	5 consecutive invalid actions	Yes (lost)
`LOOP_DETECTED`	Same state visited N times	Yes (stuck in loop)
`STAGNATION`	No progress in N steps	Yes (no convergence)
`LLM_STUCK`	LLM voluntarily signals impossibility	No (lucid)
`LLM_DONE_EARLY`	LLM signals done but goal not reached	Yes (premature)
`API_FAILURE`	Repeated API/network errors	No (infrastructure)
`TEMPORAL_DECAY`	Unstable predicate expired after 5 valid actions (L06)	Yes (too slow)

Temporal Decay Semantics (L06)

The lever-pulled predicate is unstable: it expires after DECAY_WINDOW = 5 valid world actions. The formal counting rule:

What advances the TTL clock: only actions that pass PDDL precondition checks (world_valid_steps). This is the same counter used by world_action_accuracy.

What does NOT advance the TTL clock: format errors (no tool call produced), precondition failures (invalid tool call), API errors (infrastructure), control signals (STUCK/DONE). These consume the MAX_STEPS budget but do not shorten the decay window.

Rationale: the decay window measures planning coordination ability ("can you sequence 3 lever pulls within N real moves?"), not format compliance. Format failures are already penalized by tool_call_validity_rate.

Execution order per step:

1. Execute action via WorldModel (precondition check)
2. If invalid → return early, TTL clock unchanged
3. Increment valid-action counter (current_step += 1)
4. Causal propagation (SyncEngine.propagate)
5. Track newly created unstable facts
6. Temporal decay: remove facts where age > DECAY_WINDOW

Because propagation (4) runs before decay (6), a synchronizing lever pull on the last possible step fires the sync rule (all 3 levers still in state) before the oldest lever decays. The model has exactly DECAY_WINDOW valid moves after creation.

Auditability: the state description surfaces a TTL countdown for active unstable facts (e.g., lever-pulled mainframe-past: 3 valid actions remaining), and decay messages in traces show exact valid-step numbers (created at valid-step 15, expired at valid-step 21, age=6 > TTL=5).

Observed Failure Modes

Three dominant failure patterns emerge from the 390 runs:

1. Format Failure (MAX_INVALID_STREAK) The model fails to produce valid tool calls: it outputs plain text, calls unknown tools, or sends malformed arguments (tool_call_validity_rate ≈ 0). No action ever reaches the PDDL engine. This is the exclusive failure mode for Qwen3.5-Plus (20/20 failures), and a significant contributor for Gemini 2.5 Pro on L01/L06 (4/6) and Llama-4-Scout (4/22). These models either do not engage with the planning task at all, or fail to maintain tool-calling format as complexity increases.

2. Stagnation / Budget Exhaustion (MAX_STEPS + STAGNATION) The model produces valid tool calls but fails to converge toward the goal within the step budget. It may wander through valid-but-unproductive actions, repeat similar sequences without progress, or fail to integrate feedback. This is the dominant failure mode for Llama-4-Scout (13/22 failures), Qwen3-Coder-Next (10/14), and Mistral Large 2512 (6/11). Indicates tool-use ability but insufficient planning depth or causal reasoning.

3. Temporal Coordination Failure (TEMPORAL_DECAY) Specific to L06. The model understands the sub-goals (unlock mainframes, pull levers) but fails to execute the three lever pulls within the 5-valid-action decay window. Only successful world actions count toward the TTL, failed attempts do not shorten the window. This is the dominant failure mode for top-tier models on L06: GPT-5.2 (5/5 L06 failures), DeepSeek-V3.2 (5/6), Claude Haiku 4.5 (5/7), Kimi K2.5 (4/4), Gemini 3 Flash (3/4). Even Claude Opus 4.6's single failure is a temporal decay. This failure requires understanding timing constraints that are stated in the prompt but demand tight multi-epoch coordination to satisfy.

Failure Mode	Models Most Affected	Root Cause
Format failure	qwen3.5-plus, gemini-2.5-pro (L06), llama-4-scout	Cannot produce or maintain valid tool calls
Stagnation	llama-4-scout, qwen3-coder-next, mistral-large	Valid tool calls but no convergence toward goal
Temporal decay	gpt-5.2, deepseek-v3.2, claude-haiku-4.5, kimi-k2.5	Implicit timing constraints (L06)

Data

Directory structure

results/
├── summary/
│   ├── _results.csv                  # All runs, 37 columns, one row per run
│   └── fig_pipeline_fingerprint_overall.png
├── raw/
│   ├── claude-opus-4.6/              # One folder per model
│   │   ├── ..._level_01_..._20260220.md   # Human-readable trace
│   │   └── ..._level_01_..._20260220.json # Full machine-readable trace
│   ├── gpt-5.2/
│   └── ...
└── figs/                         # Generated plots (SR, radar, fingerprints)

Per-run output

Each run produces three outputs:

Markdown trace (.md): human-readable step-by-step execution log with system prompt, each action, validation result, causal propagation events, and final scoring.

JSON trace (.json): full machine-readable record including execution metadata (model, timestamps, call params), complete conversation history (all messages sent to/from the LLM), tool definitions, per-step token counts, and the system/user prompts used.

CSV row: one row appended to results/summary/_results.csv per run.

CSV schema (37 columns)

Column	Description
`timestamp`, `problem`, `model`, `run_id`	Run identification
`solved`, `stop_reason`	Outcome (True/False) and why the run ended
`total_steps`, `world_valid_steps`, `world_invalid_steps`	Step counts
`steps_to_solve_total`, `plan_length`, `error_overhead`, `overhead_ratio`	Effort metrics
`tool_calls_total`, `tool_calls_ok`, `tool_call_validity_rate`	Format compliance
`world_action_accuracy`, `invalid_rate`, `max_invalid_streak`	World model accuracy
`format_errors`, `precondition_errors`, `api_errors`, `control_signals`	Error breakdown
`recovery_rate`, `recovered_streaks`, `total_invalid_streaks`	Recovery metrics
`milestones_reached`, `milestones_total`, `milestone_progress`, `causal_efficiency`	Causal progress
`unique_states`, `loop_detected`, `stagnation_stop`	Exploration metrics
`total_time`, `tokens_in`, `tokens_out`, `tokens_reasoning`	Cost and latency

The published results (390 runs) are included in the repository. Running the benchmark appends new rows to the same CSV.

Citation

@software{intent2026epochbench,
  author       = {J{\'e}z{\'e}quel, Yann},
  title        = {{EPOCH-Bench}: Evaluation of Planning Over Causal Horizons},
  year         = {2026},
  url          = {https://github.com/hey-intent/epoch-bench},
  note         = {LLM agentic planning benchmark using PDDL time-travel puzzles}
}

License

MIT, see LICENSE

Author

Yann JEZEQUEL, HeyIntent

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cli		cli
pddl		pddl
results		results
src		src
.env-example		.env-example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

EPOCH-Bench

What it tests

Quick Start

Installation

Run benchmark

Batch execution

Verify world model (no API key needed)

Code quality check

Design Decisions

Why PDDL?

Why Day of the Tentacle?

Why OpenRouter?

Wrapper Pattern & Fixed-Point Propagation

Conversation Pruning

Tool Calling over Text Parsing

Three Levels of Knowledge Tested

Level 1: Macro-Causality (EXPLICIT)

Level 2: Micro-Causality / Gating (FEEDBACK-DRIVEN)

Level 3: Resource Management (IMPLICIT, NO FEEDBACK)

The 6 Agentic Metrics

Tool Call Validity Rate

World Action Accuracy

Causal Progress

Causal Efficiency

Recovery Rate (Streak-Based)

Effort Metrics

Control Signals vs Errors

Results

Solve Rate (mean over 5 runs)

Per-Model Summary

Figures

Architecture

Levels

Causal Propagation Rules

Stop Conditions

Temporal Decay Semantics (L06)

Observed Failure Modes

Data

Directory structure

Per-run output

CSV schema (37 columns)

Citation

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages