Evaluation of Planning Over Causal Horizons
A benchmark for evaluating LLM agentic planning capabilities on PDDL time-travel puzzles with causal propagation, inspired by Day of the Tentacle.
git clone https://github.com/hey-intent/epoch-bench.git && cd epoch-bench
pip install -e .
cp .env-example .env # add your OpenRouter key
python cli/test_llm.py --model anthropic/claude-opus-4.6 --runs 1EPOCH-Bench measures whether LLMs can act as goal-directed agents in a formal world with strict constraints, negative feedback, and temporal causality.
Six progressively harder levels, from simple spatial navigation (L01: send an object through time in 4 steps) to multi-epoch synchronization with temporal decay (L06: coordinate 3 characters across past/present/future with timing constraints in 25+ steps).
Each action is validated by a deterministic PDDL engine. The agent receives structured feedback (actionable hints on failed preconditions, causal propagation events, state changes) and must adapt its plan in real time. The benchmark discriminates explicitly between format compliance, state tracking, causal understanding, and recovery after errors.
The puzzles are original creations inspired by the game, not reproductions of existing content.
git clone https://github.com/hey-intent/epoch-bench.git
cd epoch-bench
pip install -e .Create an OpenRouter API key at https://openrouter.ai/settings/keys
cp .env-example .env # then edit with your OpenRouter key# Run all 6 levels, 1 run each (using .env OPEN_ROUTER_MODEL)
python cli/test_llm.py
# Run specific levels with 10 runs each
python cli/test_llm.py --levels 1-3 --runs 10
# Use a specific model
python cli/test_llm.py --model google/gemini-3-flash-preview --runs 10
Edit the model list in cli/benchmark.py, or pass models via CLI:
# Run specific models and levels
python cli/benchmark.py --models anthropic/claude-opus-4.6 openai/gpt-5.2 --levels 1-3 --runs 5
# Run all configured models, all 6 levels, 5 runs each
python cli/benchmark.py --runs 5Cost warning: A full benchmark run (13 models x 6 levels x 5 runs = 390 runs) costs approximately $30-40 on OpenRouter. L06 alone can consume 100k+ input tokens per run on some models. Start with
--levels 1-2 --runs 1to validate your setup before committing to a full run.
python cli/test_world_model.py # Unit tests
python cli/manual_solutions.py # Verify optimal solutionsruff check .PDDL provides a formal, deterministic, and reproducible framework. State transitions are mathematically verifiable. Every action either satisfies its preconditions or it doesn't.
The game naturally provides a multi-epoch structure (past, present, future) with temporal causality (plant a tree in the past -> tree exists in the future). This allows testing causal understanding and multi-agent coordination without requiring real-world knowledge. The puzzles are original creations inspired by the game, not reproductions of existing content.
A benchmark comparing 10+ models across 6 providers needs a single API surface. OpenRouter provides exactly that: one endpoint, one auth token, unified tool calling format. Without it, the benchmark would require 6 separate SDKs, 6 auth flows, and 6 different tool calling implementations, all for infrastructure plumbing that adds zero scientific value.
The trade-off is real: OpenRouter adds a proxy hop (latency), and provider-specific features (Anthropic's extended thinking, Gemini's grounding) are not accessible. But for a benchmark measuring planning capability through standardized tool calls, consistency across models matters more than provider-specific optimizations. Every model gets the same prompt, the same tools schema, the same feedback loop.
Adding a new model to the benchmark is a one-line change in the config: no new SDK, no new auth, no new parsing logic.
TemporalWorldModel wraps WorldModel to add causal propagation without modifying the base class. The SyncEngine applies propagation rules via fixed-point iteration, automatically handling chain reactions (e.g., tree-planted -> tree-exists future -> unlocked park-future -> safe-key accessible).
Sliding window of N turns sent to the LLM API (configurable, default 10), while preserving the full history for export. Trade-off between available context and token consumption.
PDDL actions are exposed as OpenAI-compatible tools. This eliminates parsing ambiguity and directly tests the LLM's ability to use tools correctly, a fundamental agentic competency.
The benchmark tests three distinct levels of agentic competency:
The model receives in the prompt:
- Action signatures with typed parameters
- Complete current state (all predicates)
- Goal predicates
- Explicit causal effects ("plant-tree -> tree-exists future")
Test: can the model follow explicit rules?
The model discovers through feedback:
- PDDL preconditions:
PRECONDITION_FAILED: has(laverne, safe-key) is FALSE - Ordering constraints:
form-beta-destroyed is FALSE -> use destroy action - Gating conditions:
bureau-open is FALSE -> use open-bureau with correct key
Test: does the model integrate negative feedback to reorder its plan?
No preventive feedback. Suboptimal actions are not blocked:
- Dropping an item in a location where no other character can reach it
- Sending an item to the wrong character, requiring extra transfer steps
- Moving to dead-end locations that waste steps toward the budget limit
The engine does not warn against wasteful actions: they are technically valid.
Test: does the model plan ahead to avoid wasting resources (steps, items, positioning)?
All metrics are extracted automatically from logs, with no human annotation.
The benchmark separates two distinct failure modes that earlier versions conflated:
LLM turn ──> Tool call produced? ──yes──> Preconditions met? ──yes──> World valid step
│ no │ no
v v
FORMAT_FAILURE PRECONDITION_FAILED
(tool_call_validity_rate) (world_action_accuracy)
tool_calls_total = total_steps - control_signals - api_errors
tool_calls_ok = tool_calls_total - format_errors
Tool Call Validity Rate = tool_calls_ok / tool_calls_total
Does the model understand it is a tool-using agent, not a chatbot? A format error means the model produced plain text, called an unknown tool, or sent malformed arguments. The turn never reached the PDDL engine.
- ~1.0 → agent-compatible
- < 0.5 → unusable in an action loop
World Action Accuracy = world_valid_steps / tool_calls_ok
Among syntactically valid tool calls, how many passed PDDL precondition checks? This measures whether the model maintains a correct mental model of the world state.
The two metrics are complementary: a model can have perfect tool call validity (always calls a tool) but low world accuracy (calls the wrong actions). Conversely, a model that barely tool-calls has low validity but its rare tool calls may be world-accurate.
Causal Progress = milestones_reached / total_milestones
A causal milestone is an irreversible structuring state (tree-planted, permit-created, amendment-in-constitution). Do the actions causally advance the world?
Causal Efficiency = milestones_reached / world_valid_steps
Does the model advance the world, or explore without converging?
Complementarity with Causal Progress:
| Progress | Efficiency | Observable pattern |
|---|---|---|
| High | High | Few valid actions, most trigger milestones |
| High | Low | Many valid actions needed to reach milestones |
| Low | High | Few valid actions, none wasted, but run stopped early |
| Low | Low | Many valid actions, none trigger milestones |
Recovery Rate = recovered_streaks / total_invalid_streaks
Where recovered_streak = block of consecutive errors followed by a DIFFERENT valid action.
Can the model unblock itself after a series of errors?
Three complementary metrics:
Steps-to-Solve (total) = total LLM turns until goal
Plan Length (valid-only) = world_valid_steps until goal
Error Overhead = total - plan_length
Overhead Ratio = total / plan_length
Overhead Ratio = 1.0 means zero errors. Higher values indicate proportionally more total steps relative to useful steps.
Control signals (STUCK, DONE) are NOT errors. They are excluded from error counts, from tool call validity, and from world action accuracy. API errors (infrastructure failures) are similarly excluded: they reflect provider issues, not model capability.
13 models, 6 levels, 5 runs each = 390 runs. All models accessed via OpenRouter with standardized tool calling.
| Model | L01 | L02 | L03 | L04 | L05 | L06 | ALL |
|---|---|---|---|---|---|---|---|
| claude-opus-4.6 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.80 | 0.97 |
| grok-4.1-fast | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 0.60 | 0.90 |
| gemini-3-flash-preview | 1.00 | 1.00 | 1.00 | 1.00 | 0.80 | 0.40 | 0.87 |
| kimi-k2.5 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.20 | 0.87 |
| gpt-5.2 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.83 |
| deepseek-v3.2 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 | 0.00 | 0.80 |
| gemini-2.5-pro | 0.80 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.80 |
| claude-haiku-4.5 | 1.00 | 1.00 | 0.60 | 1.00 | 1.00 | 0.00 | 0.77 |
| devstral-2512 | 1.00 | 0.60 | 1.00 | 1.00 | 0.80 | 0.00 | 0.73 |
| mistral-large-2512 | 1.00 | 0.40 | 0.80 | 1.00 | 0.60 | 0.00 | 0.63 |
| qwen3-coder-next | 1.00 | 1.00 | 0.00 | 1.00 | 0.20 | 0.00 | 0.53 |
| qwen3.5-plus-02-15 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 |
| llama-4-scout | 0.80 | 0.60 | 0.00 | 0.20 | 0.00 | 0.00 | 0.27 |
L06 (The Three Body Problem) is the main discriminant: only 4 models ever solve it, and only claude-opus-4.6 reaches 80%.
| Model | Runs | SR | Time(avg) | In(avg) | Out(avg) |
|---|---|---|---|---|---|
| claude-opus-4.6 | 30 | 97% | 43.1s | 64k | 2k |
| grok-4.1-fast | 30 | 90% | 102.9s | 50k | 12k |
| gemini-3-flash-preview | 30 | 87% | 27.5s | 52k | 1k |
| kimi-k2.5 | 30 | 87% | 116.0s | 46k | 2k |
| gpt-5.2 | 30 | 83% | 52.8s | 39k | 2k |
| deepseek-v3.2 | 30 | 80% | 138.2s | 78k | 2k |
| gemini-2.5-pro | 30 | 80% | 39.4s | 32k | 2k |
| claude-haiku-4.5 | 30 | 77% | 29.2s | 74k | 2k |
| devstral-2512 | 30 | 73% | 12.1s | 69k | 573 |
| mistral-large-2512 | 30 | 63% | 27.5s | 81k | 1k |
| qwen3-coder-next | 30 | 53% | 35.9s | 108k | 2k |
| qwen3.5-plus-02-15 | 30 | 33% | 17.9s | 26k | 769 |
| llama-4-scout | 30 | 27% | 11.2s | 77k | 1k |
Run outcome distribution per model, sorted by SR. Dashed line = average total tokens per run. Green = solved, blue = temporal decay, yellow = max steps, red = max invalid streak.
Full traces (JSON + MD per run): results/raw/
┌──────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ State │───>│ Planner │───>│ Execute │───>│Propagate│ │
│ │ (PDDL) │ │ (LLM) │ │ Action │ │ (Sync) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ ^ │ │
│ └────────────────────────────────────────────┘ │
│ Repeat until goal │
└──────────────────────────────────────────────────────────────┘
- The LLM receives the current state (grouped predicates), goal predicates, action history with results, and recent causal trigger logs.
- The LLM selects an action via tool calling (each PDDL action is a tool with typed parameters).
- The WorldModel validates preconditions and applies PDDL effects.
- The SyncEngine applies causal propagation (fixed-point iteration).
- Feedback (valid/invalid, triggered rules, state changes) is added to the history.
| Component | File | Description |
|---|---|---|
| WorldModel | src/core/world_model.py |
Parses PDDL, tracks state, validates preconditions |
| SyncEngine | src/core/sync_engine.py |
Propagates causal effects across epochs (fixed-point) |
| Tracing | src/core/tracing.py |
Run logging, step logs, stop reason tracking |
| Scoring | src/core/scoring.py |
Multi-level metrics: tool call validity, world action accuracy, causal efficiency, recovery rate, effort |
| LLMPlanner | src/llm/planner.py |
Queries LLM via OpenRouter (tool calling) |
| LLMRunner | src/llm/runner.py |
Orchestrates planning loop with 9 stop conditions |
| Prompts | src/llm/prompts.py |
Converts PDDL state to natural language prompts |
| Export | src/llm/export.py |
Exports run results to Markdown + JSON |
| Benchmark | src/llm/benchmark.py |
Multi-model benchmark runner |
| Level | Name | Optimal | Competencies Tested |
|---|---|---|---|
| L01 | Hello ChronoJohn | 4 | Spatial navigation, cross-epoch transfer |
| L02 | The Tree | 3 | Causal propagation (past -> future) |
| L03 | The Carved Code | 8 | Multi-agent coordination, character switching |
| L04 | The Vault | 10 | Locked-resource dependencies, key management |
| L05 | The Constitutional Vacuum | 18 | Full integration of all competencies |
| L06 | The Three Body Problem | 25 | Temporal decay (5 valid-action TTL), deductive reasoning, 3-epoch sync |
| Trigger | Effect | Example |
|---|---|---|
tree-planted |
tree-exists present/future |
Tree grows over time |
tree-exists future |
unlocked park-future |
Gate opens when tree exists |
code-carved |
stone-readable present/future |
Stone persists across epochs |
founder-convinced |
amendment-in-constitution |
Constitutional change |
| Condition | Description | Agent fault? |
|---|---|---|
SOLVED |
Goal reached | -- |
MAX_STEPS |
Step budget exhausted | Yes (too slow) |
MAX_INVALID_STREAK |
5 consecutive invalid actions | Yes (lost) |
LOOP_DETECTED |
Same state visited N times | Yes (stuck in loop) |
STAGNATION |
No progress in N steps | Yes (no convergence) |
LLM_STUCK |
LLM voluntarily signals impossibility | No (lucid) |
LLM_DONE_EARLY |
LLM signals done but goal not reached | Yes (premature) |
API_FAILURE |
Repeated API/network errors | No (infrastructure) |
TEMPORAL_DECAY |
Unstable predicate expired after 5 valid actions (L06) | Yes (too slow) |
The lever-pulled predicate is unstable: it expires after DECAY_WINDOW = 5 valid world actions. The formal counting rule:
What advances the TTL clock: only actions that pass PDDL precondition checks (world_valid_steps). This is the same counter used by world_action_accuracy.
What does NOT advance the TTL clock: format errors (no tool call produced), precondition failures (invalid tool call), API errors (infrastructure), control signals (STUCK/DONE). These consume the MAX_STEPS budget but do not shorten the decay window.
Rationale: the decay window measures planning coordination ability ("can you sequence 3 lever pulls within N real moves?"), not format compliance. Format failures are already penalized by tool_call_validity_rate.
Execution order per step:
1. Execute action via WorldModel (precondition check)
2. If invalid → return early, TTL clock unchanged
3. Increment valid-action counter (current_step += 1)
4. Causal propagation (SyncEngine.propagate)
5. Track newly created unstable facts
6. Temporal decay: remove facts where age > DECAY_WINDOW
Because propagation (4) runs before decay (6), a synchronizing lever pull on the last possible step fires the sync rule (all 3 levers still in state) before the oldest lever decays. The model has exactly DECAY_WINDOW valid moves after creation.
Auditability: the state description surfaces a TTL countdown for active unstable facts (e.g., lever-pulled mainframe-past: 3 valid actions remaining), and decay messages in traces show exact valid-step numbers (created at valid-step 15, expired at valid-step 21, age=6 > TTL=5).
Three dominant failure patterns emerge from the 390 runs:
1. Format Failure (MAX_INVALID_STREAK) The model fails to produce valid tool calls: it outputs plain text, calls unknown tools, or sends malformed arguments (tool_call_validity_rate ≈ 0). No action ever reaches the PDDL engine. This is the exclusive failure mode for Qwen3.5-Plus (20/20 failures), and a significant contributor for Gemini 2.5 Pro on L01/L06 (4/6) and Llama-4-Scout (4/22). These models either do not engage with the planning task at all, or fail to maintain tool-calling format as complexity increases.
2. Stagnation / Budget Exhaustion (MAX_STEPS + STAGNATION) The model produces valid tool calls but fails to converge toward the goal within the step budget. It may wander through valid-but-unproductive actions, repeat similar sequences without progress, or fail to integrate feedback. This is the dominant failure mode for Llama-4-Scout (13/22 failures), Qwen3-Coder-Next (10/14), and Mistral Large 2512 (6/11). Indicates tool-use ability but insufficient planning depth or causal reasoning.
3. Temporal Coordination Failure (TEMPORAL_DECAY) Specific to L06. The model understands the sub-goals (unlock mainframes, pull levers) but fails to execute the three lever pulls within the 5-valid-action decay window. Only successful world actions count toward the TTL, failed attempts do not shorten the window. This is the dominant failure mode for top-tier models on L06: GPT-5.2 (5/5 L06 failures), DeepSeek-V3.2 (5/6), Claude Haiku 4.5 (5/7), Kimi K2.5 (4/4), Gemini 3 Flash (3/4). Even Claude Opus 4.6's single failure is a temporal decay. This failure requires understanding timing constraints that are stated in the prompt but demand tight multi-epoch coordination to satisfy.
| Failure Mode | Models Most Affected | Root Cause |
|---|---|---|
| Format failure | qwen3.5-plus, gemini-2.5-pro (L06), llama-4-scout | Cannot produce or maintain valid tool calls |
| Stagnation | llama-4-scout, qwen3-coder-next, mistral-large | Valid tool calls but no convergence toward goal |
| Temporal decay | gpt-5.2, deepseek-v3.2, claude-haiku-4.5, kimi-k2.5 | Implicit timing constraints (L06) |
results/
├── summary/
│ ├── _results.csv # All runs, 37 columns, one row per run
│ └── fig_pipeline_fingerprint_overall.png
├── raw/
│ ├── claude-opus-4.6/ # One folder per model
│ │ ├── ..._level_01_..._20260220.md # Human-readable trace
│ │ └── ..._level_01_..._20260220.json # Full machine-readable trace
│ ├── gpt-5.2/
│ └── ...
└── figs/ # Generated plots (SR, radar, fingerprints)
Each run produces three outputs:
Markdown trace (.md): human-readable step-by-step execution log with system prompt, each action, validation result, causal propagation events, and final scoring.
JSON trace (.json): full machine-readable record including execution metadata (model, timestamps, call params), complete conversation history (all messages sent to/from the LLM), tool definitions, per-step token counts, and the system/user prompts used.
CSV row: one row appended to results/summary/_results.csv per run.
| Column | Description |
|---|---|
timestamp, problem, model, run_id |
Run identification |
solved, stop_reason |
Outcome (True/False) and why the run ended |
total_steps, world_valid_steps, world_invalid_steps |
Step counts |
steps_to_solve_total, plan_length, error_overhead, overhead_ratio |
Effort metrics |
tool_calls_total, tool_calls_ok, tool_call_validity_rate |
Format compliance |
world_action_accuracy, invalid_rate, max_invalid_streak |
World model accuracy |
format_errors, precondition_errors, api_errors, control_signals |
Error breakdown |
recovery_rate, recovered_streaks, total_invalid_streaks |
Recovery metrics |
milestones_reached, milestones_total, milestone_progress, causal_efficiency |
Causal progress |
unique_states, loop_detected, stagnation_stop |
Exploration metrics |
total_time, tokens_in, tokens_out, tokens_reasoning |
Cost and latency |
The published results (390 runs) are included in the repository. Running the benchmark appends new rows to the same CSV.
@software{intent2026epochbench,
author = {J{\'e}z{\'e}quel, Yann},
title = {{EPOCH-Bench}: Evaluation of Planning Over Causal Horizons},
year = {2026},
url = {https://github.com/hey-intent/epoch-bench},
note = {LLM agentic planning benchmark using PDDL time-travel puzzles}
}MIT, see LICENSE
Yann JEZEQUEL, HeyIntent
