Skip to content

hey-intent/epoch-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EPOCH-Bench

License: MIT Python 3.10+

Evaluation of Planning Over Causal Horizons

A benchmark for evaluating LLM agentic planning capabilities on PDDL time-travel puzzles with causal propagation, inspired by Day of the Tentacle.

git clone https://github.com/hey-intent/epoch-bench.git && cd epoch-bench
pip install -e .
cp .env-example .env  # add your OpenRouter key
python cli/test_llm.py --model anthropic/claude-opus-4.6 --runs 1

EPOCH-Bench measures whether LLMs can act as goal-directed agents in a formal world with strict constraints, negative feedback, and temporal causality.


What it tests

Six progressively harder levels, from simple spatial navigation (L01: send an object through time in 4 steps) to multi-epoch synchronization with temporal decay (L06: coordinate 3 characters across past/present/future with timing constraints in 25+ steps).

Each action is validated by a deterministic PDDL engine. The agent receives structured feedback (actionable hints on failed preconditions, causal propagation events, state changes) and must adapt its plan in real time. The benchmark discriminates explicitly between format compliance, state tracking, causal understanding, and recovery after errors.

The puzzles are original creations inspired by the game, not reproductions of existing content.


Quick Start

Installation

git clone https://github.com/hey-intent/epoch-bench.git
cd epoch-bench
pip install -e .

Create an OpenRouter API key at https://openrouter.ai/settings/keys

cp .env-example .env   # then edit with your OpenRouter key

Run benchmark

# Run all 6 levels, 1 run each (using .env OPEN_ROUTER_MODEL)
python cli/test_llm.py

# Run specific levels with 10 runs each
python cli/test_llm.py --levels 1-3 --runs 10

# Use a specific model
python cli/test_llm.py --model google/gemini-3-flash-preview --runs 10

Batch execution

Edit the model list in cli/benchmark.py, or pass models via CLI:

# Run specific models and levels
python cli/benchmark.py --models anthropic/claude-opus-4.6 openai/gpt-5.2 --levels 1-3 --runs 5

# Run all configured models, all 6 levels, 5 runs each
python cli/benchmark.py --runs 5

Cost warning: A full benchmark run (13 models x 6 levels x 5 runs = 390 runs) costs approximately $30-40 on OpenRouter. L06 alone can consume 100k+ input tokens per run on some models. Start with --levels 1-2 --runs 1 to validate your setup before committing to a full run.

Verify world model (no API key needed)

python cli/test_world_model.py      # Unit tests
python cli/manual_solutions.py      # Verify optimal solutions

Code quality check

ruff check .

Design Decisions

Why PDDL?

PDDL provides a formal, deterministic, and reproducible framework. State transitions are mathematically verifiable. Every action either satisfies its preconditions or it doesn't.

Why Day of the Tentacle?

The game naturally provides a multi-epoch structure (past, present, future) with temporal causality (plant a tree in the past -> tree exists in the future). This allows testing causal understanding and multi-agent coordination without requiring real-world knowledge. The puzzles are original creations inspired by the game, not reproductions of existing content.

Why OpenRouter?

A benchmark comparing 10+ models across 6 providers needs a single API surface. OpenRouter provides exactly that: one endpoint, one auth token, unified tool calling format. Without it, the benchmark would require 6 separate SDKs, 6 auth flows, and 6 different tool calling implementations, all for infrastructure plumbing that adds zero scientific value.

The trade-off is real: OpenRouter adds a proxy hop (latency), and provider-specific features (Anthropic's extended thinking, Gemini's grounding) are not accessible. But for a benchmark measuring planning capability through standardized tool calls, consistency across models matters more than provider-specific optimizations. Every model gets the same prompt, the same tools schema, the same feedback loop.

Adding a new model to the benchmark is a one-line change in the config: no new SDK, no new auth, no new parsing logic.

Wrapper Pattern & Fixed-Point Propagation

TemporalWorldModel wraps WorldModel to add causal propagation without modifying the base class. The SyncEngine applies propagation rules via fixed-point iteration, automatically handling chain reactions (e.g., tree-planted -> tree-exists future -> unlocked park-future -> safe-key accessible).

Conversation Pruning

Sliding window of N turns sent to the LLM API (configurable, default 10), while preserving the full history for export. Trade-off between available context and token consumption.

Tool Calling over Text Parsing

PDDL actions are exposed as OpenAI-compatible tools. This eliminates parsing ambiguity and directly tests the LLM's ability to use tools correctly, a fundamental agentic competency.


Three Levels of Knowledge Tested

The benchmark tests three distinct levels of agentic competency:

Level 1: Macro-Causality (EXPLICIT)

The model receives in the prompt:

  • Action signatures with typed parameters
  • Complete current state (all predicates)
  • Goal predicates
  • Explicit causal effects ("plant-tree -> tree-exists future")

Test: can the model follow explicit rules?

Level 2: Micro-Causality / Gating (FEEDBACK-DRIVEN)

The model discovers through feedback:

  • PDDL preconditions: PRECONDITION_FAILED: has(laverne, safe-key) is FALSE
  • Ordering constraints: form-beta-destroyed is FALSE -> use destroy action
  • Gating conditions: bureau-open is FALSE -> use open-bureau with correct key

Test: does the model integrate negative feedback to reorder its plan?

Level 3: Resource Management (IMPLICIT, NO FEEDBACK)

No preventive feedback. Suboptimal actions are not blocked:

  • Dropping an item in a location where no other character can reach it
  • Sending an item to the wrong character, requiring extra transfer steps
  • Moving to dead-end locations that waste steps toward the budget limit

The engine does not warn against wasteful actions: they are technically valid.

Test: does the model plan ahead to avoid wasting resources (steps, items, positioning)?


The 6 Agentic Metrics

All metrics are extracted automatically from logs, with no human annotation.

The benchmark separates two distinct failure modes that earlier versions conflated:

LLM turn  ──>  Tool call produced?  ──yes──>  Preconditions met?  ──yes──>  World valid step
                   │ no                              │ no
                   v                                   v
             FORMAT_FAILURE                    PRECONDITION_FAILED
           (tool_call_validity_rate)         (world_action_accuracy)

Tool Call Validity Rate

tool_calls_total = total_steps - control_signals - api_errors
tool_calls_ok    = tool_calls_total - format_errors

Tool Call Validity Rate = tool_calls_ok / tool_calls_total

Does the model understand it is a tool-using agent, not a chatbot? A format error means the model produced plain text, called an unknown tool, or sent malformed arguments. The turn never reached the PDDL engine.

  • ~1.0 → agent-compatible
  • < 0.5 → unusable in an action loop

World Action Accuracy

World Action Accuracy = world_valid_steps / tool_calls_ok

Among syntactically valid tool calls, how many passed PDDL precondition checks? This measures whether the model maintains a correct mental model of the world state.

The two metrics are complementary: a model can have perfect tool call validity (always calls a tool) but low world accuracy (calls the wrong actions). Conversely, a model that barely tool-calls has low validity but its rare tool calls may be world-accurate.

Causal Progress

Causal Progress = milestones_reached / total_milestones

A causal milestone is an irreversible structuring state (tree-planted, permit-created, amendment-in-constitution). Do the actions causally advance the world?

Causal Efficiency

Causal Efficiency = milestones_reached / world_valid_steps

Does the model advance the world, or explore without converging?

Complementarity with Causal Progress:

Progress Efficiency Observable pattern
High High Few valid actions, most trigger milestones
High Low Many valid actions needed to reach milestones
Low High Few valid actions, none wasted, but run stopped early
Low Low Many valid actions, none trigger milestones

Recovery Rate (Streak-Based)

Recovery Rate = recovered_streaks / total_invalid_streaks

Where recovered_streak = block of consecutive errors followed by a DIFFERENT valid action.

Can the model unblock itself after a series of errors?

Effort Metrics

Three complementary metrics:

Steps-to-Solve (total)  = total LLM turns until goal
Plan Length (valid-only) = world_valid_steps until goal
Error Overhead           = total - plan_length
Overhead Ratio           = total / plan_length

Overhead Ratio = 1.0 means zero errors. Higher values indicate proportionally more total steps relative to useful steps.

Control Signals vs Errors

Control signals (STUCK, DONE) are NOT errors. They are excluded from error counts, from tool call validity, and from world action accuracy. API errors (infrastructure failures) are similarly excluded: they reflect provider issues, not model capability.


Results

13 models, 6 levels, 5 runs each = 390 runs. All models accessed via OpenRouter with standardized tool calling.

Solve Rate (mean over 5 runs)

Model L01 L02 L03 L04 L05 L06 ALL
claude-opus-4.6 1.00 1.00 1.00 1.00 1.00 0.80 0.97
grok-4.1-fast 1.00 1.00 0.80 1.00 1.00 0.60 0.90
gemini-3-flash-preview 1.00 1.00 1.00 1.00 0.80 0.40 0.87
kimi-k2.5 1.00 1.00 1.00 1.00 1.00 0.20 0.87
gpt-5.2 1.00 1.00 1.00 1.00 1.00 0.00 0.83
deepseek-v3.2 1.00 1.00 0.80 1.00 1.00 0.00 0.80
gemini-2.5-pro 0.80 1.00 1.00 1.00 1.00 0.00 0.80
claude-haiku-4.5 1.00 1.00 0.60 1.00 1.00 0.00 0.77
devstral-2512 1.00 0.60 1.00 1.00 0.80 0.00 0.73
mistral-large-2512 1.00 0.40 0.80 1.00 0.60 0.00 0.63
qwen3-coder-next 1.00 1.00 0.00 1.00 0.20 0.00 0.53
qwen3.5-plus-02-15 1.00 1.00 0.00 0.00 0.00 0.00 0.33
llama-4-scout 0.80 0.60 0.00 0.20 0.00 0.00 0.27

L06 (The Three Body Problem) is the main discriminant: only 4 models ever solve it, and only claude-opus-4.6 reaches 80%.

Per-Model Summary

Model Runs SR Time(avg) In(avg) Out(avg)
claude-opus-4.6 30 97% 43.1s 64k 2k
grok-4.1-fast 30 90% 102.9s 50k 12k
gemini-3-flash-preview 30 87% 27.5s 52k 1k
kimi-k2.5 30 87% 116.0s 46k 2k
gpt-5.2 30 83% 52.8s 39k 2k
deepseek-v3.2 30 80% 138.2s 78k 2k
gemini-2.5-pro 30 80% 39.4s 32k 2k
claude-haiku-4.5 30 77% 29.2s 74k 2k
devstral-2512 30 73% 12.1s 69k 573
mistral-large-2512 30 63% 27.5s 81k 1k
qwen3-coder-next 30 53% 35.9s 108k 2k
qwen3.5-plus-02-15 30 33% 17.9s 26k 769
llama-4-scout 30 27% 11.2s 77k 1k

Figures

Pipeline Fingerprint

Run outcome distribution per model, sorted by SR. Dashed line = average total tokens per run. Green = solved, blue = temporal decay, yellow = max steps, red = max invalid streak.

Full traces (JSON + MD per run): results/raw/


Architecture

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐  │
│   │  State  │───>│ Planner │───>│ Execute │───>│Propagate│  │
│   │  (PDDL) │    │  (LLM)  │    │ Action  │    │ (Sync)  │  │
│   └─────────┘    └─────────┘    └─────────┘    └─────────┘  │
│        ^                                            │        │
│        └────────────────────────────────────────────┘        │
│                    Repeat until goal                         │
└──────────────────────────────────────────────────────────────┘
  1. The LLM receives the current state (grouped predicates), goal predicates, action history with results, and recent causal trigger logs.
  2. The LLM selects an action via tool calling (each PDDL action is a tool with typed parameters).
  3. The WorldModel validates preconditions and applies PDDL effects.
  4. The SyncEngine applies causal propagation (fixed-point iteration).
  5. Feedback (valid/invalid, triggered rules, state changes) is added to the history.
Component File Description
WorldModel src/core/world_model.py Parses PDDL, tracks state, validates preconditions
SyncEngine src/core/sync_engine.py Propagates causal effects across epochs (fixed-point)
Tracing src/core/tracing.py Run logging, step logs, stop reason tracking
Scoring src/core/scoring.py Multi-level metrics: tool call validity, world action accuracy, causal efficiency, recovery rate, effort
LLMPlanner src/llm/planner.py Queries LLM via OpenRouter (tool calling)
LLMRunner src/llm/runner.py Orchestrates planning loop with 9 stop conditions
Prompts src/llm/prompts.py Converts PDDL state to natural language prompts
Export src/llm/export.py Exports run results to Markdown + JSON
Benchmark src/llm/benchmark.py Multi-model benchmark runner

Levels

Level Name Optimal Competencies Tested
L01 Hello ChronoJohn 4 Spatial navigation, cross-epoch transfer
L02 The Tree 3 Causal propagation (past -> future)
L03 The Carved Code 8 Multi-agent coordination, character switching
L04 The Vault 10 Locked-resource dependencies, key management
L05 The Constitutional Vacuum 18 Full integration of all competencies
L06 The Three Body Problem 25 Temporal decay (5 valid-action TTL), deductive reasoning, 3-epoch sync

Causal Propagation Rules

Trigger Effect Example
tree-planted tree-exists present/future Tree grows over time
tree-exists future unlocked park-future Gate opens when tree exists
code-carved stone-readable present/future Stone persists across epochs
founder-convinced amendment-in-constitution Constitutional change

Stop Conditions

Condition Description Agent fault?
SOLVED Goal reached --
MAX_STEPS Step budget exhausted Yes (too slow)
MAX_INVALID_STREAK 5 consecutive invalid actions Yes (lost)
LOOP_DETECTED Same state visited N times Yes (stuck in loop)
STAGNATION No progress in N steps Yes (no convergence)
LLM_STUCK LLM voluntarily signals impossibility No (lucid)
LLM_DONE_EARLY LLM signals done but goal not reached Yes (premature)
API_FAILURE Repeated API/network errors No (infrastructure)
TEMPORAL_DECAY Unstable predicate expired after 5 valid actions (L06) Yes (too slow)

Temporal Decay Semantics (L06)

The lever-pulled predicate is unstable: it expires after DECAY_WINDOW = 5 valid world actions. The formal counting rule:

What advances the TTL clock: only actions that pass PDDL precondition checks (world_valid_steps). This is the same counter used by world_action_accuracy.

What does NOT advance the TTL clock: format errors (no tool call produced), precondition failures (invalid tool call), API errors (infrastructure), control signals (STUCK/DONE). These consume the MAX_STEPS budget but do not shorten the decay window.

Rationale: the decay window measures planning coordination ability ("can you sequence 3 lever pulls within N real moves?"), not format compliance. Format failures are already penalized by tool_call_validity_rate.

Execution order per step:

1. Execute action via WorldModel (precondition check)
2. If invalid → return early, TTL clock unchanged
3. Increment valid-action counter (current_step += 1)
4. Causal propagation (SyncEngine.propagate)
5. Track newly created unstable facts
6. Temporal decay: remove facts where age > DECAY_WINDOW

Because propagation (4) runs before decay (6), a synchronizing lever pull on the last possible step fires the sync rule (all 3 levers still in state) before the oldest lever decays. The model has exactly DECAY_WINDOW valid moves after creation.

Auditability: the state description surfaces a TTL countdown for active unstable facts (e.g., lever-pulled mainframe-past: 3 valid actions remaining), and decay messages in traces show exact valid-step numbers (created at valid-step 15, expired at valid-step 21, age=6 > TTL=5).

Observed Failure Modes

Three dominant failure patterns emerge from the 390 runs:

1. Format Failure (MAX_INVALID_STREAK) The model fails to produce valid tool calls: it outputs plain text, calls unknown tools, or sends malformed arguments (tool_call_validity_rate ≈ 0). No action ever reaches the PDDL engine. This is the exclusive failure mode for Qwen3.5-Plus (20/20 failures), and a significant contributor for Gemini 2.5 Pro on L01/L06 (4/6) and Llama-4-Scout (4/22). These models either do not engage with the planning task at all, or fail to maintain tool-calling format as complexity increases.

2. Stagnation / Budget Exhaustion (MAX_STEPS + STAGNATION) The model produces valid tool calls but fails to converge toward the goal within the step budget. It may wander through valid-but-unproductive actions, repeat similar sequences without progress, or fail to integrate feedback. This is the dominant failure mode for Llama-4-Scout (13/22 failures), Qwen3-Coder-Next (10/14), and Mistral Large 2512 (6/11). Indicates tool-use ability but insufficient planning depth or causal reasoning.

3. Temporal Coordination Failure (TEMPORAL_DECAY) Specific to L06. The model understands the sub-goals (unlock mainframes, pull levers) but fails to execute the three lever pulls within the 5-valid-action decay window. Only successful world actions count toward the TTL, failed attempts do not shorten the window. This is the dominant failure mode for top-tier models on L06: GPT-5.2 (5/5 L06 failures), DeepSeek-V3.2 (5/6), Claude Haiku 4.5 (5/7), Kimi K2.5 (4/4), Gemini 3 Flash (3/4). Even Claude Opus 4.6's single failure is a temporal decay. This failure requires understanding timing constraints that are stated in the prompt but demand tight multi-epoch coordination to satisfy.

Failure Mode Models Most Affected Root Cause
Format failure qwen3.5-plus, gemini-2.5-pro (L06), llama-4-scout Cannot produce or maintain valid tool calls
Stagnation llama-4-scout, qwen3-coder-next, mistral-large Valid tool calls but no convergence toward goal
Temporal decay gpt-5.2, deepseek-v3.2, claude-haiku-4.5, kimi-k2.5 Implicit timing constraints (L06)

Data

Directory structure

results/
├── summary/
│   ├── _results.csv                  # All runs, 37 columns, one row per run
│   └── fig_pipeline_fingerprint_overall.png
├── raw/
│   ├── claude-opus-4.6/              # One folder per model
│   │   ├── ..._level_01_..._20260220.md   # Human-readable trace
│   │   └── ..._level_01_..._20260220.json # Full machine-readable trace
│   ├── gpt-5.2/
│   └── ...
└── figs/                         # Generated plots (SR, radar, fingerprints)

Per-run output

Each run produces three outputs:

Markdown trace (.md): human-readable step-by-step execution log with system prompt, each action, validation result, causal propagation events, and final scoring.

JSON trace (.json): full machine-readable record including execution metadata (model, timestamps, call params), complete conversation history (all messages sent to/from the LLM), tool definitions, per-step token counts, and the system/user prompts used.

CSV row: one row appended to results/summary/_results.csv per run.

CSV schema (37 columns)

Column Description
timestamp, problem, model, run_id Run identification
solved, stop_reason Outcome (True/False) and why the run ended
total_steps, world_valid_steps, world_invalid_steps Step counts
steps_to_solve_total, plan_length, error_overhead, overhead_ratio Effort metrics
tool_calls_total, tool_calls_ok, tool_call_validity_rate Format compliance
world_action_accuracy, invalid_rate, max_invalid_streak World model accuracy
format_errors, precondition_errors, api_errors, control_signals Error breakdown
recovery_rate, recovered_streaks, total_invalid_streaks Recovery metrics
milestones_reached, milestones_total, milestone_progress, causal_efficiency Causal progress
unique_states, loop_detected, stagnation_stop Exploration metrics
total_time, tokens_in, tokens_out, tokens_reasoning Cost and latency

The published results (390 runs) are included in the repository. Running the benchmark appends new rows to the same CSV.


Citation

@software{intent2026epochbench,
  author       = {J{\'e}z{\'e}quel, Yann},
  title        = {{EPOCH-Bench}: Evaluation of Planning Over Causal Horizons},
  year         = {2026},
  url          = {https://github.com/hey-intent/epoch-bench},
  note         = {LLM agentic planning benchmark using PDDL time-travel puzzles}
}

License

MIT, see LICENSE

Author

Yann JEZEQUEL, HeyIntent

About

EPOCH-Bench: Evaluation of Planning Over Causal Horizon, A benchmark for evaluating LLM agentic planning on PDDL time-travel puzzles with causal propagation, temporal decay, and multi-agent coordination. 6 levels, 6 metrics, deterministic validation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors