why agent guess when agent can know
The loop • Install • Your first run • Existing repo • Commands • How it works • Examples
🪨 Caveman Ecosystem · caveman talk less · cavemem remember more · cavekit build better (you are here)
A Claude Code plugin that turns natural language into specs, specs into parallel build plans, and build plans into working software — with an autonomous loop, validation gates, and optional cross-model review.
You describe what you want. Cavekit writes the contract. Agents build from the contract. Every line of code traces to a requirement. Every requirement has acceptance criteria.
/ck:sketch → /ck:map → /ck:make → /ck:check
what to how to build it verify
build build it it
Four commands. In that order. That is the whole main cycle.
/ck:sketch— decompose your project into domains, write kits with R-numbered requirements and testable acceptance criteria./ck:map— read kits, generate a tiered build site with a task dependency graph./ck:make— autonomous parallel build loop. Ready tasks grouped into packets, validated, merged wave by wave. Runs until all tasks are done or a budget trips./ck:check— gap analysis against kits + peer review of the code. Produces an APPROVE / REVISE / REJECT verdict and auto-amends kits when gaps appear.
One shortcut: /ck:ship "<description>" runs all four end-to-end with no user gates. For tiny features and throwaways — the guided path is better for anything non-trivial, because the design conversation is where the value is.
git clone https://github.com/JuliusBrussee/cavekit.git ~/.cavekit
cd ~/.cavekit && ./install.shRegisters the plugin with Claude Code, syncs into the Codex marketplace, installs the cavekit CLI. Restart Claude Code after installing.
Requires: Claude Code, git, macOS/Linux.
Optional: Codex (npm install -g @openai/codex) — adds adversarial review. Cavekit works without it.
Greenfield. New repo. You want a task-management API.
> /ck:init
Context hierarchy created. Capabilities detected.
Next: /ck:sketch
> /ck:sketch
What are you building?
> A REST API for task management. Users, projects, tasks with
priorities and due dates. PostgreSQL.
(design conversation — research if warranted, domain decomposition,
acceptance criteria refinement)
4 kits, 22 requirements, 69 acceptance criteria.
Next: /ck:map
> /ck:map
34 tasks across 5 tiers. Coverage: 69/69 criteria mapped.
Next: /ck:make
> /ck:make
Loop active — 34 tasks, 20 max iterations.
Wave 1 (3 tasks), Wave 2 (4 tasks), …
ALL TASKS DONE. Build passes. Tests pass.
Next: /ck:check
> /ck:check
Coverage: 100%. Verdict: APPROVE.
CAVEKIT COMPLETE
That is the experience.
Same loop, different first command.
> /ck:sketch --from-code
Scanning repo… Next.js 14, Prisma, NextAuth.
6 kits reverse-engineered. 4 requirements flagged as gaps (not yet implemented).
Next: /ck:map --filter collaboration (or whichever domain you're adding to)
> /ck:map --filter collaboration
8 tasks, 3 tiers.
Next: /ck:make
> /ck:make
…
> /ck:check
Verdict: APPROVE. Design-system compliance: 100%.
See example.md for fully annotated sessions.
The main cycle is four commands. Everything else is optional.
| Command | Phase | What it does |
|---|---|---|
/ck:sketch |
Draft | Decompose into domains; write kits with R-numbered requirements and testable acceptance criteria |
/ck:map |
Architect | Generate a tiered build site (task dependency graph) from kits |
/ck:make |
Build | Autonomous parallel build loop; validates each task against its criteria |
/ck:check |
Inspect | Gap analysis + peer review; verdict APPROVE / REVISE / REJECT |
/ck:ship |
End-to-end | One-shot sketch → map → make → check, no user gates. Tiny features only. |
| Command | What it does |
|---|---|
/ck:init |
Bootstrap context/ and .cavekit/ runtime state. --tools-only re-detects capabilities. |
/ck:design |
Create / import / update / audit DESIGN.md (9-section visual system) |
/ck:research |
Parallel multi-agent research → brief in context/refs/ |
/ck:revise |
Trace manual fixes back into kits. --trace runs the single-failure backpropagation protocol (auto-invoked on test failure). |
/ck:review |
Branch review: kit compliance + code quality. --mode gap, --codex, --tier, --strict narrow scope. |
/ck:status |
Task frontier and runtime state. --watch tails the live dashboard. |
/ck:config |
Execution presets and runtime keys. --global writes to ~/.cavekit/config. |
/ck:resume |
Recover an interrupted loop. |
/ck:help |
Command reference. |
Run /ck:help for flag-level detail on any command.
| Command | What it does |
|---|---|
cavekit monitor |
Interactive launcher — pick build sites, launch in tmux |
cavekit status |
Build site progress |
cavekit kill |
Stop all sessions, clean up worktrees |
cavekit version |
Print version |
cavekit reset |
Clear persisted state |
Four phases. Each one a slash command.
Describe what you're building in natural language. Cavekit decomposes it into domain kits — structured documents with numbered requirements (R1, R2, …) and testable acceptance criteria. Stack-independent. Human-readable.
If Codex is installed, kits go through a design challenge — adversarial review that catches decomposition flaws before any code is written.
Brownfield: /ck:sketch --from-code reverse-engineers kits from your code and flags gaps.
Reads every kit. Breaks requirements into tasks. Maps dependencies. Organizes into a tiered build site — a dependency graph where Tier 0 has no deps, Tier 1 depends only on Tier 0, and so on. Includes a Coverage Matrix mapping every acceptance criterion to its task(s). Nothing specified gets lost in translation.
Pre-flight coverage check validates all acceptance criteria are covered. Then the loop runs:
┌──────────────────────────────────────────────────────┐
│ Read build site → Find next unblocked task │
│ ▼ │
│ Load kit + acceptance criteria │
│ ▼ │
│ Implement task │
│ ▼ │
│ Validate (build + tests + acceptance criteria) │
│ ▼ │
│ PASS → commit → mark done → next ──┐ │
│ FAIL → diagnose → fix → revalidate │ │
│ ◄──────────────────────────────────┘ │
│ │
│ Loop until: all tasks done OR budget exhausted │
└──────────────────────────────────────────────────────┘
/ck:make parallelizes automatically. Multiple ready tasks get grouped into coherent work packets and dispatched concurrently:
═══ Wave 1 ═══
3 task(s) ready:
T-001: Database schema (tier 0)
T-002: Auth middleware (tier 0)
T-003: Config loader (tier 0)
Dispatching 2 grouped subagents…
All 3 tasks complete. Merging…
═══ Wave 2 ═══
2 task(s) ready:
T-004: User endpoints (tier 1, deps: T-001, T-002)
T-005: Health check (tier 1, deps: T-003)
…
Circuit breakers prevent infinite loops: 3 test failures → task BLOCKED; all blocked → stop and report.
At every tier boundary, optional Codex review gates advancement. P0/P1 findings must be fixed before the next tier starts. Speculative review (default) adds near-zero latency.
Gap analysis: built vs. specified. Peer review: bugs, security, missed requirements. Everything traced back to kit requirements. Verdict: APPROVE / REVISE / REJECT. Gaps feed back as remediation tasks.
If the project has UI, run /ck:design first. It creates or imports DESIGN.md — a 9-section Google-Stitch-format design system. Every kit then references its design tokens; every UI task carries a Design Ref; every build result is audited for design violations during /ck:check.
/ck:design # interactive
/ck:design --import vercel # start from a known system
/ck:design --from-site <url> # extract tokens from a live site
/ck:design --audit # gap-check an existing DESIGN.md
Settings live in two places:
| Location | Scope |
|---|---|
~/.cavekit/config |
User default |
.cavekit/config |
Project override (takes precedence) |
| Preset | Reasoning | Execution | Exploration |
|---|---|---|---|
expensive |
opus | opus | opus |
quality |
opus | opus | sonnet |
balanced |
opus | sonnet | haiku |
fast |
sonnet | sonnet | haiku |
/ck:config # show current
/ck:config preset balanced # change preset for this repo
/ck:config preset fast --global # change default for all repos
All configuration keys
| Setting | Values | Default | Purpose |
|---|---|---|---|
bp_model_preset |
expensive quality balanced fast |
quality |
Model selection |
codex_review |
auto off |
auto |
Enable/disable Codex reviews |
codex_model |
model string | (Codex default) | Model for Codex calls |
tier_gate_mode |
severity strict permissive off |
severity |
How findings gate tier advancement |
command_gate |
all interactive off |
all |
Command-gating scope |
command_gate_timeout |
ms | 3000 |
Codex classification timeout |
speculative_review |
on off |
on |
Background review of previous tier |
speculative_review_timeout |
s | 300 |
Max wait for speculative results |
caveman_mode |
on off |
on |
Token-compressed output (~75% savings) |
caveman_phases |
comma-separated | build,inspect |
Which phases use caveman-speak |
session_budget |
tokens | 500000 |
Loop token cap before auto-halt |
max_iterations |
integer | 60 |
Stop-hook iteration cap |
task_budget_quick |
tokens | 8000 |
Per-task budget for depth: quick |
task_budget_standard |
tokens | 20000 |
Per-task budget for depth: standard |
task_budget_thorough |
tokens | 45000 |
Per-task budget for depth: thorough |
auto_backprop |
on off |
on |
Trigger backpropagation on test failure |
tool_cache |
on off |
on |
Cache read-only tool results |
tool_cache_ttl_ms |
ms | 120000 |
TTL for cached tool results |
test_filter |
on off |
on |
Condense test output around failures |
progress_tracker |
on off |
on |
Write .cavekit/.progress.json |
parallelism_max_agents |
integer | 3 |
Max concurrent subagents per wave |
parallelism_max_per_repo |
integer | 2 |
Max concurrent subagents writing the same repo |
model_routing |
on off |
on |
Score-based tier routing |
graphify_enabled |
on off |
off |
Use knowledge-graph queries |
Codex review modes
Cavekit uses Codex as an adversarial reviewer — a second model with different training and different blind spots. Three levels:
After kits are drafted and internally reviewed, the full set goes to Codex.
| Finding type | Behavior |
|---|---|
| Critical | Must fix before building. Auto-fix loop, up to 2 cycles. |
| Advisory | Presented alongside kits at user review gate. |
Only design-level concerns. No implementation feedback.
Every completed tier triggers a Codex code review before advancing.
| Severity | Behavior |
|---|---|
| P0 (critical) | Blocks advancement. Auto-generates fix task. |
| P1 (high) | Blocks advancement. Auto-generates fix task. |
| P2 (medium) | Logged, does not block. |
| P3 (low) | Logged, does not block. |
Gate modes: severity (default — P0/P1 block), strict (all block), permissive (nothing blocks), off.
Fix cycle runs up to 2 iterations per tier. After that, advances with warning. Never deadlocks.
Codex reviews the previous tier in the background while Claude builds the current tier. Results are ready when the gate checks. Near-zero latency. Falls back to synchronous if needed.
PreToolUse hook intercepts every Bash command. Fast-path allowlist (50+ safe commands) / blocklist (rm -rf, force push, DROP TABLE). Ambiguous commands → Codex classifies → safe / warn / block. Verdict cached per session. Falls back to static rules when Codex is unavailable — never blocks solely because classifier is unreachable.
Without Codex installed: design challenge skipped, tier gate skipped, command gate falls back to static allowlist. Cavekit works the same; Codex makes it harder to ship bad specs and bad code.
Autonomous runtime internals
/ck:make is an autonomous loop. A Claude Code Stop hook drives the session iteration by iteration until every task is complete or a budget trips a circuit breaker.
hooks/stop-hook.sh— state-machine driver. Fires on every Stop event, routes the next prompt, returns{"decision":"block"}so the session continues.hooks/token-monitor.sh— PostToolUse budget guard. Warns at 80% of per-task budget, halts at 100%.hooks/tool-cache.js/tool-cache-store.js— 120s TTL cache for read-only commands (git status,ls,Read,Grep,Glob).hooks/test-output-filter.js— condenses test output around failures.hooks/auto-backprop.js— on test failure, writes a flag file; next iteration prepends a trace directive.hooks/progress-tracker.js— zero-stdout snapshot writer for/ck:status --watch.scripts/cavekit-tools.cjs— orchestration engine: state machine, heartbeat lock, token ledger, task registry, routing, capability discovery, checkpoints, artifact summaries.scripts/cavekit-router.cjs— model-tier router. Scores tasks across five axes, maps to haiku/sonnet/opus within each role's band, demotes under budget pressure.
Runtime state under <project>/.cavekit/:
.cavekit/
├── config.json
├── state.md
├── .loop.json
├── .loop.lock
├── token-ledger.json
├── task-status.json
├── capabilities.json
├── .progress.json
├── .auto-backprop-pending.json
├── history/backprop-log.md
└── tool-cache/
Agents end the loop cleanly by emitting <promise>CAVEKIT COMPLETE</promise> on its own line. Debug with CAVEKIT_DEBUG=1. Recover with /ck:resume.
| Phase | Runtime call |
|---|---|
/ck:init |
cavekit-tools init + discover; seeds .cavekit/ and .gitignore |
/ck:sketch |
dispatches ck:complexity per kit to auto-fill complexity: |
/ck:map |
writes .cavekit/tasks.json + init-registry + cavekit-router |
/ck:make |
setup-build.sh calls setup-loop; stop-hook drives waves |
/ck:check |
dispatches ck:verifier for goal-backward check |
/ck:review |
two-pass review; fix-cycle emits fix tasks back into the loop |
/ck:revise |
routes each manual fix through the single-failure trace |
/ck:status |
prints live runtime status; --watch tails snapshots |
/ck:resume |
steals stale locks, validates state, re-enters the loop |
/ck:config |
surfaces runtime keys alongside preset controls |
Opt in per repo with /ck:init. Commands fall back to the pre-3.0 path when .cavekit/ is absent.
File structure
context/ # Project artifacts (persist across cycles)
├── kits/
│ ├── cavekit-overview.md
│ └── cavekit-{domain}.md
├── designs/
│ ├── DESIGN.md
│ └── design-changelog.md
├── plans/
│ └── build-site.md
├── impl/
│ ├── impl-{domain}.md
│ ├── impl-review-findings.md
│ └── loop-log.md
└── refs/
.cavekit/ # Runtime state (machine-managed)
├── config.json
├── state.md
├── .loop.json
├── .loop.lock
├── token-ledger.json
├── task-status.json
├── capabilities.json
├── .progress.json
├── .auto-backprop-pending.json
├── history/backprop-log.md
└── tool-cache/
Skills
| Skill | What it covers |
|---|---|
| Methodology | Core Hunt lifecycle |
| Design System | Create and maintain DESIGN.md |
| UI Craft | Component patterns, animation, accessibility, review checklist |
| Cavekit Writing | Write kits agents can consume |
| Peer Review | Six review modes + Codex Loop Mode |
| Validation-First Design | Every requirement must be verifiable |
| Context Architecture | Progressive disclosure for agent context |
| Revision | Trace bugs upstream to kits (includes automated backprop) |
| Convergence Monitoring | Detect when iterations plateau |
| Impl Tracking | Living records of build progress |
| Brownfield Adoption | Add Cavekit to existing codebases |
| Speculative Pipeline | Overlap phases for faster builds |
| Prompt Pipeline | Design the prompts driving each phase |
| Documentation Inversion | Docs for agents, not just humans |
| Karpathy Guardrails | Think-before-code, simplicity, surgical changes |
| Autonomous Loop | State machine, sentinels, lock protocol |
| Caveman | Token-compressed output (~75% savings) |
Methodology
Cavekit applies the scientific method to AI-generated code. LLMs are non-deterministic. Software engineering does not have to be.
| Concept | Role |
|---|---|
| Kits | The hypothesis — what you expect the software to do |
| Validation gates | Controlled conditions — build, tests, acceptance criteria |
| Convergence loops | Repeated trials — iterate until stable |
| Implementation tracking | Lab notebook — what was tried, what worked, what failed |
| Revision | Update the hypothesis — trace bugs back to kits |
Ships with specialized agents (including design-reviewer for UI validation against DESIGN.md), a multi-agent research system, and 21 skills. With Codex, operates as a dual-model architecture — Claude builds, Codex reviews — catching errors single-model self-review cannot.
The spec is the product. The code is a derivative.
When the spec is clear, the code follows. When the code is wrong, the spec tells you why.
Two models disagreeing is a signal. Two models agreeing is confidence.
If cavekit save you mass debug time — leave star.
Three tools. One philosophy: agent do more with less.
| Repo | What | One-liner |
|---|---|---|
| caveman | Output compression skill | why use many token when few do trick — ~75% fewer output tokens across Claude Code, Cursor, Gemini, Codex |
| cavemem | Cross-agent persistent memory | why agent forget when agent can remember — compressed SQLite + MCP, local by default |
| cavekit (you are here) | Spec-driven autonomous build loop | why agent guess when agent can know — natural language → kits → parallel build → verified |
They compose: cavekit orchestrates the build, caveman compresses what the agent says (bundled, on by default for build/inspect phases), cavemem compresses what the agent remembers across sessions. Install one, some, or all — each stands alone.
- Revu — local-first macOS study app with FSRS spaced repetition. revu.cards
MIT
