Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions bunfig.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
preload = ["./src/load-env.ts"]

[test]
preload = ["./src/load-env.ts"]
35 changes: 17 additions & 18 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Pipeline Overview

Evalbuff follows a plan → carve → evaluaterefactor loop:
Evalbuff follows a plan → carve → baselinegated improvement round:

1. **Plan** — `planFeatures()` in `src/carve-features.ts` uses a Codex agent to scan the target repo and identify 15–25 discrete features that can be cleanly removed.
2. **Carve** — `carveFeature()` creates an isolated git worktree, runs a Codex agent to remove the feature, and captures the resulting diff and file operations.
3. **Evaluate** — `runAgentOnCarve()` in `src/eval-runner.ts` clones the repo, applies the carve, copies current docs, runs a coding agent to rebuild the feature, then hands the result to `judgeTaskResult()` in `src/judge.ts`.
4. **Write docs** — `runDocsWriterAgent()` in `src/docs-writer.ts` collects judge suggestions and runs a Claude agent in a temp clone to edit `docs/`, `AGENTS.md`, and `CLAUDE.md`.
5. **Repeat** — Steps 3–4 loop N times. Each loop also re-judges the baseline diffs with current docs to separate judge recalibration from real agent improvement.
3. **Baseline** — `runAgentOnCarve()` in `src/eval-runner.ts` clones the repo, applies the carve, copies current docs, runs a coding agent to rebuild the feature, then hands the result to `judgeTaskResult()` in `src/judge.ts`.
4. **Gate docs changes** — during the improvement round, every feature is re-run sequentially. The judge and coding agent both suggest independent docs changes. `planDocsChangesForTask()` in `src/docs-writer.ts` reads the docs once, rejects overfit/low-value suggestions, and creates one independent committed docs candidate per surviving suggestion. Evalbuff then materializes each candidate patch onto the current docs state, re-judges the originating task, and optionally re-runs the coding agent before accepting it.
5. **Baseline rejudge** — after the improvement round, Evalbuff re-judges the baseline diffs with the updated docs to separate judge recalibration from real agent improvement.

## Key Modules

Expand All @@ -19,7 +19,7 @@ Evalbuff follows a plan → carve → evaluate → refactor loop:
| `src/eval-helpers.ts` | Git/docs utilities — carve ops, docs sync, diff capture, ground-truth computation |
| `src/carve-features.ts` | Feature identification and extraction via Codex agents in git worktrees |
| `src/judge.ts` | Codex-based reviewer that scores agent output with E2E testing |
| `src/docs-writer.ts` | Holistic docs editing agent + judge suggestion collector |
| `src/docs-writer.ts` | Coding-agent suggestion parsing + per-task docs-change planning/materialization |
| `src/perfect-feature.ts` | Single-feature iterative optimizer (rebuild → judge → diagnose → update docs) |
| `src/report.ts` | Persists round results and generates `summary.json` + `report.md` |
| `src/trace-compressor.ts` | Extracts large tool outputs from traces into content-addressed sidecar files |
Expand All @@ -36,14 +36,13 @@ Target repo
↓ carveFeature() → CarvedFeature[]
↓ [saved as features.json]
For each round:
↓ runAgentOnCarve() → TaskResult (per feature, in parallel)
Baseline round:
↓ runAgentOnCarve() → TaskResult (per feature, sequentially)
↓ saveRoundResults() → round-N/ directory
↓ For each improvement loop:
↓ collectDocSuggestions() → text
↓ runDocsWriterAgent() → edits docs in target repo
↓ Improvement round:
↓ runEvalRound() → new scores
↓ gateDocsChangesForTask() → per-feature accepted/rejected doc candidates
↓ runBaselineRejudgeRound() → re-scored baseline
↓ saveSummary() → summary.json + report.md
Expand All @@ -67,14 +66,14 @@ Most workflows (eval, docs writer, judging) operate in temporary clones, not the

### Docs Refactor Pattern

`runDocsWriterAgent()` in `src/docs-writer.ts` builds a holistic prompt, not a task-specific checklist. The prompt tells the agent to:
1. Read all current docs (`docs/`, `AGENTS.md`, `CLAUDE.md`).
2. Generalize judge feedback into reusable project patterns — avoid feature-specific examples.
3. Verify every referenced symbol/path with grep before documenting it.
4. Restrict `AGENTS.md` changes to doc-index maintenance or factual corrections.
5. Sync docs back only after a successful run.
`planDocsChangesForTask()` in `src/docs-writer.ts` does one planning pass per feature. The prompt tells the agent to:
1. Read all current docs (`docs/`, `AGENTS.md`, `CLAUDE.md`) once.
2. Reject suggestions that are overfit, already covered, low-priority, or not grounded in the current code.
3. For each surviving suggestion, create one independent docs-only commit on its own branch from the same baseline docs commit.
4. Reset back to the baseline docs commit before preparing the next candidate so branches stay independent.
5. Write a manifest explaining which suggestions were accepted or rejected and why.

When building similar doc-editing agents, follow the same holistic approach: read first, generalize, verify, then write.
When building similar doc-editing agents, favor one read-heavy planning pass that emits independently replayable doc changes instead of rereading the full docs corpus for every candidate.

## Orchestration Patterns

Expand All @@ -96,7 +95,7 @@ When modifying the orchestration (new `EvalbuffOptions` fields, new phases, new

## Concurrency

Eval rounds use bounded concurrency: `opts.parallelism` workers pull from a shared queue. Each worker runs a full clone → carve → agent → judge cycle independently.
Carving uses a fixed internal worker pool in `src/run-evalbuff.ts` to speed up feature extraction, while baseline evaluation, docs gating, and baseline rejudging stay sequential. The sequential improvement round still matters because accepted docs changes from one feature should affect the very next feature.

## Events and TUI

Expand Down
34 changes: 26 additions & 8 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,17 @@
## Main Pipeline

```bash
bun run src/run-evalbuff.ts \
bun run src/run-evalbuff.ts \
--repo /path/to/repo \
[--n 20] \
[--parallelism 10] \
[--loops 3] \
[--init-command "npm install"] \
[--coding-model sonnet] \
[--docs-model opus] \
[--cached-features /path/to/features.json]
[--cached-features /path/to/features.json] \
[--output-dir /path/to/output]
```

All flags are parsed explicitly in the `import.meta.main` block. Required flags must be validated with helpful errors. The `--cached-features` flag skips planning/carving and loads pre-carved features directly.
All flags are parsed explicitly in the `import.meta.main` block. Required flags must be validated with helpful errors. The `--cached-features` flag skips planning/carving and loads pre-carved features directly. The `--output-dir` flag overrides the default artifact location (`<repo>/.evalbuff`). Evalbuff now always runs a single sequential improvement round after baseline, and carve concurrency is an internal fixed constant rather than a public flag.

## Perfect Feature (Single-Feature Optimizer)

Expand All @@ -28,7 +27,8 @@ bun run src/perfect-feature.ts \
[--judge-model opus] \
[--analyzer-model opus] \
[--docs-model opus] \
[--init-command "npm install"]
[--init-command "npm install"] \
[--output-dir /path/to/output]
```

Iteratively rebuilds a single feature: rebuild → judge → diagnose → update docs → repeat until 10/10 or max rounds.
Expand All @@ -42,6 +42,24 @@ bun run src/trace-compressor.ts --restore <compressed> # Restore

Options: `--output`, `--sidecar-dir`, `--threshold <bytes>`, `--format auto|jsonl|text`, `--summarize heuristic|claude|none`. Supports stdin/stdout with `-`.

## E2E Benchmark Repo Setup

```bash
bun run setup:e2e-repos
bun run setup:e2e-repos -- --repo mock-simple
bun run setup:e2e-repos -- --root /tmp/evalbuff-test-repos --force
```

Creates deterministic local benchmark repos under `test-repos/` by default:
- `mock-simple` — generated locally for fast/mock E2E coverage
- `codebuff` — pinned checkout of `CodebuffAI/codebuff`
- `manifold` — pinned checkout of `manifoldmarkets/manifold`, plus a local fixture commit that renames `docs/` to `external-docs/`

Flags:
- `--root <path>` chooses the target directory
- `--repo <id>` limits setup to specific repo ids and may be repeated
- `--force` rebuilds fixture directories that already exist

## TUI Dashboard

```bash
Expand All @@ -55,7 +73,7 @@ bun run tui -- --repo /path/to/repo # Start a live run with TUI attac

**Navigation**: `Enter` drills into detail screens, `Esc` goes back, `q` quits. Arrow keys and `j`/`k` navigate lists.

**Run discovery**: On macOS, run directories may appear under both `os.tmpdir()` (which resolves through `/private/var/...`) and `/tmp`. Discovery logic must scan both locations to find all runs.
**Run discovery**: The TUI scans `.evalbuff/` in the current working directory (the default output location) as well as legacy temp locations (`os.tmpdir()` and `/tmp` on macOS) to find all runs.

## CLI Conventions

Expand All @@ -65,7 +83,7 @@ For any new CLI command:
2. **Validate required flags** and print helpful error messages for missing ones. Exit early with usage text rather than failing deep in the pipeline.
3. **Add a `scripts` entry** in `package.json`.
4. **Keep the CLI contract consistent** between the file header usage comment, the flag parser, the options type, and the `package.json` script entry.
5. **Log non-default options** in startup output when they affect behavior (e.g., model overrides, parallelism).
5. **Log non-default options** in startup output when they affect behavior (e.g., model overrides).
6. **Thread every flag** through the options type into the runtime path — never parse a flag and ignore it.

### New Command Checklist
Expand Down
15 changes: 15 additions & 0 deletions docs/eval-helpers.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,21 @@

Both helpers use `getErrorObject()` from `src/vendor/error.ts` for logging init-command failures.

## Worktree Isolation Pattern (Carve Features)

`carveFeature()` in `src/carve-features.ts` uses git worktrees instead of full clones to isolate each carve operation. The lifecycle:

1. **Create**: `git worktree add -b "<branchName>" "<worktreePath>" HEAD` — creates a new worktree checked out at the current HEAD on a temporary branch. The worktree path is constructed inline as `${repoPath}-carve-${candidate.id}` and the branch as `evalbuff-carve-${candidate.id}-${Date.now()}`.
2. **Run**: The Codex agent operates inside the worktree directory, making changes to remove the feature.
3. **Capture**: Diff and file operations are captured from the worktree before cleanup.
4. **Cleanup** (in a `finally` block):
- `git worktree remove --force "<worktreePath>"`
- `git branch -D "<branchName>"`

**Why worktrees over clones**: Worktrees share the parent repo's object store rather than duplicating it. This avoids network I/O, object copying, and disk duplication — important when carving runs a parallel worker pool (`CARVE_PARALLELISM`) and each worker needs an isolated checkout. The tradeoff is that worktrees are coupled to the parent repo (deleting the parent breaks them), but carve worktrees are ephemeral and cleaned up in the same function call.

**Cleanup safety**: Both the `worktree remove` and `branch -D` commands run inside a `finally` block so worktrees and branches are cleaned up even when the carve agent fails. Tests should verify no leaked worktrees after carving by asserting `git worktree list` shows exactly one entry (the main working tree).

## Testing Helpers

See `docs/testing.md` section "Helper-Contract Tests" for the required test patterns when modifying or extending these helpers. Key rule: always test against real temp git repos, not mocked filesystem calls.
Loading