Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,6 @@ snapshots/
.vscode/
*.swp
*.swo

# Local-only design specs and implementation plans (never committed)
docs/superpowers/
6 changes: 3 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ The `evolution/<tier>/` directories form **a clean layering**: `evolution/core/`
1. CLI resolves `--skill <name>` to a `SKILL.md` via the `SkillSource` walk.
2. Eval dataset is built (synthetic LM gen / golden file / sessiondb mining).
3. Skill body wrapped as `dspy.Module`. **Saturation pre-flight** (`evolution/core/saturation_check.py`) scores the baseline on the holdout + closed-loop suite, classifies into one of four bands, and aborts (or prompts) on non-`healthy` bands — `--no-saturation-check` to skip, `--force-saturation-check` to override the default-deny in non-interactive contexts. Then GEPA optimizes the candidate with `BudgetAwareProposer` injecting a char budget into the reflection prompt.
4. Knee-point Pareto selection walks the candidates within ε of the best valset score in `--knee-point-strategy` order. Default `val-best`: highest val first, smallest body as tiebreak. `smallest` (greedy parsimony) is available via the flag for users explicitly chasing compression.
4. Candidate selection happens per `--knee-point-strategy`. Default `val-best` defers to GEPA's val-argmax (`detailed_results.best_idx`) — empirical calibration showed the ε-band walker picked GEPA's default on every run, so the val-best path skips the band entirely. `--knee-point-strategy smallest` walks the ε-band in ascending body-char order (greedy parsimony) for users explicitly chasing compression.
5. Static constraints + paired-bootstrap growth-quality gate decide deploy vs. reject; both outcomes write `gate_decision.json`. The default rule is `no_regression` (`mean >= 0`); `--quality-gate non-inferiority` switches to `lower_bound > -inferiority_tolerance` (recommended for compression-focused runs at small N where the bootstrap CI swamps tiny effects). The post-GEPA holdout eval reuses the baseline scores from the pre-flight, so net cost stays ~zero when the pre-flight ran.

## What lives where
Expand Down Expand Up @@ -210,7 +210,7 @@ Per-run dir: `output/<skill>/<YYYYMMDD_HHMMSS>/`. Contents vary by outcome:
| File | When | Purpose |
|---|---|---|
| `run.log` | always | Every LM call (start, end, heartbeats), every retry |
| `gate_decision.json` | always | Structured deploy decision (schema_version `"4"`) |
| `gate_decision.json` | always | Structured deploy decision (schema_version `"5"`) |
| `evolved_skill.md` | deploy only | New SKILL.md ready to ship |
| `baseline_skill.md` | deploy only | Original (for diffing) |
| `metrics.json` | deploy only | Top-level run summary |
Expand All @@ -230,7 +230,7 @@ Per-run dir: `output/<skill>/<YYYYMMDD_HHMMSS>/`. Contents vary by outcome:
- **`max_tokens=16000` on dataset gen LM** — load-bearing. At `eval_dataset_size>=60` the JSON output truncates mid-string with anything lower; the current default `eval_dataset_size=150` makes this even more critical. Locked by `TestSyntheticGeneratorLMConfig`.
- **`eval_dataset_size=150` is the current default**, sized for a ~53-example holdout that's tight enough on the bootstrap CI to detect ±2% effects. Per-skill calibration runs at smaller N are no longer authoritative.
- **Reflection LM `request_timeout=300, num_retries=2`** (vs `=5` for judge) — deliberate fast-fail. A reflection-LM `TimeoutError` triggers MIPROv2 fallback rather than burning more time on a stuck call.
- **Knee-point reads `optimized_module.detailed_results`** — only present when GEPA succeeded (and `track_stats=True`). MIPROv2 fallback path skips knee-point cleanly. `gate_decision.json.knee_point.applied=false` with `reason="no_detailed_results"` is the signal.
- **Candidate selection reads `optimized_module.detailed_results`** — only present when GEPA succeeded (and `track_stats=True`). The val-best path reads `details.candidates[details.best_idx]` directly; the `smallest` path additionally consumes `val_aggregate_scores` to walk the ε-band. MIPROv2 fallback path skips both cleanly. `gate_decision.json.knee_point.applied=false` with `reason="no_detailed_results"` is the signal.
- **`SkillModule.TaskWithSkill` docstring is a placeholder** — `__init__` overwrites the signature instructions per-instance via `with_instructions(skill_text)`. Don't rely on the class-level docstring.
- **`reassemble_skill` strips a leading `---` block** — defensive against the reflection LM mimicking YAML frontmatter (would otherwise produce a double-frontmatter file). Logged at WARNING when it fires; see if the prompt needs tightening.
- **Test uses both `~/.hermes/skills/` and `~/.hermes/hermes-agent/skills/`** — `external_importers._load_skill_text` (standalone CLI only) reads the former; `HermesSkillSource` (the optimizer's path) reads the latter. Same prefix, different paths.
Expand Down
46 changes: 36 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

Agent Self-Evolution evolves and optimizes agent skills, tool descriptions, system prompts, and code — producing measurably better versions through reflective evolutionary search. Built on DSPy + GEPA (Genetic-Pareto Prompt Evolution), with extra safeguards on top so what ships is reliably better than the original.

**No GPU training required.** Everything operates via API calls — mutating text, evaluating results, and selecting the best variants. ~$2-10 per optimization run.
**No GPU training required.** Everything operates via API calls — mutating text, evaluating results, and selecting the best variants. ~$1-5 per optimization run.

Works on any agent framework that emits `SKILL.md` markdown files. [Hermes Agent](https://github.com/NousResearch/hermes-agent) skills are the original target; Claude Code skills (and any other agent's `<dir>/<skill>/SKILL.md` layout) are also supported via a pluggable skill-source abstraction.

Expand All @@ -32,9 +32,8 @@ GEPA reads execution traces to understand *why* things fail (not just that they

GEPA was designed against benchmarks with hundreds of validation examples per task. Skill evolution typically has 20-60 examples, which is small enough that picking the highest-scoring candidate often picks one that won by chance — there's a real risk of shipping a "winner" that just got lucky on the eval set.

This framework adds three checks on top of GEPA so the candidate that ships is one that genuinely improved the skill:
This framework adds two checks on top of GEPA so the candidate that ships is one that genuinely improved the skill:

- **Knee-point selection** — instead of strictly the highest-scoring candidate, looks at every candidate close to the top score and prefers shorter ones. Filters out wins that came from a single lucky example.
- **Held-out deploy check** — before a candidate ships, it's compared against the baseline on examples it never saw during optimization. Several rules available, including a lenient one that's appropriate for compression-style refactors.
- **Three-dimensional scoring** — instead of pass/fail, the LLM judge rates each output on correctness, whether it followed the right procedure, and how concise it is. GEPA's reflection step uses these as feedback to guide the next mutation.

Expand Down Expand Up @@ -211,9 +210,21 @@ The chosen profile is recorded in `gate_decision.json` so any deployed variant c

Each profile also selects a reflection-prompt proposer template. `compression` tells the LM to cut redundancy under a tight char budget; `growth` tells it to add only what the failure feedback explicitly identifies as missing; `balanced` (the default) is direction-agnostic — it asks the LM to fix the failures without prescribing cuts or additions, and uses a soft "stay near N characters, ±20%" budget. All three share the same anti-hallucination guardrails: every change must ground in a specific feedback phrase, and empty feedback returns the instruction unchanged.

### Ship the evolved skill back to source
### Tune GEPA's search behavior

By default, the evolved skill lands in `output/<skill>/<timestamp>/evolved_skill.md` and stops there. Two opt-in flags automate the next step:
A few knobs control how aggressively GEPA explores the candidate space and how the deployed candidate is picked from the final population. Defaults are tuned for the typical 20-60-example skill-evolution regime; reach for these on calibration runs or when the saturation pre-flight flags a degenerate signal.

| Flag | Default | What it does |
|---|---|---|
| `--gepa-acceptance` | `improvement-or-equal` | Whether GEPA accepts plateau-equal candidates (`improvement-or-equal`) or only strictly-better ones (`strict-improvement`). The default allows more lateral exploration; the strict mode is the legacy `gepa<0.1.2` behavior. |
| `--gepa-minibatch-size` | `3` | Training examples sampled per reflective step. Bump to ~8 when saturation pre-flight flags `weak_signal` so discriminating examples appear more often in the minibatch. Larger minibatches consume more metric budget per accepted proposal — pair with `--budget heavy`. |
| `--knee-point-strategy` | `val-best` | How to pick the deployed candidate from GEPA's output. `val-best` defers to GEPA's val-argmax. `smallest` walks every candidate within ε of the top val score and picks the shortest body, trading val score for parsimony on compression-mode runs. |

### Shipping the evolved artifact

By default, the evolved artifact lands in `output/<artifact>/<timestamp>/` and stops there. Three opt-in flags automate the next step. They are independent and can be combined or used alone; all three are no-ops on a reject decision (with a stderr notice).

#### `--apply` / `--patch`: local file delivery

```bash
# Copy evolved_skill.md over the source SKILL.md in place on a deploy decision.
Expand All @@ -224,7 +235,26 @@ uv run python -m evolution.skills.evolve_skill --skill X --apply
uv run python -m evolution.skills.evolve_skill --skill X --patch | git apply
```

Both flags are no-ops on a reject decision (with a stderr notice). `--apply` also skips with a warning when the source path is under Claude Code's plugin cache (read-only by design).
`--apply` skips with a warning when the source path is under Claude Code's plugin cache (read-only by design). `--patch` is the review-by-hand path: it prints the diff and never touches the source.

#### `--create-pr`: open a draft PR against the source repo

```bash
uv run python -m evolution.skills.evolve_skill --skill X \
--create-pr --pr-draft
```

Branches the source repo from `origin/<pr-base-branch>` (default `main`), commits the evolved artifact via atomic write, pushes, and opens a GitHub PR via `gh` with a structured body. Off by default; intended for personal-use direct-push workflows against a repo you own.

| Flag | Default | Purpose |
|---|---|---|
| `--create-pr` / `--no-create-pr` | off | Toggle PR creation. |
| `--pr-base-branch` | `main` | Target branch for the PR. |
| `--pr-branch-prefix` | `evolve/` | Head branch becomes `{prefix}{artifact}-{timestamp}-{hex}`. |
| `--pr-draft` | off | Open as draft (recommended for a human review gate). |
| `--pr-allow-dirty` | off | Override the default refusal when the source tree has uncommitted changes. |

Skips cleanly when the source isn't git-backed (e.g. the Claude Code plugin cache). **Do not pair with campaign loops** — every accepted run opens its own PR, so a 10-skill sweep is 10 PRs to review.

### Safety knobs

Expand Down Expand Up @@ -318,10 +348,6 @@ Every evolved variant must pass:
4. **Semantic preservation** — Must not drift from original purpose
5. **PR review** — All changes go through human review, never direct commit

### Automated PR opening (opt-in)

`--create-pr` branches the source repo, commits the evolved artifact, pushes, and opens a GitHub PR via `gh` on a deploy decision. Off by default; intended for personal-use direct-push workflows against a repo you own. Pair with `--pr-draft` for a human review gate, and `--pr-base-branch`/`--pr-branch-prefix` to control where the PR lands. The default refuses to run against a dirty source tree (escape hatch: `--pr-allow-dirty`) and against non-git-backed sources like the Claude Code plugin cache. **Do not pair with campaign loops** — every accepted run opens its own PR, so a 10-skill sweep is 10 PRs to review.

## Full Plan

See [PLAN.md](PLAN.md) for the complete architecture, evaluation data strategy, constraints, benchmarks integration, and phased timeline.
Expand Down
2 changes: 1 addition & 1 deletion docs/codebase_info.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ evolution/
| `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples |
| **Total** | **~9,000** | excludes empty `__init__.py` shims |

Test suite: 55 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1076 tests** collected.
Test suite: 61 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1166 tests** collected.

## Runtime dependencies

Expand Down
33 changes: 29 additions & 4 deletions docs/data_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ class SaturationReport:

Only the target tool's `description` is changed; every other tool's `description`, `inputSchema`, and any `_evolution_metadata` block are preserved verbatim. With `--apply`, the source manifest file is rewritten in place with the same preservation guarantees. With `--patch`, a unified diff of (baseline → evolved) manifest JSON is written to stdout.

## gate_decision.json (schema_version "4")
## gate_decision.json (schema_version "5")

The structured deploy-gate decision, written to `output/<skill>/<timestamp>/gate_decision.json` on every run regardless of outcome. The schema is the **calibration substrate** — `tests/skills/test_evolve_skill_validation_flow.py:TestGrowthGateDecisionSchema` locks the field list so future calibration scripts (`jq -s '...' output/*/*/gate_decision.json`) don't break.

Expand All @@ -333,7 +333,7 @@ Written when any `validate_static` check fails on the evolved artifact (short-ci

```json
{
"schema_version": "4",
"schema_version": "5",
"decision": "reject",
"reason": "static_constraint_failure",
"failed_constraints": ["non_empty"],
Expand Down Expand Up @@ -367,7 +367,7 @@ Written when `--max-total-cost-usd` is set and cumulative LM cost exceeds the ce

```json
{
"schema_version": "4",
"schema_version": "5",
"decision": "aborted",
"reason": "cost_ceiling_exceeded",
"cost_ceiling_usd": 0.50,
Expand Down Expand Up @@ -399,7 +399,7 @@ Written when `--max-total-cost-usd` is set and cumulative LM cost exceeds the ce

```json
{
"schema_version": "4",
"schema_version": "5",
"decision": "deploy", // or "reject"
"reason": "passed", // or "growth_quality_gate"
"decision_rule_used": "dual_check", // or "no_regression_only" | "non_inferiority"
Expand Down Expand Up @@ -530,6 +530,31 @@ Runs of `evolution.tools.evolve_tool` write the same schema with four extra top-
| `dataset.sessiondb_drops` | `dict[str, int]` | Per-reason drop counts across the two pipeline stages. Importer keys: `short_task`, `slash_command`, `secret`, `no_tool_calls`, `non_manifest`. Judge keys: `judge_irrelevant`, `judge_error`, `noisy_middle`, `low_confidence`, `unknown_correct_tool`. Judge keys are absent when zero candidates reached the judge stage. |
| `dataset.dropped_non_manifest_count` | `int` | Pulled out of `sessiondb_drops["non_manifest"]` as a top-level int so calibration scripts don't have to know the inner key set. Counts session invocations of tools that exist in the historical session but not in the current manifest under evolution. |

### Schema v5 additions

v5 adds always-present `decision_signal` and `pr_created` fields, plus a closed-loop-primary field group that is present only when the deploy gate was decided on closed-loop signal rather than the synthetic holdout.

| Field | Type | Notes |
|---|---|---|
| `decision_signal` | `"synthetic" \| "closed_loop"` | Always present. Which signal the deploy gate actually decided on. `"closed_loop"` lands when the run executed CL-primary scoring (closed-loop tasks gained ≥ `cl_required_gain` AND synthetic non-inferiority held); `"synthetic"` otherwise. Calibration scripts should branch on this before interpreting `bootstrap` vs `cl_tasks_gained`. |
| `pr_created` | `dict` | Always present. Shape-stable across `--create-pr` on/off and across success/failure. Keys: `status` (`"created" \| "skipped" \| "failed" \| "disabled"`), `reason` (`str \| None`), `branch` (`str \| None`), `commit_sha` (`str \| None`), `url` (`str \| None`). `"disabled"` is the default when `--create-pr` is off. |

#### Closed-loop-primary fields (`decision_signal == "closed_loop"`)

Written by `evolution/core/quality_gate.py::append_cl_decision_fields` when the gate decision is taken on closed-loop signal.

| Field | Type | Notes |
|---|---|---|
| `baseline_closed_loop_per_example` | `list[float]` | Cached per-task closed-loop scores for the baseline artifact (0.0/1.0 per task). |
| `evolved_closed_loop_per_example` | `list[float]` | Per-task closed-loop scores for the evolved artifact (0.0/1.0 per task). Same length and task order as `baseline_closed_loop_per_example`. |
| `evolved_closed_loop_errored_tasks` | `list` | Task identifiers (or empty) for closed-loop evaluations that errored rather than scored. Empty list is the common case. |
| `cl_tasks_gained` | `int` | `int(sum(evolved)) - int(sum(baseline))` — the net delta of tasks passing closed-loop. The CL-primary gate requires this to meet `cl_required_gain`. |
| `cl_required_gain` | `int` | The CL-primary threshold the run had to clear, computed from `growth_pct` via the CL-primary slope/free-threshold constants. At least `1` for any non-zero growth. |
| `synthetic_sanity_check` | `dict` | The non-inferiority guard that runs alongside CL-primary. Keys: `tolerance` (float), `baseline_mean` (float), `evolved_mean` (float), `passed` (bool — `(evolved - baseline) >= -tolerance`). |
| `evolved_cl_eval_cost_usd` | `float` | LM cost in USD attributable to the evolved closed-loop evaluation pass — surfaces the CL-primary path's incremental spend. |
| `band_trigger_score` | `dict` | Pre-flight scores that decided whether CL-primary fired. Keys: `holdout` (`float \| None`), `closed_loop` (`float \| None`). |
| `validator_agent_model` | `str` | The LiteLLM model id used for the closed-loop validator agent. Recorded so historical decisions stay analysable if the default changes. |

## metrics.json (deploy-only summary)

Written to `output/<skill>/<timestamp>/metrics.json` only on deploy. Top-level summary for quick scanning:
Expand Down
Loading
Loading