From d09c5f201a18a11dbdfc5390b785106e56899159 Mon Sep 17 00:00:00 2001 From: Justin Ramos Date: Mon, 25 May 2026 14:00:57 -0600 Subject: [PATCH 1/3] docs(readme): refresh for May 23-25 feature ship; consolidate deployment section Reflect the recent shipping arc: PR automation, GEPA acceptance default flip, knee-point selector becoming opt-in, and updated per-run cost range. Collapse the two overlapping deployment subsections into a single "Shipping the evolved artifact" section with separate --apply/--patch and --create-pr subsections so the orthogonal delivery paths are discoverable side-by-side. Add a "Tune GEPA's search behavior" table so --gepa-acceptance, --gepa-minibatch-size, and --knee-point-strategy are documented in user-facing prose for the first time. --- README.md | 46 ++++++++++++++++++++++++++++++++++++---------- 1 file changed, 36 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index e4672c8..3961d13 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ Agent Self-Evolution evolves and optimizes agent skills, tool descriptions, system prompts, and code — producing measurably better versions through reflective evolutionary search. Built on DSPy + GEPA (Genetic-Pareto Prompt Evolution), with extra safeguards on top so what ships is reliably better than the original. -**No GPU training required.** Everything operates via API calls — mutating text, evaluating results, and selecting the best variants. ~$2-10 per optimization run. +**No GPU training required.** Everything operates via API calls — mutating text, evaluating results, and selecting the best variants. ~$1-5 per optimization run. Works on any agent framework that emits `SKILL.md` markdown files. [Hermes Agent](https://github.com/NousResearch/hermes-agent) skills are the original target; Claude Code skills (and any other agent's `//SKILL.md` layout) are also supported via a pluggable skill-source abstraction. @@ -32,9 +32,8 @@ GEPA reads execution traces to understand *why* things fail (not just that they GEPA was designed against benchmarks with hundreds of validation examples per task. Skill evolution typically has 20-60 examples, which is small enough that picking the highest-scoring candidate often picks one that won by chance — there's a real risk of shipping a "winner" that just got lucky on the eval set. -This framework adds three checks on top of GEPA so the candidate that ships is one that genuinely improved the skill: +This framework adds two checks on top of GEPA so the candidate that ships is one that genuinely improved the skill: -- **Knee-point selection** — instead of strictly the highest-scoring candidate, looks at every candidate close to the top score and prefers shorter ones. Filters out wins that came from a single lucky example. - **Held-out deploy check** — before a candidate ships, it's compared against the baseline on examples it never saw during optimization. Several rules available, including a lenient one that's appropriate for compression-style refactors. - **Three-dimensional scoring** — instead of pass/fail, the LLM judge rates each output on correctness, whether it followed the right procedure, and how concise it is. GEPA's reflection step uses these as feedback to guide the next mutation. @@ -211,9 +210,21 @@ The chosen profile is recorded in `gate_decision.json` so any deployed variant c Each profile also selects a reflection-prompt proposer template. `compression` tells the LM to cut redundancy under a tight char budget; `growth` tells it to add only what the failure feedback explicitly identifies as missing; `balanced` (the default) is direction-agnostic — it asks the LM to fix the failures without prescribing cuts or additions, and uses a soft "stay near N characters, ±20%" budget. All three share the same anti-hallucination guardrails: every change must ground in a specific feedback phrase, and empty feedback returns the instruction unchanged. -### Ship the evolved skill back to source +### Tune GEPA's search behavior -By default, the evolved skill lands in `output///evolved_skill.md` and stops there. Two opt-in flags automate the next step: +A few knobs control how aggressively GEPA explores the candidate space and how the deployed candidate is picked from the final population. Defaults are tuned for the typical 20-60-example skill-evolution regime; reach for these on calibration runs or when the saturation pre-flight flags a degenerate signal. + +| Flag | Default | What it does | +|---|---|---| +| `--gepa-acceptance` | `improvement-or-equal` | Whether GEPA accepts plateau-equal candidates (`improvement-or-equal`) or only strictly-better ones (`strict-improvement`). The default allows more lateral exploration; the strict mode is the legacy `gepa<0.1.2` behavior. | +| `--gepa-minibatch-size` | `3` | Training examples sampled per reflective step. Bump to ~8 when saturation pre-flight flags `weak_signal` so discriminating examples appear more often in the minibatch. Larger minibatches consume more metric budget per accepted proposal — pair with `--budget heavy`. | +| `--knee-point-strategy` | `val-best` | How to pick the deployed candidate from GEPA's output. `val-best` defers to GEPA's val-argmax. `smallest` walks every candidate within ε of the top val score and picks the shortest body, trading val score for parsimony on compression-mode runs. | + +### Shipping the evolved artifact + +By default, the evolved artifact lands in `output///` and stops there. Three opt-in flags automate the next step. They are independent and can be combined or used alone; all three are no-ops on a reject decision (with a stderr notice). + +#### `--apply` / `--patch`: local file delivery ```bash # Copy evolved_skill.md over the source SKILL.md in place on a deploy decision. @@ -224,7 +235,26 @@ uv run python -m evolution.skills.evolve_skill --skill X --apply uv run python -m evolution.skills.evolve_skill --skill X --patch | git apply ``` -Both flags are no-ops on a reject decision (with a stderr notice). `--apply` also skips with a warning when the source path is under Claude Code's plugin cache (read-only by design). +`--apply` skips with a warning when the source path is under Claude Code's plugin cache (read-only by design). `--patch` is the review-by-hand path: it prints the diff and never touches the source. + +#### `--create-pr`: open a draft PR against the source repo + +```bash +uv run python -m evolution.skills.evolve_skill --skill X \ + --create-pr --pr-draft +``` + +Branches the source repo from `origin/` (default `main`), commits the evolved artifact via atomic write, pushes, and opens a GitHub PR via `gh` with a structured body. Off by default; intended for personal-use direct-push workflows against a repo you own. + +| Flag | Default | Purpose | +|---|---|---| +| `--create-pr` / `--no-create-pr` | off | Toggle PR creation. | +| `--pr-base-branch` | `main` | Target branch for the PR. | +| `--pr-branch-prefix` | `evolve/` | Head branch becomes `{prefix}{artifact}-{timestamp}-{hex}`. | +| `--pr-draft` | off | Open as draft (recommended for a human review gate). | +| `--pr-allow-dirty` | off | Override the default refusal when the source tree has uncommitted changes. | + +Skips cleanly when the source isn't git-backed (e.g. the Claude Code plugin cache). **Do not pair with campaign loops** — every accepted run opens its own PR, so a 10-skill sweep is 10 PRs to review. ### Safety knobs @@ -318,10 +348,6 @@ Every evolved variant must pass: 4. **Semantic preservation** — Must not drift from original purpose 5. **PR review** — All changes go through human review, never direct commit -### Automated PR opening (opt-in) - -`--create-pr` branches the source repo, commits the evolved artifact, pushes, and opens a GitHub PR via `gh` on a deploy decision. Off by default; intended for personal-use direct-push workflows against a repo you own. Pair with `--pr-draft` for a human review gate, and `--pr-base-branch`/`--pr-branch-prefix` to control where the PR lands. The default refuses to run against a dirty source tree (escape hatch: `--pr-allow-dirty`) and against non-git-backed sources like the Claude Code plugin cache. **Do not pair with campaign loops** — every accepted run opens its own PR, so a 10-skill sweep is 10 PRs to review. - ## Full Plan See [PLAN.md](PLAN.md) for the complete architecture, evaluation data strategy, constraints, benchmarks integration, and phased timeline. From 8bf0e3328ce15947636d567dcfeb06f1288de429 Mon Sep 17 00:00:00 2001 From: Justin Ramos Date: Mon, 25 May 2026 14:16:51 -0600 Subject: [PATCH 2/3] docs(refs): refresh dependencies, model_resolution, review_notes, interfaces for current state --- AGENTS.md | 4 +-- docs/dependencies.md | 21 +++++++++++- docs/index.md | 1 - docs/interfaces.md | 24 ++++++++++--- docs/model_resolution.md | 2 ++ docs/review_notes.md | 74 ---------------------------------------- 6 files changed, 43 insertions(+), 83 deletions(-) delete mode 100644 docs/review_notes.md diff --git a/AGENTS.md b/AGENTS.md index ff85cf6..661512e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -75,7 +75,7 @@ The `evolution//` directories form **a clean layering**: `evolution/core/` 1. CLI resolves `--skill ` to a `SKILL.md` via the `SkillSource` walk. 2. Eval dataset is built (synthetic LM gen / golden file / sessiondb mining). 3. Skill body wrapped as `dspy.Module`. **Saturation pre-flight** (`evolution/core/saturation_check.py`) scores the baseline on the holdout + closed-loop suite, classifies into one of four bands, and aborts (or prompts) on non-`healthy` bands — `--no-saturation-check` to skip, `--force-saturation-check` to override the default-deny in non-interactive contexts. Then GEPA optimizes the candidate with `BudgetAwareProposer` injecting a char budget into the reflection prompt. -4. Knee-point Pareto selection walks the candidates within ε of the best valset score in `--knee-point-strategy` order. Default `val-best`: highest val first, smallest body as tiebreak. `smallest` (greedy parsimony) is available via the flag for users explicitly chasing compression. +4. Candidate selection happens per `--knee-point-strategy`. Default `val-best` defers to GEPA's val-argmax (`detailed_results.best_idx`) — empirical calibration showed the ε-band walker picked GEPA's default on every run, so the val-best path skips the band entirely. `--knee-point-strategy smallest` walks the ε-band in ascending body-char order (greedy parsimony) for users explicitly chasing compression. 5. Static constraints + paired-bootstrap growth-quality gate decide deploy vs. reject; both outcomes write `gate_decision.json`. The default rule is `no_regression` (`mean >= 0`); `--quality-gate non-inferiority` switches to `lower_bound > -inferiority_tolerance` (recommended for compression-focused runs at small N where the bootstrap CI swamps tiny effects). The post-GEPA holdout eval reuses the baseline scores from the pre-flight, so net cost stays ~zero when the pre-flight ran. ## What lives where @@ -230,7 +230,7 @@ Per-run dir: `output///`. Contents vary by outcome: - **`max_tokens=16000` on dataset gen LM** — load-bearing. At `eval_dataset_size>=60` the JSON output truncates mid-string with anything lower; the current default `eval_dataset_size=150` makes this even more critical. Locked by `TestSyntheticGeneratorLMConfig`. - **`eval_dataset_size=150` is the current default**, sized for a ~53-example holdout that's tight enough on the bootstrap CI to detect ±2% effects. Per-skill calibration runs at smaller N are no longer authoritative. - **Reflection LM `request_timeout=300, num_retries=2`** (vs `=5` for judge) — deliberate fast-fail. A reflection-LM `TimeoutError` triggers MIPROv2 fallback rather than burning more time on a stuck call. -- **Knee-point reads `optimized_module.detailed_results`** — only present when GEPA succeeded (and `track_stats=True`). MIPROv2 fallback path skips knee-point cleanly. `gate_decision.json.knee_point.applied=false` with `reason="no_detailed_results"` is the signal. +- **Candidate selection reads `optimized_module.detailed_results`** — only present when GEPA succeeded (and `track_stats=True`). The val-best path reads `details.candidates[details.best_idx]` directly; the `smallest` path additionally consumes `val_aggregate_scores` to walk the ε-band. MIPROv2 fallback path skips both cleanly. `gate_decision.json.knee_point.applied=false` with `reason="no_detailed_results"` is the signal. - **`SkillModule.TaskWithSkill` docstring is a placeholder** — `__init__` overwrites the signature instructions per-instance via `with_instructions(skill_text)`. Don't rely on the class-level docstring. - **`reassemble_skill` strips a leading `---` block** — defensive against the reflection LM mimicking YAML frontmatter (would otherwise produce a double-frontmatter file). Logged at WARNING when it fires; see if the prompt needs tightening. - **Test uses both `~/.hermes/skills/` and `~/.hermes/hermes-agent/skills/`** — `external_importers._load_skill_text` (standalone CLI only) reads the former; `HermesSkillSource` (the optimizer's path) reads the latter. Same prefix, different paths. diff --git a/docs/dependencies.md b/docs/dependencies.md index a02fa63..8563621 100644 --- a/docs/dependencies.md +++ b/docs/dependencies.md @@ -2,11 +2,13 @@ External packages the framework depends on, what each is used for, and how they're constrained. +Last verified against `pyproject.toml` + `uv.lock` on 2026-05-25. + ## Hard runtime dependencies ### `dspy>=3.2.0,<3.3` -The optimization engine. Used pervasively: +The optimization engine. DSPy 3.2.0 is the floor; in practice the lock pins it at 3.2.0 because that's the version whose `gepa[dspy]==0.0.27` transitive constraint is what forced the gepa override below. Used pervasively: | Module | Usage | |---|---| @@ -25,6 +27,23 @@ The optimization engine. Used pervasively: **Why pinned to `<3.3`:** `BaseCallback` lives at `dspy.utils.callback.BaseCallback` and is not in `dspy.__all__`. A 3.3 minor bump could move or rename it. Same for the `DspyAdapter` / `propose_new_texts` interaction in `gepa/api.py:317-321` that forces our use of `instruction_proposer` instead of `reflection_prompt_template`. +### `gepa @ git+https://github.com/gepa-ai/gepa.git@5e24ee5c8e1857a62a1ba19731de9da45ffb6f1b` + +GEPA itself is pinned to an exact git SHA — the merge commit for the upstream `acceptance_criterion` API. The framework's `--gepa-acceptance` flag forwards `acceptance_criterion="improvement_or_equal"` to `dspy.GEPA(..., gepa_kwargs={...})`; the latest released `gepa==0.1.1` on PyPI predates that merge and doesn't accept the kwarg. + +**Why a git pin instead of a version range:** the version that ships with the merged API (`0.1.2`) hadn't been published at the time the framework adopted it. When `0.1.2` (or newer) lands on PyPI, swap the dep to `gepa>=0.1.2,<0.2` and drop the `[tool.uv]` override below. + +**`[tool.uv] override-dependencies`:** DSPy 3.2.0 hard-pins `gepa[dspy]==0.0.27` as a transitive constraint. Without the override, `uv sync` would resolve to that pre-`acceptance_criterion` build and the `improvement-or-equal` default would error at GEPA construction time. The override block forces every resolver path — direct dep, transitive via `dspy[optuna]`, transitive via anything else — onto the same git SHA: + +```toml +[tool.uv] +override-dependencies = [ + "gepa @ git+https://github.com/gepa-ai/gepa.git@5e24ee5c8e1857a62a1ba19731de9da45ffb6f1b", +] +``` + +This is uv-specific (PEP 735 doesn't standardize overrides). Users installing with pip pick up the direct git dep but not the override — if they also install `dspy[optuna]`, pip's resolver will report a conflict. `uv sync` is the supported install path. + ### `litellm>=1.82.0,<2.0` The provider-agnostic LLM client that DSPy wraps. Used directly only in `evolution/core/lm_timing_callback.py`: diff --git a/docs/index.md b/docs/index.md index 06b5c3a..ce075a0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -56,7 +56,6 @@ The codebase is mid-sized (~9K LOC of source + 55 test files / ~1076 tests) and | [`workflows.md`](workflows.md) | Step-by-step workflows with mermaid sequence diagrams: skill deploy path, reject paths, GEPA→MIPROv2 fallback, sessiondb mining, tool evolution, closed-loop validation, closed-loop signal during evolution | | [`dependencies.md`](dependencies.md) | Each external package — what it's used for, why it's pinned, what we don't depend on | | [`framework_advantages.md`](framework_advantages.md) | User-facing explainer of how this framework's selection layer, deploy gate, proposer, and composite fitness differ from raw DSPy + GEPA — and when raw GEPA is the right choice | -| [`review_notes.md`](review_notes.md) | Consistency + completeness gaps found during this docs pass | ## Documents elsewhere worth knowing about diff --git a/docs/interfaces.md b/docs/interfaces.md index 20ad548..56faadd 100644 --- a/docs/interfaces.md +++ b/docs/interfaces.md @@ -18,6 +18,8 @@ The primary user-facing interface. | `--iterations ` | `10` | DEPRECATED. Maps `1→light`, `2→medium`, `3→heavy`; anything else collapses to `light`. | | `--no-fallback` | off | Re-raise GEPA exceptions instead of falling back to MIPROv2. Debug only. | | `--seed ` | `42` | RNG seed for dataset shuffles + DSPy optimizer. | +| `--gepa-minibatch-size ` | `3` | GEPA's reflective minibatch size. Default matches GEPA's own default. Bump to ~8 when the saturation pre-flight flags `weak_signal` (wider sampling window makes discriminating examples appear in ~68% of minibatches vs ~34% at default). Aborts at startup if value exceeds trainset size. | +| `--gepa-acceptance {strict-improvement,improvement-or-equal}` | `improvement-or-equal` | GEPA's acceptance criterion under the `sum(minibatch_scores)` gate. `improvement-or-equal` (default) allows plateau-equal candidates through — the literature-recommended fix for noisy LM-judge fitness where strict acceptance rejects ~50% of true-equal mutations. `strict-improvement` is the legacy `gepa<0.1.2` default; pass it only to reproduce strict-acceptance behavior for comparison runs. Mapped to GEPA's underlying `acceptance_criterion` kwarg with hyphens converted to underscores. | ### Models | Flag | Default | Notes | @@ -45,8 +47,8 @@ The primary user-facing interface. | `--max-absolute-chars ` | (preset) | Override absolute char ceiling. | | `--bootstrap-confidence ` | `0.90` | Two-sided CI confidence for the holdout improvement bootstrap. | | `--bootstrap-resamples ` | `2000` | Bootstrap iterations. | -| `--knee-point-epsilon ` | `1/n_val` | ε for knee-point Pareto band. Override only with calibrated reason. | -| `--knee-point-strategy {val-best,smallest}` | `val-best` | Within the ε-band, which candidate to pick. `val-best` (default): highest val score wins, smallest body as tiebreak. `smallest`: greedy parsimony — picks the smallest body in the band regardless of val cost; available for users explicitly chasing compression. | +| `--knee-point-epsilon ` | `1/n_val` | ε for knee-point Pareto band. Only consulted by `--knee-point-strategy smallest`; the default `val-best` path defers to GEPA's val-argmax and ignores ε. Override only with calibrated reason. | +| `--knee-point-strategy {val-best,smallest}` | `val-best` | How to pick the deployed candidate from GEPA's output. `val-best` (default): defer to GEPA's `detailed_results.best_idx` — empirical calibration showed the ε-band walker picked GEPA's default on every observed run, so this path skips the band entirely. `smallest`: walk the ε-band in ascending body-char order (greedy parsimony) for users explicitly chasing compression even at val cost. | | `--fitness-profile {balanced,compression,growth}` | `balanced` | Composite fitness weighting profile for the LLM judge. `balanced` (0.5/0.3/0.2 for correctness/procedure/conciseness) is general-purpose. `compression` (0.4/0.2/0.4) upweights conciseness for shrink-direction work. `growth` (0.6/0.4/0.0) drops conciseness so the optimizer doesn't punish necessary additions. Also selects the `BudgetAwareProposer` template: `compression` → compression-mode (cut redundancy under a tight budget), `balanced` → balanced-mode (direction-agnostic, soft ±20% target), `growth` → growth-mode (add only what feedback identifies as missing). Both the profile and the resolved proposer mode are recorded in `gate_decision.json`. | ### Proposer @@ -60,8 +62,13 @@ The primary user-facing interface. |---|---|---| | `--apply` | off | On a deploy decision, copy `evolved_skill.md` over the source `SKILL.md` in place. No git operations — leaves workflow to the user. No-op (with warning) when the skill source is read-only (Claude Code plugin cache under `~/.claude/plugins/cache`). | | `--patch` | off | On a deploy decision, emit a unified diff of (baseline → evolved) to stdout, labelled with the source path. Pipe to `patch`, `git apply`, or a code-review tool. | +| `--create-pr / --no-create-pr` | off | On a deploy decision, branch the source repo, atomically copy the evolved artifact in, commit, push, and open a GitHub PR via `gh pr create`. Skips cleanly when the source isn't git-backed (e.g. Claude Code plugin cache). Skips when the working tree is dirty unless `--pr-allow-dirty` is also set. Requires `gh` on `$PATH`. | +| `--pr-base-branch ` | `main` | Target branch for the PR opened by `--create-pr`. The PR's head branch is created from `origin/`. | +| `--pr-branch-prefix ` | `evolve/` | Prefix for the PR's head branch under `--create-pr`. Branch names become `{prefix}{artifact}-{timestamp}-{hex}`. | +| `--pr-draft` | off | Open the `--create-pr` PR as a draft. Recommended for personal automation pipelines that want a human review gate before merge. | +| `--pr-allow-dirty` | off | Override `--create-pr`'s dirty-tree refusal. Default behavior skips PR creation when the source repo has uncommitted changes, to avoid sweeping unrelated edits into the evolution PR. | -Both delivery flags are no-ops on a reject decision and emit a one-line stderr notice in that case. Both default off; they only fire when the user opts in. +`--apply`, `--patch`, and `--create-pr` are all no-ops on a reject decision and emit a one-line stderr notice in that case. All three default off; they only fire when the user opts in. ### Misc | Flag | Default | Notes | @@ -71,7 +78,7 @@ Both delivery flags are no-ops on a reject decision and emit a one-line stderr n | `--max-total-cost-usd FLOAT` | off | Safety net: abort cleanly when cumulative LM cost exceeds this dollar amount. Worst-case overshoot is one LM call past the ceiling (the cost callback fires AFTER the call returns; the next call aborts at start). 0 is accepted (aborts on first call). Negatives rejected. Writes a `decision="aborted"` `gate_decision.json` with `cost_at_abort_usd`, `cost_ceiling_usd`, and the full `cost_summary` block. | | `--benchmark-cmd ""` | off | Deploy-gate hook: shell command run AFTER the framework's own deploy gate passes; nonzero exit flips the decision to `reject` with `reason="benchmark_failed"`. Receives `EVOLVED_PATH`, `BASELINE_PATH`, `RUN_DIR`, `TARGET_NAME`, `ARTIFACT_TYPE` via env. Runs under `/bin/sh -c`; aliases and shell functions from your interactive shell are not available. Trust boundary: the command string is yours; do not pass strings you didn't write. Adds a `benchmark` block to `gate_decision.json`. | | `--benchmark-timeout-seconds INT` | `600` | Wall-clock cap for the `--benchmark-cmd` hook. Timeout treated as a benchmark fail with `reason="timeout"`. | -| `--closed-loop-during-evolution ` | off | Wired symmetrically with `evolve_tool` for CLI consistency. Skill-side closed-loop validation requires a `SkillFileInstaller` that doesn't exist yet, so setting this flag raises with a clear error. | +| `--closed-loop-during-evolution ` | off | Path to a closed-loop JSONL task suite. The skill-side flow drives `hermes -z` against a temporary working copy of the resolved `SKILL.md` for each task; same `--closed-loop-mode`/`--closed-loop-in-valset` semantics as `evolve_tool`. The full closed-loop flag family (`--closed-loop-mode`, `--closed-loop-saturation-threshold`, `--closed-loop-min-iters`, `--closed-loop-window-size`, `--closed-loop-in-valset`, `--closed-loop-agent-model`, `--closed-loop-task-timeout-seconds`) is wired symmetrically with `evolve_tool`. | | `--no-saturation-check` | off | Skip the saturation pre-flight (`evolution/core/saturation_check.py`). By default, the framework scores the baseline on the holdout (and the closed-loop suite, if `--closed-loop-during-evolution` is set) BEFORE GEPA starts; non-`healthy` bands prompt for confirmation (interactive) or default-deny (non-interactive) with a `--force-saturation-check` override. Pass `--no-saturation-check` to skip the probe entirely. | | `--force-saturation-check` | off | Run the saturation pre-flight, render the panel, but proceed regardless of band. Required to override a non-`healthy` verdict in non-interactive contexts (no TTY on stdin). Without this in such a context, the framework exits cleanly without spending GEPA budget. | @@ -99,8 +106,15 @@ Evolves one tool's top-level `description` field inside an MCP-shape manifest. T | `--fitness-profile {compression,balanced,growth}` | `balanced` | Same composite-weighting profile as `evolve_skill`. Maps to `BudgetAwareToolProposer` mode via `resolve_proposer_mode`. | | `--quality-gate {strict,default,lenient,off,non-inferiority}` | `default` | Same preset semantics as `evolve_skill`. | | `--max-absolute-chars ` | preset value | Override the description's absolute-length ceiling. | +| `--gepa-minibatch-size ` | `3` | GEPA's reflective minibatch size; same meaning as the skill-path flag. Bump alongside `--iterations` when the saturation pre-flight flags `weak_signal`. Aborts at startup if value exceeds trainset size. | +| `--gepa-acceptance {strict-improvement,improvement-or-equal}` | `improvement-or-equal` | Same meaning as the skill-path flag. `improvement-or-equal` (default) lets plateau-equal candidates through GEPA's acceptance gate. `strict-improvement` is the legacy `gepa<0.1.2` default; pass it only to reproduce strict-acceptance behavior for comparison runs. | | `--apply` | off | Rewrite the source manifest file in place with the evolved description on a deploy decision. Preserves every non-target tool's description, `inputSchema`, and any `_evolution_metadata` block. No-op (with stderr notice) when the manifest is under `~/.claude/plugins/cache`. Mutually exclusive with `--patch`. | | `--patch` | off | Emit a unified diff of (baseline → evolved) manifest JSON to stdout. Mutually exclusive with `--apply`. | +| `--create-pr / --no-create-pr` | off | On a deploy decision, branch the source repo, atomically copy the evolved manifest in, commit, push, and open a GitHub PR via `gh pr create`. Skips cleanly when the source isn't git-backed. Skips when the working tree is dirty unless `--pr-allow-dirty` is also set. Requires `gh` on `$PATH`. | +| `--pr-base-branch ` | `main` | Target branch for the PR opened by `--create-pr`. | +| `--pr-branch-prefix ` | `evolve/` | Prefix for the PR's head branch. Branch names become `{prefix}{tool}-{timestamp}-{hex}`. | +| `--pr-draft` | off | Open the `--create-pr` PR as a draft. | +| `--pr-allow-dirty` | off | Override `--create-pr`'s dirty-tree refusal. | | `--seed ` | `42` | RNG seed for dataset splitting. | | `--eval-source {synthetic,sessiondb}` | `synthetic` | Where the eval dataset comes from. `synthetic` runs the three-bucket generator (50%/30%/20% target-correct / confusable-neighbor / regression-detection). `sessiondb` mines Hermes session JSON for `(task, invoked_tool)` pairs and re-judges them against the current manifest; misselections at judge confidence ≥0.85 become flipped-label training examples. Claude Code and Copilot logs aren't mined (no tool-call data). | | `--dry-run` | off | Build the eval dataset and stop. Useful for confirming sessiondb discovery before spending judge + GEPA budget. Returns `{"decision": "dry-run", "dataset_size": N}`. | @@ -119,7 +133,7 @@ Evolves one tool's top-level `description` field inside an MCP-shape manifest. T `main()` rejects `--closed-loop-during-evolution` without `--closed-loop-hermes-repo`, and rejects `--closed-loop-mode != feedback` without `--closed-loop-during-evolution`. Local imports keep the validation stack out of cold-path runs. -Both `--apply` and `--patch` are no-ops on a reject decision and emit a one-line stderr notice in that case. +`--apply`, `--patch`, and `--create-pr` are all no-ops on a reject decision and emit a one-line stderr notice in that case. ### Exit conditions - `sys.exit(1)` if `--eval-source sessiondb` produces zero usable examples — the run.log includes a per-reason drop breakdown (importer + judge stages); the suggestion is to switch to `--eval-source synthetic`. diff --git a/docs/model_resolution.md b/docs/model_resolution.md index 1082348..a9dcade 100644 --- a/docs/model_resolution.md +++ b/docs/model_resolution.md @@ -2,6 +2,8 @@ How `agent-self-evolution` decides which model to call for each LM role. +Last verified against `evolution/core/hermes_provider.py`, `evolution/core/codex_lm.py`, and `evolution/core/nous_lm.py` on 2026-05-25. Unchanged behavior since the OpenAI Codex Responses API + Nous Portal OAuth + AWS Bedrock provider work landed. + ## TL;DR If you have Hermes Agent configured (`~/.hermes/config.yaml` exists), the framework uses your Hermes-configured model and provider automatically — for the optimizer, reflection, eval, and judge roles. No env vars to set. diff --git a/docs/review_notes.md b/docs/review_notes.md deleted file mode 100644 index 31bde95..0000000 --- a/docs/review_notes.md +++ /dev/null @@ -1,74 +0,0 @@ -# Documentation Review Notes - -A consistency + completeness pass over the docs in this directory. - -## Verified accurate - -- Module / package layout matches `find evolution -type f -name "*.py"`. Four subpackages now have content: `core/`, `skills/`, `tools/`, `validation/`. `prompts/`, `code/`, `monitor/` remain empty stubs. -- `EvolutionConfig` field defaults match `evolution/core/config.py`. -- `gate_decision.json` schema_version `"4"` matches both payload-writer sites in `evolution/skills/evolve_skill.py` and the test fixtures in `tests/skills/test_evolve_skill_validation_flow.py`. -- `ValidationReport` schema_version `"1"` matches `evolution/validation/validator.py` and the consumers in `evolution/core/closed_loop_feedback.py`. -- `_HEARTBEAT_TIERS` table matches `evolution/core/lm_timing_callback.py`. -- LM `request_timeout` / `num_retries` values per surface verified: - - judge LM (`fitness.py`): `request_timeout=60, num_retries=5` - - dataset gen LM (`dataset_builder.py`): `request_timeout=120, num_retries=5` - - reflection LM (`evolve_skill.py`): `request_timeout=300, num_retries=2` -- Pinned dep ranges verified against `pyproject.toml`. Direct deps include `numpy>=1.24` and `pyyaml>=6.0`. -- 681 tests collected (`pytest tests/ -q` inside venv). 37 test files spanning `tests/{core,skills,tools,validation}/`. -- `generate_report.py` is a renderer that takes `--run output/// --prose reports/_prose.yaml --out reports/_validation_report.pdf`. Numbers come from the run dir's `gate_decision.json` + `metrics.json` + `run.log`; editorial prose + tables come from the YAML; the title-page logo is `assets/dna.png`. -- Closed-loop integration: `--closed-loop-during-evolution`, `--closed-loop-hermes-repo`, `--closed-loop-mode {feedback,trainset,both}`, `--closed-loop-in-valset`, `--closed-loop-saturation-threshold`, `--closed-loop-min-iters`, `--closed-loop-window-size` all wired in `evolve_tool.main()`. Symmetric flag on `evolve_skill.main()` raises `UsageError` until a `SkillFileInstaller` ships. -- `examples/hermes_tools_evolution_metadata.json` ships the confusable-neighbors sidecar users copy into `/tools/_evolution_metadata.json`. - -## Minor inconsistencies in the codebase (worth tracking, not blockers) - -### 4. Module-import-time `logging.basicConfig` -`evolution/skills/evolve_skill.py:30-34` calls `logging.basicConfig` at import. This is *idempotent* in stdlib (only first call wins) but means importing `evolve_skill` from another script silently configures the root logger. Documented in `interfaces.md` (Logging conventions) — flag if a future user wants to import `evolve()` from a notebook without the side effect. - -### 5. `HermesSkillSource` env var name has changed -The `external_importers._load_skill_text` standalone CLI uses `~/.hermes/skills/`, but the `HermesSkillSource` adapter uses `~/.hermes/hermes-agent/skills/` (or `$SKILL_SOURCES_HERMES_REPO/skills/`). Different path under the same `~/.hermes/` prefix; could confuse a user who deletes one and expects both surfaces to break together. - -### 6. CLI flag naming inconsistency -- `--bootstrap-resamples` (CLI) maps to `bootstrap_n_resamples` (Python) — note the `n_` prefix difference. -- All other CLI flags map straightforwardly. - -### 7. Tier-3/4/5 packages are empty stubs -`evolution/{prompts,code,monitor}/` contain only `__init__.py`. They anchor the planned architecture but currently do nothing. Documented in `codebase_info.md` (implementation status table). Could confuse a new contributor expecting working code there. - -### 8. Tool descriptions have a parallel-but-not-shared infrastructure -`evolution/tools/` mirrors `evolution/skills/` for the dataset, judge, proposer, and orchestration — the modules are intentionally duplicated rather than parameterized because the prompts differ enough that a shared base would be more abstract than helpful. `evolution/core/quality_gate.py` is the one piece that was extracted out and is now genuinely shared (preset table + gate-decision persistence). - -### 9. Closed-loop signal is opt-in even when wired -`--closed-loop-during-evolution` is required to construct the `ClosedLoopFeedbackCache`; otherwise the metric's behavioral branch is dead code (no behavioral examples in trainset). Default `--closed-loop-mode feedback` keeps full backward compatibility with the pre-closed-loop CLI. - -## Gaps that warrant future docs - -### 1. No deployment / release docs -No `release.md`, `CONTRIBUTING.md`, `RELEASE.md`. Project is currently single-author with PR-based merges; if it scales, these would be needed. - -### 2. No example `gate_decision.json` walkthrough -`data_models.md` shows the schema; a worked example narrating "the bootstrap CI lower bound was -0.06 so dual-check rejected" would help users reading their own decisions for the first time. Could be added if rejection diagnostics become a frequent user task. - -### 3. No "how to add a new constraint" guide -`ConstraintValidator` is closed over a hardcoded set of checks. Adding a new one requires editing both the validator and (for the gate-payload integration) `evolve_skill.py`. Pattern is straightforward but undocumented; would be useful when Tier 2/3 lands and tool-description-specific constraints are added. - -### 4. No GEPA-vs-MIPROv2 comparison -The fallback chain is implemented but the "when does GEPA underperform / when does MIPROv2" narrative isn't documented. The MIPROv2 path is a degraded mode (no knee-point, no `detailed_results`); user-facing implications are not surfaced beyond "knee_point.applied=false." - -### 5. No "how to author a closed-loop task suite" guide -Users adopting `--closed-loop-during-evolution` need to write JSONL suites with calibrated `expected_tools` / `forbidden_tools` per task. The shape is documented in `data_models.md`; the *design heuristics* (how to choose tasks the agent's behavior is sensitive to) are not. The current shipped suites (`evolution/validation/suites/{patch,write_file,search_files}.jsonl`) are the de-facto examples. - -## Recommended documentation maintenance - -1. **Re-verify defaults on every release.** `EvolutionConfig` defaults are tuned often; doc table in `data_models.md` will drift. -2. **Re-collect test count when refactoring.** Currently ~680; bump if tests are added/removed. -3. **Update `gate_decision.json` schema docs on every schema bump.** When `schema_version` increments, both `data_models.md` and `interfaces.md` (test surfaces) need to mention the new fields. -4. **Update `ValidationReport` schema docs on every schema bump.** Currently `"1"`; fields are stable but new diagnostic fields will likely accumulate. -5. **Verify mermaid diagrams render.** GitHub renders mermaid in markdown; if a diagram breaks during edits, the rest of the page still renders, so silent breakage is possible. Spot-check on github.com after pushing. - -## What's NOT documented (intentionally) - -- **Per-PR rationale or change log.** That's `git log` + PR descriptions — not duplicated here. -- **Bug-fix recipes.** The fix is in the code; the commit message has the context. -- **Debugging output samples.** Run logs and `gate_decision.json` snapshots are user-specific and rot fast. -- **Style preferences.** Lives in `AGENTS.md`. -- **Experimental run results / findings docs.** Run outcomes belong in PR descriptions; the durable claims surface in `PLAN.md` deviations or in code-level docstrings where the constraint applies. From c19dd288088772aa3e221102289f6d12dca07f50 Mon Sep 17 00:00:00 2001 From: Justin Ramos Date: Mon, 25 May 2026 14:26:43 -0600 Subject: [PATCH 3/3] chore: remove docs/superpowers/ from repo (plans + specs are local-only); refresh schema_version + test counts; mark superseded research --- .gitignore | 3 +++ AGENTS.md | 2 +- docs/codebase_info.md | 2 +- docs/data_models.md | 33 ++++++++++++++++++++++++---- docs/framework_advantages.md | 2 +- docs/index.md | 6 ++--- docs/interfaces.md | 2 +- docs/research/knee_point_analysis.md | 2 ++ evolution/core/saturation_check.py | 2 +- 9 files changed, 42 insertions(+), 12 deletions(-) diff --git a/.gitignore b/.gitignore index d5436f8..000e03a 100644 --- a/.gitignore +++ b/.gitignore @@ -27,3 +27,6 @@ snapshots/ .vscode/ *.swp *.swo + +# Local-only design specs and implementation plans (never committed) +docs/superpowers/ diff --git a/AGENTS.md b/AGENTS.md index 661512e..458da8e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -210,7 +210,7 @@ Per-run dir: `output///`. Contents vary by outcome: | File | When | Purpose | |---|---|---| | `run.log` | always | Every LM call (start, end, heartbeats), every retry | -| `gate_decision.json` | always | Structured deploy decision (schema_version `"4"`) | +| `gate_decision.json` | always | Structured deploy decision (schema_version `"5"`) | | `evolved_skill.md` | deploy only | New SKILL.md ready to ship | | `baseline_skill.md` | deploy only | Original (for diffing) | | `metrics.json` | deploy only | Top-level run summary | diff --git a/docs/codebase_info.md b/docs/codebase_info.md index b6e787d..2900d46 100644 --- a/docs/codebase_info.md +++ b/docs/codebase_info.md @@ -111,7 +111,7 @@ evolution/ | `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples | | **Total** | **~9,000** | excludes empty `__init__.py` shims | -Test suite: 55 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1076 tests** collected. +Test suite: 61 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1166 tests** collected. ## Runtime dependencies diff --git a/docs/data_models.md b/docs/data_models.md index a41a97b..c2455d4 100644 --- a/docs/data_models.md +++ b/docs/data_models.md @@ -323,7 +323,7 @@ class SaturationReport: Only the target tool's `description` is changed; every other tool's `description`, `inputSchema`, and any `_evolution_metadata` block are preserved verbatim. With `--apply`, the source manifest file is rewritten in place with the same preservation guarantees. With `--patch`, a unified diff of (baseline → evolved) manifest JSON is written to stdout. -## gate_decision.json (schema_version "4") +## gate_decision.json (schema_version "5") The structured deploy-gate decision, written to `output///gate_decision.json` on every run regardless of outcome. The schema is the **calibration substrate** — `tests/skills/test_evolve_skill_validation_flow.py:TestGrowthGateDecisionSchema` locks the field list so future calibration scripts (`jq -s '...' output/*/*/gate_decision.json`) don't break. @@ -333,7 +333,7 @@ Written when any `validate_static` check fails on the evolved artifact (short-ci ```json { - "schema_version": "4", + "schema_version": "5", "decision": "reject", "reason": "static_constraint_failure", "failed_constraints": ["non_empty"], @@ -367,7 +367,7 @@ Written when `--max-total-cost-usd` is set and cumulative LM cost exceeds the ce ```json { - "schema_version": "4", + "schema_version": "5", "decision": "aborted", "reason": "cost_ceiling_exceeded", "cost_ceiling_usd": 0.50, @@ -399,7 +399,7 @@ Written when `--max-total-cost-usd` is set and cumulative LM cost exceeds the ce ```json { - "schema_version": "4", + "schema_version": "5", "decision": "deploy", // or "reject" "reason": "passed", // or "growth_quality_gate" "decision_rule_used": "dual_check", // or "no_regression_only" | "non_inferiority" @@ -530,6 +530,31 @@ Runs of `evolution.tools.evolve_tool` write the same schema with four extra top- | `dataset.sessiondb_drops` | `dict[str, int]` | Per-reason drop counts across the two pipeline stages. Importer keys: `short_task`, `slash_command`, `secret`, `no_tool_calls`, `non_manifest`. Judge keys: `judge_irrelevant`, `judge_error`, `noisy_middle`, `low_confidence`, `unknown_correct_tool`. Judge keys are absent when zero candidates reached the judge stage. | | `dataset.dropped_non_manifest_count` | `int` | Pulled out of `sessiondb_drops["non_manifest"]` as a top-level int so calibration scripts don't have to know the inner key set. Counts session invocations of tools that exist in the historical session but not in the current manifest under evolution. | +### Schema v5 additions + +v5 adds always-present `decision_signal` and `pr_created` fields, plus a closed-loop-primary field group that is present only when the deploy gate was decided on closed-loop signal rather than the synthetic holdout. + +| Field | Type | Notes | +|---|---|---| +| `decision_signal` | `"synthetic" \| "closed_loop"` | Always present. Which signal the deploy gate actually decided on. `"closed_loop"` lands when the run executed CL-primary scoring (closed-loop tasks gained ≥ `cl_required_gain` AND synthetic non-inferiority held); `"synthetic"` otherwise. Calibration scripts should branch on this before interpreting `bootstrap` vs `cl_tasks_gained`. | +| `pr_created` | `dict` | Always present. Shape-stable across `--create-pr` on/off and across success/failure. Keys: `status` (`"created" \| "skipped" \| "failed" \| "disabled"`), `reason` (`str \| None`), `branch` (`str \| None`), `commit_sha` (`str \| None`), `url` (`str \| None`). `"disabled"` is the default when `--create-pr` is off. | + +#### Closed-loop-primary fields (`decision_signal == "closed_loop"`) + +Written by `evolution/core/quality_gate.py::append_cl_decision_fields` when the gate decision is taken on closed-loop signal. + +| Field | Type | Notes | +|---|---|---| +| `baseline_closed_loop_per_example` | `list[float]` | Cached per-task closed-loop scores for the baseline artifact (0.0/1.0 per task). | +| `evolved_closed_loop_per_example` | `list[float]` | Per-task closed-loop scores for the evolved artifact (0.0/1.0 per task). Same length and task order as `baseline_closed_loop_per_example`. | +| `evolved_closed_loop_errored_tasks` | `list` | Task identifiers (or empty) for closed-loop evaluations that errored rather than scored. Empty list is the common case. | +| `cl_tasks_gained` | `int` | `int(sum(evolved)) - int(sum(baseline))` — the net delta of tasks passing closed-loop. The CL-primary gate requires this to meet `cl_required_gain`. | +| `cl_required_gain` | `int` | The CL-primary threshold the run had to clear, computed from `growth_pct` via the CL-primary slope/free-threshold constants. At least `1` for any non-zero growth. | +| `synthetic_sanity_check` | `dict` | The non-inferiority guard that runs alongside CL-primary. Keys: `tolerance` (float), `baseline_mean` (float), `evolved_mean` (float), `passed` (bool — `(evolved - baseline) >= -tolerance`). | +| `evolved_cl_eval_cost_usd` | `float` | LM cost in USD attributable to the evolved closed-loop evaluation pass — surfaces the CL-primary path's incremental spend. | +| `band_trigger_score` | `dict` | Pre-flight scores that decided whether CL-primary fired. Keys: `holdout` (`float \| None`), `closed_loop` (`float \| None`). | +| `validator_agent_model` | `str` | The LiteLLM model id used for the closed-loop validator agent. Recorded so historical decisions stay analysable if the default changes. | + ## metrics.json (deploy-only summary) Written to `output///metrics.json` only on deploy. Top-level summary for quick scanning: diff --git a/docs/framework_advantages.md b/docs/framework_advantages.md index a10c338..5737f3e 100644 --- a/docs/framework_advantages.md +++ b/docs/framework_advantages.md @@ -52,7 +52,7 @@ Files: `evolution/core/saturation_check.py`. ## Telemetry as a first-class feature -Every run writes `gate_decision.json` (schema_version `"4"`) capturing the deploy decision, the paired-bootstrap statistics, the static-constraint results, the knee-point band roster, and an explicit comparison against the candidate stock GEPA would have picked. Combined with `metrics.json` (deploy summary) and `run.log` (every LM call timing), this means a deploy decision is auditable post-hoc and the system can be re-calibrated on accumulated runs. Most upstream users won't realize they're missing this until they need to debug a bad ship. +Every run writes `gate_decision.json` (schema_version `"5"`) capturing the deploy decision, the paired-bootstrap statistics, the static-constraint results, the knee-point band roster, and an explicit comparison against the candidate stock GEPA would have picked. Combined with `metrics.json` (deploy summary) and `run.log` (every LM call timing), this means a deploy decision is auditable post-hoc and the system can be re-calibrated on accumulated runs. Most upstream users won't realize they're missing this until they need to debug a bad ship. ## When raw GEPA is the right choice diff --git a/docs/index.md b/docs/index.md index ce075a0..1d810c8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -6,7 +6,7 @@ This directory is a structured documentation set for **`agent-self-evolution`** **Start here every time.** This file is the entry point — it describes which documents to consult for which kinds of question. Load it into context first; the other docs are loaded on demand. -The codebase is mid-sized (~9K LOC of source + 55 test files / ~1076 tests) and architecturally dense — most of the substance is in *why* things are shaped a certain way, not *what* they are. The docs prioritize that "why." +The codebase is mid-sized (~9K LOC of source + 61 test files / ~1166 tests) and architecturally dense — most of the substance is in *why* things are shaped a certain way, not *what* they are. The docs prioritize that "why." ### Question routing table @@ -77,9 +77,9 @@ The codebase is mid-sized (~9K LOC of source + 55 test files / ~1076 tests) and The fast-moving parts to verify against source when consulting these docs: - `EvolutionConfig` defaults (especially `eval_dataset_size`, `growth_*`, `bootstrap_*`) -- `gate_decision.json` schema_version (currently `"4"`) +- `gate_decision.json` schema_version (currently `"5"`) - LM model defaults in `evolve_skill.py` / `evolve_tool.py` CLI options -- Test count (currently ~1076) +- Test count (currently ~1166) - LM `request_timeout` / `num_retries` — may be tuned further - Closed-loop CLI flags on `evolve_tool` (`--closed-loop-during-evolution`, `--closed-loop-mode`, …) - Saturation pre-flight default thresholds (`evolution/core/saturation_check.py:DEFAULT_THRESHOLDS`) — likely to be calibrated as more real-world bands are observed diff --git a/docs/interfaces.md b/docs/interfaces.md index 56faadd..412291b 100644 --- a/docs/interfaces.md +++ b/docs/interfaces.md @@ -284,7 +284,7 @@ Per-run directory: `output///`. Contents vary by ou These are technically internal but tested directly because downstream calibration scripts depend on them: - `_write_gate_decision(output_dir, payload) -> Path` — keep filename `gate_decision.json`. -- `gate_decision.json` schema fields — `tests/skills/test_evolve_skill_validation_flow.py:TestGrowthGateDecisionSchema` and `TestStaticValidationShortCircuitsBeforeHoldout` lock `schema_version="4"` plus the full key list. See [data_models.md](data_models.md). +- `gate_decision.json` schema fields — `tests/skills/test_evolve_skill_validation_flow.py:TestGrowthGateDecisionSchema` and `TestStaticValidationShortCircuitsBeforeHoldout` lock `schema_version="5"` plus the full key list. See [data_models.md](data_models.md). - `_dataset_payload(dataset)` — `size_total`, `size_train`, `size_val`, `size_holdout`, `sources` (per-source counter; "unknown" bucket for `source=""`). Locked by `TestDatasetPayloadHelper`. - `_knee_point_payload(pick)` — applied/skipped shapes both locked by `TestKneePointPayloadHelper`. - `paired_bootstrap()` return shape — `mean`, `lower_bound`, `upper_bound`, `n_examples`, `n_resamples`, `confidence`. Calibration scripts depend on these key names. diff --git a/docs/research/knee_point_analysis.md b/docs/research/knee_point_analysis.md index 3b3e105..c515af5 100644 --- a/docs/research/knee_point_analysis.md +++ b/docs/research/knee_point_analysis.md @@ -1,3 +1,5 @@ +> **Superseded.** Historical analysis from 2026-05-03. The knee-point ε-band selector was empirically dropped as a no-op for the val-best path in May 2026 (10/10 mode agreement across 5 ε modes on a regenerated calibration corpus). The selector survives only for the `--knee-point-strategy smallest` opt-in path. See [`reports/calibration_findings.md`](../../reports/calibration_findings.md) for current status. + # Knee-point Pareto Selection: Analysis This document evaluates the framework's custom Pareto selector at diff --git a/evolution/core/saturation_check.py b/evolution/core/saturation_check.py index a8eaec8..5d0bedc 100644 --- a/evolution/core/saturation_check.py +++ b/evolution/core/saturation_check.py @@ -4,7 +4,7 @@ returns a structured report. Call sites in evolve_skill / evolve_tool render a Rich panel and decide whether to prompt or default-deny. -See docs/superpowers/specs/2026-05-21-path-f-saturation-preflight-design.md +See reports/calibration_findings.md for the calibration data behind DEFAULT_THRESHOLDS. """ from __future__ import annotations