diff --git a/README.md b/README.md index e145a91..5fae14d 100644 --- a/README.md +++ b/README.md @@ -169,6 +169,22 @@ The framework parses every `*_SCHEMA = {...}` and `*_SCHEMAS = [...]` declaratio With `--apply`, the evolved description is spliced into the source file's bytes at the original position — comments, formatting, and unrelated tools are untouched. Multi-line parenthesized concatenations collapse to a single triple-quoted string at the same indent. +### Evolve a system prompt section + +For Hermes Agent, evolve a named section of the assembled system prompt — any top-level string constant in `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`, which governs when and what the agent saves to memory): + +```bash +uv run python -m evolution.prompts.evolve_prompt_section \ + --section MEMORY_GUIDANCE \ + --hermes-repo /path/to/hermes-agent \ + --tasks evolution/validation/suites/memory_guidance.jsonl \ + --iterations 10 +``` + +Unlike skill and tool evolution — where the deploy gate can lean on a synthetic LLM-judge signal — a prompt section is evaluated **purely behaviorally**: every candidate is spliced into the live `prompt_builder.py` and scored by running the real agent (`hermes -z`) against the task suite. The verdict is compound — Layer 1 checks whether the agent invoked the expected tool (e.g. `memory`), and Layer 2 runs an LLM judge over the saved content against each task's `expected_save_content` rubric. The candidate is spliced in only for the duration of the run; the file is restored byte-for-byte afterward (atomic backup + flock + checksum-drift detection, shared with the tool-description path). + +`--apply` writes the evolved section into `prompt_builder.py` in place; results land in `output/prompts/
//`. PR automation (`--create-pr`) is not yet wired for prompt sections — use `--apply` plus a manual PR. To demonstrate the loop on an already-tuned section (which the saturation pre-flight will otherwise correctly default-deny as having no headroom), `--baseline-override-file` starts evolution from arbitrary text — e.g. a deliberately-weakened baseline that gives GEPA real failures to learn from. + ### Mine real session history for evals For skill evolution: @@ -331,7 +347,7 @@ Cost: each task is one `hermes -z` run (~$0.05–$0.50). The bundled `patch.json |-------|--------|--------|--------| | **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ [Validated](reports/phase1_validation_report.pdf) | | **Phase 2** | Tool descriptions + dual-signal deploy gate | DSPy + GEPA | ✅ [Validated](reports/phase2_validation_report.pdf) | -| **Phase 3** | System prompt sections | DSPy + GEPA | 🔲 Planned | +| **Phase 3** | System prompt sections | DSPy + GEPA | ✅ [Validated](reports/phase3_validation_report.pdf) | | **Phase 4** | Tool implementation code | Darwinian Evolver | 🔲 Planned | | **Phase 5** | Continuous improvement loop | Automated pipeline | 🔲 Planned | diff --git a/docs/architecture.md b/docs/architecture.md index 772a94b..8e73e94 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -56,10 +56,20 @@ graph TB hermes_source[tools.hermes_source
Hermes *_SCHEMA AST adapter] end + subgraph prompts_tier[Prompt Tier] + evolve_prompt[prompts.evolve_prompt_section
main + evolve] + prompt_module[prompts.prompt_module
PromptModule + sentinels] + prompt_proposer[prompts.prompt_proposer
PromptSectionProposer] + prompt_judge[prompts.prompt_judge
SaveCallJudge + judge_save_calls
+ prompt fitness/splice scorer] + prompt_source[prompts.prompt_source
PromptSource protocol + SectionDescriptor] + hermes_prompt_source[prompts.hermes_prompt_source
HermesPromptSource — prompt_builder.py AST] + end + subgraph validation_subsystem[Closed-loop validation] validator[validation.validator
ClosedLoopValidator] hermes_runner[validation.hermes_runner
hermes -z subprocess] - installer[validation.artifact_installer
HermesToolDescriptionInstaller] + installer[validation.artifact_installer
HermesToolDescriptionInstaller +
HermesPromptSectionInstaller] + savejudge[validation.report
score_task Layer-2 judge hook] report[validation.report
ValidationReport + decision] task[validation.task
Task + TaskSuite] cl_cli[validation.closed_loop
CLI] @@ -117,10 +127,26 @@ graph TB tool_judge --> fitness tool_proposer --> budget + evolve_prompt --> prompt_module + evolve_prompt --> prompt_proposer + evolve_prompt --> prompt_judge + evolve_prompt --> prompt_source + evolve_prompt --> hermes_prompt_source + evolve_prompt --> config + evolve_prompt --> quality + evolve_prompt --> timing + evolve_prompt --> validator + hermes_prompt_source --> prompt_source + prompt_module --> dspy + prompt_proposer --> budget + prompt_judge --> fitness + installer --> hermes_prompt_source + validator --> hermes_runner validator --> installer validator --> report validator --> task + validator --> savejudge cl_cli --> validator hermes_runner --> hermes @@ -138,7 +164,9 @@ graph TB importers --> dataset ``` -`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools. This keeps the tier-3/4/5 expansion path open. +`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, `evolution/prompts/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools/prompts. This keeps the tier-4/5 expansion path open. + +The `prompts` tier (Phase 3) is the prompt-section evolution path: `evolve_prompt_section` wraps a named `prompt_builder.py` constant as a `PromptModule` (a passthrough predictor carrying the candidate in sentinel-delimited instructions), mutates it with `PromptSectionProposer`, and — because there is no synthetic classification signal for a system-prompt section — scores **purely behaviorally** through the closed-loop validator running a real `hermes -z` against a curated JSONL suite. The deploy gate is therefore a closed-loop pass-rate / win-loss decision, not a paired-bootstrap one. Unlike the skill/tool tiers it reuses `ClosedLoopValidator` directly rather than going through `closed_loop_feedback.py`, and it integrates by AST-splicing the candidate into the live `agent/prompt_builder.py` (`HermesPromptSectionInstaller`) with atomic restore. The Layer-2 content judge (`SaveCallJudge` / `judge_save_calls`) runs inside `score_task` to grade memory-save *content* on top of the Layer-1 trigger-membership check. ## Design patterns in active use diff --git a/docs/codebase_info.md b/docs/codebase_info.md index 2900d46..c16decc 100644 --- a/docs/codebase_info.md +++ b/docs/codebase_info.md @@ -67,14 +67,20 @@ evolution/ │ └── tool_judge.py # tool-flavored LLMJudge + GEPA-shaped metric ├── validation/ # closed-loop validation against a real agent │ ├── agent_runner.py # AgentRunner Protocol + AgentRunResult dataclass -│ ├── artifact_installer.py # ArtifactInstaller Protocol + HermesToolDescriptionInstaller +│ ├── artifact_installer.py # ArtifactInstaller Protocol + HermesToolDescriptionInstaller + HermesPromptSectionInstaller │ ├── closed_loop.py # CLI: drive baseline + evolved through hermes -z, compare -│ ├── hermes_runner.py # HermesAgentRunner — subprocess hermes -z with sandboxed HOME -│ ├── report.py # ValidationReport + TaskResult + decision rule -│ ├── suites/ # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl) +│ ├── hermes_runner.py # HermesAgentRunner — subprocess hermes -z; reads sessions from SQLite state.db (parse_session_from_db) +│ ├── report.py # ValidationReport + TaskResult + decision rule + Layer-2 SaveCallJudge in score_task +│ ├── suites/ # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl, memory_guidance.jsonl) │ ├── task.py # Task + TaskSuite.from_jsonl (with sha256 audit) │ └── validator.py # ClosedLoopValidator.validate — mutates + restores live agent file -├── prompts/ # Tier 3: planned, empty package +├── prompts/ # Tier 3: system-prompt-section evolution +│ ├── evolve_prompt_section.py # CLI + orchestration; purely-behavioral closed-loop gate +│ ├── prompt_source.py # PromptSource Protocol (read + write) + SectionDescriptor +│ ├── hermes_prompt_source.py # HermesPromptSource — AST read/write of prompt_builder.py constants +│ ├── prompt_module.py # PromptModule — passthrough predictor carrying candidate in sentinels +│ ├── prompt_proposer.py # PromptSectionProposer — sentinel-preserving GEPA proposer +│ └── prompt_judge.py # SaveCallJudge + judge_save_calls Layer-2 content judge + fitness/splice scorers ├── code/ # Tier 4: planned, empty package └── monitor/ # planned, empty package ``` @@ -86,6 +92,7 @@ evolution/ | `evolution/skills/evolve_skill.py` | ~1340 | CLI, orchestration, gate-decision payload assembly | | `evolution/tools/evolve_tool.py` | ~1170 | CLI + orchestration for tool-description evolution | | `evolution/core/external_importers.py` | ~770 | 3 importers + relevance filter + standalone CLI | +| `evolution/prompts/evolve_prompt_section.py` | ~660 | CLI + orchestration; purely-behavioral closed-loop deploy gate | | `evolution/core/dataset_builder.py` | ~480 | synthetic generator + golden loader + tool-selection three-bucket gen | | `evolution/core/lm_timing_callback.py` | ~400 | DSPy BaseCallback + litellm.failure_callback + cost ledger | | `evolution/core/fitness.py` | ~380 | LLMJudge + skill/tool fitness metrics + behavioral score helper | @@ -94,6 +101,7 @@ evolution/ | `evolution/core/closed_loop_feedback.py` | ~320 | cache + saturation gate + deterministic feedback block + `force_run` (bypasses gate for pre-flight) | | `evolution/core/saturation_check.py` | ~255 | pre-flight: band classifier + `SaturationReport` + Rich panel + interactive confirm | | `evolution/tools/tool_judge.py` | ~230 | tool-flavored judge + GEPA-shaped metric with behavioral branch | +| `evolution/prompts/prompt_judge.py` | ~230 | SaveCallJudge + judge_save_calls Layer-2 content judge + prompt fitness/splice scorers | | `evolution/validation/validator.py` | ~220 | mutate + restore live agent file with flock + checksum drift check | | `evolution/validation/report.py` | ~225 | ValidationReport JSON + Rich rendering + two-condition decision | | `evolution/core/skill_sources.py` | ~210 | Hermes / Claude Code / LocalDir | @@ -101,15 +109,19 @@ evolution/ | `evolution/skills/knee_point.py` | ~205 | parsimony-based candidate picker | | `evolution/validation/hermes_runner.py` | ~205 | hermes -z subprocess with sandboxed HOME | | `evolution/tools/tool_proposer.py` | ~200 | sentinel-preserving reflection prompt | -| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore | +| `evolution/prompts/prompt_proposer.py` | ~160 | sentinel-preserving GEPA proposer for prompt sections | +| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore (tool + prompt-section installers) | +| `evolution/prompts/hermes_prompt_source.py` | ~135 | AST read/write of prompt_builder.py string constants | +| `evolution/prompts/prompt_module.py` | ~120 | PromptModule passthrough predictor + sentinel parse | | `evolution/validation/closed_loop.py` | ~135 | standalone closed-loop CLI | | `evolution/skills/skill_module.py` | ~125 | wraps SKILL.md as `dspy.Module` | | `evolution/validation/task.py` | ~90 | Task + TaskSuite.from_jsonl | | `evolution/core/config.py` | ~80 | `EvolutionConfig` dataclass | | `evolution/core/stats.py` | ~60 | `paired_bootstrap` helper | +| `evolution/prompts/prompt_source.py` | ~55 | PromptSource Protocol + SectionDescriptor | | `evolution/validation/agent_runner.py` | ~55 | AgentRunner Protocol + dataclasses | | `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples | -| **Total** | **~9,000** | excludes empty `__init__.py` shims | +| **Total** | **~10,400** | excludes empty `__init__.py` shims | Test suite: 61 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1166 tests** collected. @@ -139,11 +151,11 @@ The README's table summarizes intent; reality: |---|---|---|---| | 1 | Skill files (SKILL.md) | DSPy + GEPA | ✅ implemented in `evolution/skills/` | | 2 | Tool descriptions | DSPy + GEPA | ✅ implemented in `evolution/tools/` — MCP-JSON and Hermes-Python-AST adapters; one target tool per run | -| 3 | System prompt sections | DSPy + GEPA | 🔲 `evolution/prompts/` package exists, empty | +| 3 | System prompt sections | DSPy + GEPA | ✅ implemented in `evolution/prompts/` — AST splice of `prompt_builder.py` constants; purely-behavioral closed-loop deploy gate (no synthetic signal) | | 4 | Tool implementation code | Darwinian Evolver | 🔲 `evolution/code/` package exists, empty; `[darwinian]` extra reserves the dep | | 5 | Continuous improvement loop | Automated pipeline | 🔲 `evolution/monitor/` package exists, empty | -Tiers 1 and 2 are built. Tier 3-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec. +Tiers 1-3 are built. Tier 4-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec. **Orthogonal validation surface.** `evolution/validation/` runs a real agent (`hermes -z`) through a JSONL task suite with baseline vs evolved artifacts spliced into the live install. Scores actual tool-selection behavior with `expected_tools` / `forbidden_tools` per task; compares with a two-condition decision rule. Available three ways: diff --git a/docs/components.md b/docs/components.md index 8821142..2d3c85c 100644 --- a/docs/components.md +++ b/docs/components.md @@ -368,6 +368,51 @@ Score is **never** modified by `pred_trace` enrichment — GEPA enforces score e **Cost ceiling + benchmark hook (shared with `evolve_skill`):** `--max-total-cost-usd` participates in the same `CostLedger` kill switch (see `lm_timing_callback.py`); `--benchmark-cmd` is a post-gate shell hook whose env vars include `EVOLVED_PATH` / `BASELINE_PATH` pointing at the rendered manifest JSONs and `ARTIFACT_TYPE="tool_description"`. Both write structured blocks into `gate_decision.json` — see `data_models.md`. +## evolution/prompts/evolve_prompt_section.py — CLI + orchestrator + +**Owns:** the end-to-end `evolve_prompt_section()` flow and the Click CLI (`main`) for evolving a named system-prompt section — a top-level string constant in Hermes `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`). The phase-3 analogue of `evolve_tool`, but with a fundamentally different eval substrate: there is no cheap synthetic classification GEPA can score, so **every** candidate is spliced into the live `prompt_builder.py` and run through a real `hermes -z` subprocess. The deploy gate is therefore a `ClosedLoopValidator` win/loss decision, not a paired-bootstrap CI. + +**Public surface:** +- `main()` — Click command. CLI flags map onto `evolve_prompt_section()` kwargs. +- `evolve_prompt_section(section_name, hermes_repo, tasks_path, ...) -> dict` — orchestrator function. Importable and used directly by tests. + +**Integration model — in-place splice + atomic restore.** Unlike skills (separate writable workdir) there is no env-var hook or plugin seam: the section is a constant inside Hermes' own source, so the framework edits that file in place and restores it. The whole evolution runs inside `_prompt_builder_guard(target_path)` — a context manager that takes an atomic `.cl_backup` (`_BACKUP_SUFFIX`), grabs an exclusive `fcntl.flock` on `.cl_validation.lock` (`_LOCK_FILENAME`) in the target's parent dir, and byte-restores the original on exit (refusing to start on a stale backup or a held lock). These are the *same* lock + backup names `ClosedLoopValidator` uses, so the guard is sequenced *before* the deploy-gate validator, never nested. The deploy gate then re-acquires the lock itself. + +**Phases inside `evolve_prompt_section()`:** +1. Resolve baseline: `HermesPromptSource.read(section_name)` validates the section is a top-level string constant, then reads its text — or `--baseline-override-file` supplies starting text (a deliberately-weakened baseline for headroom, or a regression ablation) while the *live* file is still backed up/restored and `--apply` still writes the live section. +2. Train/holdout split of the JSONL suite (`_split_train_holdout`, deterministic shuffle+seed, ≥1 task each side; suites with <2 tasks are rejected). +3. Build the eval stack: `SaveCallJudge` + a per-task Layer-2 factory (`_make_layer2_factory`, binds each task's `expected_save_content` rubric + message into a `score_task`-shaped scorer; returns `None` for tasks with no rubric) → `HermesPromptSectionInstaller` + `HermesAgentRunner` + a `make_memoizing_splice_scorer` over `install_candidate` / `score_task_id`, serialized under a `threading.Lock`. +4. `dspy.configure(lm=eval_lm)` sets the **global** default LM (not just `dspy.context`) so the passthrough predictor resolves an LM inside GEPA's worker threads — without it, `forward()`'s passthrough call raises "No LM is loaded" in those threads, yielding no trajectories and no proposal. +5. Inside `_prompt_builder_guard`: saturation pre-flight (baseline behavior on the holdout; aborts/denies on a non-`healthy` band unless `--force-saturation-check`, with non-interactive contexts refusing rather than prompting) followed by GEPA(`PromptModule`, `PromptSectionProposer`, `make_prompt_fitness_metric` + the memoizing splice scorer). Trainset/valset are `_behavioral_examples` (task message + `closed_loop_task_id`). +6. Select the evolved section via GEPA val-argmax (`detailed_results.best_idx`), reading the body back out of the winning candidate's sentinel region (`_section_text_from_candidate`). +7. Deploy gate: `ClosedLoopValidator.validate(...)` runs baseline vs evolved on the holdout suite (the same per-task Layer-2 factory + threshold threaded in). `report.decision == "pass"` is the deploy verdict. +8. Write `gate_decision.json`; on a passing gate `--apply` writes the evolved section back into `prompt_builder.py`. `baseline_section.txt` / `evolved_section.txt` are also emitted. + +`_run_one_task_score` is the GEPA in-loop scorer: materialize the task fixture into a tmp dir, run the agent against whatever section is currently spliced, `score_task`, return 1.0/0.0 (in-loop abstentions score 0.0 — the deploy gate handles abstentions properly). Budget rides the shared `COST_LEDGER` + `CostCeilingExceeded` kill switch; the ceiling abort writes a `cost_ceiling_exceeded` gate decision. + +**`gate_decision.json` additions:** `artifact_type: "prompt_section"`, `target_section: `, `baseline_chars` / `evolved_chars` / `growth_pct`, a `closed_loop` block (the validator decision + pass rates + W/L/T), and `sentinel_failures` (proposer candidates rejected for losing the sentinels). `decision_signal` is always `"closed_loop"`. `--create-pr` is **deferred** for prompt sections (it would pollute the diff with the local override-hook commit) and is recorded as `skipped`; use `--apply` + a manual PR. + +### Supporting modules (`evolution/prompts/`) + +- `prompt_source.py` — `PromptSource` Protocol (`read` + `write` only, `runtime_checkable`) + `SectionDescriptor` (frozen metadata). The Protocol is deliberately minimal — the driver only reads a baseline and writes/splices an evolved value. `list_sections` is a concrete convenience on `HermesPromptSource` (a future `--list-sections` affordance), not part of the contract. +- `hermes_prompt_source.py` — `HermesPromptSource`, the splice primitive. `read` AST-walks top-level `NAME = "..."` string constants (v1 string-typed only; dict-typed constants like `PLATFORM_HINTS` raise `KeyError`). `write` splices by byte offset using `repr(new_text)` so the literal round-trips byte-equal regardless of embedded quotes/newlines, and `ast.parse`-guards the result before an atomic `os.replace` — it **refuses to write non-parseable Python**, leaving the user's Hermes startable. +- `prompt_module.py` — `PromptModule(section_name, candidate_text)`: a `dspy.Module` whose `ChainOfThought` passthrough predictor carries the candidate in `signature.instructions` between sentinel markers (`` … ``). There is no cheap classification to score, so the predictor exists only as a mutation target. `forward()` **must** invoke the passthrough so GEPA captures a trace for `passthrough.predict` — without a traced predictor call, `make_reflective_dataset` finds "no valid predictions" and never proposes a mutation. It returns a placeholder response with `_closed_loop_task_id` + `_candidate_text` attached for the behavioral metric. GEPA discovers the target via `named_predictors()` → `"passthrough.predict"`. +- `prompt_proposer.py` — `PromptSectionProposer`, a sentinel-preserving GEPA `instruction_proposer` subclassing `BudgetAwareProposer` (inherits the char-budget infrastructure; see `budget_aware_proposer.py`). Runs the proposer LM, then passes the candidate through `extract_and_rebuild` so only the sentinel-delimited region survives. On a candidate that loses the sentinels it increments `sentinel_failures` and **re-raises** `SentinelParseError` rather than returning the parent unchanged — GEPA's reflective-mutation path skips the iteration instead of admitting a phantom identical-to-parent candidate into the selection pool. +- `prompt_judge.py` — + - `SaveCallJudge` — LLM-as-judge scoring an individual memory-save's content against `MEMORY_GUIDANCE`'s rules (durable, declarative, fact-focused; not task progress / PR numbers / completed-work logs). Unparseable judge output falls back to a neutral 0.5 (logged so it's distinguishable from a real mediocre score). + - `judge_save_calls` — the Layer-2 aggregate. Only judges `SAVE_ACTIONS = {add, replace}` (the real Hermes `memory` tool actions that carry a `content` payload; `remove` is not a save), caps judged calls at `MAX_JUDGED_CALLS_PER_TASK = 5` (excess score 0 each), and returns a vacuous 1.0 when there are no save calls or no judge/rubric is configured. + - `make_prompt_fitness_metric` — the GEPA 5-arg metric. Routes purely behaviorally: a prediction missing `_closed_loop_task_id` is degenerate and scores 0 with a diagnostic; otherwise `closed_loop_scorer(task_id, candidate_text)` runs one closed-loop trial. Appends a `[BUDGET]` feedback line. + - `make_memoizing_splice_scorer` — builds `closed_loop_scorer(task_id, candidate_text)` that splices **only when `candidate_text` changes** (consecutive tasks for one candidate reuse the live splice). Serialized under a `threading.Lock` because `dspy.Evaluate` is multi-threaded but `prompt_builder.py` is one shared mutable file — behavioral scoring is therefore effectively serial, an accepted v1 cost of splice-and-restore. Backup/restore is the caller's job (the guard wraps the whole run). + +### Shared validation-stack changes that enable the prompt path + +These let the prompt path reuse `ClosedLoopValidator` unchanged (see the validation section below for the base machinery): + +- `HermesPromptSectionInstaller` (in `artifact_installer.py`) — implements the `ArtifactInstaller` Protocol. `target_path` = `agent/prompt_builder.py`; `install(text_file)` reads the candidate body and calls `HermesPromptSource.write`, returning the post-install `sha256`; `verify_backup` = `verify_python_parses`. Constraint: the section must be a top-level string constant. +- `ClosedLoopValidator` gained an optional `layer2_judge_factory` (per-task — prompt-section judging needs the task's `expected_save_content` rubric + message, which a single global fn couldn't carry) plus a `layer2_threshold`. When unset, scoring is Layer 1 only and the tool-description path is unchanged. +- `report.py`'s `score_task` gained the compound Layer 2: when a `layer2_judge_fn` is supplied a task passes only if Layer 1 (trigger membership) passes **and** the judge scores `>= layer2_threshold`. Layer 1 short-circuits — the judge is never called (no LLM cost) on a task that already failed the trigger test, and `test_command` mode ignores Layer 2. The judge receives the subset of `run.tool_calls_with_args` whose name is `memory`. `Task` gained `expected_save_content`; `AgentRunResult` gained `tool_calls_with_args`. +- `hermes_runner.py` (shared change): reads agent sessions from the SQLite `state.db` (`parse_session_from_db`) since the current one-shot `hermes -z` is ephemeral and no longer writes `session_*.json`. A row whose `tool_calls` column won't parse as JSON aborts with an `error` result (the task **abstains**) rather than being silently read as "no tools." + ## evolution/validation/ — closed-loop validation against a real agent Drives an actual agent (`HermesAgentRunner` via `hermes -z`) through a small task suite with baseline and evolved artifacts, scores real tool-selection behavior, compares. Orthogonal to skills/tools/prompts/code — measures agent behavior, not artifact production. @@ -388,6 +433,6 @@ Drives an actual agent (`HermesAgentRunner` via `hermes -z`) through a small tas **During-evolution integration.** Beyond the standalone CLI, the same `ClosedLoopValidator` powers `evolution/core/closed_loop_feedback.py`'s `ClosedLoopFeedbackCache`. The cache writes the candidate description into a tmp manifest JSON, calls `validator.validate(ValidationInputs(...))` with it as `evolved_artifact`, and caches the returned `ValidationReport` by candidate text. The cache surfaces verdicts to the metric two ways: as a deterministic feedback block on the reflection path (`feedback` mode), or as per-task `TaskResult.passed` reads via `get_task_verdict(candidate, task_id)` for the behavioral-example branch (`trainset` mode). The validator itself doesn't know about the cache; it always sees a `ValidationInputs` with two artifacts and produces a `ValidationReport`. -## evolution/{prompts, code, monitor}/ — planned, empty +## evolution/{code, monitor}/ — planned, empty -These packages exist as empty stubs anchoring the planned tier-3/4/5 work. See `PLAN.md` for the design. +These packages exist as empty stubs anchoring the planned tier-4/5 work. See `PLAN.md` for the design. (`prompts/` is now implemented — see the phase-3 section above.) diff --git a/docs/data_models.md b/docs/data_models.md index c2455d4..6c0c23e 100644 --- a/docs/data_models.md +++ b/docs/data_models.md @@ -555,6 +555,79 @@ Written by `evolution/core/quality_gate.py::append_cl_decision_fields` when the | `band_trigger_score` | `dict` | Pre-flight scores that decided whether CL-primary fired. Keys: `holdout` (`float \| None`), `closed_loop` (`float \| None`). | | `validator_agent_model` | `str` | The LiteLLM model id used for the closed-loop validator agent. Recorded so historical decisions stay analysable if the default changes. | +### Prompt-section additions (`artifact_type == "prompt_section"`) + +Runs of `evolution.prompts.evolve_prompt_section` (Phase 3) write the same `schema_version` "5" envelope but a **deliberately different field set** from the skill/tool variant, because the deploy gate is a closed-loop pass-rate / win-loss decision, **not** a paired-bootstrap one. There is no synthetic classification signal for a system-prompt section — every candidate is scored behaviorally by a real `hermes -z` against a curated suite — so the bootstrap substrate doesn't apply. + +```json +{ + "schema_version": "5", + "artifact_type": "prompt_section", + "target_section": "MEMORY_GUIDANCE", + "decision": "deploy", // "deploy" | "reject" | "denied" | "dry_run" | "aborted" + "decision_signal": "closed_loop", // always "closed_loop" on this path + "baseline_chars": 1840, + "evolved_chars": 2104, + "growth_pct": 0.143, // (evolved_chars - baseline_chars) / baseline_chars + "closed_loop": { + "decision": "pass", // "pass" | "regression" (ValidationReport.decision) + "decision_reasons": ["pass_rate 0.92 >= baseline 0.75", "n_wins 4 >= 2*n_losses 0"], + "baseline_pass_rate": 0.75, + "evolved_pass_rate": 0.92, + "n_wins": 4, + "n_losses": 0, + "n_ties": 8 + }, + "sentinel_failures": 1, // reflection-LM outputs the proposer rejected for breaking sentinel preservation + "elapsed_seconds": 412.6, + "cost": { /* same shape as cost_summary: total_usd + by_model */ }, + "run_inputs": { /* seed, iterations, model versions, suite path/sha, validator_agent_model, ... */ }, + "pr_created": { "status": "skipped", "reason": "prompt_section_pr_unsupported", "branch": null, "commit_sha": null, "url": null } +} +``` + +**Fields this variant carries** (and the tool/skill variant does not, or differs on): + +| Field | Type | Notes | +|---|---|---| +| `artifact_type` | `"prompt_section"` | Disjoint from `"skill"` / `"tool_description"`. | +| `target_section` | `str` | The `prompt_builder.py` constant whose text was evolved (e.g. `MEMORY_GUIDANCE`). | +| `decision` | `"deploy" \| "reject" \| "denied" \| "dry_run" \| "aborted"` | `"denied"` lands on a saturation pre-flight default-deny; `"dry_run"` when the run was asked to evaluate without splicing; `"aborted"` on cost-ceiling / interrupt. | +| `decision_signal` | `"closed_loop"` | Always `"closed_loop"` here — the synthetic value never appears on this path. | +| `baseline_chars` / `evolved_chars` / `growth_pct` | int / int / float | Size telemetry; growth informs the closed-loop required-gain threshold but is not gated on a bootstrap. | +| `closed_loop` | `dict` | `{decision, decision_reasons, baseline_pass_rate, evolved_pass_rate, n_wins, n_losses, n_ties}` — the deploy gate's primary evidence (sourced from `ValidationReport` over the behavioral suite). | +| `sentinel_failures` | `int` | Count of reflection-LM proposals rejected for failing sentinel preservation (same meaning as the tool path). | +| `elapsed_seconds` / `cost` | float / dict | Wall-clock + per-model cost ledger. | +| `run_inputs` | `dict` | Reproduction inputs (seed, iterations, models, suite path + sha, `validator_agent_model`). | +| `pr_created` | `dict` | Shape-stable with the skill/tool path, but the prompt-section path currently emits a `status: "skipped"` block (PR automation for in-place `prompt_builder.py` splices is not wired). | + +**Fields the prompt-section variant deliberately OMITS.** A reader or calibration script must not assume these are present — they exist only on the skill/tool (paired-bootstrap) path: + +- `bootstrap` — no per-example bootstrap CI; the gate is win-loss, not a resampled mean. +- `avg_baseline` / `avg_evolved` — no synthetic holdout mean. The analogous numbers live inside `closed_loop` as `baseline_pass_rate` / `evolved_pass_rate`. +- `dataset` — there is no synthetic eval dataset and no `dataset` block with per-source/per-category counts; the behavioral suite is the JSONL passed via `--tasks`. `run_inputs` records the run config (models, seed, iterations, holdout-ratio, `eval_source: "closed_loop"`), not the suite path or sha. +- `knee_point` — Pareto knee-point selection over a synthetic valset doesn't apply; candidates are chosen on behavioral score. + +#### Saturation-denied variant (prompt section) + +When the saturation pre-flight default-denies (non-healthy band, non-interactive context, no `--force-saturation-check`), the prompt-section gate writes `decision: "denied"` and carries a `saturation_band` field naming the band that triggered the denial: + +```json +{ + "schema_version": "5", + "artifact_type": "prompt_section", + "target_section": "MEMORY_GUIDANCE", + "decision": "denied", + "decision_signal": "closed_loop", + "saturation_band": "no_headroom", // "healthy" never lands here; one of no_headroom | weak_signal | uniform_failure + "baseline_chars": 1840, + "run_inputs": { /* ... */ }, + "pr_created": { "status": "skipped", "reason": "prompt_section_pr_unsupported", "branch": null, "commit_sha": null, "url": null } +} +``` + +`saturation_band` appears only on the `"denied"` decision (it records why the run never started); it is absent on `deploy` / `reject` / `dry_run`. + ## metrics.json (deploy-only summary) Written to `output///metrics.json` only on deploy. Top-level summary for quick scanning: diff --git a/docs/index.md b/docs/index.md index 1d810c8..dfe14e0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,6 +15,7 @@ The codebase is mid-sized (~9K LOC of source + 61 test files / ~1166 tests) and | **What this project is** | `codebase_info.md` → `architecture.md` → repo-root `README.md` | | **How a skill run works end-to-end** | `workflows.md` (Workflow 1) → `architecture.md` (top-level flow) | | **How a tool-description run works end-to-end** | `workflows.md` (Workflow 9) → `components.md` (`evolve_tool.py`) | +| **How a prompt-section run works end-to-end** | `workflows.md` (Workflow 12) → `components.md` (`evolve_prompt_section.py`) | | **What flag does X / how to run the CLI** | `interfaces.md` (CLI section) | | **Why the deploy gate rejected a run** | `data_models.md` (gate_decision.json) → `components.md` (`constraints.py`) | | **What's in `gate_decision.json` / `metrics.json`** | `data_models.md` (full schema with examples) | @@ -53,7 +54,7 @@ The codebase is mid-sized (~9K LOC of source + 61 test files / ~1166 tests) and | [`components.md`](components.md) | Per-module reference: what each owns, public surface, load-bearing implementation notes | | [`interfaces.md`](interfaces.md) | CLIs (skill, tool, closed-loop, sessiondb importer), Python API, SkillSource + ToolSource Protocols, output artifacts, DSPy + litellm integration, test surfaces, env vars | | [`data_models.md`](data_models.md) | All dataclasses, on-disk formats, full `gate_decision.json` schema with worked examples, `ValidationReport` schema | -| [`workflows.md`](workflows.md) | Step-by-step workflows with mermaid sequence diagrams: skill deploy path, reject paths, GEPA→MIPROv2 fallback, sessiondb mining, tool evolution, closed-loop validation, closed-loop signal during evolution | +| [`workflows.md`](workflows.md) | Step-by-step workflows with mermaid sequence diagrams: skill deploy path, reject paths, GEPA→MIPROv2 fallback, sessiondb mining, tool evolution, closed-loop validation, closed-loop signal during evolution, prompt-section evolution | | [`dependencies.md`](dependencies.md) | Each external package — what it's used for, why it's pinned, what we don't depend on | | [`framework_advantages.md`](framework_advantages.md) | User-facing explainer of how this framework's selection layer, deploy gate, proposer, and composite fitness differ from raw DSPy + GEPA — and when raw GEPA is the right choice | diff --git a/docs/interfaces.md b/docs/interfaces.md index 412291b..4476fd3 100644 --- a/docs/interfaces.md +++ b/docs/interfaces.md @@ -140,6 +140,46 @@ Evolves one tool's top-level `description` field inside an MCP-shape manifest. T - `sys.exit(1)` if the holdout split has fewer than `min_holdout_size` (default 10) examples. - Returns normally (rejection path) if static or growth-quality gate fails — `evolved_FAILED.json` + `gate_decision.json` are written. +## CLI: `python -m evolution.prompts.evolve_prompt_section` + +Evolves one named section of an agent's system prompt — a top-level string constant in Hermes Agent's `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`). Unlike the skill and tool paths, evaluation is **purely behavioral**: there is no synthetic LLM-judge signal. Every candidate is spliced into the live `prompt_builder.py` and scored by running the real agent (`hermes -z`) against the task suite, so the deploy gate is a `ClosedLoopValidator` run (pass-rate + win/loss), not a paired-bootstrap CI over judge scores. + +The verdict is **compound**: Layer 1 is the same `expected_tools` / `forbidden_tools` membership rule as the closed-loop tool path; Layer 2 is an LLM judge that scores each `memory(action=add|replace)` call's content against the task's `expected_save_content` rubric (only tasks that declare a rubric are Layer-2 judged). The candidate is spliced in for the duration of the run and the file is restored byte-for-byte afterward, reusing the tool-path backup + flock + checksum-drift machinery. + +### Required flags +| Flag | Purpose | +|---|---| +| `--section ` | The `prompt_builder.py` top-level string constant to evolve (e.g. `MEMORY_GUIDANCE`). Dict-typed constants (e.g. `PLATFORM_HINTS`) are not supported. | +| `--hermes-repo ` | Path to your hermes-agent checkout. `agent/prompt_builder.py` inside it is the splice/restore target. | +| `--tasks ` | JSONL eval suite (e.g. `evolution/validation/suites/memory_guidance.jsonl`). Same task shape as the closed-loop tool suite, plus an optional `expected_save_content` rubric per task for Layer 2. Must contain ≥2 tasks (so the split yields a non-empty trainset and holdout). | + +### Optional flags +| Flag | Default | Notes | +|---|---|---| +| `--iterations ` | `10` | GEPA `max_full_evals`. | +| `--holdout-ratio ` | `0.5` | Fraction of tasks held out for the deploy gate. Clamped to keep both the trainset and holdout non-empty. | +| `--seed ` | `42` | RNG seed for the train/holdout split and GEPA. | +| `--max-growth ` | `0.2` | Section length budget as a fraction over the baseline; framed to the `PromptSectionProposer` so candidates stay near the baseline length (set higher when evolving from a short baseline that needs to grow). | +| `--optimizer-model` / `--reflection-model` / `--eval-model ` | config default | Per-role LiteLLM model overrides; resolved like the other CLIs. `--eval-model` is the Layer 2 content judge. | +| `--agent-model ` | config default | The model the `hermes -z` agent itself runs as. A deliberately weaker agent exposes more behavioral signal (a strong agent saturates the suite regardless of the prompt). LiteLLM provider prefixes are stripped before `hermes -m`. | +| `--layer2-threshold ` | `0.7` | Minimum mean content-judge score for a save task to pass Layer 2. | +| `--task-timeout-seconds ` | `120` | Per-task wall-clock cap for `hermes -z`. Timeouts abstain (don't tip the decision). | +| `--max-cost-usd ` | `150.0` | Abort cleanly when cumulative **in-process** LM cost (judge + reflection + the passthrough predictor) exceeds this. The agent's own LM spend happens inside the `hermes` child process and is not captured by this ceiling. | +| `--gepa-minibatch-size ` | `3` | GEPA reflective minibatch size; same meaning as the other paths. | +| `--gepa-acceptance {improvement-or-equal,strict-improvement}` | `improvement-or-equal` | Same meaning as the other paths. | +| `--apply` | off | On a deploy decision, write the evolved section into `prompt_builder.py` in place (byte-precise AST splice, `ast.parse`-guarded, atomic). | +| `--create-pr` | off | **Deferred for prompt sections** — accepted and recorded as a `skipped` PR block in `gate_decision.json`, but no PR is opened (copying a full evolved `prompt_builder.py` over `origin/` would carry unrelated local changes into the diff). Use `--apply` + a manual PR. | +| `--baseline-override-file ` | off | Start evolution from this text instead of the live section. The live section is still the splice/restore target (backed up + restored); `--apply` still writes the evolved text. Use it to create headroom on an already-tuned section (e.g. a deliberately-weakened baseline) or for regression-injection ablations. | +| `--skip-saturation-check` | off | Skip the saturation pre-flight entirely. | +| `--force-saturation-check` | off | Run the pre-flight, render the panel, but proceed regardless of band — required to override a non-`healthy` verdict non-interactively. | +| `--dry-run` | off | Resolve the baseline + build the modules, then stop — exercises wiring with no LM/agent calls. Writes a `decision="dry_run"` `gate_decision.json`. | +| `--output-dir ` | `output/prompts/
//` | Where `gate_decision.json` and the baseline/evolved section text files land. | + +### Exit conditions +- `0` on a `deploy` decision (or a `--dry-run`). +- `1` on `reject` (the holdout deploy gate found a regression), `denied` (saturated baseline default-denied non-interactively), or `aborted` (cost ceiling). +- `ValueError` at startup if the suite has fewer than 2 tasks. + ## CLI: `python -m evolution.core.external_importers` Standalone session-history importer. Useful for previewing what `--eval-source sessiondb` would produce without running the full evolution. diff --git a/docs/workflows.md b/docs/workflows.md index eb80148..e686908 100644 --- a/docs/workflows.md +++ b/docs/workflows.md @@ -545,6 +545,158 @@ When your daily-driver Hermes model is capable enough to solve every textbook bu Manual smoke harness: `tests/manual/skill_closed_loop_smoke.py` (supports `--suite {basic,advanced}`, `--agent-model MODEL`, `--task-timeout-seconds N`). +## Workflow 12: Evolve a prompt section (deploy path) + +The prompt-section analog of Workflow 9 (tool descriptions), but **purely behavioral** end to end. There is no synthetic judge dataset and no paired-bootstrap gate: every candidate is spliced into the live `prompt_builder.py` and scored by a real `hermes -z` subprocess, and the deploy gate is a `ClosedLoopValidator` run. Three structural contrasts with the tool path: + +- **Integration is in-place splice-and-restore**, not an MCP manifest rewrite or a copied skill directory. The target is a single named string constant inside the user's `prompt_builder.py`; the harness backs it up byte-for-byte and restores it on exit. +- **The deploy gate is closed-loop pass-rate / win-loss**, not a paired-bootstrap confidence interval. Decision = pass-rate no-regression + `n_wins >= 2 * n_losses` (the `ClosedLoopValidator.decide` rule), all behavioral. +- **PR automation is deferred.** `--create-pr` is recorded as `skipped`; deploy means `--apply` writes the evolved section into `prompt_builder.py` in place, and the user opens a PR by hand. + +```bash +python -m evolution.prompts.evolve_prompt_section \ + --section MEMORY_GUIDANCE \ + --hermes-repo ~/src/NousResearch/hermes-agent \ + --tasks evolution/validation/suites/memory_guidance.jsonl \ + --iterations 10 \ + --apply +``` + +### Phase A — Setup: resolve baseline, split, build the behavioral harness + +```mermaid +sequenceDiagram + autonumber + participant CLI as evolve_prompt_section + participant Src as HermesPromptSource + participant Suite as TaskSuite + participant Judge as SaveCallJudge + participant Inst as HermesPromptSectionInstaller + participant Run as HermesAgentRunner + participant V as ClosedLoopValidator + + CLI->>Src: read(section_name) — validate it exists / is a string constant + alt --baseline-override-file + CLI->>CLI: baseline_text = override_file.read_text() + else + CLI->>Src: baseline_text = read(section_name) + end + CLI->>Suite: TaskSuite.from_jsonl(tasks) — reject < 2 tasks + CLI->>CLI: _split_train_holdout(seed) — ≥1 task each side + CLI->>Judge: SaveCallJudge(config) → layer2_factory(task) + CLI->>Inst: HermesPromptSectionInstaller(repo, section) + CLI->>Run: HermesAgentRunner(timeout, agent_model?) + CLI->>V: ClosedLoopValidator(installer, runner, layer2_judge_factory, layer2_threshold) +``` + +The baseline is the **live section text** unless `--baseline-override-file` points evolution at arbitrary text — e.g. a deliberately-weakened baseline to manufacture headroom, or a regression-injection ablation. The override only changes where evolution *starts*; the guard still backs up and restores the real file, and `--apply` writes the evolved text back into the live section. The suite floor is 2 tasks so the seeded split yields a non-empty GEPA trainset **and** a non-empty deploy-gate holdout. + +### Phase B — Configure the global LM, then enter the guard + +```mermaid +sequenceDiagram + autonumber + participant CLI as evolve_prompt_section + participant Scorer as memoizing_splice_scorer + participant Metric as prompt_fitness_metric + participant LM as eval_lm + participant DSPy as dspy.configure + + CLI->>Scorer: make_memoizing_splice_scorer(install_fn=source.write, score_fn=run_one_task, lock) + CLI->>Metric: make_prompt_fitness_metric(baseline_text, max_growth, closed_loop_scorer=scorer) + CLI->>LM: instantiate eval_lm (role=eval, temp=0) + CLI->>DSPy: dspy.configure(lm=eval_lm, callbacks=[LMTimingCallback()]) + Note over CLI,DSPy: global LM set so GEPA worker threads can run PromptModule's
passthrough predictor — the pre-flight's dspy.context doesn't reach them +``` + +The `closed_loop_scorer` is the spine of behavioral scoring: `score(task_id, candidate_text)` splices the candidate into the live `prompt_builder.py` **only when it changes** (consecutive tasks for the same candidate reuse the live splice), runs the task via `hermes -z`, and reads the session back from the sandbox `state.db`. The splice+run is serialized under one `threading.Lock` because `dspy.Evaluate` scores with a thread pool but the spliced file is a single shared mutable resource — behavioral scoring is therefore effectively serial, an accepted v1 cost. The explicit `dspy.configure` is load-bearing: `dspy.context` inside the saturation pre-flight does **not** propagate into GEPA's worker threads, so without the global LM the passthrough predictor raises "No LM is loaded" → no trajectories → no proposal. + +### Phase C — Inside the guard: saturation pre-flight, then GEPA + +```mermaid +sequenceDiagram + autonumber + participant CLI as evolve_prompt_section + participant Guard as _prompt_builder_guard + participant FS as live prompt_builder.py + participant Sat as saturation_preflight + participant GEPA as dspy.GEPA + participant PM as PromptModule + participant Prop as PromptSectionProposer + participant Scorer as splice scorer + participant H as hermes -z + state.db + + CLI->>Guard: enter(installer.target_path) + Guard->>FS: refuse if stale .cl_backup; flock parent dir (LOCK_EX|NB) + Guard->>FS: atomic_write_bytes(.cl_backup, target.read_bytes()) + opt not --skip-saturation-check + CLI->>Sat: saturation_preflight(baseline_module, holdout, metric, eval_lm, baseline_text) + Sat->>Scorer: behavioral score of baseline on each holdout task + Sat-->>CLI: SaturationReport(band, ...) + alt band != healthy + alt --force-saturation-check + Note over CLI: proceed regardless + else non-interactive + CLI->>FS: write gate_decision.json (decision=denied, reason=saturated_baseline) + Note over CLI: return — GEPA never runs (default-deny) + else interactive + CLI->>CLI: prompt "Continue anyway? [y/N]" + end + end + end + CLI->>GEPA: compile(PromptModule(baseline), trainset, valset, instruction_proposer=PromptSectionProposer) + loop per iteration + GEPA->>PM: forward(task, closed_loop_task_id) — candidate in sentinel region of predictor instructions + PM-->>GEPA: Prediction(_candidate_text, _closed_loop_task_id) + GEPA->>Scorer: metric → closed_loop_scorer(task_id, candidate_text) + Scorer->>FS: splice candidate into live section (only if changed) + Scorer->>H: run task; read session from sandbox state.db + H-->>Scorer: tool_calls_with_args + final text + Scorer->>Scorer: compound verdict = Layer 1 (memory fired?) + Layer 2 (judge on memory add/replace content) + Scorer-->>GEPA: score ∈ {0.0, 1.0} + GEPA->>Prop: reflect on failures → sentinel-preserving candidate + end + GEPA-->>CLI: optimized module with detailed_results + CLI->>Guard: exit → atomic_write_bytes(target, .cl_backup); unlink backup; release flock +``` + +Everything that mutates the file lives **inside** the guard, which holds an exclusive `flock` (the same lock name the deploy-gate `ClosedLoopValidator` uses — sequenced before it, never nested) and restores the original bytes on exit. The saturation pre-flight scores the baseline behaviorally on the holdout; a non-`healthy` band (e.g. `no_headroom` on an already-tuned section) **default-denies in non-interactive contexts** unless `--force-saturation-check`, writing a `decision="denied"` gate before GEPA spends a cent. The compound per-task verdict is two layers: **Layer 1** is trigger membership (did the `memory` tool fire, via `expected_tools` / `forbidden_tools`), **Layer 2** is the `SaveCallJudge` scoring `memory(action=add|replace)` content against the task's `expected_save_content` rubric (`remove` is not a save; a passing Layer 1 with no save action scores a vacuous 1.0 on Layer 2). GEPA mutates only the sentinel-delimited region of the passthrough predictor's instructions; the `PromptSectionProposer` rejects any reflection-LM output that fails sentinel preservation. + +### Phase D — Deploy gate (closed-loop on the holdout), persist, apply + +```mermaid +sequenceDiagram + autonumber + participant CLI as evolve_prompt_section + participant Sel as candidate selection + participant V as ClosedLoopValidator + participant Inst as HermesPromptSectionInstaller + participant FS as live prompt_builder.py + participant H as hermes -z + participant Src as HermesPromptSource + + Note over CLI: guard already exited — file restored to baseline + CLI->>Sel: evolved_text = section_from_candidate(best_idx) # GEPA val-argmax + CLI->>FS: write baseline_section.txt + evolved_section.txt + CLI->>V: validate(ValidationInputs(section, holdout_suite, baseline_file, evolved_file)) + Note over V: own backup/restore + flock — independent of the Phase C guard + loop baseline phase, then evolved phase + V->>Inst: install(section_file) — splice into live prompt_builder.py + loop each holdout task + V->>H: run task; score Layer 1 + Layer 2 via layer2_judge_factory + end + end + V-->>CLI: ValidationReport(baseline_pass_rate, evolved_pass_rate, n_wins/n_losses, decision) + CLI->>FS: write gate_decision.json (artifact_type="prompt_section", decision=deploy|reject) + alt decision == pass AND --apply + CLI->>Src: write(section_name, evolved_text) — live section updated in place + end +``` + +The selected candidate is GEPA's val-argmax (`detailed_results.best_idx`) — there's no knee-point parsimony pass on the prompt-section path. The deploy gate is a fresh `ClosedLoopValidator.validate` over the **holdout** suite, with its own backup/restore + `flock` (it runs after the Phase C guard has already exited and restored the file, so the two never nest). Its decision is closed-loop only: pass-rate no-regression plus `n_wins >= 2 * n_losses`. The gate decision is written with `artifact_type="prompt_section"`, `target_section`, `baseline_chars` / `evolved_chars` / `growth_pct`, a `closed_loop` block (both pass-rates + win/loss/tie counts), and `sentinel_failures`. `--create-pr` records a `skipped` PR block (deferred for sections); `--apply` is the only way to ship, writing the evolved text into the live section. + +**Empirical anchors.** The real `MEMORY_GUIDANCE` section saturates — it scored 1.0 across the holdout (`no_headroom` band) and the harness correctly default-denied a non-interactive run before GEPA started. To exercise the full deploy path, an adversarially-weakened baseline (via `--baseline-override-file`) evolved `0.67 → 1.00` pass-rate with 2 wins / 0 losses on the holdout, clearing the closed-loop gate and deploying. The saturating-real-section result is the expected, correct outcome, not a bug: there is no headroom to evolve into when the section already passes every behavioral task. + ## Failure-mode summary | Trigger | Outcome | Where to look | @@ -565,3 +717,8 @@ Manual smoke harness: `tests/manual/skill_closed_loop_smoke.py` (supports `--sui | Closed-loop validator concurrent run | `ConcurrentRunError` (`fcntl.flock` non-blocking acquire fails) | console only | | Closed-loop validator drift between tasks | `ChecksumDriftError` after the offending task; phase aborts, restore still runs | run.log + raised error | | Closed-loop cache validator failure during evolution | `WARNING` logged, cache returns `None`, GEPA continues without the verdict — never aborts the run | run.log | +| Prompt-section suite < 2 tasks | `ValueError` (can't split into non-empty train + holdout) | console only | +| Prompt-section stale `.cl_backup` on guard entry | `RuntimeError` naming the backup file; refuses to start | console only | +| Prompt-section saturated baseline, non-interactive | `decision="denied"` `gate_decision.json`; GEPA never runs (override with `--force-saturation-check`) | `gate_decision.json` (`saturation_band`) | +| Prompt-section closed-loop gate rejects | `decision="reject"` `reason="closed_loop_gate"`; section not applied | `gate_decision.json` (`closed_loop` block) | +| Prompt-section `--create-pr` | recorded as `skipped` (PR automation deferred); use `--apply` + manual PR | `gate_decision.json` (`pr_created` block) | diff --git a/generate_report.py b/generate_report.py index 3008116..b7a1ec7 100644 --- a/generate_report.py +++ b/generate_report.py @@ -45,13 +45,81 @@ DEFAULT_LOGO = REPO_ROOT / "assets" / "dna.png" +def _extract_prompt_section_data(gate: dict, run_dir: Path) -> dict[str, Any]: + """Build the render context for a Phase 3 prompt-section run. + + The prompt-section path is behavioral-only — its gate_decision carries a + ``closed_loop`` pass-rate / win-loss block instead of the skill/tool + bootstrap-CI + synthetic-dataset + knee-point fields, and self-sources + cost/timing/call-count (no metrics.json needed). The ``_experiment`` and + ``_results`` renderers branch on ``artifact_type`` to lay out the matching + tables; every other section is prose-driven via the keys returned here. + """ + cl = gate.get("closed_loop", {}) + cost = gate.get("cost", {}) + resolved = (gate.get("run_inputs", {}) or {}).get("resolved_lms", {}) + + n_wins = int(cl.get("n_wins", 0)) + n_losses = int(cl.get("n_losses", 0)) + n_ties = int(cl.get("n_ties", 0)) + cl_total = n_wins + n_losses + n_ties + baseline_rate = float(cl.get("baseline_pass_rate", 0.0)) + evolved_rate = float(cl.get("evolved_pass_rate", 0.0)) + cl_baseline_pass = round(baseline_rate * cl_total) + cl_evolved_pass = round(evolved_rate * cl_total) + elapsed = int(float(gate.get("elapsed_seconds", 0))) + lm_calls = sum(int(m.get("calls", 0)) for m in (cost.get("by_model") or {}).values()) + decision = gate.get("decision", "") + + def _model(role: str) -> str: + return (resolved.get(role) or {}).get("model", "—") + + return { + "artifact_type": "prompt_section", + "skill_name": gate.get("target_section", run_dir.parent.name), + "section_name": gate.get("target_section", ""), + "baseline_chars": int(gate.get("baseline_chars", 0)), + "evolved_chars": int(gate.get("evolved_chars", 0)), + "growth_pct": float(gate.get("growth_pct", 0.0)), + "growth_abs_pct": abs(float(gate.get("growth_pct", 0.0))), + "decision": decision, + "decision_upper": "DEPLOYED" if decision == "deploy" else "REJECTED", + "decision_signal": gate.get("decision_signal", "closed_loop"), + "baseline_pass_rate": baseline_rate, + "evolved_pass_rate": evolved_rate, + "baseline_pass_pct": baseline_rate * 100, + "evolved_pass_pct": evolved_rate * 100, + "cl_baseline_pass": cl_baseline_pass, + "cl_evolved_pass": cl_evolved_pass, + "cl_total_tasks": cl_total, + "cl_tasks_gained": cl_evolved_pass - cl_baseline_pass, + "n_wins": n_wins, + "n_losses": n_losses, + "n_ties": n_ties, + "elapsed_seconds": elapsed, + "elapsed_minutes": elapsed // 60, + "cost_total_usd": float(cost.get("total_usd", 0.0)), + "lm_calls_metrics": lm_calls, + "optimizer_lm": _model("optimizer"), + "reflection_lm": _model("reflection"), + "eval_lm": _model("eval"), + "saturation_band": gate.get("saturation_band", ""), + "sentinel_failures": int(gate.get("sentinel_failures", 0)), + "decision_reasons": "; ".join(cl.get("decision_reasons", [])), + } + + def _extract_run_data(run_dir: Path) -> dict[str, Any]: """Pull all numbers the renderer needs from a run dir. Reads gate_decision.json (always present) + metrics.json (deploy only) + - run.log (LM call counts grep'd from timing-callback lines). + run.log (LM call counts grep'd from timing-callback lines). Prompt-section + (Phase 3) runs are behavioral-only and self-source from gate_decision.json + alone — see ``_extract_prompt_section_data``. """ gate = json.loads((run_dir / "gate_decision.json").read_text()) + if gate.get("artifact_type") == "prompt_section": + return _extract_prompt_section_data(gate, run_dir) metrics_path = run_dir / "metrics.json" metrics = json.loads(metrics_path.read_text()) if metrics_path.is_file() else {} @@ -442,7 +510,7 @@ def _approach(prose: dict, ctx: dict, styles) -> list: ap = prose["approach"] engines = ap["engines"] flow = [ - Paragraph("Approach: Evolutionary Skill Optimization", styles['SectionHead']), + Paragraph(ap.get("section_title", "Approach: Evolutionary Optimization"), styles['SectionHead']), Paragraph("Three Optimization Engines", styles['SubSection']), _highlight_table( header=engines["header"], @@ -463,9 +531,57 @@ def _approach(prose: dict, ctx: dict, styles) -> list: return flow +def _experiment_prompt_section(exp: dict, overrides: dict, ctx: dict, styles) -> list: + """Phase 3 experiment section: behavioral config (no synthetic eval set, + no knee-point), and the suite is described in prose rather than via a + train.jsonl examples table (prompt-section runs don't write one).""" + config_rows = [ + ['Target Section', _fmt(overrides["target_section_label"], ctx)], + ['Baseline Size', f'{ctx["baseline_chars"]:,} characters'], + ['Optimizer LM', _fmt(overrides["optimizer_lm"], ctx)], + ['Reflection LM (GEPA)', _fmt(overrides["reflection_lm"], ctx)], + ['Content-Judge LM (Layer 2)', _fmt(overrides["eval_judge_lm"], ctx)], + ['Agent (hermes -z)', _fmt(overrides["agent_lm"], ctx)], + ['Behavioral Suite', f'{ctx["cl_total_tasks"]} holdout tasks (real hermes -z, scored end-to-end)'], + ['Total Optimization Time', + f'{ctx["elapsed_seconds"]:,} seconds (~{ctx["elapsed_minutes"]} minutes)'], + ['Total LM Calls (in-process)', f'{ctx["lm_calls_metrics"]:,}'], + ['Total Cost (USD, in-process)', f'${ctx["cost_total_usd"]:.2f}'], + ['Deploy Gate', _fmt(overrides["quality_gate_label"], ctx)], + ['Saturation Pre-flight', _fmt(overrides["saturation_label"], ctx)], + ] + config_data = [[_wrap_cell(c, styles['TableHeaderCell']) for c in ['Parameter', 'Value']]] + config_data += [[_wrap_cell(c, styles['TableCell']) for c in row] for row in config_rows] + config_table = Table(config_data, colWidths=[2.0 * inch, 4.0 * inch]) + config_table.setStyle(TableStyle([ + ('BACKGROUND', (0, 0), (-1, 0), HexColor('#1a1a2e')), + ('GRID', (0, 0), (-1, -1), 0.5, HexColor('#cccccc')), + ('TOPPADDING', (0, 0), (-1, -1), 5), + ('BOTTOMPADDING', (0, 0), (-1, -1), 5), + ('LEFTPADDING', (0, 0), (-1, -1), 8), + ])) + return [ + Paragraph(exp.get("section_title", "Experiment"), styles['SectionHead']), + Paragraph("Configuration", styles['SubSection']), + config_table, + Paragraph("Evaluation Suite", styles['SubSection']), + Paragraph(_fmt(exp["dataset_intro"], ctx), styles['BodyJust']), + Paragraph("Fitness Function", styles['SubSection']), + Paragraph(_fmt(exp["fitness_intro"], ctx), styles['BodyJust']), + Paragraph( + f"{exp['fitness_formula']}", + ParagraphStyle('Formula', parent=styles['Normal'], alignment=TA_CENTER, + spaceBefore=8, spaceAfter=8, fontSize=10), + ), + Paragraph(_fmt(exp["fitness_closing"], ctx), styles['BodyJust']), + ] + + def _experiment(prose: dict, ctx: dict, styles, examples: list[tuple[str, str]]) -> list: exp = prose["experiment"] overrides = exp["config_overrides"] + if ctx.get("artifact_type") == "prompt_section": + return _experiment_prompt_section(exp, overrides, ctx, styles) # Phase 1 runs counted gpt-4.1-mini + gpt-5-mini explicitly via run.log grep; # Phase 2 runs use a single optimizer LM tier (e.g., gpt-5.4-mini), so fall @@ -571,24 +687,38 @@ def _results(prose: dict, ctx: dict, styles) -> list: accent_bg = HexColor('#fff8e1') accent_fg = HexColor('#5d4037') - results_rows = [ - ['Metric', 'Baseline', 'Evolved (knee-point pick)', 'Δ'], - ['Body size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'], - [f'Avg holdout score (n={ctx["n_holdout"]})', - f'{ctx["avg_baseline"]:.3f}', f'{ctx["avg_evolved"]:.3f}', f'{ctx["improvement"]:+.3f}'], - ['Bootstrap mean diff', '—', f'{ctx["bootstrap_mean"]:+.3f}', '—'], - ['Bootstrap 90% CI lower', '—', f'{ctx["bootstrap_lower"]:+.3f}', '—'], - ['Bootstrap 90% CI upper', '—', f'{ctx["bootstrap_upper"]:+.3f}', '—'], - ] - # Phase 2: surface the closed-loop behavioral signal when the v5 schema - # exposed it (absent on synthetic-only runs). - if ctx.get("cl_total_tasks"): - results_rows.append([ - f'Closed-loop tasks (n={ctx["cl_total_tasks"]})', - f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}', - f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}', - f'+{ctx["cl_tasks_gained"]} (req ≥{ctx["cl_required_gain"]})', - ]) + if ctx.get("artifact_type") == "prompt_section": + # Behavioral-only: pass-rate + win/loss, no bootstrap/synthetic rows. + delta_rate = ctx["evolved_pass_rate"] - ctx["baseline_pass_rate"] + results_rows = [ + ['Metric', 'Baseline', 'Evolved', 'Δ'], + ['Section size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'], + [f'Holdout pass-rate (n={ctx["cl_total_tasks"]})', + f'{ctx["baseline_pass_rate"]:.0%}', f'{ctx["evolved_pass_rate"]:.0%}', f'{delta_rate:+.0%}'], + [f'Tasks passing (n={ctx["cl_total_tasks"]})', + f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}', + f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}', + f'+{ctx["n_wins"]}W / {ctx["n_losses"]}L'], + ] + else: + results_rows = [ + ['Metric', 'Baseline', 'Evolved (knee-point pick)', 'Δ'], + ['Body size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'], + [f'Avg holdout score (n={ctx["n_holdout"]})', + f'{ctx["avg_baseline"]:.3f}', f'{ctx["avg_evolved"]:.3f}', f'{ctx["improvement"]:+.3f}'], + ['Bootstrap mean diff', '—', f'{ctx["bootstrap_mean"]:+.3f}', '—'], + ['Bootstrap 90% CI lower', '—', f'{ctx["bootstrap_lower"]:+.3f}', '—'], + ['Bootstrap 90% CI upper', '—', f'{ctx["bootstrap_upper"]:+.3f}', '—'], + ] + # Phase 2: surface the closed-loop behavioral signal when the v5 schema + # exposed it (absent on synthetic-only runs). + if ctx.get("cl_total_tasks"): + results_rows.append([ + f'Closed-loop tasks (n={ctx["cl_total_tasks"]})', + f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}', + f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}', + f'+{ctx["cl_tasks_gained"]} (req ≥{ctx["cl_required_gain"]})', + ]) results_rows.append(['Decision', '—', decision_cell, decision_note]) # Per-cell style picks: header row uses bold/white; first column (metric diff --git a/reports/phase3_prose.yaml b/reports/phase3_prose.yaml new file mode 100644 index 0000000..e68499c --- /dev/null +++ b/reports/phase3_prose.yaml @@ -0,0 +1,233 @@ +# Editorial content for the Phase 3 validation report. +# Numbers come from the run dir's gate_decision.json (the prompt-section path is +# behavioral-only and self-sources cost/timing/calls — no metrics.json/run.log +# needed). Pass via `generate_report.py --run output/prompts//`. Text blocks +# may include {placeholder} substitutions the renderer fills from that data. + +meta: + title: "Agent Self-Evolution" + subtitle: "Phase 3 Validation Report
System-prompt section evolution via splice-and-restore" + organization: "" + repository: "github.com/jramos/agent-self-evolution" + +executive_summary: + framework_intro: > + Agent Self-Evolution is a standalone optimization pipeline that uses DSPy and GEPA + (Genetic-Pareto Prompt Evolution) to automatically improve an agent's skills, tool + descriptions, system prompts, and code through evolutionary search — all via API + calls with no GPU training required. Phase 1 shipped a synthetic-only deploy gate; + Phase 2 made it behavior-aware and brought tool-description parity. Phase 3 extends + the framework to the third instructions surface — named sections of the agent's + system prompt — evaluated end-to-end against the real agent. + run_summary: > + This report documents the Phase 3 validation of system-prompt section evolution. + The target is a top-level string constant in Hermes Agent's + prompt_builder.py (here, {section_name}), evolved + via GEPA and validated purely behaviorally: every candidate is spliced into + the live prompt file and scored by running the real agent + (hermes -z) against a curated task suite — there is no + synthetic LLM-as-judge signal to lean on. Production {section_name} is already + well-tuned, so the saturation pre-flight correctly default-denies it (no headroom). + To exercise the loop end-to-end, the headline run evolves a deliberately-weakened + baseline (supplied via --baseline-override-file): the + agent's holdout pass-rate moved {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%} + ({cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks} tasks, + +{n_wins}W / {n_losses}L) while the section shrank {growth_pct:+.1%}. + The closed-loop deploy gate decided {decision_upper}, and the live prompt file + was restored byte-for-byte after every trial. + +key_result_box: + title_template: "KEY RESULT — {section_name} (prompt-section deploy via closed-loop gate)" + rows: + - "Holdout pass-rate (n={cl_total_tasks}): {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%} (+{n_wins}W / {n_losses}L)" + - "Tasks passing: {cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks}" + - "Section size: {baseline_chars:,} → {evolved_chars:,} chars ({growth_pct:+.1%})" + - "Decision: {decision_upper} via the closed-loop behavioral gate" + +background: + intro: > + Agent Self-Evolution targets the instructions layer of an LLM agent — skill files, + tool descriptions, and system prompts — and evolves the text via API-only + evolutionary search. An agent's behavior is governed by three layers: + layers: + header: ["Layer", "What It Is", "How It's Currently Improved"] + rows: + - ["Model Weights", "The underlying LLM (Claude, GPT, etc.)", "RL training (Tinker-Atropos)"] + - ["Instructions", "Skills, system prompts, tool descriptions", "Manual authoring (static)"] + - ["Tool Code", "Python implementations of each tool", "Manual development"] + highlight_row: 1 + closing: > + Phases 1 and 2 validated skill files and tool descriptions. Phase 3 completes the + instructions trio with system-prompt sections — the highest-leverage, widest + blast-radius surface, since one section governs the agent across every task. The + section is a string constant inside Hermes' own source, so unlike the skill path + (separate writable workdir) there is no env-var hook or plugin seam: the framework + edits prompt_builder.py in place. The integration is an + AST-precise splice-and-restore — the candidate is byte-spliced into the live + file for the duration of a trial and restored from an atomic backup afterward + (flock + checksum-drift detection + parse-guard, reused + from the Phase 2 closed-loop validator). Crucially, a system-prompt section has no + cheap synthetic proxy: the only honest measure of "did this guidance help" is + running the real agent, so Phase 3's deploy gate is purely behavioral. + +approach: + section_title: "Approach: Behavioral Prompt-Section Evolution" + engines: + header: ["Engine", "What It Optimizes", "License", "Role"] + rows: + - ["DSPy + GEPA", "Skills, prompts, tool descriptions", "MIT", "Primary (validated)"] + - ["DSPy MIPROv2", "Few-shot examples, instruction text", "MIT", "Fallback optimizer"] + - ["Darwinian Evolver", "Code files, algorithms", "AGPL v3", "Code evolution (Phase 4)"] + gepa_narrative: > + GEPA (Genetic-Pareto Prompt Evolution) is the star engine — an ICLR 2026 + Oral paper from Stanford/UC Berkeley. Unlike traditional evolutionary search that + only sees pass/fail scores, GEPA reads full execution traces to understand + why things failed, then proposes targeted mutations. Phase 3 wires GEPA to a + sentinel-preserving proposer (mutations are confined to the section's text, never + the surrounding scaffolding) and routes every candidate score through a real + hermes -z subprocess. Because the spliced + prompt_builder.py is a single shared file and DSPy + evaluates with a thread pool, candidate scoring is serialized under a lock — an + accepted cost of the splice-and-restore model. + pipeline_steps: + - "Resolve baseline — Read the section's current text from prompt_builder.py (or accept a weakened baseline via --baseline-override-file to create headroom on an already-tuned section)" + - "Split — Deterministic seeded train / holdout split of the curated JSONL task suite" + - "Saturation pre-flight — Score the baseline behaviorally on the holdout; a no_headroom band default-denies (correctly refusing to evolve a saturated section) unless overridden" + - "GEPA loop — The section text is a sentinel-delimited region of a passthrough predictor's instructions; GEPA mutates it with the sentinel-preserving proposer. Each candidate is spliced into the live file and scored by running the agent on each task" + - "Compound verdict — Layer 1: did the agent invoke the expected tool (e.g. memory)? Layer 2: an LLM judge scores the saved content against each task's rubric" + - "Closed-loop deploy gate — Select the GEPA val-best candidate, then run baseline vs. evolved on the holdout suite; deploy iff holdout pass-rate doesn't regress and per-task wins offset losses ≥ 2:1" + - "Report + restore — Structured gate_decision.json (v5 schema, prompt-section variant); the live file is restored byte-for-byte" + cost_paragraph: > + The honest Phase 3 story is two-part. First, the framework's regression-catching + discipline: the production {section_name} is already + well-tuned, so a capable agent satisfies the suite regardless of small wording + changes — the saturation pre-flight scores the baseline at ceiling and correctly + default-denies, refusing to spend GEPA budget where no improvement is + possible. This mirrors the Phase 2 finding that the framework is improvement-finding + only where headroom genuinely exists. Second, to demonstrate that the loop produces + a real, grounded improvement when headroom does exist, the headline run + evolves a deliberately-adversarial baseline (one that instructs the agent not + to save) — exactly the weakened-target approach Phase 2 used for its headline. That + run consumed ${cost_total_usd:.2f} across {lm_calls_metrics:,} in-process LM + calls in ~{elapsed_minutes:.0f} minutes (the agent's own subprocess spend is + separate). Splicing a different section measurably changed live agent behavior, and + GEPA recovered a corrected section that the closed-loop gate deployed. + +experiment: + section_title: "Phase 3 Experiment" + config_overrides: + target_section_label: "{section_name} — evolved from a deliberately-weakened baseline (production {section_name} is saturated; the weak baseline, supplied via --baseline-override-file, exercises the loop end-to-end)" + optimizer_lm: "{optimizer_lm}" + reflection_lm: "{reflection_lm}" + eval_judge_lm: "{eval_lm}" + agent_lm: "openai/gpt-5.4-mini (Hermes-configured default)" + quality_gate_label: "closed-loop behavioral — holdout pass-rate no-regression + per-task wins ≥ 2·losses; compound verdict (Layer 1 trigger + Layer 2 content judge)" + saturation_label: "forced via --force-saturation-check (the weakened baseline had real headroom; production {section_name} default-denies as no_headroom)" + dataset_intro: > + The evaluation suite is a curated, hand-authored JSONL benchmark + (memory_guidance.jsonl, 12 tasks across five categories: + save-preference, save-correction, dont-save-task-progress, + dont-save-completed-work-log, and declarative-vs-imperative). Unlike Phases 1 and 2, + there is no synthetically-generated train/val/holdout of LLM-judge examples — + every task is scored behaviorally by running the real agent, and the deploy gate's + holdout is {cl_total_tasks} of those tasks. Each save task carries an + expected_save_content rubric consumed by the Layer 2 + content judge. + fitness_intro: > + Fitness is behavioral, not a synthetic judge score. For each task, the candidate + section is spliced into the live prompt_builder.py, the + agent runs once via hermes -z, and the resulting session + is read back from Hermes' SQLite session store. The verdict is compound: + fitness_formula: "pass = Layer1(expected memory action fired, forbidden actions absent) AND Layer2(content-judge score ≥ 0.7 on save tasks)" + fitness_closing: > + GEPA's reflection LM reads the per-task failures and proposes a targeted mutation of + the section text; the sentinel-preserving proposer confines edits to the section and + re-raises rather than admit a candidate that drops the markers. The deploy gate then + re-runs baseline vs. evolved on the holdout and decides on the behavioral signal + alone — holdout pass-rate no-regression plus a per-task win/loss rule — with no + paired-bootstrap CI, because there is no synthetic per-example distribution to + resample. + +results: + narrative: > + Evolving the weakened {section_name} baseline, the agent's holdout pass-rate + moved {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%} + ({cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks} tasks, + +{n_wins} wins / {n_losses} losses, {n_ties} ties) while the section text + shrank {growth_pct:+.1%} ({baseline_chars:,} → {evolved_chars:,} + chars). GEPA learned from the save-task failures and inverted the adversarial + instruction — it removed the "never proactively save" misdirection and restored + proactive saving while keeping the legitimate "don't store passing remarks" + discrimination, in fewer characters. Decision: {decision_upper} via the + closed-loop gate ({decision_reasons}). The proposer rejected + {sentinel_failures} sentinel-breaking candidates. Throughout, the live + prompt_builder.py was restored byte-for-byte after every + trial. The production {section_name} itself is saturated and correctly + default-denies — the framework is regression-catching, and only finds improvements + where real headroom exists. + how_produced_intro: "GEPA evolves the section text through a reflective loop; the gate then reads the behavioral signal:" + how_produced_steps: + - "Splice a candidate section into the live prompt_builder.py (only when the candidate changes); run each holdout task once via hermes -z and read the session from Hermes' state.db" + - "Score each run with the compound verdict (Layer 1 tool-trigger membership + Layer 2 content judge on memory-save content); abstentions (agent/runner errors) score 0 in-loop and tie at the gate" + - "The reflection LM reads the failures and proposes a sentinel-confined mutation of the section text; GEPA accepts on improvement-or-equal" + - "Select the GEPA val-best candidate; run the closed-loop deploy gate (baseline vs. evolved on {cl_total_tasks} holdout tasks, its own backup/restore)" + - "Decide — Deploy iff evolved holdout pass-rate ≥ baseline AND per-task wins offset losses ≥ 2:1. On this run: {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%}, {n_wins}W/{n_losses}L → DEPLOY" + how_produced_closing: > + Two design choices made this outcome trustworthy. First, the splice-and-restore + guard (atomic backup + exclusive flock + byte-restore, + with stale-backup refusal) means the user's Hermes checkout is never left mutated, + even on crash. Second, the deploy gate is the same proven closed-loop validator used + for tool descriptions — the prompt path adds only a thin installer plus a per-task + content judge, so the decision rule, audit trail, and restore machinery are shared + and already battle-tested. The behavioral-only design is not a shortcut: it is the + only honest measure for a system-prompt section, which has no cheap synthetic proxy. + +safety: + intro: "Every evolved section must clear these constraints, and the live prompt file is protected throughout:" + table: + header: ["Constraint", "Enforcement", "Status"] + rows: + - ["Self-evolution test suite", "1,232 pytest tests pass on the optimizer itself", "Implemented"] + - ["Byte-clean splice/restore", "Atomic backup + byte-for-byte restore of prompt_builder.py after every run", "Implemented"] + - ["Parse-guarded write", "Candidate spliced via repr() + ast.parse check; refuses to write non-parseable Python", "Implemented"] + - ["Exclusive lock + drift check", "flock on the prompt file's dir + sha-drift detection; stale-backup refusal on startup", "Implemented"] + - ["Compound verdict", "Layer 1 tool-trigger membership AND Layer 2 LLM content judge (≥ threshold)", "Implemented"] + - ["Abstain on corrupt session", "A malformed agent session abstains (neutral), never scores as a behavioral regression", "Implemented"] + - ["Closed-loop deploy gate", "Holdout pass-rate no-regression + per-task wins ≥ 2·losses", "Implemented"] + - ["Saturation pre-flight", "Default-denies a saturated (no_headroom) section before spending GEPA budget", "Implemented"] + - ["Budget ceiling", "--max-cost-usd aborts on in-process LM spend overrun", "Implemented"] + - ["Deployment via apply + review", "--apply writes the section; PR automation deferred for prompt sections", "By design"] + - ["Benchmark regression", "External --benchmark-cmd hook (TBLite / harness)", "Planned"] + closing: > + The source Hermes repository is never left modified: the section is spliced in only + for the duration of a trial and restored from an atomic backup, and all evolution + output (gate decisions, section before/after text, run logs) is written under the + framework's local output/ directory. PR automation is + deferred for prompt sections — a section-scoped PR path is future work — so the + deploy step is an explicit --apply plus a human-authored + pull request. + +roadmap: + table: + header: ["Phase", "Target", "Engine", "Timeline", "Status"] + rows: + - ["Phase 1", "Skill files (SKILL.md)", "DSPy + GEPA", "3-4 weeks", "Validated ✓"] + - ["Phase 2", "Tool descriptions", "DSPy + GEPA", "2-3 weeks", "Validated ✓"] + - ["Phase 3", "System prompt sections", "DSPy + GEPA", "2-3 weeks", "Validated ✓"] + - ["Phase 4", "Tool implementation code", "Darwinian Evolver", "3-4 weeks", "Planned"] + - ["Phase 5", "Continuous improvement", "Automated pipeline", "2 weeks", "Planned"] + highlight_row: 2 + closing: > + Phase 3 completes the instructions trio — skills, tool descriptions, and now + system-prompt sections — all gated by the same closed-loop discipline. The + behavioral-only deploy gate proves the framework can evolve the highest-blast-radius + instructions surface safely: it default-denies a saturated section, produces a real + grounded improvement where headroom exists, and never leaves the agent's source + mutated. Phase 4 (tool implementation code) and Phase 5 (continuous improvement) + extend the framework beyond the instructions layer. + +next_steps: + - "Harder behavioral suites — Production system-prompt sections are heavily tuned and saturate the current suites; develop richer, harder task suites (and weaker agent tiers) so headroom exists on real targets, not only adversarial baselines." + - "Additional sections — The same path supports any string-constant section (SKILLS_GUIDANCE, SESSION_SEARCH_GUIDANCE, etc.); MEMORY_GUIDANCE was the first proof point, chosen for its clear tool-call anchor." + - "Section-scoped PR automation — Wire --create-pr for prompt sections by splicing into origin/<base>'s prompt_builder.py (not the local checkout), so the PR diff carries only the section change." + - "Agent-side cost capture — The agent's own LM spend happens inside the hermes subprocess and is invisible to the in-process budget ceiling; surface it from the session store so --max-cost-usd accounts for end-to-end spend." diff --git a/reports/phase3_validation_report.pdf b/reports/phase3_validation_report.pdf new file mode 100644 index 0000000..b2cd415 Binary files /dev/null and b/reports/phase3_validation_report.pdf differ