From d32e9976376ea4ce0a953bb2e17a9da12ac6d9fd Mon Sep 17 00:00:00 2001 From: Justin Ramos Date: Tue, 2 Jun 2026 09:04:07 -0600 Subject: [PATCH 1/4] docs(readme): document Phase 3 prompt-section evolution Add an 'Evolve a system prompt section' Quick Start subsection (behavioral closed-loop validation, compound verdict, splice-and-restore, --apply, --baseline-override-file) and mark Phase 3 complete in the capabilities table. --- README.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index e145a91..60386b8 100644 --- a/README.md +++ b/README.md @@ -169,6 +169,22 @@ The framework parses every `*_SCHEMA = {...}` and `*_SCHEMAS = [...]` declaratio With `--apply`, the evolved description is spliced into the source file's bytes at the original position — comments, formatting, and unrelated tools are untouched. Multi-line parenthesized concatenations collapse to a single triple-quoted string at the same indent. +### Evolve a system prompt section + +For Hermes Agent, evolve a named section of the assembled system prompt — any top-level string constant in `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`, which governs when and what the agent saves to memory): + +```bash +uv run python -m evolution.prompts.evolve_prompt_section \ + --section MEMORY_GUIDANCE \ + --hermes-repo /path/to/hermes-agent \ + --tasks evolution/validation/suites/memory_guidance.jsonl \ + --iterations 10 +``` + +Unlike skill and tool evolution — where the deploy gate can lean on a synthetic LLM-judge signal — a prompt section is evaluated **purely behaviorally**: every candidate is spliced into the live `prompt_builder.py` and scored by running the real agent (`hermes -z`) against the task suite. The verdict is compound — Layer 1 checks whether the agent invoked the expected tool (e.g. `memory`), and Layer 2 runs an LLM judge over the saved content against each task's `expected_save_content` rubric. The candidate is spliced in only for the duration of the run; the file is restored byte-for-byte afterward (atomic backup + flock + checksum-drift detection, shared with the tool-description path). + +`--apply` writes the evolved section into `prompt_builder.py` in place; results land in `output/prompts/
//`. PR automation (`--create-pr`) is not yet wired for prompt sections — use `--apply` plus a manual PR. To demonstrate the loop on an already-tuned section (which the saturation pre-flight will otherwise correctly default-deny as having no headroom), `--baseline-override-file` starts evolution from arbitrary text — e.g. a deliberately-weakened baseline that gives GEPA real failures to learn from. + ### Mine real session history for evals For skill evolution: @@ -331,7 +347,7 @@ Cost: each task is one `hermes -z` run (~$0.05–$0.50). The bundled `patch.json |-------|--------|--------|--------| | **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ [Validated](reports/phase1_validation_report.pdf) | | **Phase 2** | Tool descriptions + dual-signal deploy gate | DSPy + GEPA | ✅ [Validated](reports/phase2_validation_report.pdf) | -| **Phase 3** | System prompt sections | DSPy + GEPA | 🔲 Planned | +| **Phase 3** | System prompt sections | DSPy + GEPA | ✅ Complete | | **Phase 4** | Tool implementation code | Darwinian Evolver | 🔲 Planned | | **Phase 5** | Continuous improvement loop | Automated pipeline | 🔲 Planned | From ef318c0797e60586484dedd851ba8176e8c9ba58 Mon Sep 17 00:00:00 2001 From: Justin Ramos Date: Tue, 2 Jun 2026 09:05:52 -0600 Subject: [PATCH 2/4] docs(interfaces): add evolve_prompt_section CLI reference --- docs/interfaces.md | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/docs/interfaces.md b/docs/interfaces.md index 412291b..4476fd3 100644 --- a/docs/interfaces.md +++ b/docs/interfaces.md @@ -140,6 +140,46 @@ Evolves one tool's top-level `description` field inside an MCP-shape manifest. T - `sys.exit(1)` if the holdout split has fewer than `min_holdout_size` (default 10) examples. - Returns normally (rejection path) if static or growth-quality gate fails — `evolved_FAILED.json` + `gate_decision.json` are written. +## CLI: `python -m evolution.prompts.evolve_prompt_section` + +Evolves one named section of an agent's system prompt — a top-level string constant in Hermes Agent's `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`). Unlike the skill and tool paths, evaluation is **purely behavioral**: there is no synthetic LLM-judge signal. Every candidate is spliced into the live `prompt_builder.py` and scored by running the real agent (`hermes -z`) against the task suite, so the deploy gate is a `ClosedLoopValidator` run (pass-rate + win/loss), not a paired-bootstrap CI over judge scores. + +The verdict is **compound**: Layer 1 is the same `expected_tools` / `forbidden_tools` membership rule as the closed-loop tool path; Layer 2 is an LLM judge that scores each `memory(action=add|replace)` call's content against the task's `expected_save_content` rubric (only tasks that declare a rubric are Layer-2 judged). The candidate is spliced in for the duration of the run and the file is restored byte-for-byte afterward, reusing the tool-path backup + flock + checksum-drift machinery. + +### Required flags +| Flag | Purpose | +|---|---| +| `--section ` | The `prompt_builder.py` top-level string constant to evolve (e.g. `MEMORY_GUIDANCE`). Dict-typed constants (e.g. `PLATFORM_HINTS`) are not supported. | +| `--hermes-repo ` | Path to your hermes-agent checkout. `agent/prompt_builder.py` inside it is the splice/restore target. | +| `--tasks ` | JSONL eval suite (e.g. `evolution/validation/suites/memory_guidance.jsonl`). Same task shape as the closed-loop tool suite, plus an optional `expected_save_content` rubric per task for Layer 2. Must contain ≥2 tasks (so the split yields a non-empty trainset and holdout). | + +### Optional flags +| Flag | Default | Notes | +|---|---|---| +| `--iterations ` | `10` | GEPA `max_full_evals`. | +| `--holdout-ratio ` | `0.5` | Fraction of tasks held out for the deploy gate. Clamped to keep both the trainset and holdout non-empty. | +| `--seed ` | `42` | RNG seed for the train/holdout split and GEPA. | +| `--max-growth ` | `0.2` | Section length budget as a fraction over the baseline; framed to the `PromptSectionProposer` so candidates stay near the baseline length (set higher when evolving from a short baseline that needs to grow). | +| `--optimizer-model` / `--reflection-model` / `--eval-model ` | config default | Per-role LiteLLM model overrides; resolved like the other CLIs. `--eval-model` is the Layer 2 content judge. | +| `--agent-model ` | config default | The model the `hermes -z` agent itself runs as. A deliberately weaker agent exposes more behavioral signal (a strong agent saturates the suite regardless of the prompt). LiteLLM provider prefixes are stripped before `hermes -m`. | +| `--layer2-threshold ` | `0.7` | Minimum mean content-judge score for a save task to pass Layer 2. | +| `--task-timeout-seconds ` | `120` | Per-task wall-clock cap for `hermes -z`. Timeouts abstain (don't tip the decision). | +| `--max-cost-usd ` | `150.0` | Abort cleanly when cumulative **in-process** LM cost (judge + reflection + the passthrough predictor) exceeds this. The agent's own LM spend happens inside the `hermes` child process and is not captured by this ceiling. | +| `--gepa-minibatch-size ` | `3` | GEPA reflective minibatch size; same meaning as the other paths. | +| `--gepa-acceptance {improvement-or-equal,strict-improvement}` | `improvement-or-equal` | Same meaning as the other paths. | +| `--apply` | off | On a deploy decision, write the evolved section into `prompt_builder.py` in place (byte-precise AST splice, `ast.parse`-guarded, atomic). | +| `--create-pr` | off | **Deferred for prompt sections** — accepted and recorded as a `skipped` PR block in `gate_decision.json`, but no PR is opened (copying a full evolved `prompt_builder.py` over `origin/` would carry unrelated local changes into the diff). Use `--apply` + a manual PR. | +| `--baseline-override-file ` | off | Start evolution from this text instead of the live section. The live section is still the splice/restore target (backed up + restored); `--apply` still writes the evolved text. Use it to create headroom on an already-tuned section (e.g. a deliberately-weakened baseline) or for regression-injection ablations. | +| `--skip-saturation-check` | off | Skip the saturation pre-flight entirely. | +| `--force-saturation-check` | off | Run the pre-flight, render the panel, but proceed regardless of band — required to override a non-`healthy` verdict non-interactively. | +| `--dry-run` | off | Resolve the baseline + build the modules, then stop — exercises wiring with no LM/agent calls. Writes a `decision="dry_run"` `gate_decision.json`. | +| `--output-dir ` | `output/prompts/
//` | Where `gate_decision.json` and the baseline/evolved section text files land. | + +### Exit conditions +- `0` on a `deploy` decision (or a `--dry-run`). +- `1` on `reject` (the holdout deploy gate found a regression), `denied` (saturated baseline default-denied non-interactively), or `aborted` (cost ceiling). +- `ValueError` at startup if the suite has fewer than 2 tasks. + ## CLI: `python -m evolution.core.external_importers` Standalone session-history importer. Useful for previewing what `--eval-source sessiondb` would produce without running the full evolution. From 04a3efb73efb95253ad9c7c1ba24d2a143031998 Mon Sep 17 00:00:00 2001 From: Justin Ramos Date: Tue, 2 Jun 2026 09:13:25 -0600 Subject: [PATCH 3/4] docs(reference): add Phase 3 prompt-section evolution to the knowledge base components.md (orchestrator + supporting modules + shared validation changes), workflows.md (Workflow 12: prompt-section deploy path), architecture.md (prompts tier + HermesPromptSectionInstaller in the module graph), codebase_info.md (prompts package + LOC + Tier 3 implemented), data_models.md (prompt-section gate_decision shape + the fields it deliberately omits vs the paired-bootstrap path), index.md (routing rows). --- docs/architecture.md | 32 ++++++++- docs/codebase_info.md | 30 +++++--- docs/components.md | 49 ++++++++++++- docs/data_models.md | 73 ++++++++++++++++++++ docs/index.md | 3 +- docs/workflows.md | 157 ++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 330 insertions(+), 14 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index 772a94b..8e73e94 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -56,10 +56,20 @@ graph TB hermes_source[tools.hermes_source
Hermes *_SCHEMA AST adapter] end + subgraph prompts_tier[Prompt Tier] + evolve_prompt[prompts.evolve_prompt_section
main + evolve] + prompt_module[prompts.prompt_module
PromptModule + sentinels] + prompt_proposer[prompts.prompt_proposer
PromptSectionProposer] + prompt_judge[prompts.prompt_judge
SaveCallJudge + judge_save_calls
+ prompt fitness/splice scorer] + prompt_source[prompts.prompt_source
PromptSource protocol + SectionDescriptor] + hermes_prompt_source[prompts.hermes_prompt_source
HermesPromptSource — prompt_builder.py AST] + end + subgraph validation_subsystem[Closed-loop validation] validator[validation.validator
ClosedLoopValidator] hermes_runner[validation.hermes_runner
hermes -z subprocess] - installer[validation.artifact_installer
HermesToolDescriptionInstaller] + installer[validation.artifact_installer
HermesToolDescriptionInstaller +
HermesPromptSectionInstaller] + savejudge[validation.report
score_task Layer-2 judge hook] report[validation.report
ValidationReport + decision] task[validation.task
Task + TaskSuite] cl_cli[validation.closed_loop
CLI] @@ -117,10 +127,26 @@ graph TB tool_judge --> fitness tool_proposer --> budget + evolve_prompt --> prompt_module + evolve_prompt --> prompt_proposer + evolve_prompt --> prompt_judge + evolve_prompt --> prompt_source + evolve_prompt --> hermes_prompt_source + evolve_prompt --> config + evolve_prompt --> quality + evolve_prompt --> timing + evolve_prompt --> validator + hermes_prompt_source --> prompt_source + prompt_module --> dspy + prompt_proposer --> budget + prompt_judge --> fitness + installer --> hermes_prompt_source + validator --> hermes_runner validator --> installer validator --> report validator --> task + validator --> savejudge cl_cli --> validator hermes_runner --> hermes @@ -138,7 +164,9 @@ graph TB importers --> dataset ``` -`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools. This keeps the tier-3/4/5 expansion path open. +`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, `evolution/prompts/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools/prompts. This keeps the tier-4/5 expansion path open. + +The `prompts` tier (Phase 3) is the prompt-section evolution path: `evolve_prompt_section` wraps a named `prompt_builder.py` constant as a `PromptModule` (a passthrough predictor carrying the candidate in sentinel-delimited instructions), mutates it with `PromptSectionProposer`, and — because there is no synthetic classification signal for a system-prompt section — scores **purely behaviorally** through the closed-loop validator running a real `hermes -z` against a curated JSONL suite. The deploy gate is therefore a closed-loop pass-rate / win-loss decision, not a paired-bootstrap one. Unlike the skill/tool tiers it reuses `ClosedLoopValidator` directly rather than going through `closed_loop_feedback.py`, and it integrates by AST-splicing the candidate into the live `agent/prompt_builder.py` (`HermesPromptSectionInstaller`) with atomic restore. The Layer-2 content judge (`SaveCallJudge` / `judge_save_calls`) runs inside `score_task` to grade memory-save *content* on top of the Layer-1 trigger-membership check. ## Design patterns in active use diff --git a/docs/codebase_info.md b/docs/codebase_info.md index 2900d46..c16decc 100644 --- a/docs/codebase_info.md +++ b/docs/codebase_info.md @@ -67,14 +67,20 @@ evolution/ │ └── tool_judge.py # tool-flavored LLMJudge + GEPA-shaped metric ├── validation/ # closed-loop validation against a real agent │ ├── agent_runner.py # AgentRunner Protocol + AgentRunResult dataclass -│ ├── artifact_installer.py # ArtifactInstaller Protocol + HermesToolDescriptionInstaller +│ ├── artifact_installer.py # ArtifactInstaller Protocol + HermesToolDescriptionInstaller + HermesPromptSectionInstaller │ ├── closed_loop.py # CLI: drive baseline + evolved through hermes -z, compare -│ ├── hermes_runner.py # HermesAgentRunner — subprocess hermes -z with sandboxed HOME -│ ├── report.py # ValidationReport + TaskResult + decision rule -│ ├── suites/ # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl) +│ ├── hermes_runner.py # HermesAgentRunner — subprocess hermes -z; reads sessions from SQLite state.db (parse_session_from_db) +│ ├── report.py # ValidationReport + TaskResult + decision rule + Layer-2 SaveCallJudge in score_task +│ ├── suites/ # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl, memory_guidance.jsonl) │ ├── task.py # Task + TaskSuite.from_jsonl (with sha256 audit) │ └── validator.py # ClosedLoopValidator.validate — mutates + restores live agent file -├── prompts/ # Tier 3: planned, empty package +├── prompts/ # Tier 3: system-prompt-section evolution +│ ├── evolve_prompt_section.py # CLI + orchestration; purely-behavioral closed-loop gate +│ ├── prompt_source.py # PromptSource Protocol (read + write) + SectionDescriptor +│ ├── hermes_prompt_source.py # HermesPromptSource — AST read/write of prompt_builder.py constants +│ ├── prompt_module.py # PromptModule — passthrough predictor carrying candidate in sentinels +│ ├── prompt_proposer.py # PromptSectionProposer — sentinel-preserving GEPA proposer +│ └── prompt_judge.py # SaveCallJudge + judge_save_calls Layer-2 content judge + fitness/splice scorers ├── code/ # Tier 4: planned, empty package └── monitor/ # planned, empty package ``` @@ -86,6 +92,7 @@ evolution/ | `evolution/skills/evolve_skill.py` | ~1340 | CLI, orchestration, gate-decision payload assembly | | `evolution/tools/evolve_tool.py` | ~1170 | CLI + orchestration for tool-description evolution | | `evolution/core/external_importers.py` | ~770 | 3 importers + relevance filter + standalone CLI | +| `evolution/prompts/evolve_prompt_section.py` | ~660 | CLI + orchestration; purely-behavioral closed-loop deploy gate | | `evolution/core/dataset_builder.py` | ~480 | synthetic generator + golden loader + tool-selection three-bucket gen | | `evolution/core/lm_timing_callback.py` | ~400 | DSPy BaseCallback + litellm.failure_callback + cost ledger | | `evolution/core/fitness.py` | ~380 | LLMJudge + skill/tool fitness metrics + behavioral score helper | @@ -94,6 +101,7 @@ evolution/ | `evolution/core/closed_loop_feedback.py` | ~320 | cache + saturation gate + deterministic feedback block + `force_run` (bypasses gate for pre-flight) | | `evolution/core/saturation_check.py` | ~255 | pre-flight: band classifier + `SaturationReport` + Rich panel + interactive confirm | | `evolution/tools/tool_judge.py` | ~230 | tool-flavored judge + GEPA-shaped metric with behavioral branch | +| `evolution/prompts/prompt_judge.py` | ~230 | SaveCallJudge + judge_save_calls Layer-2 content judge + prompt fitness/splice scorers | | `evolution/validation/validator.py` | ~220 | mutate + restore live agent file with flock + checksum drift check | | `evolution/validation/report.py` | ~225 | ValidationReport JSON + Rich rendering + two-condition decision | | `evolution/core/skill_sources.py` | ~210 | Hermes / Claude Code / LocalDir | @@ -101,15 +109,19 @@ evolution/ | `evolution/skills/knee_point.py` | ~205 | parsimony-based candidate picker | | `evolution/validation/hermes_runner.py` | ~205 | hermes -z subprocess with sandboxed HOME | | `evolution/tools/tool_proposer.py` | ~200 | sentinel-preserving reflection prompt | -| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore | +| `evolution/prompts/prompt_proposer.py` | ~160 | sentinel-preserving GEPA proposer for prompt sections | +| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore (tool + prompt-section installers) | +| `evolution/prompts/hermes_prompt_source.py` | ~135 | AST read/write of prompt_builder.py string constants | +| `evolution/prompts/prompt_module.py` | ~120 | PromptModule passthrough predictor + sentinel parse | | `evolution/validation/closed_loop.py` | ~135 | standalone closed-loop CLI | | `evolution/skills/skill_module.py` | ~125 | wraps SKILL.md as `dspy.Module` | | `evolution/validation/task.py` | ~90 | Task + TaskSuite.from_jsonl | | `evolution/core/config.py` | ~80 | `EvolutionConfig` dataclass | | `evolution/core/stats.py` | ~60 | `paired_bootstrap` helper | +| `evolution/prompts/prompt_source.py` | ~55 | PromptSource Protocol + SectionDescriptor | | `evolution/validation/agent_runner.py` | ~55 | AgentRunner Protocol + dataclasses | | `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples | -| **Total** | **~9,000** | excludes empty `__init__.py` shims | +| **Total** | **~10,400** | excludes empty `__init__.py` shims | Test suite: 61 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1166 tests** collected. @@ -139,11 +151,11 @@ The README's table summarizes intent; reality: |---|---|---|---| | 1 | Skill files (SKILL.md) | DSPy + GEPA | ✅ implemented in `evolution/skills/` | | 2 | Tool descriptions | DSPy + GEPA | ✅ implemented in `evolution/tools/` — MCP-JSON and Hermes-Python-AST adapters; one target tool per run | -| 3 | System prompt sections | DSPy + GEPA | 🔲 `evolution/prompts/` package exists, empty | +| 3 | System prompt sections | DSPy + GEPA | ✅ implemented in `evolution/prompts/` — AST splice of `prompt_builder.py` constants; purely-behavioral closed-loop deploy gate (no synthetic signal) | | 4 | Tool implementation code | Darwinian Evolver | 🔲 `evolution/code/` package exists, empty; `[darwinian]` extra reserves the dep | | 5 | Continuous improvement loop | Automated pipeline | 🔲 `evolution/monitor/` package exists, empty | -Tiers 1 and 2 are built. Tier 3-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec. +Tiers 1-3 are built. Tier 4-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec. **Orthogonal validation surface.** `evolution/validation/` runs a real agent (`hermes -z`) through a JSONL task suite with baseline vs evolved artifacts spliced into the live install. Scores actual tool-selection behavior with `expected_tools` / `forbidden_tools` per task; compares with a two-condition decision rule. Available three ways: diff --git a/docs/components.md b/docs/components.md index 8821142..2d3c85c 100644 --- a/docs/components.md +++ b/docs/components.md @@ -368,6 +368,51 @@ Score is **never** modified by `pred_trace` enrichment — GEPA enforces score e **Cost ceiling + benchmark hook (shared with `evolve_skill`):** `--max-total-cost-usd` participates in the same `CostLedger` kill switch (see `lm_timing_callback.py`); `--benchmark-cmd` is a post-gate shell hook whose env vars include `EVOLVED_PATH` / `BASELINE_PATH` pointing at the rendered manifest JSONs and `ARTIFACT_TYPE="tool_description"`. Both write structured blocks into `gate_decision.json` — see `data_models.md`. +## evolution/prompts/evolve_prompt_section.py — CLI + orchestrator + +**Owns:** the end-to-end `evolve_prompt_section()` flow and the Click CLI (`main`) for evolving a named system-prompt section — a top-level string constant in Hermes `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`). The phase-3 analogue of `evolve_tool`, but with a fundamentally different eval substrate: there is no cheap synthetic classification GEPA can score, so **every** candidate is spliced into the live `prompt_builder.py` and run through a real `hermes -z` subprocess. The deploy gate is therefore a `ClosedLoopValidator` win/loss decision, not a paired-bootstrap CI. + +**Public surface:** +- `main()` — Click command. CLI flags map onto `evolve_prompt_section()` kwargs. +- `evolve_prompt_section(section_name, hermes_repo, tasks_path, ...) -> dict` — orchestrator function. Importable and used directly by tests. + +**Integration model — in-place splice + atomic restore.** Unlike skills (separate writable workdir) there is no env-var hook or plugin seam: the section is a constant inside Hermes' own source, so the framework edits that file in place and restores it. The whole evolution runs inside `_prompt_builder_guard(target_path)` — a context manager that takes an atomic `.cl_backup` (`_BACKUP_SUFFIX`), grabs an exclusive `fcntl.flock` on `.cl_validation.lock` (`_LOCK_FILENAME`) in the target's parent dir, and byte-restores the original on exit (refusing to start on a stale backup or a held lock). These are the *same* lock + backup names `ClosedLoopValidator` uses, so the guard is sequenced *before* the deploy-gate validator, never nested. The deploy gate then re-acquires the lock itself. + +**Phases inside `evolve_prompt_section()`:** +1. Resolve baseline: `HermesPromptSource.read(section_name)` validates the section is a top-level string constant, then reads its text — or `--baseline-override-file` supplies starting text (a deliberately-weakened baseline for headroom, or a regression ablation) while the *live* file is still backed up/restored and `--apply` still writes the live section. +2. Train/holdout split of the JSONL suite (`_split_train_holdout`, deterministic shuffle+seed, ≥1 task each side; suites with <2 tasks are rejected). +3. Build the eval stack: `SaveCallJudge` + a per-task Layer-2 factory (`_make_layer2_factory`, binds each task's `expected_save_content` rubric + message into a `score_task`-shaped scorer; returns `None` for tasks with no rubric) → `HermesPromptSectionInstaller` + `HermesAgentRunner` + a `make_memoizing_splice_scorer` over `install_candidate` / `score_task_id`, serialized under a `threading.Lock`. +4. `dspy.configure(lm=eval_lm)` sets the **global** default LM (not just `dspy.context`) so the passthrough predictor resolves an LM inside GEPA's worker threads — without it, `forward()`'s passthrough call raises "No LM is loaded" in those threads, yielding no trajectories and no proposal. +5. Inside `_prompt_builder_guard`: saturation pre-flight (baseline behavior on the holdout; aborts/denies on a non-`healthy` band unless `--force-saturation-check`, with non-interactive contexts refusing rather than prompting) followed by GEPA(`PromptModule`, `PromptSectionProposer`, `make_prompt_fitness_metric` + the memoizing splice scorer). Trainset/valset are `_behavioral_examples` (task message + `closed_loop_task_id`). +6. Select the evolved section via GEPA val-argmax (`detailed_results.best_idx`), reading the body back out of the winning candidate's sentinel region (`_section_text_from_candidate`). +7. Deploy gate: `ClosedLoopValidator.validate(...)` runs baseline vs evolved on the holdout suite (the same per-task Layer-2 factory + threshold threaded in). `report.decision == "pass"` is the deploy verdict. +8. Write `gate_decision.json`; on a passing gate `--apply` writes the evolved section back into `prompt_builder.py`. `baseline_section.txt` / `evolved_section.txt` are also emitted. + +`_run_one_task_score` is the GEPA in-loop scorer: materialize the task fixture into a tmp dir, run the agent against whatever section is currently spliced, `score_task`, return 1.0/0.0 (in-loop abstentions score 0.0 — the deploy gate handles abstentions properly). Budget rides the shared `COST_LEDGER` + `CostCeilingExceeded` kill switch; the ceiling abort writes a `cost_ceiling_exceeded` gate decision. + +**`gate_decision.json` additions:** `artifact_type: "prompt_section"`, `target_section: `, `baseline_chars` / `evolved_chars` / `growth_pct`, a `closed_loop` block (the validator decision + pass rates + W/L/T), and `sentinel_failures` (proposer candidates rejected for losing the sentinels). `decision_signal` is always `"closed_loop"`. `--create-pr` is **deferred** for prompt sections (it would pollute the diff with the local override-hook commit) and is recorded as `skipped`; use `--apply` + a manual PR. + +### Supporting modules (`evolution/prompts/`) + +- `prompt_source.py` — `PromptSource` Protocol (`read` + `write` only, `runtime_checkable`) + `SectionDescriptor` (frozen metadata). The Protocol is deliberately minimal — the driver only reads a baseline and writes/splices an evolved value. `list_sections` is a concrete convenience on `HermesPromptSource` (a future `--list-sections` affordance), not part of the contract. +- `hermes_prompt_source.py` — `HermesPromptSource`, the splice primitive. `read` AST-walks top-level `NAME = "..."` string constants (v1 string-typed only; dict-typed constants like `PLATFORM_HINTS` raise `KeyError`). `write` splices by byte offset using `repr(new_text)` so the literal round-trips byte-equal regardless of embedded quotes/newlines, and `ast.parse`-guards the result before an atomic `os.replace` — it **refuses to write non-parseable Python**, leaving the user's Hermes startable. +- `prompt_module.py` — `PromptModule(section_name, candidate_text)`: a `dspy.Module` whose `ChainOfThought` passthrough predictor carries the candidate in `signature.instructions` between sentinel markers (`` … ``). There is no cheap classification to score, so the predictor exists only as a mutation target. `forward()` **must** invoke the passthrough so GEPA captures a trace for `passthrough.predict` — without a traced predictor call, `make_reflective_dataset` finds "no valid predictions" and never proposes a mutation. It returns a placeholder response with `_closed_loop_task_id` + `_candidate_text` attached for the behavioral metric. GEPA discovers the target via `named_predictors()` → `"passthrough.predict"`. +- `prompt_proposer.py` — `PromptSectionProposer`, a sentinel-preserving GEPA `instruction_proposer` subclassing `BudgetAwareProposer` (inherits the char-budget infrastructure; see `budget_aware_proposer.py`). Runs the proposer LM, then passes the candidate through `extract_and_rebuild` so only the sentinel-delimited region survives. On a candidate that loses the sentinels it increments `sentinel_failures` and **re-raises** `SentinelParseError` rather than returning the parent unchanged — GEPA's reflective-mutation path skips the iteration instead of admitting a phantom identical-to-parent candidate into the selection pool. +- `prompt_judge.py` — + - `SaveCallJudge` — LLM-as-judge scoring an individual memory-save's content against `MEMORY_GUIDANCE`'s rules (durable, declarative, fact-focused; not task progress / PR numbers / completed-work logs). Unparseable judge output falls back to a neutral 0.5 (logged so it's distinguishable from a real mediocre score). + - `judge_save_calls` — the Layer-2 aggregate. Only judges `SAVE_ACTIONS = {add, replace}` (the real Hermes `memory` tool actions that carry a `content` payload; `remove` is not a save), caps judged calls at `MAX_JUDGED_CALLS_PER_TASK = 5` (excess score 0 each), and returns a vacuous 1.0 when there are no save calls or no judge/rubric is configured. + - `make_prompt_fitness_metric` — the GEPA 5-arg metric. Routes purely behaviorally: a prediction missing `_closed_loop_task_id` is degenerate and scores 0 with a diagnostic; otherwise `closed_loop_scorer(task_id, candidate_text)` runs one closed-loop trial. Appends a `[BUDGET]` feedback line. + - `make_memoizing_splice_scorer` — builds `closed_loop_scorer(task_id, candidate_text)` that splices **only when `candidate_text` changes** (consecutive tasks for one candidate reuse the live splice). Serialized under a `threading.Lock` because `dspy.Evaluate` is multi-threaded but `prompt_builder.py` is one shared mutable file — behavioral scoring is therefore effectively serial, an accepted v1 cost of splice-and-restore. Backup/restore is the caller's job (the guard wraps the whole run). + +### Shared validation-stack changes that enable the prompt path + +These let the prompt path reuse `ClosedLoopValidator` unchanged (see the validation section below for the base machinery): + +- `HermesPromptSectionInstaller` (in `artifact_installer.py`) — implements the `ArtifactInstaller` Protocol. `target_path` = `agent/prompt_builder.py`; `install(text_file)` reads the candidate body and calls `HermesPromptSource.write`, returning the post-install `sha256`; `verify_backup` = `verify_python_parses`. Constraint: the section must be a top-level string constant. +- `ClosedLoopValidator` gained an optional `layer2_judge_factory` (per-task — prompt-section judging needs the task's `expected_save_content` rubric + message, which a single global fn couldn't carry) plus a `layer2_threshold`. When unset, scoring is Layer 1 only and the tool-description path is unchanged. +- `report.py`'s `score_task` gained the compound Layer 2: when a `layer2_judge_fn` is supplied a task passes only if Layer 1 (trigger membership) passes **and** the judge scores `>= layer2_threshold`. Layer 1 short-circuits — the judge is never called (no LLM cost) on a task that already failed the trigger test, and `test_command` mode ignores Layer 2. The judge receives the subset of `run.tool_calls_with_args` whose name is `memory`. `Task` gained `expected_save_content`; `AgentRunResult` gained `tool_calls_with_args`. +- `hermes_runner.py` (shared change): reads agent sessions from the SQLite `state.db` (`parse_session_from_db`) since the current one-shot `hermes -z` is ephemeral and no longer writes `session_*.json`. A row whose `tool_calls` column won't parse as JSON aborts with an `error` result (the task **abstains**) rather than being silently read as "no tools." + ## evolution/validation/ — closed-loop validation against a real agent Drives an actual agent (`HermesAgentRunner` via `hermes -z`) through a small task suite with baseline and evolved artifacts, scores real tool-selection behavior, compares. Orthogonal to skills/tools/prompts/code — measures agent behavior, not artifact production. @@ -388,6 +433,6 @@ Drives an actual agent (`HermesAgentRunner` via `hermes -z`) through a small tas **During-evolution integration.** Beyond the standalone CLI, the same `ClosedLoopValidator` powers `evolution/core/closed_loop_feedback.py`'s `ClosedLoopFeedbackCache`. The cache writes the candidate description into a tmp manifest JSON, calls `validator.validate(ValidationInputs(...))` with it as `evolved_artifact`, and caches the returned `ValidationReport` by candidate text. The cache surfaces verdicts to the metric two ways: as a deterministic feedback block on the reflection path (`feedback` mode), or as per-task `TaskResult.passed` reads via `get_task_verdict(candidate, task_id)` for the behavioral-example branch (`trainset` mode). The validator itself doesn't know about the cache; it always sees a `ValidationInputs` with two artifacts and produces a `ValidationReport`. -## evolution/{prompts, code, monitor}/ — planned, empty +## evolution/{code, monitor}/ — planned, empty -These packages exist as empty stubs anchoring the planned tier-3/4/5 work. See `PLAN.md` for the design. +These packages exist as empty stubs anchoring the planned tier-4/5 work. See `PLAN.md` for the design. (`prompts/` is now implemented — see the phase-3 section above.) diff --git a/docs/data_models.md b/docs/data_models.md index c2455d4..6c0c23e 100644 --- a/docs/data_models.md +++ b/docs/data_models.md @@ -555,6 +555,79 @@ Written by `evolution/core/quality_gate.py::append_cl_decision_fields` when the | `band_trigger_score` | `dict` | Pre-flight scores that decided whether CL-primary fired. Keys: `holdout` (`float \| None`), `closed_loop` (`float \| None`). | | `validator_agent_model` | `str` | The LiteLLM model id used for the closed-loop validator agent. Recorded so historical decisions stay analysable if the default changes. | +### Prompt-section additions (`artifact_type == "prompt_section"`) + +Runs of `evolution.prompts.evolve_prompt_section` (Phase 3) write the same `schema_version` "5" envelope but a **deliberately different field set** from the skill/tool variant, because the deploy gate is a closed-loop pass-rate / win-loss decision, **not** a paired-bootstrap one. There is no synthetic classification signal for a system-prompt section — every candidate is scored behaviorally by a real `hermes -z` against a curated suite — so the bootstrap substrate doesn't apply. + +```json +{ + "schema_version": "5", + "artifact_type": "prompt_section", + "target_section": "MEMORY_GUIDANCE", + "decision": "deploy", // "deploy" | "reject" | "denied" | "dry_run" | "aborted" + "decision_signal": "closed_loop", // always "closed_loop" on this path + "baseline_chars": 1840, + "evolved_chars": 2104, + "growth_pct": 0.143, // (evolved_chars - baseline_chars) / baseline_chars + "closed_loop": { + "decision": "pass", // "pass" | "regression" (ValidationReport.decision) + "decision_reasons": ["pass_rate 0.92 >= baseline 0.75", "n_wins 4 >= 2*n_losses 0"], + "baseline_pass_rate": 0.75, + "evolved_pass_rate": 0.92, + "n_wins": 4, + "n_losses": 0, + "n_ties": 8 + }, + "sentinel_failures": 1, // reflection-LM outputs the proposer rejected for breaking sentinel preservation + "elapsed_seconds": 412.6, + "cost": { /* same shape as cost_summary: total_usd + by_model */ }, + "run_inputs": { /* seed, iterations, model versions, suite path/sha, validator_agent_model, ... */ }, + "pr_created": { "status": "skipped", "reason": "prompt_section_pr_unsupported", "branch": null, "commit_sha": null, "url": null } +} +``` + +**Fields this variant carries** (and the tool/skill variant does not, or differs on): + +| Field | Type | Notes | +|---|---|---| +| `artifact_type` | `"prompt_section"` | Disjoint from `"skill"` / `"tool_description"`. | +| `target_section` | `str` | The `prompt_builder.py` constant whose text was evolved (e.g. `MEMORY_GUIDANCE`). | +| `decision` | `"deploy" \| "reject" \| "denied" \| "dry_run" \| "aborted"` | `"denied"` lands on a saturation pre-flight default-deny; `"dry_run"` when the run was asked to evaluate without splicing; `"aborted"` on cost-ceiling / interrupt. | +| `decision_signal` | `"closed_loop"` | Always `"closed_loop"` here — the synthetic value never appears on this path. | +| `baseline_chars` / `evolved_chars` / `growth_pct` | int / int / float | Size telemetry; growth informs the closed-loop required-gain threshold but is not gated on a bootstrap. | +| `closed_loop` | `dict` | `{decision, decision_reasons, baseline_pass_rate, evolved_pass_rate, n_wins, n_losses, n_ties}` — the deploy gate's primary evidence (sourced from `ValidationReport` over the behavioral suite). | +| `sentinel_failures` | `int` | Count of reflection-LM proposals rejected for failing sentinel preservation (same meaning as the tool path). | +| `elapsed_seconds` / `cost` | float / dict | Wall-clock + per-model cost ledger. | +| `run_inputs` | `dict` | Reproduction inputs (seed, iterations, models, suite path + sha, `validator_agent_model`). | +| `pr_created` | `dict` | Shape-stable with the skill/tool path, but the prompt-section path currently emits a `status: "skipped"` block (PR automation for in-place `prompt_builder.py` splices is not wired). | + +**Fields the prompt-section variant deliberately OMITS.** A reader or calibration script must not assume these are present — they exist only on the skill/tool (paired-bootstrap) path: + +- `bootstrap` — no per-example bootstrap CI; the gate is win-loss, not a resampled mean. +- `avg_baseline` / `avg_evolved` — no synthetic holdout mean. The analogous numbers live inside `closed_loop` as `baseline_pass_rate` / `evolved_pass_rate`. +- `dataset` — there is no synthetic eval dataset and no `dataset` block with per-source/per-category counts; the behavioral suite is the JSONL passed via `--tasks`. `run_inputs` records the run config (models, seed, iterations, holdout-ratio, `eval_source: "closed_loop"`), not the suite path or sha. +- `knee_point` — Pareto knee-point selection over a synthetic valset doesn't apply; candidates are chosen on behavioral score. + +#### Saturation-denied variant (prompt section) + +When the saturation pre-flight default-denies (non-healthy band, non-interactive context, no `--force-saturation-check`), the prompt-section gate writes `decision: "denied"` and carries a `saturation_band` field naming the band that triggered the denial: + +```json +{ + "schema_version": "5", + "artifact_type": "prompt_section", + "target_section": "MEMORY_GUIDANCE", + "decision": "denied", + "decision_signal": "closed_loop", + "saturation_band": "no_headroom", // "healthy" never lands here; one of no_headroom | weak_signal | uniform_failure + "baseline_chars": 1840, + "run_inputs": { /* ... */ }, + "pr_created": { "status": "skipped", "reason": "prompt_section_pr_unsupported", "branch": null, "commit_sha": null, "url": null } +} +``` + +`saturation_band` appears only on the `"denied"` decision (it records why the run never started); it is absent on `deploy` / `reject` / `dry_run`. + ## metrics.json (deploy-only summary) Written to `output///metrics.json` only on deploy. Top-level summary for quick scanning: diff --git a/docs/index.md b/docs/index.md index 1d810c8..dfe14e0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,6 +15,7 @@ The codebase is mid-sized (~9K LOC of source + 61 test files / ~1166 tests) and | **What this project is** | `codebase_info.md` → `architecture.md` → repo-root `README.md` | | **How a skill run works end-to-end** | `workflows.md` (Workflow 1) → `architecture.md` (top-level flow) | | **How a tool-description run works end-to-end** | `workflows.md` (Workflow 9) → `components.md` (`evolve_tool.py`) | +| **How a prompt-section run works end-to-end** | `workflows.md` (Workflow 12) → `components.md` (`evolve_prompt_section.py`) | | **What flag does X / how to run the CLI** | `interfaces.md` (CLI section) | | **Why the deploy gate rejected a run** | `data_models.md` (gate_decision.json) → `components.md` (`constraints.py`) | | **What's in `gate_decision.json` / `metrics.json`** | `data_models.md` (full schema with examples) | @@ -53,7 +54,7 @@ The codebase is mid-sized (~9K LOC of source + 61 test files / ~1166 tests) and | [`components.md`](components.md) | Per-module reference: what each owns, public surface, load-bearing implementation notes | | [`interfaces.md`](interfaces.md) | CLIs (skill, tool, closed-loop, sessiondb importer), Python API, SkillSource + ToolSource Protocols, output artifacts, DSPy + litellm integration, test surfaces, env vars | | [`data_models.md`](data_models.md) | All dataclasses, on-disk formats, full `gate_decision.json` schema with worked examples, `ValidationReport` schema | -| [`workflows.md`](workflows.md) | Step-by-step workflows with mermaid sequence diagrams: skill deploy path, reject paths, GEPA→MIPROv2 fallback, sessiondb mining, tool evolution, closed-loop validation, closed-loop signal during evolution | +| [`workflows.md`](workflows.md) | Step-by-step workflows with mermaid sequence diagrams: skill deploy path, reject paths, GEPA→MIPROv2 fallback, sessiondb mining, tool evolution, closed-loop validation, closed-loop signal during evolution, prompt-section evolution | | [`dependencies.md`](dependencies.md) | Each external package — what it's used for, why it's pinned, what we don't depend on | | [`framework_advantages.md`](framework_advantages.md) | User-facing explainer of how this framework's selection layer, deploy gate, proposer, and composite fitness differ from raw DSPy + GEPA — and when raw GEPA is the right choice | diff --git a/docs/workflows.md b/docs/workflows.md index eb80148..e686908 100644 --- a/docs/workflows.md +++ b/docs/workflows.md @@ -545,6 +545,158 @@ When your daily-driver Hermes model is capable enough to solve every textbook bu Manual smoke harness: `tests/manual/skill_closed_loop_smoke.py` (supports `--suite {basic,advanced}`, `--agent-model MODEL`, `--task-timeout-seconds N`). +## Workflow 12: Evolve a prompt section (deploy path) + +The prompt-section analog of Workflow 9 (tool descriptions), but **purely behavioral** end to end. There is no synthetic judge dataset and no paired-bootstrap gate: every candidate is spliced into the live `prompt_builder.py` and scored by a real `hermes -z` subprocess, and the deploy gate is a `ClosedLoopValidator` run. Three structural contrasts with the tool path: + +- **Integration is in-place splice-and-restore**, not an MCP manifest rewrite or a copied skill directory. The target is a single named string constant inside the user's `prompt_builder.py`; the harness backs it up byte-for-byte and restores it on exit. +- **The deploy gate is closed-loop pass-rate / win-loss**, not a paired-bootstrap confidence interval. Decision = pass-rate no-regression + `n_wins >= 2 * n_losses` (the `ClosedLoopValidator.decide` rule), all behavioral. +- **PR automation is deferred.** `--create-pr` is recorded as `skipped`; deploy means `--apply` writes the evolved section into `prompt_builder.py` in place, and the user opens a PR by hand. + +```bash +python -m evolution.prompts.evolve_prompt_section \ + --section MEMORY_GUIDANCE \ + --hermes-repo ~/src/NousResearch/hermes-agent \ + --tasks evolution/validation/suites/memory_guidance.jsonl \ + --iterations 10 \ + --apply +``` + +### Phase A — Setup: resolve baseline, split, build the behavioral harness + +```mermaid +sequenceDiagram + autonumber + participant CLI as evolve_prompt_section + participant Src as HermesPromptSource + participant Suite as TaskSuite + participant Judge as SaveCallJudge + participant Inst as HermesPromptSectionInstaller + participant Run as HermesAgentRunner + participant V as ClosedLoopValidator + + CLI->>Src: read(section_name) — validate it exists / is a string constant + alt --baseline-override-file + CLI->>CLI: baseline_text = override_file.read_text() + else + CLI->>Src: baseline_text = read(section_name) + end + CLI->>Suite: TaskSuite.from_jsonl(tasks) — reject < 2 tasks + CLI->>CLI: _split_train_holdout(seed) — ≥1 task each side + CLI->>Judge: SaveCallJudge(config) → layer2_factory(task) + CLI->>Inst: HermesPromptSectionInstaller(repo, section) + CLI->>Run: HermesAgentRunner(timeout, agent_model?) + CLI->>V: ClosedLoopValidator(installer, runner, layer2_judge_factory, layer2_threshold) +``` + +The baseline is the **live section text** unless `--baseline-override-file` points evolution at arbitrary text — e.g. a deliberately-weakened baseline to manufacture headroom, or a regression-injection ablation. The override only changes where evolution *starts*; the guard still backs up and restores the real file, and `--apply` writes the evolved text back into the live section. The suite floor is 2 tasks so the seeded split yields a non-empty GEPA trainset **and** a non-empty deploy-gate holdout. + +### Phase B — Configure the global LM, then enter the guard + +```mermaid +sequenceDiagram + autonumber + participant CLI as evolve_prompt_section + participant Scorer as memoizing_splice_scorer + participant Metric as prompt_fitness_metric + participant LM as eval_lm + participant DSPy as dspy.configure + + CLI->>Scorer: make_memoizing_splice_scorer(install_fn=source.write, score_fn=run_one_task, lock) + CLI->>Metric: make_prompt_fitness_metric(baseline_text, max_growth, closed_loop_scorer=scorer) + CLI->>LM: instantiate eval_lm (role=eval, temp=0) + CLI->>DSPy: dspy.configure(lm=eval_lm, callbacks=[LMTimingCallback()]) + Note over CLI,DSPy: global LM set so GEPA worker threads can run PromptModule's
passthrough predictor — the pre-flight's dspy.context doesn't reach them +``` + +The `closed_loop_scorer` is the spine of behavioral scoring: `score(task_id, candidate_text)` splices the candidate into the live `prompt_builder.py` **only when it changes** (consecutive tasks for the same candidate reuse the live splice), runs the task via `hermes -z`, and reads the session back from the sandbox `state.db`. The splice+run is serialized under one `threading.Lock` because `dspy.Evaluate` scores with a thread pool but the spliced file is a single shared mutable resource — behavioral scoring is therefore effectively serial, an accepted v1 cost. The explicit `dspy.configure` is load-bearing: `dspy.context` inside the saturation pre-flight does **not** propagate into GEPA's worker threads, so without the global LM the passthrough predictor raises "No LM is loaded" → no trajectories → no proposal. + +### Phase C — Inside the guard: saturation pre-flight, then GEPA + +```mermaid +sequenceDiagram + autonumber + participant CLI as evolve_prompt_section + participant Guard as _prompt_builder_guard + participant FS as live prompt_builder.py + participant Sat as saturation_preflight + participant GEPA as dspy.GEPA + participant PM as PromptModule + participant Prop as PromptSectionProposer + participant Scorer as splice scorer + participant H as hermes -z + state.db + + CLI->>Guard: enter(installer.target_path) + Guard->>FS: refuse if stale .cl_backup; flock parent dir (LOCK_EX|NB) + Guard->>FS: atomic_write_bytes(.cl_backup, target.read_bytes()) + opt not --skip-saturation-check + CLI->>Sat: saturation_preflight(baseline_module, holdout, metric, eval_lm, baseline_text) + Sat->>Scorer: behavioral score of baseline on each holdout task + Sat-->>CLI: SaturationReport(band, ...) + alt band != healthy + alt --force-saturation-check + Note over CLI: proceed regardless + else non-interactive + CLI->>FS: write gate_decision.json (decision=denied, reason=saturated_baseline) + Note over CLI: return — GEPA never runs (default-deny) + else interactive + CLI->>CLI: prompt "Continue anyway? [y/N]" + end + end + end + CLI->>GEPA: compile(PromptModule(baseline), trainset, valset, instruction_proposer=PromptSectionProposer) + loop per iteration + GEPA->>PM: forward(task, closed_loop_task_id) — candidate in sentinel region of predictor instructions + PM-->>GEPA: Prediction(_candidate_text, _closed_loop_task_id) + GEPA->>Scorer: metric → closed_loop_scorer(task_id, candidate_text) + Scorer->>FS: splice candidate into live section (only if changed) + Scorer->>H: run task; read session from sandbox state.db + H-->>Scorer: tool_calls_with_args + final text + Scorer->>Scorer: compound verdict = Layer 1 (memory fired?) + Layer 2 (judge on memory add/replace content) + Scorer-->>GEPA: score ∈ {0.0, 1.0} + GEPA->>Prop: reflect on failures → sentinel-preserving candidate + end + GEPA-->>CLI: optimized module with detailed_results + CLI->>Guard: exit → atomic_write_bytes(target, .cl_backup); unlink backup; release flock +``` + +Everything that mutates the file lives **inside** the guard, which holds an exclusive `flock` (the same lock name the deploy-gate `ClosedLoopValidator` uses — sequenced before it, never nested) and restores the original bytes on exit. The saturation pre-flight scores the baseline behaviorally on the holdout; a non-`healthy` band (e.g. `no_headroom` on an already-tuned section) **default-denies in non-interactive contexts** unless `--force-saturation-check`, writing a `decision="denied"` gate before GEPA spends a cent. The compound per-task verdict is two layers: **Layer 1** is trigger membership (did the `memory` tool fire, via `expected_tools` / `forbidden_tools`), **Layer 2** is the `SaveCallJudge` scoring `memory(action=add|replace)` content against the task's `expected_save_content` rubric (`remove` is not a save; a passing Layer 1 with no save action scores a vacuous 1.0 on Layer 2). GEPA mutates only the sentinel-delimited region of the passthrough predictor's instructions; the `PromptSectionProposer` rejects any reflection-LM output that fails sentinel preservation. + +### Phase D — Deploy gate (closed-loop on the holdout), persist, apply + +```mermaid +sequenceDiagram + autonumber + participant CLI as evolve_prompt_section + participant Sel as candidate selection + participant V as ClosedLoopValidator + participant Inst as HermesPromptSectionInstaller + participant FS as live prompt_builder.py + participant H as hermes -z + participant Src as HermesPromptSource + + Note over CLI: guard already exited — file restored to baseline + CLI->>Sel: evolved_text = section_from_candidate(best_idx) # GEPA val-argmax + CLI->>FS: write baseline_section.txt + evolved_section.txt + CLI->>V: validate(ValidationInputs(section, holdout_suite, baseline_file, evolved_file)) + Note over V: own backup/restore + flock — independent of the Phase C guard + loop baseline phase, then evolved phase + V->>Inst: install(section_file) — splice into live prompt_builder.py + loop each holdout task + V->>H: run task; score Layer 1 + Layer 2 via layer2_judge_factory + end + end + V-->>CLI: ValidationReport(baseline_pass_rate, evolved_pass_rate, n_wins/n_losses, decision) + CLI->>FS: write gate_decision.json (artifact_type="prompt_section", decision=deploy|reject) + alt decision == pass AND --apply + CLI->>Src: write(section_name, evolved_text) — live section updated in place + end +``` + +The selected candidate is GEPA's val-argmax (`detailed_results.best_idx`) — there's no knee-point parsimony pass on the prompt-section path. The deploy gate is a fresh `ClosedLoopValidator.validate` over the **holdout** suite, with its own backup/restore + `flock` (it runs after the Phase C guard has already exited and restored the file, so the two never nest). Its decision is closed-loop only: pass-rate no-regression plus `n_wins >= 2 * n_losses`. The gate decision is written with `artifact_type="prompt_section"`, `target_section`, `baseline_chars` / `evolved_chars` / `growth_pct`, a `closed_loop` block (both pass-rates + win/loss/tie counts), and `sentinel_failures`. `--create-pr` records a `skipped` PR block (deferred for sections); `--apply` is the only way to ship, writing the evolved text into the live section. + +**Empirical anchors.** The real `MEMORY_GUIDANCE` section saturates — it scored 1.0 across the holdout (`no_headroom` band) and the harness correctly default-denied a non-interactive run before GEPA started. To exercise the full deploy path, an adversarially-weakened baseline (via `--baseline-override-file`) evolved `0.67 → 1.00` pass-rate with 2 wins / 0 losses on the holdout, clearing the closed-loop gate and deploying. The saturating-real-section result is the expected, correct outcome, not a bug: there is no headroom to evolve into when the section already passes every behavioral task. + ## Failure-mode summary | Trigger | Outcome | Where to look | @@ -565,3 +717,8 @@ Manual smoke harness: `tests/manual/skill_closed_loop_smoke.py` (supports `--sui | Closed-loop validator concurrent run | `ConcurrentRunError` (`fcntl.flock` non-blocking acquire fails) | console only | | Closed-loop validator drift between tasks | `ChecksumDriftError` after the offending task; phase aborts, restore still runs | run.log + raised error | | Closed-loop cache validator failure during evolution | `WARNING` logged, cache returns `None`, GEPA continues without the verdict — never aborts the run | run.log | +| Prompt-section suite < 2 tasks | `ValueError` (can't split into non-empty train + holdout) | console only | +| Prompt-section stale `.cl_backup` on guard entry | `RuntimeError` naming the backup file; refuses to start | console only | +| Prompt-section saturated baseline, non-interactive | `decision="denied"` `gate_decision.json`; GEPA never runs (override with `--force-saturation-check`) | `gate_decision.json` (`saturation_band`) | +| Prompt-section closed-loop gate rejects | `decision="reject"` `reason="closed_loop_gate"`; section not applied | `gate_decision.json` (`closed_loop` block) | +| Prompt-section `--create-pr` | recorded as `skipped` (PR automation deferred); use `--apply` + manual PR | `gate_decision.json` (`pr_created` block) | From e254e604ed297dee29d1b808a3297cb2e0cf9a12 Mon Sep 17 00:00:00 2001 From: Justin Ramos Date: Tue, 2 Jun 2026 09:32:10 -0600 Subject: [PATCH 4/4] docs(reports): Phase 3 validation report MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a prompt-section branch to generate_report.py (behavioral-only runs self-source from gate_decision.json — no metrics.json/run.log; the _experiment and _results renderers lay out pass-rate/win-loss tables instead of bootstrap/knee/synthetic), author reports/phase3_prose.yaml, render reports/phase3_validation_report.pdf from the adversarial-baseline headline run (67%→100% holdout, 2W/0L, section shrank 15.2%), and link it from the README phase table. The skill/tool report path is unchanged (additive artifact_type branch). --- README.md | 2 +- generate_report.py | 170 ++++++++++++++++--- reports/phase3_prose.yaml | 233 +++++++++++++++++++++++++++ reports/phase3_validation_report.pdf | Bin 0 -> 24431 bytes 4 files changed, 384 insertions(+), 21 deletions(-) create mode 100644 reports/phase3_prose.yaml create mode 100644 reports/phase3_validation_report.pdf diff --git a/README.md b/README.md index 60386b8..5fae14d 100644 --- a/README.md +++ b/README.md @@ -347,7 +347,7 @@ Cost: each task is one `hermes -z` run (~$0.05–$0.50). The bundled `patch.json |-------|--------|--------|--------| | **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ [Validated](reports/phase1_validation_report.pdf) | | **Phase 2** | Tool descriptions + dual-signal deploy gate | DSPy + GEPA | ✅ [Validated](reports/phase2_validation_report.pdf) | -| **Phase 3** | System prompt sections | DSPy + GEPA | ✅ Complete | +| **Phase 3** | System prompt sections | DSPy + GEPA | ✅ [Validated](reports/phase3_validation_report.pdf) | | **Phase 4** | Tool implementation code | Darwinian Evolver | 🔲 Planned | | **Phase 5** | Continuous improvement loop | Automated pipeline | 🔲 Planned | diff --git a/generate_report.py b/generate_report.py index 3008116..b7a1ec7 100644 --- a/generate_report.py +++ b/generate_report.py @@ -45,13 +45,81 @@ DEFAULT_LOGO = REPO_ROOT / "assets" / "dna.png" +def _extract_prompt_section_data(gate: dict, run_dir: Path) -> dict[str, Any]: + """Build the render context for a Phase 3 prompt-section run. + + The prompt-section path is behavioral-only — its gate_decision carries a + ``closed_loop`` pass-rate / win-loss block instead of the skill/tool + bootstrap-CI + synthetic-dataset + knee-point fields, and self-sources + cost/timing/call-count (no metrics.json needed). The ``_experiment`` and + ``_results`` renderers branch on ``artifact_type`` to lay out the matching + tables; every other section is prose-driven via the keys returned here. + """ + cl = gate.get("closed_loop", {}) + cost = gate.get("cost", {}) + resolved = (gate.get("run_inputs", {}) or {}).get("resolved_lms", {}) + + n_wins = int(cl.get("n_wins", 0)) + n_losses = int(cl.get("n_losses", 0)) + n_ties = int(cl.get("n_ties", 0)) + cl_total = n_wins + n_losses + n_ties + baseline_rate = float(cl.get("baseline_pass_rate", 0.0)) + evolved_rate = float(cl.get("evolved_pass_rate", 0.0)) + cl_baseline_pass = round(baseline_rate * cl_total) + cl_evolved_pass = round(evolved_rate * cl_total) + elapsed = int(float(gate.get("elapsed_seconds", 0))) + lm_calls = sum(int(m.get("calls", 0)) for m in (cost.get("by_model") or {}).values()) + decision = gate.get("decision", "") + + def _model(role: str) -> str: + return (resolved.get(role) or {}).get("model", "—") + + return { + "artifact_type": "prompt_section", + "skill_name": gate.get("target_section", run_dir.parent.name), + "section_name": gate.get("target_section", ""), + "baseline_chars": int(gate.get("baseline_chars", 0)), + "evolved_chars": int(gate.get("evolved_chars", 0)), + "growth_pct": float(gate.get("growth_pct", 0.0)), + "growth_abs_pct": abs(float(gate.get("growth_pct", 0.0))), + "decision": decision, + "decision_upper": "DEPLOYED" if decision == "deploy" else "REJECTED", + "decision_signal": gate.get("decision_signal", "closed_loop"), + "baseline_pass_rate": baseline_rate, + "evolved_pass_rate": evolved_rate, + "baseline_pass_pct": baseline_rate * 100, + "evolved_pass_pct": evolved_rate * 100, + "cl_baseline_pass": cl_baseline_pass, + "cl_evolved_pass": cl_evolved_pass, + "cl_total_tasks": cl_total, + "cl_tasks_gained": cl_evolved_pass - cl_baseline_pass, + "n_wins": n_wins, + "n_losses": n_losses, + "n_ties": n_ties, + "elapsed_seconds": elapsed, + "elapsed_minutes": elapsed // 60, + "cost_total_usd": float(cost.get("total_usd", 0.0)), + "lm_calls_metrics": lm_calls, + "optimizer_lm": _model("optimizer"), + "reflection_lm": _model("reflection"), + "eval_lm": _model("eval"), + "saturation_band": gate.get("saturation_band", ""), + "sentinel_failures": int(gate.get("sentinel_failures", 0)), + "decision_reasons": "; ".join(cl.get("decision_reasons", [])), + } + + def _extract_run_data(run_dir: Path) -> dict[str, Any]: """Pull all numbers the renderer needs from a run dir. Reads gate_decision.json (always present) + metrics.json (deploy only) + - run.log (LM call counts grep'd from timing-callback lines). + run.log (LM call counts grep'd from timing-callback lines). Prompt-section + (Phase 3) runs are behavioral-only and self-source from gate_decision.json + alone — see ``_extract_prompt_section_data``. """ gate = json.loads((run_dir / "gate_decision.json").read_text()) + if gate.get("artifact_type") == "prompt_section": + return _extract_prompt_section_data(gate, run_dir) metrics_path = run_dir / "metrics.json" metrics = json.loads(metrics_path.read_text()) if metrics_path.is_file() else {} @@ -442,7 +510,7 @@ def _approach(prose: dict, ctx: dict, styles) -> list: ap = prose["approach"] engines = ap["engines"] flow = [ - Paragraph("Approach: Evolutionary Skill Optimization", styles['SectionHead']), + Paragraph(ap.get("section_title", "Approach: Evolutionary Optimization"), styles['SectionHead']), Paragraph("Three Optimization Engines", styles['SubSection']), _highlight_table( header=engines["header"], @@ -463,9 +531,57 @@ def _approach(prose: dict, ctx: dict, styles) -> list: return flow +def _experiment_prompt_section(exp: dict, overrides: dict, ctx: dict, styles) -> list: + """Phase 3 experiment section: behavioral config (no synthetic eval set, + no knee-point), and the suite is described in prose rather than via a + train.jsonl examples table (prompt-section runs don't write one).""" + config_rows = [ + ['Target Section', _fmt(overrides["target_section_label"], ctx)], + ['Baseline Size', f'{ctx["baseline_chars"]:,} characters'], + ['Optimizer LM', _fmt(overrides["optimizer_lm"], ctx)], + ['Reflection LM (GEPA)', _fmt(overrides["reflection_lm"], ctx)], + ['Content-Judge LM (Layer 2)', _fmt(overrides["eval_judge_lm"], ctx)], + ['Agent (hermes -z)', _fmt(overrides["agent_lm"], ctx)], + ['Behavioral Suite', f'{ctx["cl_total_tasks"]} holdout tasks (real hermes -z, scored end-to-end)'], + ['Total Optimization Time', + f'{ctx["elapsed_seconds"]:,} seconds (~{ctx["elapsed_minutes"]} minutes)'], + ['Total LM Calls (in-process)', f'{ctx["lm_calls_metrics"]:,}'], + ['Total Cost (USD, in-process)', f'${ctx["cost_total_usd"]:.2f}'], + ['Deploy Gate', _fmt(overrides["quality_gate_label"], ctx)], + ['Saturation Pre-flight', _fmt(overrides["saturation_label"], ctx)], + ] + config_data = [[_wrap_cell(c, styles['TableHeaderCell']) for c in ['Parameter', 'Value']]] + config_data += [[_wrap_cell(c, styles['TableCell']) for c in row] for row in config_rows] + config_table = Table(config_data, colWidths=[2.0 * inch, 4.0 * inch]) + config_table.setStyle(TableStyle([ + ('BACKGROUND', (0, 0), (-1, 0), HexColor('#1a1a2e')), + ('GRID', (0, 0), (-1, -1), 0.5, HexColor('#cccccc')), + ('TOPPADDING', (0, 0), (-1, -1), 5), + ('BOTTOMPADDING', (0, 0), (-1, -1), 5), + ('LEFTPADDING', (0, 0), (-1, -1), 8), + ])) + return [ + Paragraph(exp.get("section_title", "Experiment"), styles['SectionHead']), + Paragraph("Configuration", styles['SubSection']), + config_table, + Paragraph("Evaluation Suite", styles['SubSection']), + Paragraph(_fmt(exp["dataset_intro"], ctx), styles['BodyJust']), + Paragraph("Fitness Function", styles['SubSection']), + Paragraph(_fmt(exp["fitness_intro"], ctx), styles['BodyJust']), + Paragraph( + f"{exp['fitness_formula']}", + ParagraphStyle('Formula', parent=styles['Normal'], alignment=TA_CENTER, + spaceBefore=8, spaceAfter=8, fontSize=10), + ), + Paragraph(_fmt(exp["fitness_closing"], ctx), styles['BodyJust']), + ] + + def _experiment(prose: dict, ctx: dict, styles, examples: list[tuple[str, str]]) -> list: exp = prose["experiment"] overrides = exp["config_overrides"] + if ctx.get("artifact_type") == "prompt_section": + return _experiment_prompt_section(exp, overrides, ctx, styles) # Phase 1 runs counted gpt-4.1-mini + gpt-5-mini explicitly via run.log grep; # Phase 2 runs use a single optimizer LM tier (e.g., gpt-5.4-mini), so fall @@ -571,24 +687,38 @@ def _results(prose: dict, ctx: dict, styles) -> list: accent_bg = HexColor('#fff8e1') accent_fg = HexColor('#5d4037') - results_rows = [ - ['Metric', 'Baseline', 'Evolved (knee-point pick)', 'Δ'], - ['Body size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'], - [f'Avg holdout score (n={ctx["n_holdout"]})', - f'{ctx["avg_baseline"]:.3f}', f'{ctx["avg_evolved"]:.3f}', f'{ctx["improvement"]:+.3f}'], - ['Bootstrap mean diff', '—', f'{ctx["bootstrap_mean"]:+.3f}', '—'], - ['Bootstrap 90% CI lower', '—', f'{ctx["bootstrap_lower"]:+.3f}', '—'], - ['Bootstrap 90% CI upper', '—', f'{ctx["bootstrap_upper"]:+.3f}', '—'], - ] - # Phase 2: surface the closed-loop behavioral signal when the v5 schema - # exposed it (absent on synthetic-only runs). - if ctx.get("cl_total_tasks"): - results_rows.append([ - f'Closed-loop tasks (n={ctx["cl_total_tasks"]})', - f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}', - f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}', - f'+{ctx["cl_tasks_gained"]} (req ≥{ctx["cl_required_gain"]})', - ]) + if ctx.get("artifact_type") == "prompt_section": + # Behavioral-only: pass-rate + win/loss, no bootstrap/synthetic rows. + delta_rate = ctx["evolved_pass_rate"] - ctx["baseline_pass_rate"] + results_rows = [ + ['Metric', 'Baseline', 'Evolved', 'Δ'], + ['Section size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'], + [f'Holdout pass-rate (n={ctx["cl_total_tasks"]})', + f'{ctx["baseline_pass_rate"]:.0%}', f'{ctx["evolved_pass_rate"]:.0%}', f'{delta_rate:+.0%}'], + [f'Tasks passing (n={ctx["cl_total_tasks"]})', + f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}', + f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}', + f'+{ctx["n_wins"]}W / {ctx["n_losses"]}L'], + ] + else: + results_rows = [ + ['Metric', 'Baseline', 'Evolved (knee-point pick)', 'Δ'], + ['Body size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'], + [f'Avg holdout score (n={ctx["n_holdout"]})', + f'{ctx["avg_baseline"]:.3f}', f'{ctx["avg_evolved"]:.3f}', f'{ctx["improvement"]:+.3f}'], + ['Bootstrap mean diff', '—', f'{ctx["bootstrap_mean"]:+.3f}', '—'], + ['Bootstrap 90% CI lower', '—', f'{ctx["bootstrap_lower"]:+.3f}', '—'], + ['Bootstrap 90% CI upper', '—', f'{ctx["bootstrap_upper"]:+.3f}', '—'], + ] + # Phase 2: surface the closed-loop behavioral signal when the v5 schema + # exposed it (absent on synthetic-only runs). + if ctx.get("cl_total_tasks"): + results_rows.append([ + f'Closed-loop tasks (n={ctx["cl_total_tasks"]})', + f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}', + f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}', + f'+{ctx["cl_tasks_gained"]} (req ≥{ctx["cl_required_gain"]})', + ]) results_rows.append(['Decision', '—', decision_cell, decision_note]) # Per-cell style picks: header row uses bold/white; first column (metric diff --git a/reports/phase3_prose.yaml b/reports/phase3_prose.yaml new file mode 100644 index 0000000..e68499c --- /dev/null +++ b/reports/phase3_prose.yaml @@ -0,0 +1,233 @@ +# Editorial content for the Phase 3 validation report. +# Numbers come from the run dir's gate_decision.json (the prompt-section path is +# behavioral-only and self-sources cost/timing/calls — no metrics.json/run.log +# needed). Pass via `generate_report.py --run output/prompts//`. Text blocks +# may include {placeholder} substitutions the renderer fills from that data. + +meta: + title: "Agent Self-Evolution" + subtitle: "Phase 3 Validation Report
System-prompt section evolution via splice-and-restore" + organization: "" + repository: "github.com/jramos/agent-self-evolution" + +executive_summary: + framework_intro: > + Agent Self-Evolution is a standalone optimization pipeline that uses DSPy and GEPA + (Genetic-Pareto Prompt Evolution) to automatically improve an agent's skills, tool + descriptions, system prompts, and code through evolutionary search — all via API + calls with no GPU training required. Phase 1 shipped a synthetic-only deploy gate; + Phase 2 made it behavior-aware and brought tool-description parity. Phase 3 extends + the framework to the third instructions surface — named sections of the agent's + system prompt — evaluated end-to-end against the real agent. + run_summary: > + This report documents the Phase 3 validation of system-prompt section evolution. + The target is a top-level string constant in Hermes Agent's + prompt_builder.py (here, {section_name}), evolved + via GEPA and validated purely behaviorally: every candidate is spliced into + the live prompt file and scored by running the real agent + (hermes -z) against a curated task suite — there is no + synthetic LLM-as-judge signal to lean on. Production {section_name} is already + well-tuned, so the saturation pre-flight correctly default-denies it (no headroom). + To exercise the loop end-to-end, the headline run evolves a deliberately-weakened + baseline (supplied via --baseline-override-file): the + agent's holdout pass-rate moved {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%} + ({cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks} tasks, + +{n_wins}W / {n_losses}L) while the section shrank {growth_pct:+.1%}. + The closed-loop deploy gate decided {decision_upper}, and the live prompt file + was restored byte-for-byte after every trial. + +key_result_box: + title_template: "KEY RESULT — {section_name} (prompt-section deploy via closed-loop gate)" + rows: + - "Holdout pass-rate (n={cl_total_tasks}): {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%} (+{n_wins}W / {n_losses}L)" + - "Tasks passing: {cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks}" + - "Section size: {baseline_chars:,} → {evolved_chars:,} chars ({growth_pct:+.1%})" + - "Decision: {decision_upper} via the closed-loop behavioral gate" + +background: + intro: > + Agent Self-Evolution targets the instructions layer of an LLM agent — skill files, + tool descriptions, and system prompts — and evolves the text via API-only + evolutionary search. An agent's behavior is governed by three layers: + layers: + header: ["Layer", "What It Is", "How It's Currently Improved"] + rows: + - ["Model Weights", "The underlying LLM (Claude, GPT, etc.)", "RL training (Tinker-Atropos)"] + - ["Instructions", "Skills, system prompts, tool descriptions", "Manual authoring (static)"] + - ["Tool Code", "Python implementations of each tool", "Manual development"] + highlight_row: 1 + closing: > + Phases 1 and 2 validated skill files and tool descriptions. Phase 3 completes the + instructions trio with system-prompt sections — the highest-leverage, widest + blast-radius surface, since one section governs the agent across every task. The + section is a string constant inside Hermes' own source, so unlike the skill path + (separate writable workdir) there is no env-var hook or plugin seam: the framework + edits prompt_builder.py in place. The integration is an + AST-precise splice-and-restore — the candidate is byte-spliced into the live + file for the duration of a trial and restored from an atomic backup afterward + (flock + checksum-drift detection + parse-guard, reused + from the Phase 2 closed-loop validator). Crucially, a system-prompt section has no + cheap synthetic proxy: the only honest measure of "did this guidance help" is + running the real agent, so Phase 3's deploy gate is purely behavioral. + +approach: + section_title: "Approach: Behavioral Prompt-Section Evolution" + engines: + header: ["Engine", "What It Optimizes", "License", "Role"] + rows: + - ["DSPy + GEPA", "Skills, prompts, tool descriptions", "MIT", "Primary (validated)"] + - ["DSPy MIPROv2", "Few-shot examples, instruction text", "MIT", "Fallback optimizer"] + - ["Darwinian Evolver", "Code files, algorithms", "AGPL v3", "Code evolution (Phase 4)"] + gepa_narrative: > + GEPA (Genetic-Pareto Prompt Evolution) is the star engine — an ICLR 2026 + Oral paper from Stanford/UC Berkeley. Unlike traditional evolutionary search that + only sees pass/fail scores, GEPA reads full execution traces to understand + why things failed, then proposes targeted mutations. Phase 3 wires GEPA to a + sentinel-preserving proposer (mutations are confined to the section's text, never + the surrounding scaffolding) and routes every candidate score through a real + hermes -z subprocess. Because the spliced + prompt_builder.py is a single shared file and DSPy + evaluates with a thread pool, candidate scoring is serialized under a lock — an + accepted cost of the splice-and-restore model. + pipeline_steps: + - "Resolve baseline — Read the section's current text from prompt_builder.py (or accept a weakened baseline via --baseline-override-file to create headroom on an already-tuned section)" + - "Split — Deterministic seeded train / holdout split of the curated JSONL task suite" + - "Saturation pre-flight — Score the baseline behaviorally on the holdout; a no_headroom band default-denies (correctly refusing to evolve a saturated section) unless overridden" + - "GEPA loop — The section text is a sentinel-delimited region of a passthrough predictor's instructions; GEPA mutates it with the sentinel-preserving proposer. Each candidate is spliced into the live file and scored by running the agent on each task" + - "Compound verdict — Layer 1: did the agent invoke the expected tool (e.g. memory)? Layer 2: an LLM judge scores the saved content against each task's rubric" + - "Closed-loop deploy gate — Select the GEPA val-best candidate, then run baseline vs. evolved on the holdout suite; deploy iff holdout pass-rate doesn't regress and per-task wins offset losses ≥ 2:1" + - "Report + restore — Structured gate_decision.json (v5 schema, prompt-section variant); the live file is restored byte-for-byte" + cost_paragraph: > + The honest Phase 3 story is two-part. First, the framework's regression-catching + discipline: the production {section_name} is already + well-tuned, so a capable agent satisfies the suite regardless of small wording + changes — the saturation pre-flight scores the baseline at ceiling and correctly + default-denies, refusing to spend GEPA budget where no improvement is + possible. This mirrors the Phase 2 finding that the framework is improvement-finding + only where headroom genuinely exists. Second, to demonstrate that the loop produces + a real, grounded improvement when headroom does exist, the headline run + evolves a deliberately-adversarial baseline (one that instructs the agent not + to save) — exactly the weakened-target approach Phase 2 used for its headline. That + run consumed ${cost_total_usd:.2f} across {lm_calls_metrics:,} in-process LM + calls in ~{elapsed_minutes:.0f} minutes (the agent's own subprocess spend is + separate). Splicing a different section measurably changed live agent behavior, and + GEPA recovered a corrected section that the closed-loop gate deployed. + +experiment: + section_title: "Phase 3 Experiment" + config_overrides: + target_section_label: "{section_name} — evolved from a deliberately-weakened baseline (production {section_name} is saturated; the weak baseline, supplied via --baseline-override-file, exercises the loop end-to-end)" + optimizer_lm: "{optimizer_lm}" + reflection_lm: "{reflection_lm}" + eval_judge_lm: "{eval_lm}" + agent_lm: "openai/gpt-5.4-mini (Hermes-configured default)" + quality_gate_label: "closed-loop behavioral — holdout pass-rate no-regression + per-task wins ≥ 2·losses; compound verdict (Layer 1 trigger + Layer 2 content judge)" + saturation_label: "forced via --force-saturation-check (the weakened baseline had real headroom; production {section_name} default-denies as no_headroom)" + dataset_intro: > + The evaluation suite is a curated, hand-authored JSONL benchmark + (memory_guidance.jsonl, 12 tasks across five categories: + save-preference, save-correction, dont-save-task-progress, + dont-save-completed-work-log, and declarative-vs-imperative). Unlike Phases 1 and 2, + there is no synthetically-generated train/val/holdout of LLM-judge examples — + every task is scored behaviorally by running the real agent, and the deploy gate's + holdout is {cl_total_tasks} of those tasks. Each save task carries an + expected_save_content rubric consumed by the Layer 2 + content judge. + fitness_intro: > + Fitness is behavioral, not a synthetic judge score. For each task, the candidate + section is spliced into the live prompt_builder.py, the + agent runs once via hermes -z, and the resulting session + is read back from Hermes' SQLite session store. The verdict is compound: + fitness_formula: "pass = Layer1(expected memory action fired, forbidden actions absent) AND Layer2(content-judge score ≥ 0.7 on save tasks)" + fitness_closing: > + GEPA's reflection LM reads the per-task failures and proposes a targeted mutation of + the section text; the sentinel-preserving proposer confines edits to the section and + re-raises rather than admit a candidate that drops the markers. The deploy gate then + re-runs baseline vs. evolved on the holdout and decides on the behavioral signal + alone — holdout pass-rate no-regression plus a per-task win/loss rule — with no + paired-bootstrap CI, because there is no synthetic per-example distribution to + resample. + +results: + narrative: > + Evolving the weakened {section_name} baseline, the agent's holdout pass-rate + moved {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%} + ({cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks} tasks, + +{n_wins} wins / {n_losses} losses, {n_ties} ties) while the section text + shrank {growth_pct:+.1%} ({baseline_chars:,} → {evolved_chars:,} + chars). GEPA learned from the save-task failures and inverted the adversarial + instruction — it removed the "never proactively save" misdirection and restored + proactive saving while keeping the legitimate "don't store passing remarks" + discrimination, in fewer characters. Decision: {decision_upper} via the + closed-loop gate ({decision_reasons}). The proposer rejected + {sentinel_failures} sentinel-breaking candidates. Throughout, the live + prompt_builder.py was restored byte-for-byte after every + trial. The production {section_name} itself is saturated and correctly + default-denies — the framework is regression-catching, and only finds improvements + where real headroom exists. + how_produced_intro: "GEPA evolves the section text through a reflective loop; the gate then reads the behavioral signal:" + how_produced_steps: + - "Splice a candidate section into the live prompt_builder.py (only when the candidate changes); run each holdout task once via hermes -z and read the session from Hermes' state.db" + - "Score each run with the compound verdict (Layer 1 tool-trigger membership + Layer 2 content judge on memory-save content); abstentions (agent/runner errors) score 0 in-loop and tie at the gate" + - "The reflection LM reads the failures and proposes a sentinel-confined mutation of the section text; GEPA accepts on improvement-or-equal" + - "Select the GEPA val-best candidate; run the closed-loop deploy gate (baseline vs. evolved on {cl_total_tasks} holdout tasks, its own backup/restore)" + - "Decide — Deploy iff evolved holdout pass-rate ≥ baseline AND per-task wins offset losses ≥ 2:1. On this run: {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%}, {n_wins}W/{n_losses}L → DEPLOY" + how_produced_closing: > + Two design choices made this outcome trustworthy. First, the splice-and-restore + guard (atomic backup + exclusive flock + byte-restore, + with stale-backup refusal) means the user's Hermes checkout is never left mutated, + even on crash. Second, the deploy gate is the same proven closed-loop validator used + for tool descriptions — the prompt path adds only a thin installer plus a per-task + content judge, so the decision rule, audit trail, and restore machinery are shared + and already battle-tested. The behavioral-only design is not a shortcut: it is the + only honest measure for a system-prompt section, which has no cheap synthetic proxy. + +safety: + intro: "Every evolved section must clear these constraints, and the live prompt file is protected throughout:" + table: + header: ["Constraint", "Enforcement", "Status"] + rows: + - ["Self-evolution test suite", "1,232 pytest tests pass on the optimizer itself", "Implemented"] + - ["Byte-clean splice/restore", "Atomic backup + byte-for-byte restore of prompt_builder.py after every run", "Implemented"] + - ["Parse-guarded write", "Candidate spliced via repr() + ast.parse check; refuses to write non-parseable Python", "Implemented"] + - ["Exclusive lock + drift check", "flock on the prompt file's dir + sha-drift detection; stale-backup refusal on startup", "Implemented"] + - ["Compound verdict", "Layer 1 tool-trigger membership AND Layer 2 LLM content judge (≥ threshold)", "Implemented"] + - ["Abstain on corrupt session", "A malformed agent session abstains (neutral), never scores as a behavioral regression", "Implemented"] + - ["Closed-loop deploy gate", "Holdout pass-rate no-regression + per-task wins ≥ 2·losses", "Implemented"] + - ["Saturation pre-flight", "Default-denies a saturated (no_headroom) section before spending GEPA budget", "Implemented"] + - ["Budget ceiling", "--max-cost-usd aborts on in-process LM spend overrun", "Implemented"] + - ["Deployment via apply + review", "--apply writes the section; PR automation deferred for prompt sections", "By design"] + - ["Benchmark regression", "External --benchmark-cmd hook (TBLite / harness)", "Planned"] + closing: > + The source Hermes repository is never left modified: the section is spliced in only + for the duration of a trial and restored from an atomic backup, and all evolution + output (gate decisions, section before/after text, run logs) is written under the + framework's local output/ directory. PR automation is + deferred for prompt sections — a section-scoped PR path is future work — so the + deploy step is an explicit --apply plus a human-authored + pull request. + +roadmap: + table: + header: ["Phase", "Target", "Engine", "Timeline", "Status"] + rows: + - ["Phase 1", "Skill files (SKILL.md)", "DSPy + GEPA", "3-4 weeks", "Validated ✓"] + - ["Phase 2", "Tool descriptions", "DSPy + GEPA", "2-3 weeks", "Validated ✓"] + - ["Phase 3", "System prompt sections", "DSPy + GEPA", "2-3 weeks", "Validated ✓"] + - ["Phase 4", "Tool implementation code", "Darwinian Evolver", "3-4 weeks", "Planned"] + - ["Phase 5", "Continuous improvement", "Automated pipeline", "2 weeks", "Planned"] + highlight_row: 2 + closing: > + Phase 3 completes the instructions trio — skills, tool descriptions, and now + system-prompt sections — all gated by the same closed-loop discipline. The + behavioral-only deploy gate proves the framework can evolve the highest-blast-radius + instructions surface safely: it default-denies a saturated section, produces a real + grounded improvement where headroom exists, and never leaves the agent's source + mutated. Phase 4 (tool implementation code) and Phase 5 (continuous improvement) + extend the framework beyond the instructions layer. + +next_steps: + - "Harder behavioral suites — Production system-prompt sections are heavily tuned and saturate the current suites; develop richer, harder task suites (and weaker agent tiers) so headroom exists on real targets, not only adversarial baselines." + - "Additional sections — The same path supports any string-constant section (SKILLS_GUIDANCE, SESSION_SEARCH_GUIDANCE, etc.); MEMORY_GUIDANCE was the first proof point, chosen for its clear tool-call anchor." + - "Section-scoped PR automation — Wire --create-pr for prompt sections by splicing into origin/<base>'s prompt_builder.py (not the local checkout), so the PR diff carries only the section change." + - "Agent-side cost capture — The agent's own LM spend happens inside the hermes subprocess and is invisible to the in-process budget ceiling; surface it from the session store so --max-cost-usd accounts for end-to-end spend." diff --git a/reports/phase3_validation_report.pdf b/reports/phase3_validation_report.pdf new file mode 100644 index 0000000000000000000000000000000000000000..b2cd415560bce8a8ba870816882e2f5649df6e29 GIT binary patch literal 24431 zcmdSB+19FRwk>!cPca3=3PccV5veUuK*b6a0RamtQBR(@$rs4H%lmxRWUe*ithHlD zW=6!tIc>MUwO>H|3iueKk3RaKsPZC;*U7&W|F8e^|MOo(9eH{0_z!ITbv}Ph^PNrpKi~iPk$;o7^{1|yf9~r3d8=N=B~hBh zKd0IKfpJNgWb4#+^PlqGf4jThN@DX??%~JFei%P6X3b5f-4^fVBl_8Vm|v}5JExyx zz6ZC@j}Gwzlf5|j+@?Y3TYc$t>i6oLXEg79zu$5@{l)78UnA$!e)0nj zsj$PrN^eZw+fIw!lG`w7axKx_$S(U0ITGJ|f^xVHe+omfV3aerEnuv6*hb-k4W!~n+*>2EX zm5a*EYgD1@toeGDmvpec)crJ^f>XTIs(m!m6g<~o5KdU8p84Ivo=irIC3Mfp6*%#Q)IF8zV0rM-a76XT<=m@VA+G% z?M)|Rrk6QMJ+7}}`NmM~)9I@38*!Wee2s}0wQ%-%w6-rw{l2Os1i|dUZ|Dd3sYi=; z=87BU6OR{XS#4|0q>18o3{>0p=zcq13$ar>9I`169O=yTYL?$yw+`>u;6xQJYx}^) zjpwQhtuTDbw-(hY$IiBcq{@PD|SMZls5o0JB|t8dyIW+ zR<7B+f9qCv8)#N3WS{YQfA*8HuuJ)Nte+1jw=zuW>~tAzNq^MkxnnfOZi_|33Ow_a z77*G74qj~v-8aSBb85}E=WcayE~0FyI0cN^Sg!o7#`Dw18ByB$bD|Qc(;A#KCS)Z)SA^3t)TF}94QVFbuD4Wh)-+J-6F}iLxEjZolVgy`nLw}`6 zi?YkG_Eh*|5+8PauI)Ut&ZyN$7|bi}K}K4LYhi}ir< zg6o|hG1lR}eiuwJ`W!~@tc4z$+ADLA6yCVdY_>SDy{(O;^)^3iJ-1ulD6yUmE6$2? zTi{@@Av3itEK=jzuq_i!_!?UjrMPD8-Faiuf7O-+c%RM;K-mUpb1{FGbg&Z@t$_L8 zN4abGW%+(NiJ!*ve>jPM%5qDY{nwvX{5MN&?-u5T=*LE?HMw9R| z&t?lQehG9Z?GMdKLny$6!LZM%xa%Bt>j6=qrd|;!CBDTscei7rM9+;FHd|gcluwQ? zfh#0bDz|o5Hi5HrWX8BYE_wB(yzQm)Le-yG{jds@W{B4I`ipyAD5<%bSx&K$*7`h1 zpredM?d<8hW*@5ypc}-t+iFV((+iJywm@d&lX4nS)>crA@ly&d_3nKNo5{EJDy{5I z-O%tBnaLP7@GA9WTv#2{v;@;kCU{w3eMC;*GmEVZ-WnYi>1W^>mLJH}4Qko!z(}LMlRr6RMI!b0%K3d>z zYIH>ihb{=*{kSSRa(Zs6+HNvVUs1`X2K)8t*?n~;Vsq7!m+}i+H>9%oP^Iw(?htR= zz~pfP5emMOMWu(@?OM1UvNdB1%qN-#UWCG0w;({6?C{y9si&QwT||Ip`*tWF$dUUV z)Q3|T1$`PZyd5Gc@m*RyYQ;jK?5twO_ZV%;0&+*`8Q{ z=?d1nRxT{+aT)4w|G{(m6Wsle=kz~=Ih0Mp>pz3&atQTLA-eXT`r~r2mjk)?58@|C z%~9#XyWVX!`R_S$Yvg}zocPh{5kJjB0KXv;XiNo-&*@)t^A5${->{%f9q!Be-6L?n8R-i=zqa_|M%wb=Ya2z zIs6_T{m*}9|GhcgH{~~MY&zS0uIsCM!|0`_gzjq)0vA;Rmsn6$(fn(?1 z;BR)K{mB)h(Ul=JiQk^cWR zl>>|G*UjB!`49f>C&G~LPG)C+{ek{4|HQlf?BZAO@f$~!&*1L|u(^N2eDHriwH!pG zY0l_jq8+~fj2nKtaDTYN@9zxQ{GLz2`0=|D_|0Rke*ETi@{iv=CH}j!!GE{c_^&g6 ze>!pi^anHtc;jbFe`CVzoC~n9k@JQJ^L_6!dR7=|cg>0_ zeJGMz*{1@GCP9jl8x1(QJmA@cU@A2v>Mpr4xLQTAv>U%bhdiEBJ+P{pv@qksFe(hj z6{U+$t&`U~2|A6UtqiH4NvMo+7#)Pn_naai!CS-WAPA1rHlUu%O?Fhy+Vc5{8=D4! zHV^fgm6~UU2nJ$#R-BeL*M_mk7N|)8JhXkG)vSo*v`yaz^MYF5 z*+aeA&Ym(rO66>WjW3~ey-Md!woocOvPalzdv>`?Z%V>d4KNiaS5C@UllXRNI)jxt=u2%i8dR@##f@tmt40ms2;_9m@Yp(#LwTVGRaAArV0SZs z0oy?in+K`*yU*#xFkht}wkE)B2U@7Yg9G=2rA6Inr0;Wcfiwp2s1y$7UtQ8R^GeH3 z8*K}o2=#Wi!bg|=YVEGppFVXCF{^RP6hCF6s9z5xG>w-v1=19^t3q`icZYkIN6A^Wq z-%`E;QaBOKW{0jJuxSeG%V1UCX!cW>8Wl~^K8-Op8JzaTkDfg&pEKrdtW#8%y_H2- z%>_U=6>o)%TzKj&s(N7bfwmls`I_Hqzy}XD+9u1wCl}=8<9rSSr5gCCmLH*UE|krE za!&hQvozXIxanm6;oha1?alb4>*&>X!|bf5LH&*OX(c?Z`H}%t*UHNQhW>kMLEobi zQ8y}VlphsuIZgWTVuCY!cK?y53ifgOd-38>1^QV8RAH7F6qQKTc&kk_@dRz-vp-Hk z*jI|9(Y@)(v+=`Kn@{B!OK4b-g!x|ZdQO%$M<65SyauU$Dm|y1diCw}UN#^~)oiA%SW^V7s&z&hT2fS{7_~dnDtUqn)b)K$O?Br0=R@r@u$oQ(P3b`NhSqF^imq?x8@0*x6|PPB4zOPVc4TvsEG$@gyrnga@6W0202Kd za9SQ!u~P9lY9T@#4DS3a-oIF;b>DkeYzft7z}&AvO<$Rgsj7E2IPNfiEqloGB`(e4 z#?W|-7*`GkyG1wD+_laL`fcNre*IMUgP}PX-CIVbd!JQe5^hQb1?5uvq+o#%49d%5pCOQH%uy$URicf}CE};-1Rd?K6e0_GnBa89eGc9Hvcx$vrLv@T<<5`$GAP56MT2%zeF@QVg{i z>`KI|bnc!RLwmV3esiR1<51gxpEg1tI3H_-&1az`^^_O4pMQ-MEAV1C&RRq^KAg@L zsnG3CSp+|L9dV8&aOS&>Pdwc4?STL(=8N0#rKgVcf%k}9xMjL1ha-oz4v+V35Xx+9 zf53<>88$V#0r%%UV~0z)KzaPEhiTd3$W+lLvhJ(5-ddsn-HEWZI%Yi@bwRXWxl09~ zDNq$b4j#;&gSxz{=(P;4*~7Agx5UnOj^dLJDYAH_SEbq1YU0O=c4)EIQisz~XVI%-!w&`m@{5y=OlvN$bgVC6k$X*^I2?deYE3An&{ftG)Cl zcgM=2yfdNl0V0^jz7~(V(0Vm#_xp|c@!cHVF3?jiqb-0RCABY_d39rR49VaT=G>|` zxI9^kY#Q5H_-#H2t5s5iDKE>svJi51Z@H7^2o;UYts}e8X}Q@4Np**gz)VDdqt?Dj zhI`^Ino6vNa$=o5_-4qD$kk?-88$m7pVRsT?%Y*c%r=xs-^SUwSYaR0yH}Q~9t|HC z%jm9h$^?I6qFH4QN##-4KB+M^E>0M0f0+)+D)YTV)q{kt37Z}6vl})Qd-iKytxWNH zr!ZqP(L*f4dhaX#$*DZ~i}A#h$&?G-*lgmre(`ew%+YzK4qEnnIG#R+K0uU-)9KR- z*3Hu!xjKh6zxFOH#|}+mS?i6HY5*|wosD3Gd1mY9Twp1ENa;s)yXa2Q1)+9$Y`Z&- zLza2?*SEApx_P>3cH7!F$BI$gOIPV#=#0lb?$QT8@_jl3BIec^GG{w!^-1wUjw-z- zarx?2^tdC3=A5m`&BYxpmJi~l%JqE|m^Z)0O95?&=CIc9PXb0>+0c0NoHSq5dRp#{ z+*#tUTS9Id)TkQrxdE5k@GTBO=(st;nen=|7qjW%t#-05NUM~>YwSL22~m}NjYF=u z2(ni=?WKB5=>9mH2Ao^G+gr0M-gl8#Bj!kX5Zm}tw6orP^&k(kkJ(A(#^PN9liOJv zdk+}^E*H2YwpeanAu09Ew3g3(RK+@C|k*B+xgzn#w!e?LW;#{%_ zRRMJ!f4O=sP9L@l?+on*#@%|J%|023+ICM*#^(L&A3Fm9!AtE?*HJF=IDf>J@TBoA5@tB)Vuk83(D#Rp z9~`fbj!$bzReuSE^DFZ%rEqe(#yb^qrV=#a!9JIE1Z~(x3gNUgtXez!mJV0tJlS+A z(#0F?LDfZ-#rGJ88qlYBBl!Bddo35$$*ipGnQ`H~o~6xG!=nzB-k6W$g3IriDZjoR zoJUI-FS7*MWzRMU&8oLBcv1+AnVUIr*rShN;}dktdc2<6*cdGJ{cSi{JY-?b)8+m5 zSe!y96x+uKB6$dt`)*vKDLZ5kYSO@zVaQUqd#g-UqSG5lcKv?@*Z$1S{Ht)SO|<_W zt_90Vu@1i)vMw&8DOYLQ;7YaO{gU5f7yHrZ&*a8%51L)I_^2a0DbL%?Afm9+N+sJE zQiwcrxB6aCqM9g+AN_2KNm%H36Og#c4bDxTZI&RSA*bE+{w4=)754SYB=aV~qs20p zUSQVz3&5(rcr^=%e8lZ)16Ku#yjRk7kkND{qoixf6z8c+#}7g1cmwE~@{)l~;A ztyakGa@%v*k)f$4F9Ae}IVi2Z#&0kqQ66)TgeIYsv}c7!bJIZ{J?ndzsmPecog35d z>(dHo)$5CeIqX)7)l2(T?9|1YIF<`yyOTlk7tw9vZEiw$%7-9VEe%TV(=yOD9#A?M z&bzb#qC-8$r|GopO&3JjD8aV#fEzdgj@Z(Ue>3o+ix)`g^*j(AMt23nPQ61=%f$j0 zIB^b2;R~J3L6y4W)vJF2AX=-#$CD4n+@8LtTUrP!I1GZ7%lSMh7Vmf~!=*-HTv@uH z))&`@Kpu(Fq+nU>29Kb|zG@0g*L|6W2Le6W#JxwsBH~Js-at3O^EXSCH-LuD$x@%I zGRw%Hwg;#icO3nK-)GB8FB@sdiwNnZd{JXTimx}#5RVlSzl^PiQF$)4GWVf7g;s_3 zrGdA#G$+~L#p0_AU5vgvoo@F-5F}exAF8jxktj4*GW?J(WQnTz1Gg*irHWfm zXn|F%(x#uM2FLGO$6c|3{9xwe>|U=53Jl)Hh1(K;!DLBnZKd+7_4FDr32ZS)S`DXV z@a51fp)AB=KMhX%IH|DeZrHCs;qllM#+|jW26KO}_D(A>i_fp#U0n6-t9R?J=LXDm zZ22)1pBjr{M&iw=wqx7%K`*jYOpve5rN(8JIia&fiD!vonml!sCRJojC> zyJEFP8C8k-bZDWORgU+_+M7IBp_?$7zy4Uw-j(d7FdmPab+!9-+orPj8)^&tgxhmU zWI?OfN!%FUswb=RE+WBpKDr*p$Z7%Ui^&5!m*@pBnKB|mg~vB|tF>Ho0IK?2`E=@* zQ+P`L{)pwC=1|otn@LS+J^^y)o{9|wEzA>@Q$o-DiV&vUOzs+(IomYocjqaHEZMDF{w=zfr~s6>ctY!?{9J3Vn0pFhonaNS!# z_j|kQU?*c{=lSH%n*%J7T&d9$o7S}e<^C&2c0vO@tj>Z`PaB494$ER%YB1q#g07ys zK*u#v=&uWt2*$spPakqHigBAh2a=8XtnMnjl6uQPUF+@1aaJSuMLw!w3Yj|ZKKd@6 zvu*9uLH#=Nf;j7VO3%{mNze8}o_r;rZa9OCvmu(5jA+vA7V^{iw0pzfD?@;v{c@ik zAgNI`8nH@D0c`x8q$FAF_GtUzZ&$rcsSc^F`7?C%m+Bs_y*9P8EV8m zSca{z-FAOQqRKtzeWKUivwN58YN{j*(im>*=8F+@y&iJ{T_C@*VLhK7qKHrookm}@FkQ=+p-{a&F;lr zoLP>i$Do8>?PPd94YdS-fH Ae+~hgmX9N>a(wI@?2fYX!4vuMQ!t?Z=OS9HnYl*Gi428;0ugX+LldoFEjrlAf(|)fB*JPtZ9dAV` zpUtIyds5USvk+Z`r)3P#)`w_|~r9#@2I+)f(AJ=e(Wlm2y6((B7+`wxr?DOngP!+utND zCp7lUOmzDo(HO+Oqdg)Qsu02Rkte2)HJ}bgtBT6ZgZmfMU!8^7Ue5xHkv2-TFt0q@ zErK1)Zzu2i8UuwkvO4Y3Zr?iWeXn7<>t3G<2aA~?lT*2gbak+{Jjl}m)9j9dcCN=m zOM7(!S)2k1u5MmB-0LH}Hj#2w*CS`k=L1U2Hb1ImvuR zM*M3rW2@AnFckq5Th#T1;*lkfLf(=KvrVwy_lJ$SBM=g=bs**wsbf9$arh1Plf{U>{hhi=l#q5HYV!3+Q}JHj>**$_ADVheYzqGIm_~o z(CMGEZvQGeZT9|-3a5}9osMi+e{h>S_^m8?XL0q+tYcwEmAGmn@za#4akz+vuz2Sb z36MMVeGf3M^K4=3EA%*0*|Al`=8Z;Acn;9Jbvg+(}VIIJTUj_Z9nP**)xzA3WpsR|v9>7`Yd*r1WqPNe=!(P&myNn= zScB#JI;<|2{pq{>a+u+0Sep*^B7X+DLDa#@knsvSMk=UA>H`dte74Zs3O4QROVvlk z_4?auexT?9BvHFN7$s>|y3o&gG$jrUBbTl@8^ER~HHql^tO%^f*s-x7BKPdjNOsA)IXqCXnhcc9(pXm z*M-K;&D|EI474fP*VdlGkF&bjv=-N*;+4DT_tAgNRUJzUtMhi-m(A2c397AQNpNqq z7W%ai+G<;oTGNmk_$40@hxP)2@fY+AOVk;wRP_0)H_*iQ;uLxe&V$_ry&rTMdl0&y z0BQEG`&Q-4A(U>S>uttwy@}fUT$a8y?L2O#Y^pHoFD7t)POT^FLM2rF&2YT(=bYk2 z?czxn_Qj)BgA^}@iVdz2dcbx4&VvAV0-dn8{vK<`uKZr~dJhe6NT_gXsrITU7;gv8 zW!J)nK8}Z7yIzI%aq-Hg!mK_>2ct7X>%i>dA+>#)aM1RXLfiJ#>{b>?$D)e6(dbyM z@#Vf*ZEW3d4aZBXNXL&QpbdQY4^&;lMW8XDTWj|Ec;g&BysnJ{*09ZvHgiI*-#Fuo zy0`c7t+Ody=@;_=PVr%k)l2ol6%y3)1h716sq7#(nNyM?Oi8JudEI)m+*N>@L9ZPG zZdQ@bmrnxUtIT$zvXTb?+8nrYf|+x-im|KJkMp7~58zEj4DQml-#xXwuY7vd-qxmQ zR!?=z`m}1vvI5V7#7@L62O}9 zng>3T;6omcw!<xn!zGXaBE(R9(p z=3yGY-uv2PYqAGuIh=W-f=6d&I@*RM0Hrq0YO+}aXw-k(>b~BE0|C#>R5B{XE;6<>Svb; zu)#a0YwMT2y1Foz{-`@gix3!m&10?FEnjEEeN}25s^{AGGw+#x@o@{{!2v4>OQ|e< zj1)~eBiwE&E0Ln*>h$<9Tip_wlH1q5dc0oZ8q7XdkJpr$)VB0m$Tl!2C$*TbB_as0 zEK(ic)ZL-DNEhq=8a_uIS$(5aZ8EjK*5N{6vZm6TEGJ3k#CZP{QQy3jLs7Fz%O*z~ z0g=}66J3)h-Z&1M1MN~gPkRW{ABIlrwLuC=0DO$Kb5W1(cQH}9z30%cQHV`>zuPOYZ;R1lF21a0xok15Ix0$rGK^*Ugh!jj_r#S^&p5uR0(nx(wUX z`I*a{_k&v#=dPT(W!`>cz14b{`(mP3Bb`yF=16u!DLm)*_3ZeZ$EQaPg-> z8?Kr2$?OoS=QjPA<_7uG+bhMlOU5EyzaPiUJx_?uh%L8m=mCk}% z^{~x%)uJly<0icxj@DbRuyEYk8Br~KKDME7E7ZmC`dZzSSDV^g4nd>FP>*&_6~Cl$ z*^iw^6VT*lk?9rJ{pC(X+iR`PX)`nfl<>VxrXoJx#i+N%x zaR|9?@}@`EPbFHH!Pw_I_51wR??3e7Rk++<`&_J_7dl*?1pSW?+n+T?|0-f@{XJrf zhTY9>CzS{6ae6vjg=Sd1)!&hE+2N&J%}pJJ$VhrRv9(I#nD*+O+Po?Ri?MlNV*kuYG z?Nh72@3-6-X^jOrcZL}%ckPnPhd8wNOt97M znGL3I{ZQIk-O4Vnb2!jL=<2#=$cdHiHV1&{Pzb~1sl44%I;``H%~i{*jhwj+PBovQ zi7MFY$;zd$86TJ|UIAgF)2Nku6DekvceJbZ7pegVMzKTI#sYv<7$H4A9`t6}Vz`E& zO-3_#N+A9vdUJCT_)qb7n>nvGZqlt^egNevR zOT(_*QSu$qw9{MKY1@_>QXTTZ3EW*7wF~ohlrVBF8vUrBp*%%uHB4?YDtcQBCce-r z9d7@ikMZ&qz|qv~3<{MF?h;H_%(LTf#pY_rr`PUipfZ3aj%6%{l$UT&MH)g0GT6vX~IhP{p9(RcIbo)7p_{Yg}2lJl$DDBplHh&9V zmFHshT=iay*Ikh2SYFt56ofG90BDWqcbq%gX_^^#@S2Li+xQv`+{O+mgr~deseFuz5udSCuKc@ASOEeHjsNt$e9hs#SZ? z2W_I)81yGj0^kAzuRfnXvpbe%#p6+dhqn*~tYxOg%~g|B+q%~X&Yb$OOGt2np!tKC z^4NgiMV(cN++uk#oPPF0eoq%A{JaMC&>$zhsuCU^Xi>SeNu+g3mCYF6QWJJc)py(o z)m=^b5}WbBnHpj>yQ&-Dco?phvpk;;O&@!W{YqKbQnnNBd*5}rS~AG1u@>M^t0fxz zw(lLGqGMf(75GJo=gjkbx1Db9FuX%AD!h>=WLxUCjzCM9620jhHz9mA9yG3#} ztUJg`plrLfJx6x?bSOi0$OYRJ4_3eK*Cs|BCu8YW5utesfX2&+E9H9QbsBm22UxVX z1bh=5LdlB;wnXl&a{63Z^ZH&c?v|55gMe9T?pZ+gTU zTM}=$r4g;II#CIW!)0yr;%BCaun&*{m&!^%0Q^0vwA0OzBgflt{P{H1Y{aj6LN}&`^0<(UPF!!bot|kptpoi^jHlsZ8khRsU-ZMf3 ze`%~G;UB)Ez{zM>0r(=6c|T5iN^?+=%JifWXARYG11V+Wyx8X3 zT{Is9Rq2uH>0l7ED$UWy<>uR+%pD}!ZNK2H+1psXdi1Wa1be+U`g2UA`oqp+>c_mU zRqMtePY&$2F*f&8?|b+dKq{QCUlEvfjd5%Jt*<=rpxdqdLLXW8dWRxFqV=ZqQk-M2 z-Y<87^V0FWRR~|9c`>if>K@7JTQ1YLlX9o&yoUFCiF!zO1VD_b0@^TX*1FrxrSNU} zOj%r&Yk75-g8`%E^K6vBf_&PxwbpYPWgB~O*ac5MnoWA>T$X8~n@q;1QGIrz7Sj_C z7GY^_#l($nY{FRs&I-HSQoK5;IBuHXqC#53ON$Qm<>pe_x}tgoX?m^&w^8rdT#V4V zW}`8F*0~NDs%@i)R^~G%qjy!)A8Yq^_#K+#OY!Zk8NtEC!^))}rSt4GneR_`vY!O& zy}u*X+KrzlTweFPuuW0U<~3@;k;y*vfkxB|jOGr%9;nuJ#iwr)KZVz?-dYVsw$;%a zs=HZXyQL4oTIX|KE%jh;1XWE?N#q?6Lp)b&5DUFtoR2QlF0;n=U5h7+1n%T%Mm4uQ zhXlJ6rEfItv&a4kMg=2ZIz#CZh23XVVX_m0>a(mT^krIqUF+fUHfER6iZnp)XJpB` zFNoo19N57elQ5f}O7wpfd3NysP(W^fC?uEIm7!9_ zDSG2|kT39ZdEq$Kd0m|3bGIU?2eJ}G!xFIZ*>?x2OgwulmKJBM)OvR(pod9yd8zVzKoTH4>Hg?9mJnwxuR!=4Qn9QfhXE0FtLCH=x3uG-z>PwOyazapM|-pk@^$h&L4oWM zRQ%f2s+V#2b2_ZP_2e^dojt;@Ng~~kfu&@3#MOjDHrc*;6wus_-Yw*l1<~`MzgKb{ zdNR%N=|;}{=!O%W*0!h|z&yw37v3%ANr4Gtjw&KY%{TWD1dP?}(WPfuMVwS-S7QLq zDulV^ce}$RuQIPZEjBOW4L0#Icg=?{%Tkbl!1T9r?>Up*D0C1gA zaj?Hx(CctNMXkQrt2SHbey8DYV?=nMC@vlrF<^pO2{1EiW9$E$Zf1So?zK99ZE&cfavuKvwZ9u@us;eJ5K< z|8}C|S}&-Z=gmAp_V?3xxR9?$CFja!*G4d4Tvi^8H#4TugK53Ht&gnb_~o~+r$hG< zKk+wn&Gko@Ed@OEx!%|L$v_I_GvhLwqMfm&L}57r9Q>X2%|H8kOhXE4TDO{ooE$I z4VM8v`YeQF{7HrxJ{l{_vVYxT^25B9A-x5dlG~c$Udl8ESN;?s5c=r)MXX&oIo#S* zF{Kq;o0`Le>o#`Z%wVC`Uy~PEEx(Q7#a+c!i0~^@AMDSO7k@c)a$}kefLt?LcDf#c z@sByE_BY%0g1({!Y@DlD?%ulwql=2sx{wcfUf8sJ$NR_ysS5zFTR=6L2}#Kxf3xTc z2drPN2e)mN6~b+;dCcXAEpaSe`!+qAH7?zFNWBh^YwNp&{K`%mKBJT;yCch6bhGx# zwg8yQN)hMh@-|;4$ZE|e3c*NZ+*5Bq8RVtgtsv$iX)a12869)(QE$8o^ZPham+iyt zF$$xJN$rcEvcezL?)h@kp{3OC{tRnZ6=qt8jmEtkZo zO1{6RwRHiTG$S>PNz%c;SFrCH5L8FkccoPsWy>DqNRY6tct{o`z4>+LHqt!J(|Ssu z-4F}E!mab*pzHAaDoc3b;TlTVo zeSk}~LqZxzl|cA#G;1VEOXscda!j>(Y=T7pL_T>vSDokDpL5M0PX;p%-P2y1F(>hg zr#{G&MqUG#wgYZpwZKUgvq$zJFB3>08;`{lny%XA4>@6$^7TTi>)FumNlbDnWQw=ySbXg)tl>M)nis4A^5TR;XeE#7^GQqtxl)OQ z>zNGeX!Z^^MRxJ-=OV?RAsta1coN?cENp9I_gUh_5gHZgg4R2QIrc;@t%t`{BKf$c z=ZU>M&OCHf&?cDZ48XO`Rrpw#_C^lad*_*l?Z@rB>6$Rh9Z~4k@^?4%z9}ZS9{SnO zb9UmPkrm%3oTkzAJI<}lH>&c^tEsnY@3Gyc)r}Ik00}IV7WUld;=?q!oHsi`xW~(u-8WIK(@zw?=LxXwDXF=rm1SYNprQg@sP~s0Mlj834V3Qj zsq1(9%uS`#fQ*k^;Wo)`@`MUn=X%cZP2v$U7u|W8t%P?j*kE847}HT3?|=5yLIYyG z$|~D}D zrEx(TU1s3hs#?N9_)M9ueY?X#c{~)BIVW?$MKq)nY8dhyb0@e1=aK92dhq_2z7R=#Gblj(auvUEH2OA+CgC*SU5l8#HuFfu7 z)dbJ|XNpZ`8|k?$Y(6r*Yz{c;);p2)K&H=+&pft{WzD}#POEX`WBRLH_oGl-Ot&RLmBw!`xiIaO%&nl0*0Cg2kP zNSnQ0uV9}<=QQ1@*3v%CpLxFW@tyc>uGbDmLt$IEENE2h zbYXC$A743we|6OH@!S>j>JTQnkkcjbkJ#2fS1$gmn3wFd{)2k++YqSBML)r>-um#o zZpt}#%C35KvVLzi+ZdkfF;oNMX6OyOk!;VX8ckRv|gB2_s z3E}A0t@)_bXUMa{Z658-2+2t#WK$?{n{*77h!edAy+P#(X4TtM&*V!`q3Z4^$?Ti|4?qoC4^1WyL{SkLlUS;^neF_b4I>^-UV zUaqjWZ8%rK#*d65uUzD;3k#x(EMj_|=6ToLxvizive6!V<=c6!6^p0oEQir7H!)=W zHF$M7*K;59aEp|vp28F$pmg8Ox_z|8v5)IuW6lFfI0Z+-Uy9#1Yt82y>n$IdvWw*~ z-tH_@0wu-uUYjrxSON&WIpV%I`(QSQ!S3NVy@M@_J7`$JzRD;oeKoFF+un+;!aKW} z^Z%!kbL~+U%cAi2_gB1L@B%7#L4y&)roXSU)@X=?L;f<*?6bFZf;^B`U!_rF^8GfPL~Qt`NF*fr0GM zQ<3brfo4r)DsAS<@$&*1zbV4pO8I);>35&=T{sEsa!|>b3$xem#dE&f64}BQ9soTT zm^Q}9YTNhZjzV13o$~12HOpCV&eiv*g*e!uGxkkE**}2X;x)rux6wG|2hoQyUR3|n z$!QF(rnvusI#xQDZP3Sr$S&(=(=OUS^E02SndL|efY4seCH|1^mTi8T22r`3GXeX| zrb3~85~9{Y7`fyaffjcLsZXDV6==tx?N%WyW=t0g(b8q5bqRo)3i;Wfi!M5?B@FT- zYTGWwWYEci2IA?C%TMjY@wa2PQ~=(oa&e&@F#g8Vo!s_`gk*6dUnfBwDvkt$3k6EK zV*bk<-bd&yqxLoO^jT|gM}BQr0Jvr@I~bjtamh%-UUE$Y64ge;H7~B$oaUloc22V% zJWKUY+&hBM6dZW7D|#F>nwVC<W% z#MA!cJ{8cy=5Uf~3&z0g&g>re>}A4MVm^QxC~2dzyJQTc~mQ zkSK?U{+p~zido;l{d91&1oOJkz(q5Bb2bZaZD8&fiTpE=cN^qYWat06$Us@O&0I~u(c~;lUz13zTuiEvtxLsD;n@cU(bu#9tqZIV=trH7J zd7N&i?sEJIC0`yqF%JfkaE@YfS}Pqwi~&qGv|hYsc34Jrh-OeR?c6nzF6~r={rvI{ zd_s4(FxtB}!h1jE_StTC<-Jyew-&W6HW%QSArR~#;k)z8oK6jjQ7%+>`}PnRaR*SQ zUOEdyQIzSzfi%#@>a8niYw%)CY?Rc)HZKHXe3STDcZkk;mV4xsy_HrbZz+b(H0L5^Zlg1QtZUdtjDpr$zob`nUP0X3ySt~@)OPUX9W zkBgX^*W281pL@~uRra!js`_Vtd5EY7qgvamsiSOs?HsP2yZ@&)@UQTP|FaFm*>7#& zn+II0KQq_!_z~z*m48GRfy@O&airNxAo!IIvOrhSGBA-8)Hff$%`UKYsJVP@pr3oi zGBdgjYgxT^GwdK59lDP+lg{9JEVjV8Ug=e;jbjB`oaxE2p1dW?;tQq9J*yigrk^Fp zXG?amI>>D3X+;4!YZKOXo+FZP zb&of&9dsGLfPM@;J@sW(;FF$o;%3pP;zKb8Rcc)ZQh$792Ar(l*?^P$*_D} zrdu%SL;(#v{0VNY6JjgSHL+QxRHt(2!Um}f+uJ&{zG)d?o1ZLS@38S`Q z+ps=s-!oEN=;A_S<)OX0v52Ce7WoS4D$5mm9hBgCEWEFGixay-5j{Q2<3aTuMg*H8 zD_N}OS5fmwD61Rrw}+OqnihwH)5R79U2BkY(u~f|_^A^xCJP(hjY*Ulw9UG?2GSm~ zL_f&Xu=4HS>eJ9-@2mE2*O$ys~@e#UqYylMb&NB7ZYU1%Z)5}TYj5&EX;&=>sAd6X}Sgxcdd&Et(I`rzYl@N z2Ek2Om@x|(nrh5kw$woK^I+Lu9o$Cm6YtiL{ zP!}SYQ_<9ptg{5E4NIOok&HYxJ1rEi=5y@kM8MtnMCZwI zQqMaa;LYVSh!1&KL?`ej*q;K4BDEH8DC0_!nMmGNxJ{cvy^kup5FK;q;f!>~E}p6{ zW(PM5E_aQl^DClj7bz{d1>D()qvc~Tcl~f3Q+EHlD6(he3F1u~qbJ_9teSbiA!F@% z0MYwsAb6c0+EN93Dqej%WL#J^vl({wuoq>6p_i-06nISws~5&W;?So}?)1_)wYx4q z)z)p7wN6gbR;`X51~sW|+;GqtI$R-HQ!8bVJ07L5g|d;sEAN!RiG+;g(V8D_0j@*b z1Q{L#M|~CXHrY>$Lg!r13Oa-i47`i3`Ki`tn^r@uv}8v{pVMb@dEWh;P>PwQW}y{k@GwzhwNHbP(38ClxbR>(Bfz?w zjk?R8K3viFH+eE%@S-#X03UO9zIS&`%5P_-d?1_sv)4OAOmIYKesh~=)2Fi zI`{pxF_bkR_nmvK>%<`?4%5gkf8W)N>fVPTkdi@i%)KHn3uNZuy+V*+@6XZZqikn* zcR+(V%h%W|Ggv5TKQeOJ`>Eb%jiRx(Gns9kZ?UXqA3}NFs5F^90rf^5e&U^8LvFRA zwp8x*B;vpt6EssEfmW>ray;(Qo$u(r56$RvHFkIRL;gM}t{2_|jh*i2);A_t-mh~Z za@|DC6U5^YzO)}mbmmZ@dym0~XWevd#v)zT>AEEzp`xm|n@WItNT+56~j#0a= z&x3HCj%y<_32xbw1-Ab>0oW*%T$E%2b9$*it5#QO5!f90>8k>n@yf`*S445lH$x!h z7Ua+D>ZF`JthL+pKRmAgmHOvj*l(io3;Xl+o1tlv{(b)aXN>{$y06#o=+7EXfwbNtIE`NV7iIYHQ`c@;TpXDN%~f6U_U;C!m%IMOA(D9pFxrA&vkVD=em0R$6hFw`cb0+i2VEhXaamY-}Vf0 z?myPC)Q{`M0Tt`}{&14{xsD+k-|qq5z_IW5ioh8LG<^TPfA7?}-2LytEvlLW9-lm3ZjJdY*Z_Wxe7`vv_4bszjQ>b_e2?Ogy?>>Kp{7dnj2 v@$WxaV#Y6k=!^j%^s7S!f6Ooc_P_q||3FkZ>HCkL0zon){R@KRsq*VT@K2hj literal 0 HcmV?d00001