jramos · jramos · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/README.md b/README.md
@@ -169,6 +169,22 @@ The framework parses every `*_SCHEMA = {...}` and `*_SCHEMAS = [...]` declaratio
 
 With `--apply`, the evolved description is spliced into the source file's bytes at the original position — comments, formatting, and unrelated tools are untouched. Multi-line parenthesized concatenations collapse to a single triple-quoted string at the same indent.
 
+### Evolve a system prompt section
+
+For Hermes Agent, evolve a named section of the assembled system prompt — any top-level string constant in `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`, which governs when and what the agent saves to memory):
+
+```bash
+uv run python -m evolution.prompts.evolve_prompt_section \
+    --section MEMORY_GUIDANCE \
+    --hermes-repo /path/to/hermes-agent \
+    --tasks evolution/validation/suites/memory_guidance.jsonl \
+    --iterations 10
+```
+
+Unlike skill and tool evolution — where the deploy gate can lean on a synthetic LLM-judge signal — a prompt section is evaluated **purely behaviorally**: every candidate is spliced into the live `prompt_builder.py` and scored by running the real agent (`hermes -z`) against the task suite. The verdict is compound — Layer 1 checks whether the agent invoked the expected tool (e.g. `memory`), and Layer 2 runs an LLM judge over the saved content against each task's `expected_save_content` rubric. The candidate is spliced in only for the duration of the run; the file is restored byte-for-byte afterward (atomic backup + flock + checksum-drift detection, shared with the tool-description path).
+
+`--apply` writes the evolved section into `prompt_builder.py` in place; results land in `output/prompts/<section>/<timestamp>/`. PR automation (`--create-pr`) is not yet wired for prompt sections — use `--apply` plus a manual PR. To demonstrate the loop on an already-tuned section (which the saturation pre-flight will otherwise correctly default-deny as having no headroom), `--baseline-override-file` starts evolution from arbitrary text — e.g. a deliberately-weakened baseline that gives GEPA real failures to learn from.
+
 ### Mine real session history for evals
 
 For skill evolution:
@@ -331,7 +347,7 @@ Cost: each task is one `hermes -z` run (~$0.05–$0.50). The bundled `patch.json
 |-------|--------|--------|--------|
 | **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ [Validated](reports/phase1_validation_report.pdf) |
 | **Phase 2** | Tool descriptions + dual-signal deploy gate | DSPy + GEPA | ✅ [Validated](reports/phase2_validation_report.pdf) |
-| **Phase 3** | System prompt sections | DSPy + GEPA | 🔲 Planned |
+| **Phase 3** | System prompt sections | DSPy + GEPA | ✅ [Validated](reports/phase3_validation_report.pdf) |
 | **Phase 4** | Tool implementation code | Darwinian Evolver | 🔲 Planned |
 | **Phase 5** | Continuous improvement loop | Automated pipeline | 🔲 Planned |
 

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -56,10 +56,20 @@ graph TB
         hermes_source[tools.hermes_source<br/>Hermes *_SCHEMA AST adapter]
     end
 
+    subgraph prompts_tier[Prompt Tier]
+        evolve_prompt[prompts.evolve_prompt_section<br/>main + evolve]
+        prompt_module[prompts.prompt_module<br/>PromptModule + sentinels]
+        prompt_proposer[prompts.prompt_proposer<br/>PromptSectionProposer]
+        prompt_judge[prompts.prompt_judge<br/>SaveCallJudge + judge_save_calls<br/>+ prompt fitness/splice scorer]
+        prompt_source[prompts.prompt_source<br/>PromptSource protocol + SectionDescriptor]
+        hermes_prompt_source[prompts.hermes_prompt_source<br/>HermesPromptSource — prompt_builder.py AST]
+    end
+
     subgraph validation_subsystem[Closed-loop validation]
         validator[validation.validator<br/>ClosedLoopValidator]
         hermes_runner[validation.hermes_runner<br/>hermes -z subprocess]
-        installer[validation.artifact_installer<br/>HermesToolDescriptionInstaller]
+        installer[validation.artifact_installer<br/>HermesToolDescriptionInstaller +<br/>HermesPromptSectionInstaller]
+        savejudge[validation.report<br/>score_task Layer-2 judge hook]
         report[validation.report<br/>ValidationReport + decision]
         task[validation.task<br/>Task + TaskSuite]
         cl_cli[validation.closed_loop<br/>CLI]
@@ -117,10 +127,26 @@ graph TB
     tool_judge --> fitness
     tool_proposer --> budget
 
+    evolve_prompt --> prompt_module
+    evolve_prompt --> prompt_proposer
+    evolve_prompt --> prompt_judge
+    evolve_prompt --> prompt_source
+    evolve_prompt --> hermes_prompt_source
+    evolve_prompt --> config
+    evolve_prompt --> quality
+    evolve_prompt --> timing
+    evolve_prompt --> validator
+    hermes_prompt_source --> prompt_source
+    prompt_module --> dspy
+    prompt_proposer --> budget
+    prompt_judge --> fitness
+    installer --> hermes_prompt_source
+
     validator --> hermes_runner
     validator --> installer
     validator --> report
     validator --> task
+    validator --> savejudge
     cl_cli --> validator
     hermes_runner --> hermes
 
@@ -138,7 +164,9 @@ graph TB
     importers --> dataset
 ```
 
-`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools. This keeps the tier-3/4/5 expansion path open.
+`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, `evolution/prompts/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools/prompts. This keeps the tier-4/5 expansion path open.
+
+The `prompts` tier (Phase 3) is the prompt-section evolution path: `evolve_prompt_section` wraps a named `prompt_builder.py` constant as a `PromptModule` (a passthrough predictor carrying the candidate in sentinel-delimited instructions), mutates it with `PromptSectionProposer`, and — because there is no synthetic classification signal for a system-prompt section — scores **purely behaviorally** through the closed-loop validator running a real `hermes -z` against a curated JSONL suite. The deploy gate is therefore a closed-loop pass-rate / win-loss decision, not a paired-bootstrap one. Unlike the skill/tool tiers it reuses `ClosedLoopValidator` directly rather than going through `closed_loop_feedback.py`, and it integrates by AST-splicing the candidate into the live `agent/prompt_builder.py` (`HermesPromptSectionInstaller`) with atomic restore. The Layer-2 content judge (`SaveCallJudge` / `judge_save_calls`) runs inside `score_task` to grade memory-save *content* on top of the Layer-1 trigger-membership check.
 
 ## Design patterns in active use
 

diff --git a/docs/codebase_info.md b/docs/codebase_info.md
@@ -67,14 +67,20 @@ evolution/
 │   └── tool_judge.py                    # tool-flavored LLMJudge + GEPA-shaped metric
 ├── validation/                          # closed-loop validation against a real agent
 │   ├── agent_runner.py                  # AgentRunner Protocol + AgentRunResult dataclass
-│   ├── artifact_installer.py            # ArtifactInstaller Protocol + HermesToolDescriptionInstaller
+│   ├── artifact_installer.py            # ArtifactInstaller Protocol + HermesToolDescriptionInstaller + HermesPromptSectionInstaller
 │   ├── closed_loop.py                   # CLI: drive baseline + evolved through hermes -z, compare
-│   ├── hermes_runner.py                 # HermesAgentRunner — subprocess hermes -z with sandboxed HOME
-│   ├── report.py                        # ValidationReport + TaskResult + decision rule
-│   ├── suites/                          # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl)
+│   ├── hermes_runner.py                 # HermesAgentRunner — subprocess hermes -z; reads sessions from SQLite state.db (parse_session_from_db)
+│   ├── report.py                        # ValidationReport + TaskResult + decision rule + Layer-2 SaveCallJudge in score_task
+│   ├── suites/                          # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl, memory_guidance.jsonl)
 │   ├── task.py                          # Task + TaskSuite.from_jsonl (with sha256 audit)
 │   └── validator.py                     # ClosedLoopValidator.validate — mutates + restores live agent file
-├── prompts/                             # Tier 3: planned, empty package
+├── prompts/                             # Tier 3: system-prompt-section evolution
+│   ├── evolve_prompt_section.py         # CLI + orchestration; purely-behavioral closed-loop gate
+│   ├── prompt_source.py                 # PromptSource Protocol (read + write) + SectionDescriptor
+│   ├── hermes_prompt_source.py          # HermesPromptSource — AST read/write of prompt_builder.py constants
+│   ├── prompt_module.py                 # PromptModule — passthrough predictor carrying candidate in sentinels
+│   ├── prompt_proposer.py               # PromptSectionProposer — sentinel-preserving GEPA proposer
+│   └── prompt_judge.py                  # SaveCallJudge + judge_save_calls Layer-2 content judge + fitness/splice scorers
 ├── code/                                # Tier 4: planned, empty package
 └── monitor/                             # planned, empty package
 ```
@@ -86,6 +92,7 @@ evolution/
 | `evolution/skills/evolve_skill.py` | ~1340 | CLI, orchestration, gate-decision payload assembly |
 | `evolution/tools/evolve_tool.py` | ~1170 | CLI + orchestration for tool-description evolution |
 | `evolution/core/external_importers.py` | ~770 | 3 importers + relevance filter + standalone CLI |
+| `evolution/prompts/evolve_prompt_section.py` | ~660 | CLI + orchestration; purely-behavioral closed-loop deploy gate |
 | `evolution/core/dataset_builder.py` | ~480 | synthetic generator + golden loader + tool-selection three-bucket gen |
 | `evolution/core/lm_timing_callback.py` | ~400 | DSPy BaseCallback + litellm.failure_callback + cost ledger |
 | `evolution/core/fitness.py` | ~380 | LLMJudge + skill/tool fitness metrics + behavioral score helper |
@@ -94,22 +101,27 @@ evolution/
 | `evolution/core/closed_loop_feedback.py` | ~320 | cache + saturation gate + deterministic feedback block + `force_run` (bypasses gate for pre-flight) |
 | `evolution/core/saturation_check.py` | ~255 | pre-flight: band classifier + `SaturationReport` + Rich panel + interactive confirm |
 | `evolution/tools/tool_judge.py` | ~230 | tool-flavored judge + GEPA-shaped metric with behavioral branch |
+| `evolution/prompts/prompt_judge.py` | ~230 | SaveCallJudge + judge_save_calls Layer-2 content judge + prompt fitness/splice scorers |
 | `evolution/validation/validator.py` | ~220 | mutate + restore live agent file with flock + checksum drift check |
 | `evolution/validation/report.py` | ~225 | ValidationReport JSON + Rich rendering + two-condition decision |
 | `evolution/core/skill_sources.py` | ~210 | Hermes / Claude Code / LocalDir |
 | `evolution/core/quality_gate.py` | ~210 | preset table + proposer-mode resolution + gate-decision persistence |
 | `evolution/skills/knee_point.py` | ~205 | parsimony-based candidate picker |
 | `evolution/validation/hermes_runner.py` | ~205 | hermes -z subprocess with sandboxed HOME |
 | `evolution/tools/tool_proposer.py` | ~200 | sentinel-preserving reflection prompt |
-| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore |
+| `evolution/prompts/prompt_proposer.py` | ~160 | sentinel-preserving GEPA proposer for prompt sections |
+| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore (tool + prompt-section installers) |
+| `evolution/prompts/hermes_prompt_source.py` | ~135 | AST read/write of prompt_builder.py string constants |
+| `evolution/prompts/prompt_module.py` | ~120 | PromptModule passthrough predictor + sentinel parse |
 | `evolution/validation/closed_loop.py` | ~135 | standalone closed-loop CLI |
 | `evolution/skills/skill_module.py` | ~125 | wraps SKILL.md as `dspy.Module` |
 | `evolution/validation/task.py` | ~90 | Task + TaskSuite.from_jsonl |
 | `evolution/core/config.py` | ~80 | `EvolutionConfig` dataclass |
 | `evolution/core/stats.py` | ~60 | `paired_bootstrap` helper |
+| `evolution/prompts/prompt_source.py` | ~55 | PromptSource Protocol + SectionDescriptor |
 | `evolution/validation/agent_runner.py` | ~55 | AgentRunner Protocol + dataclasses |
 | `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples |
-| **Total** | **~9,000** | excludes empty `__init__.py` shims |
+| **Total** | **~10,400** | excludes empty `__init__.py` shims |
 
 Test suite: 61 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1166 tests** collected.
 
@@ -139,11 +151,11 @@ The README's table summarizes intent; reality:
 |---|---|---|---|
 | 1 | Skill files (SKILL.md) | DSPy + GEPA | ✅ implemented in `evolution/skills/` |
 | 2 | Tool descriptions | DSPy + GEPA | ✅ implemented in `evolution/tools/` — MCP-JSON and Hermes-Python-AST adapters; one target tool per run |
-| 3 | System prompt sections | DSPy + GEPA | 🔲 `evolution/prompts/` package exists, empty |
+| 3 | System prompt sections | DSPy + GEPA | ✅ implemented in `evolution/prompts/` — AST splice of `prompt_builder.py` constants; purely-behavioral closed-loop deploy gate (no synthetic signal) |
 | 4 | Tool implementation code | Darwinian Evolver | 🔲 `evolution/code/` package exists, empty; `[darwinian]` extra reserves the dep |
 | 5 | Continuous improvement loop | Automated pipeline | 🔲 `evolution/monitor/` package exists, empty |
 
-Tiers 1 and 2 are built. Tier 3-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec.
+Tiers 1-3 are built. Tier 4-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec.
 
 **Orthogonal validation surface.** `evolution/validation/` runs a real agent (`hermes -z`) through a JSONL task suite with baseline vs evolved artifacts spliced into the live install. Scores actual tool-selection behavior with `expected_tools` / `forbidden_tools` per task; compares with a two-condition decision rule. Available three ways: