diff --git a/README.md b/README.md
index e145a91..5fae14d 100644
--- a/README.md
+++ b/README.md
@@ -169,6 +169,22 @@ The framework parses every `*_SCHEMA = {...}` and `*_SCHEMAS = [...]` declaratio
 
 With `--apply`, the evolved description is spliced into the source file's bytes at the original position — comments, formatting, and unrelated tools are untouched. Multi-line parenthesized concatenations collapse to a single triple-quoted string at the same indent.
 
+### Evolve a system prompt section
+
+For Hermes Agent, evolve a named section of the assembled system prompt — any top-level string constant in `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`, which governs when and what the agent saves to memory):
+
+```bash
+uv run python -m evolution.prompts.evolve_prompt_section \
+    --section MEMORY_GUIDANCE \
+    --hermes-repo /path/to/hermes-agent \
+    --tasks evolution/validation/suites/memory_guidance.jsonl \
+    --iterations 10
+```
+
+Unlike skill and tool evolution — where the deploy gate can lean on a synthetic LLM-judge signal — a prompt section is evaluated **purely behaviorally**: every candidate is spliced into the live `prompt_builder.py` and scored by running the real agent (`hermes -z`) against the task suite. The verdict is compound — Layer 1 checks whether the agent invoked the expected tool (e.g. `memory`), and Layer 2 runs an LLM judge over the saved content against each task's `expected_save_content` rubric. The candidate is spliced in only for the duration of the run; the file is restored byte-for-byte afterward (atomic backup + flock + checksum-drift detection, shared with the tool-description path).
+
+`--apply` writes the evolved section into `prompt_builder.py` in place; results land in `output/prompts/<section>/<timestamp>/`. PR automation (`--create-pr`) is not yet wired for prompt sections — use `--apply` plus a manual PR. To demonstrate the loop on an already-tuned section (which the saturation pre-flight will otherwise correctly default-deny as having no headroom), `--baseline-override-file` starts evolution from arbitrary text — e.g. a deliberately-weakened baseline that gives GEPA real failures to learn from.
+
 ### Mine real session history for evals
 
 For skill evolution:
@@ -331,7 +347,7 @@ Cost: each task is one `hermes -z` run (~$0.05–$0.50). The bundled `patch.json
 |-------|--------|--------|--------|
 | **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ [Validated](reports/phase1_validation_report.pdf) |
 | **Phase 2** | Tool descriptions + dual-signal deploy gate | DSPy + GEPA | ✅ [Validated](reports/phase2_validation_report.pdf) |
-| **Phase 3** | System prompt sections | DSPy + GEPA | 🔲 Planned |
+| **Phase 3** | System prompt sections | DSPy + GEPA | ✅ [Validated](reports/phase3_validation_report.pdf) |
 | **Phase 4** | Tool implementation code | Darwinian Evolver | 🔲 Planned |
 | **Phase 5** | Continuous improvement loop | Automated pipeline | 🔲 Planned |
 
diff --git a/docs/architecture.md b/docs/architecture.md
index 772a94b..8e73e94 100644
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -56,10 +56,20 @@ graph TB
         hermes_source[tools.hermes_source<br/>Hermes *_SCHEMA AST adapter]
     end
 
+    subgraph prompts_tier[Prompt Tier]
+        evolve_prompt[prompts.evolve_prompt_section<br/>main + evolve]
+        prompt_module[prompts.prompt_module<br/>PromptModule + sentinels]
+        prompt_proposer[prompts.prompt_proposer<br/>PromptSectionProposer]
+        prompt_judge[prompts.prompt_judge<br/>SaveCallJudge + judge_save_calls<br/>+ prompt fitness/splice scorer]
+        prompt_source[prompts.prompt_source<br/>PromptSource protocol + SectionDescriptor]
+        hermes_prompt_source[prompts.hermes_prompt_source<br/>HermesPromptSource — prompt_builder.py AST]
+    end
+
     subgraph validation_subsystem[Closed-loop validation]
         validator[validation.validator<br/>ClosedLoopValidator]
         hermes_runner[validation.hermes_runner<br/>hermes -z subprocess]
-        installer[validation.artifact_installer<br/>HermesToolDescriptionInstaller]
+        installer[validation.artifact_installer<br/>HermesToolDescriptionInstaller +<br/>HermesPromptSectionInstaller]
+        savejudge[validation.report<br/>score_task Layer-2 judge hook]
         report[validation.report<br/>ValidationReport + decision]
         task[validation.task<br/>Task + TaskSuite]
         cl_cli[validation.closed_loop<br/>CLI]
@@ -117,10 +127,26 @@ graph TB
     tool_judge --> fitness
     tool_proposer --> budget
 
+    evolve_prompt --> prompt_module
+    evolve_prompt --> prompt_proposer
+    evolve_prompt --> prompt_judge
+    evolve_prompt --> prompt_source
+    evolve_prompt --> hermes_prompt_source
+    evolve_prompt --> config
+    evolve_prompt --> quality
+    evolve_prompt --> timing
+    evolve_prompt --> validator
+    hermes_prompt_source --> prompt_source
+    prompt_module --> dspy
+    prompt_proposer --> budget
+    prompt_judge --> fitness
+    installer --> hermes_prompt_source
+
     validator --> hermes_runner
     validator --> installer
     validator --> report
     validator --> task
+    validator --> savejudge
     cl_cli --> validator
     hermes_runner --> hermes
 
@@ -138,7 +164,9 @@ graph TB
     importers --> dataset
 ```
 
-`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools. This keeps the tier-3/4/5 expansion path open.
+`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, `evolution/prompts/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools/prompts. This keeps the tier-4/5 expansion path open.
+
+The `prompts` tier (Phase 3) is the prompt-section evolution path: `evolve_prompt_section` wraps a named `prompt_builder.py` constant as a `PromptModule` (a passthrough predictor carrying the candidate in sentinel-delimited instructions), mutates it with `PromptSectionProposer`, and — because there is no synthetic classification signal for a system-prompt section — scores **purely behaviorally** through the closed-loop validator running a real `hermes -z` against a curated JSONL suite. The deploy gate is therefore a closed-loop pass-rate / win-loss decision, not a paired-bootstrap one. Unlike the skill/tool tiers it reuses `ClosedLoopValidator` directly rather than going through `closed_loop_feedback.py`, and it integrates by AST-splicing the candidate into the live `agent/prompt_builder.py` (`HermesPromptSectionInstaller`) with atomic restore. The Layer-2 content judge (`SaveCallJudge` / `judge_save_calls`) runs inside `score_task` to grade memory-save *content* on top of the Layer-1 trigger-membership check.
 
 ## Design patterns in active use
 
diff --git a/docs/codebase_info.md b/docs/codebase_info.md
index 2900d46..c16decc 100644
--- a/docs/codebase_info.md
+++ b/docs/codebase_info.md
@@ -67,14 +67,20 @@ evolution/
 │   └── tool_judge.py                    # tool-flavored LLMJudge + GEPA-shaped metric
 ├── validation/                          # closed-loop validation against a real agent
 │   ├── agent_runner.py                  # AgentRunner Protocol + AgentRunResult dataclass
-│   ├── artifact_installer.py            # ArtifactInstaller Protocol + HermesToolDescriptionInstaller
+│   ├── artifact_installer.py            # ArtifactInstaller Protocol + HermesToolDescriptionInstaller + HermesPromptSectionInstaller
 │   ├── closed_loop.py                   # CLI: drive baseline + evolved through hermes -z, compare
-│   ├── hermes_runner.py                 # HermesAgentRunner — subprocess hermes -z with sandboxed HOME
-│   ├── report.py                        # ValidationReport + TaskResult + decision rule
-│   ├── suites/                          # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl)
+│   ├── hermes_runner.py                 # HermesAgentRunner — subprocess hermes -z; reads sessions from SQLite state.db (parse_session_from_db)
+│   ├── report.py                        # ValidationReport + TaskResult + decision rule + Layer-2 SaveCallJudge in score_task
+│   ├── suites/                          # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl, memory_guidance.jsonl)
 │   ├── task.py                          # Task + TaskSuite.from_jsonl (with sha256 audit)
 │   └── validator.py                     # ClosedLoopValidator.validate — mutates + restores live agent file
-├── prompts/                             # Tier 3: planned, empty package
+├── prompts/                             # Tier 3: system-prompt-section evolution
+│   ├── evolve_prompt_section.py         # CLI + orchestration; purely-behavioral closed-loop gate
+│   ├── prompt_source.py                 # PromptSource Protocol (read + write) + SectionDescriptor
+│   ├── hermes_prompt_source.py          # HermesPromptSource — AST read/write of prompt_builder.py constants
+│   ├── prompt_module.py                 # PromptModule — passthrough predictor carrying candidate in sentinels
+│   ├── prompt_proposer.py               # PromptSectionProposer — sentinel-preserving GEPA proposer
+│   └── prompt_judge.py                  # SaveCallJudge + judge_save_calls Layer-2 content judge + fitness/splice scorers
 ├── code/                                # Tier 4: planned, empty package
 └── monitor/                             # planned, empty package
 ```
@@ -86,6 +92,7 @@ evolution/
 | `evolution/skills/evolve_skill.py` | ~1340 | CLI, orchestration, gate-decision payload assembly |
 | `evolution/tools/evolve_tool.py` | ~1170 | CLI + orchestration for tool-description evolution |
 | `evolution/core/external_importers.py` | ~770 | 3 importers + relevance filter + standalone CLI |
+| `evolution/prompts/evolve_prompt_section.py` | ~660 | CLI + orchestration; purely-behavioral closed-loop deploy gate |
 | `evolution/core/dataset_builder.py` | ~480 | synthetic generator + golden loader + tool-selection three-bucket gen |
 | `evolution/core/lm_timing_callback.py` | ~400 | DSPy BaseCallback + litellm.failure_callback + cost ledger |
 | `evolution/core/fitness.py` | ~380 | LLMJudge + skill/tool fitness metrics + behavioral score helper |
@@ -94,6 +101,7 @@ evolution/
 | `evolution/core/closed_loop_feedback.py` | ~320 | cache + saturation gate + deterministic feedback block + `force_run` (bypasses gate for pre-flight) |
 | `evolution/core/saturation_check.py` | ~255 | pre-flight: band classifier + `SaturationReport` + Rich panel + interactive confirm |
 | `evolution/tools/tool_judge.py` | ~230 | tool-flavored judge + GEPA-shaped metric with behavioral branch |
+| `evolution/prompts/prompt_judge.py` | ~230 | SaveCallJudge + judge_save_calls Layer-2 content judge + prompt fitness/splice scorers |
 | `evolution/validation/validator.py` | ~220 | mutate + restore live agent file with flock + checksum drift check |
 | `evolution/validation/report.py` | ~225 | ValidationReport JSON + Rich rendering + two-condition decision |
 | `evolution/core/skill_sources.py` | ~210 | Hermes / Claude Code / LocalDir |
@@ -101,15 +109,19 @@ evolution/
 | `evolution/skills/knee_point.py` | ~205 | parsimony-based candidate picker |
 | `evolution/validation/hermes_runner.py` | ~205 | hermes -z subprocess with sandboxed HOME |
 | `evolution/tools/tool_proposer.py` | ~200 | sentinel-preserving reflection prompt |
-| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore |
+| `evolution/prompts/prompt_proposer.py` | ~160 | sentinel-preserving GEPA proposer for prompt sections |
+| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore (tool + prompt-section installers) |
+| `evolution/prompts/hermes_prompt_source.py` | ~135 | AST read/write of prompt_builder.py string constants |
+| `evolution/prompts/prompt_module.py` | ~120 | PromptModule passthrough predictor + sentinel parse |
 | `evolution/validation/closed_loop.py` | ~135 | standalone closed-loop CLI |
 | `evolution/skills/skill_module.py` | ~125 | wraps SKILL.md as `dspy.Module` |
 | `evolution/validation/task.py` | ~90 | Task + TaskSuite.from_jsonl |
 | `evolution/core/config.py` | ~80 | `EvolutionConfig` dataclass |
 | `evolution/core/stats.py` | ~60 | `paired_bootstrap` helper |
+| `evolution/prompts/prompt_source.py` | ~55 | PromptSource Protocol + SectionDescriptor |
 | `evolution/validation/agent_runner.py` | ~55 | AgentRunner Protocol + dataclasses |
 | `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples |
-| **Total** | **~9,000** | excludes empty `__init__.py` shims |
+| **Total** | **~10,400** | excludes empty `__init__.py` shims |
 
 Test suite: 61 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1166 tests** collected.
 
@@ -139,11 +151,11 @@ The README's table summarizes intent; reality:
 |---|---|---|---|
 | 1 | Skill files (SKILL.md) | DSPy + GEPA | ✅ implemented in `evolution/skills/` |
 | 2 | Tool descriptions | DSPy + GEPA | ✅ implemented in `evolution/tools/` — MCP-JSON and Hermes-Python-AST adapters; one target tool per run |
-| 3 | System prompt sections | DSPy + GEPA | 🔲 `evolution/prompts/` package exists, empty |
+| 3 | System prompt sections | DSPy + GEPA | ✅ implemented in `evolution/prompts/` — AST splice of `prompt_builder.py` constants; purely-behavioral closed-loop deploy gate (no synthetic signal) |
 | 4 | Tool implementation code | Darwinian Evolver | 🔲 `evolution/code/` package exists, empty; `[darwinian]` extra reserves the dep |
 | 5 | Continuous improvement loop | Automated pipeline | 🔲 `evolution/monitor/` package exists, empty |
 
-Tiers 1 and 2 are built. Tier 3-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec.
+Tiers 1-3 are built. Tier 4-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec.
 
 **Orthogonal validation surface.** `evolution/validation/` runs a real agent (`hermes -z`) through a JSONL task suite with baseline vs evolved artifacts spliced into the live install. Scores actual tool-selection behavior with `expected_tools` / `forbidden_tools` per task; compares with a two-condition decision rule. Available three ways:
 
diff --git a/docs/components.md b/docs/components.md
index 8821142..2d3c85c 100644
--- a/docs/components.md
+++ b/docs/components.md
@@ -368,6 +368,51 @@ Score is **never** modified by `pred_trace` enrichment — GEPA enforces score e
 
 **Cost ceiling + benchmark hook (shared with `evolve_skill`):** `--max-total-cost-usd` participates in the same `CostLedger` kill switch (see `lm_timing_callback.py`); `--benchmark-cmd` is a post-gate shell hook whose env vars include `EVOLVED_PATH` / `BASELINE_PATH` pointing at the rendered manifest JSONs and `ARTIFACT_TYPE="tool_description"`. Both write structured blocks into `gate_decision.json` — see `data_models.md`.
 
+## evolution/prompts/evolve_prompt_section.py — CLI + orchestrator
+
+**Owns:** the end-to-end `evolve_prompt_section()` flow and the Click CLI (`main`) for evolving a named system-prompt section — a top-level string constant in Hermes `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`). The phase-3 analogue of `evolve_tool`, but with a fundamentally different eval substrate: there is no cheap synthetic classification GEPA can score, so **every** candidate is spliced into the live `prompt_builder.py` and run through a real `hermes -z` subprocess. The deploy gate is therefore a `ClosedLoopValidator` win/loss decision, not a paired-bootstrap CI.
+
+**Public surface:**
+- `main()` — Click command. CLI flags map onto `evolve_prompt_section()` kwargs.
+- `evolve_prompt_section(section_name, hermes_repo, tasks_path, ...) -> dict` — orchestrator function. Importable and used directly by tests.
+
+**Integration model — in-place splice + atomic restore.** Unlike skills (separate writable workdir) there is no env-var hook or plugin seam: the section is a constant inside Hermes' own source, so the framework edits that file in place and restores it. The whole evolution runs inside `_prompt_builder_guard(target_path)` — a context manager that takes an atomic `.cl_backup` (`_BACKUP_SUFFIX`), grabs an exclusive `fcntl.flock` on `.cl_validation.lock` (`_LOCK_FILENAME`) in the target's parent dir, and byte-restores the original on exit (refusing to start on a stale backup or a held lock). These are the *same* lock + backup names `ClosedLoopValidator` uses, so the guard is sequenced *before* the deploy-gate validator, never nested. The deploy gate then re-acquires the lock itself.
+
+**Phases inside `evolve_prompt_section()`:**
+1. Resolve baseline: `HermesPromptSource.read(section_name)` validates the section is a top-level string constant, then reads its text — or `--baseline-override-file` supplies starting text (a deliberately-weakened baseline for headroom, or a regression ablation) while the *live* file is still backed up/restored and `--apply` still writes the live section.
+2. Train/holdout split of the JSONL suite (`_split_train_holdout`, deterministic shuffle+seed, ≥1 task each side; suites with <2 tasks are rejected).
+3. Build the eval stack: `SaveCallJudge` + a per-task Layer-2 factory (`_make_layer2_factory`, binds each task's `expected_save_content` rubric + message into a `score_task`-shaped scorer; returns `None` for tasks with no rubric) → `HermesPromptSectionInstaller` + `HermesAgentRunner` + a `make_memoizing_splice_scorer` over `install_candidate` / `score_task_id`, serialized under a `threading.Lock`.
+4. `dspy.configure(lm=eval_lm)` sets the **global** default LM (not just `dspy.context`) so the passthrough predictor resolves an LM inside GEPA's worker threads — without it, `forward()`'s passthrough call raises "No LM is loaded" in those threads, yielding no trajectories and no proposal.
+5. Inside `_prompt_builder_guard`: saturation pre-flight (baseline behavior on the holdout; aborts/denies on a non-`healthy` band unless `--force-saturation-check`, with non-interactive contexts refusing rather than prompting) followed by GEPA(`PromptModule`, `PromptSectionProposer`, `make_prompt_fitness_metric` + the memoizing splice scorer). Trainset/valset are `_behavioral_examples` (task message + `closed_loop_task_id`).
+6. Select the evolved section via GEPA val-argmax (`detailed_results.best_idx`), reading the body back out of the winning candidate's sentinel region (`_section_text_from_candidate`).
+7. Deploy gate: `ClosedLoopValidator.validate(...)` runs baseline vs evolved on the holdout suite (the same per-task Layer-2 factory + threshold threaded in). `report.decision == "pass"` is the deploy verdict.
+8. Write `gate_decision.json`; on a passing gate `--apply` writes the evolved section back into `prompt_builder.py`. `baseline_section.txt` / `evolved_section.txt` are also emitted.
+
+`_run_one_task_score` is the GEPA in-loop scorer: materialize the task fixture into a tmp dir, run the agent against whatever section is currently spliced, `score_task`, return 1.0/0.0 (in-loop abstentions score 0.0 — the deploy gate handles abstentions properly). Budget rides the shared `COST_LEDGER` + `CostCeilingExceeded` kill switch; the ceiling abort writes a `cost_ceiling_exceeded` gate decision.
+
+**`gate_decision.json` additions:** `artifact_type: "prompt_section"`, `target_section: <name>`, `baseline_chars` / `evolved_chars` / `growth_pct`, a `closed_loop` block (the validator decision + pass rates + W/L/T), and `sentinel_failures` (proposer candidates rejected for losing the sentinels). `decision_signal` is always `"closed_loop"`. `--create-pr` is **deferred** for prompt sections (it would pollute the diff with the local override-hook commit) and is recorded as `skipped`; use `--apply` + a manual PR.
+
+### Supporting modules (`evolution/prompts/`)
+
+- `prompt_source.py` — `PromptSource` Protocol (`read` + `write` only, `runtime_checkable`) + `SectionDescriptor` (frozen metadata). The Protocol is deliberately minimal — the driver only reads a baseline and writes/splices an evolved value. `list_sections` is a concrete convenience on `HermesPromptSource` (a future `--list-sections` affordance), not part of the contract.
+- `hermes_prompt_source.py` — `HermesPromptSource`, the splice primitive. `read` AST-walks top-level `NAME = "..."` string constants (v1 string-typed only; dict-typed constants like `PLATFORM_HINTS` raise `KeyError`). `write` splices by byte offset using `repr(new_text)` so the literal round-trips byte-equal regardless of embedded quotes/newlines, and `ast.parse`-guards the result before an atomic `os.replace` — it **refuses to write non-parseable Python**, leaving the user's Hermes startable.
+- `prompt_module.py` — `PromptModule(section_name, candidate_text)`: a `dspy.Module` whose `ChainOfThought` passthrough predictor carries the candidate in `signature.instructions` between sentinel markers (`<!-- SECTION:name -->` … `<!-- /SECTION:name -->`). There is no cheap classification to score, so the predictor exists only as a mutation target. `forward()` **must** invoke the passthrough so GEPA captures a trace for `passthrough.predict` — without a traced predictor call, `make_reflective_dataset` finds "no valid predictions" and never proposes a mutation. It returns a placeholder response with `_closed_loop_task_id` + `_candidate_text` attached for the behavioral metric. GEPA discovers the target via `named_predictors()` → `"passthrough.predict"`.
+- `prompt_proposer.py` — `PromptSectionProposer`, a sentinel-preserving GEPA `instruction_proposer` subclassing `BudgetAwareProposer` (inherits the char-budget infrastructure; see `budget_aware_proposer.py`). Runs the proposer LM, then passes the candidate through `extract_and_rebuild` so only the sentinel-delimited region survives. On a candidate that loses the sentinels it increments `sentinel_failures` and **re-raises** `SentinelParseError` rather than returning the parent unchanged — GEPA's reflective-mutation path skips the iteration instead of admitting a phantom identical-to-parent candidate into the selection pool.
+- `prompt_judge.py` —
+  - `SaveCallJudge` — LLM-as-judge scoring an individual memory-save's content against `MEMORY_GUIDANCE`'s rules (durable, declarative, fact-focused; not task progress / PR numbers / completed-work logs). Unparseable judge output falls back to a neutral 0.5 (logged so it's distinguishable from a real mediocre score).
+  - `judge_save_calls` — the Layer-2 aggregate. Only judges `SAVE_ACTIONS = {add, replace}` (the real Hermes `memory` tool actions that carry a `content` payload; `remove` is not a save), caps judged calls at `MAX_JUDGED_CALLS_PER_TASK = 5` (excess score 0 each), and returns a vacuous 1.0 when there are no save calls or no judge/rubric is configured.
+  - `make_prompt_fitness_metric` — the GEPA 5-arg metric. Routes purely behaviorally: a prediction missing `_closed_loop_task_id` is degenerate and scores 0 with a diagnostic; otherwise `closed_loop_scorer(task_id, candidate_text)` runs one closed-loop trial. Appends a `[BUDGET]` feedback line.
+  - `make_memoizing_splice_scorer` — builds `closed_loop_scorer(task_id, candidate_text)` that splices **only when `candidate_text` changes** (consecutive tasks for one candidate reuse the live splice). Serialized under a `threading.Lock` because `dspy.Evaluate` is multi-threaded but `prompt_builder.py` is one shared mutable file — behavioral scoring is therefore effectively serial, an accepted v1 cost of splice-and-restore. Backup/restore is the caller's job (the guard wraps the whole run).
+
+### Shared validation-stack changes that enable the prompt path
+
+These let the prompt path reuse `ClosedLoopValidator` unchanged (see the validation section below for the base machinery):
+
+- `HermesPromptSectionInstaller` (in `artifact_installer.py`) — implements the `ArtifactInstaller` Protocol. `target_path` = `agent/prompt_builder.py`; `install(text_file)` reads the candidate body and calls `HermesPromptSource.write`, returning the post-install `sha256`; `verify_backup` = `verify_python_parses`. Constraint: the section must be a top-level string constant.
+- `ClosedLoopValidator` gained an optional `layer2_judge_factory` (per-task — prompt-section judging needs the task's `expected_save_content` rubric + message, which a single global fn couldn't carry) plus a `layer2_threshold`. When unset, scoring is Layer 1 only and the tool-description path is unchanged.
+- `report.py`'s `score_task` gained the compound Layer 2: when a `layer2_judge_fn` is supplied a task passes only if Layer 1 (trigger membership) passes **and** the judge scores `>= layer2_threshold`. Layer 1 short-circuits — the judge is never called (no LLM cost) on a task that already failed the trigger test, and `test_command` mode ignores Layer 2. The judge receives the subset of `run.tool_calls_with_args` whose name is `memory`. `Task` gained `expected_save_content`; `AgentRunResult` gained `tool_calls_with_args`.
+- `hermes_runner.py` (shared change): reads agent sessions from the SQLite `state.db` (`parse_session_from_db`) since the current one-shot `hermes -z` is ephemeral and no longer writes `session_*.json`. A row whose `tool_calls` column won't parse as JSON aborts with an `error` result (the task **abstains**) rather than being silently read as "no tools."
+
 ## evolution/validation/ — closed-loop validation against a real agent
 
 Drives an actual agent (`HermesAgentRunner` via `hermes -z`) through a small task suite with baseline and evolved artifacts, scores real tool-selection behavior, compares. Orthogonal to skills/tools/prompts/code — measures agent behavior, not artifact production.
@@ -388,6 +433,6 @@ Drives an actual agent (`HermesAgentRunner` via `hermes -z`) through a small tas
 
 **During-evolution integration.** Beyond the standalone CLI, the same `ClosedLoopValidator` powers `evolution/core/closed_loop_feedback.py`'s `ClosedLoopFeedbackCache`. The cache writes the candidate description into a tmp manifest JSON, calls `validator.validate(ValidationInputs(...))` with it as `evolved_artifact`, and caches the returned `ValidationReport` by candidate text. The cache surfaces verdicts to the metric two ways: as a deterministic feedback block on the reflection path (`feedback` mode), or as per-task `TaskResult.passed` reads via `get_task_verdict(candidate, task_id)` for the behavioral-example branch (`trainset` mode). The validator itself doesn't know about the cache; it always sees a `ValidationInputs` with two artifacts and produces a `ValidationReport`.
 
-## evolution/{prompts, code, monitor}/ — planned, empty
+## evolution/{code, monitor}/ — planned, empty
 
-These packages exist as empty stubs anchoring the planned tier-3/4/5 work. See `PLAN.md` for the design.
+These packages exist as empty stubs anchoring the planned tier-4/5 work. See `PLAN.md` for the design. (`prompts/` is now implemented — see the phase-3 section above.)
diff --git a/docs/data_models.md b/docs/data_models.md
index c2455d4..6c0c23e 100644
--- a/docs/data_models.md
+++ b/docs/data_models.md
@@ -555,6 +555,79 @@ Written by `evolution/core/quality_gate.py::append_cl_decision_fields` when the
 | `band_trigger_score` | `dict` | Pre-flight scores that decided whether CL-primary fired. Keys: `holdout` (`float \| None`), `closed_loop` (`float \| None`). |
 | `validator_agent_model` | `str` | The LiteLLM model id used for the closed-loop validator agent. Recorded so historical decisions stay analysable if the default changes. |
 
+### Prompt-section additions (`artifact_type == "prompt_section"`)
+
+Runs of `evolution.prompts.evolve_prompt_section` (Phase 3) write the same `schema_version` "5" envelope but a **deliberately different field set** from the skill/tool variant, because the deploy gate is a closed-loop pass-rate / win-loss decision, **not** a paired-bootstrap one. There is no synthetic classification signal for a system-prompt section — every candidate is scored behaviorally by a real `hermes -z` against a curated suite — so the bootstrap substrate doesn't apply.
+
+```json
+{
+  "schema_version": "5",
+  "artifact_type": "prompt_section",
+  "target_section": "MEMORY_GUIDANCE",
+  "decision": "deploy",                          // "deploy" | "reject" | "denied" | "dry_run" | "aborted"
+  "decision_signal": "closed_loop",              // always "closed_loop" on this path
+  "baseline_chars": 1840,
+  "evolved_chars": 2104,
+  "growth_pct": 0.143,                           // (evolved_chars - baseline_chars) / baseline_chars
+  "closed_loop": {
+    "decision": "pass",                          // "pass" | "regression" (ValidationReport.decision)
+    "decision_reasons": ["pass_rate 0.92 >= baseline 0.75", "n_wins 4 >= 2*n_losses 0"],
+    "baseline_pass_rate": 0.75,
+    "evolved_pass_rate": 0.92,
+    "n_wins": 4,
+    "n_losses": 0,
+    "n_ties": 8
+  },
+  "sentinel_failures": 1,                         // reflection-LM outputs the proposer rejected for breaking sentinel preservation
+  "elapsed_seconds": 412.6,
+  "cost": { /* same shape as cost_summary: total_usd + by_model */ },
+  "run_inputs": { /* seed, iterations, model versions, suite path/sha, validator_agent_model, ... */ },
+  "pr_created": { "status": "skipped", "reason": "prompt_section_pr_unsupported", "branch": null, "commit_sha": null, "url": null }
+}
+```
+
+**Fields this variant carries** (and the tool/skill variant does not, or differs on):
+
+| Field | Type | Notes |
+|---|---|---|
+| `artifact_type` | `"prompt_section"` | Disjoint from `"skill"` / `"tool_description"`. |
+| `target_section` | `str` | The `prompt_builder.py` constant whose text was evolved (e.g. `MEMORY_GUIDANCE`). |
+| `decision` | `"deploy" \| "reject" \| "denied" \| "dry_run" \| "aborted"` | `"denied"` lands on a saturation pre-flight default-deny; `"dry_run"` when the run was asked to evaluate without splicing; `"aborted"` on cost-ceiling / interrupt. |
+| `decision_signal` | `"closed_loop"` | Always `"closed_loop"` here — the synthetic value never appears on this path. |
+| `baseline_chars` / `evolved_chars` / `growth_pct` | int / int / float | Size telemetry; growth informs the closed-loop required-gain threshold but is not gated on a bootstrap. |
+| `closed_loop` | `dict` | `{decision, decision_reasons, baseline_pass_rate, evolved_pass_rate, n_wins, n_losses, n_ties}` — the deploy gate's primary evidence (sourced from `ValidationReport` over the behavioral suite). |
+| `sentinel_failures` | `int` | Count of reflection-LM proposals rejected for failing sentinel preservation (same meaning as the tool path). |
+| `elapsed_seconds` / `cost` | float / dict | Wall-clock + per-model cost ledger. |
+| `run_inputs` | `dict` | Reproduction inputs (seed, iterations, models, suite path + sha, `validator_agent_model`). |
+| `pr_created` | `dict` | Shape-stable with the skill/tool path, but the prompt-section path currently emits a `status: "skipped"` block (PR automation for in-place `prompt_builder.py` splices is not wired). |
+
+**Fields the prompt-section variant deliberately OMITS.** A reader or calibration script must not assume these are present — they exist only on the skill/tool (paired-bootstrap) path:
+
+- `bootstrap` — no per-example bootstrap CI; the gate is win-loss, not a resampled mean.
+- `avg_baseline` / `avg_evolved` — no synthetic holdout mean. The analogous numbers live inside `closed_loop` as `baseline_pass_rate` / `evolved_pass_rate`.
+- `dataset` — there is no synthetic eval dataset and no `dataset` block with per-source/per-category counts; the behavioral suite is the JSONL passed via `--tasks`. `run_inputs` records the run config (models, seed, iterations, holdout-ratio, `eval_source: "closed_loop"`), not the suite path or sha.
+- `knee_point` — Pareto knee-point selection over a synthetic valset doesn't apply; candidates are chosen on behavioral score.
+
+#### Saturation-denied variant (prompt section)
+
+When the saturation pre-flight default-denies (non-healthy band, non-interactive context, no `--force-saturation-check`), the prompt-section gate writes `decision: "denied"` and carries a `saturation_band` field naming the band that triggered the denial:
+
+```json
+{
+  "schema_version": "5",
+  "artifact_type": "prompt_section",
+  "target_section": "MEMORY_GUIDANCE",
+  "decision": "denied",
+  "decision_signal": "closed_loop",
+  "saturation_band": "no_headroom",              // "healthy" never lands here; one of no_headroom | weak_signal | uniform_failure
+  "baseline_chars": 1840,
+  "run_inputs": { /* ... */ },
+  "pr_created": { "status": "skipped", "reason": "prompt_section_pr_unsupported", "branch": null, "commit_sha": null, "url": null }
+}
+```
+
+`saturation_band` appears only on the `"denied"` decision (it records why the run never started); it is absent on `deploy` / `reject` / `dry_run`.
+
 ## metrics.json (deploy-only summary)
 
 Written to `output/<skill>/<timestamp>/metrics.json` only on deploy. Top-level summary for quick scanning:
diff --git a/docs/index.md b/docs/index.md
index 1d810c8..dfe14e0 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -15,6 +15,7 @@ The codebase is mid-sized (~9K LOC of source + 61 test files / ~1166 tests) and
 | **What this project is** | `codebase_info.md` → `architecture.md` → repo-root `README.md` |
 | **How a skill run works end-to-end** | `workflows.md` (Workflow 1) → `architecture.md` (top-level flow) |
 | **How a tool-description run works end-to-end** | `workflows.md` (Workflow 9) → `components.md` (`evolve_tool.py`) |
+| **How a prompt-section run works end-to-end** | `workflows.md` (Workflow 12) → `components.md` (`evolve_prompt_section.py`) |
 | **What flag does X / how to run the CLI** | `interfaces.md` (CLI section) |
 | **Why the deploy gate rejected a run** | `data_models.md` (gate_decision.json) → `components.md` (`constraints.py`) |
 | **What's in `gate_decision.json` / `metrics.json`** | `data_models.md` (full schema with examples) |
@@ -53,7 +54,7 @@ The codebase is mid-sized (~9K LOC of source + 61 test files / ~1166 tests) and
 | [`components.md`](components.md) | Per-module reference: what each owns, public surface, load-bearing implementation notes |
 | [`interfaces.md`](interfaces.md) | CLIs (skill, tool, closed-loop, sessiondb importer), Python API, SkillSource + ToolSource Protocols, output artifacts, DSPy + litellm integration, test surfaces, env vars |
 | [`data_models.md`](data_models.md) | All dataclasses, on-disk formats, full `gate_decision.json` schema with worked examples, `ValidationReport` schema |
-| [`workflows.md`](workflows.md) | Step-by-step workflows with mermaid sequence diagrams: skill deploy path, reject paths, GEPA→MIPROv2 fallback, sessiondb mining, tool evolution, closed-loop validation, closed-loop signal during evolution |
+| [`workflows.md`](workflows.md) | Step-by-step workflows with mermaid sequence diagrams: skill deploy path, reject paths, GEPA→MIPROv2 fallback, sessiondb mining, tool evolution, closed-loop validation, closed-loop signal during evolution, prompt-section evolution |
 | [`dependencies.md`](dependencies.md) | Each external package — what it's used for, why it's pinned, what we don't depend on |
 | [`framework_advantages.md`](framework_advantages.md) | User-facing explainer of how this framework's selection layer, deploy gate, proposer, and composite fitness differ from raw DSPy + GEPA — and when raw GEPA is the right choice |
 
diff --git a/docs/interfaces.md b/docs/interfaces.md
index 412291b..4476fd3 100644
--- a/docs/interfaces.md
+++ b/docs/interfaces.md
@@ -140,6 +140,46 @@ Evolves one tool's top-level `description` field inside an MCP-shape manifest. T
 - `sys.exit(1)` if the holdout split has fewer than `min_holdout_size` (default 10) examples.
 - Returns normally (rejection path) if static or growth-quality gate fails — `evolved_FAILED.json` + `gate_decision.json` are written.
 
+## CLI: `python -m evolution.prompts.evolve_prompt_section`
+
+Evolves one named section of an agent's system prompt — a top-level string constant in Hermes Agent's `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`). Unlike the skill and tool paths, evaluation is **purely behavioral**: there is no synthetic LLM-judge signal. Every candidate is spliced into the live `prompt_builder.py` and scored by running the real agent (`hermes -z`) against the task suite, so the deploy gate is a `ClosedLoopValidator` run (pass-rate + win/loss), not a paired-bootstrap CI over judge scores.
+
+The verdict is **compound**: Layer 1 is the same `expected_tools` / `forbidden_tools` membership rule as the closed-loop tool path; Layer 2 is an LLM judge that scores each `memory(action=add|replace)` call's content against the task's `expected_save_content` rubric (only tasks that declare a rubric are Layer-2 judged). The candidate is spliced in for the duration of the run and the file is restored byte-for-byte afterward, reusing the tool-path backup + flock + checksum-drift machinery.
+
+### Required flags
+| Flag | Purpose |
+|---|---|
+| `--section <name>` | The `prompt_builder.py` top-level string constant to evolve (e.g. `MEMORY_GUIDANCE`). Dict-typed constants (e.g. `PLATFORM_HINTS`) are not supported. |
+| `--hermes-repo <path>` | Path to your hermes-agent checkout. `agent/prompt_builder.py` inside it is the splice/restore target. |
+| `--tasks <path>` | JSONL eval suite (e.g. `evolution/validation/suites/memory_guidance.jsonl`). Same task shape as the closed-loop tool suite, plus an optional `expected_save_content` rubric per task for Layer 2. Must contain ≥2 tasks (so the split yields a non-empty trainset and holdout). |
+
+### Optional flags
+| Flag | Default | Notes |
+|---|---|---|
+| `--iterations <int>` | `10` | GEPA `max_full_evals`. |
+| `--holdout-ratio <float>` | `0.5` | Fraction of tasks held out for the deploy gate. Clamped to keep both the trainset and holdout non-empty. |
+| `--seed <int>` | `42` | RNG seed for the train/holdout split and GEPA. |
+| `--max-growth <float>` | `0.2` | Section length budget as a fraction over the baseline; framed to the `PromptSectionProposer` so candidates stay near the baseline length (set higher when evolving from a short baseline that needs to grow). |
+| `--optimizer-model` / `--reflection-model` / `--eval-model <name>` | config default | Per-role LiteLLM model overrides; resolved like the other CLIs. `--eval-model` is the Layer 2 content judge. |
+| `--agent-model <name>` | config default | The model the `hermes -z` agent itself runs as. A deliberately weaker agent exposes more behavioral signal (a strong agent saturates the suite regardless of the prompt). LiteLLM provider prefixes are stripped before `hermes -m`. |
+| `--layer2-threshold <float>` | `0.7` | Minimum mean content-judge score for a save task to pass Layer 2. |
+| `--task-timeout-seconds <int>` | `120` | Per-task wall-clock cap for `hermes -z`. Timeouts abstain (don't tip the decision). |
+| `--max-cost-usd <float>` | `150.0` | Abort cleanly when cumulative **in-process** LM cost (judge + reflection + the passthrough predictor) exceeds this. The agent's own LM spend happens inside the `hermes` child process and is not captured by this ceiling. |
+| `--gepa-minibatch-size <int>` | `3` | GEPA reflective minibatch size; same meaning as the other paths. |
+| `--gepa-acceptance {improvement-or-equal,strict-improvement}` | `improvement-or-equal` | Same meaning as the other paths. |
+| `--apply` | off | On a deploy decision, write the evolved section into `prompt_builder.py` in place (byte-precise AST splice, `ast.parse`-guarded, atomic). |
+| `--create-pr` | off | **Deferred for prompt sections** — accepted and recorded as a `skipped` PR block in `gate_decision.json`, but no PR is opened (copying a full evolved `prompt_builder.py` over `origin/<base>` would carry unrelated local changes into the diff). Use `--apply` + a manual PR. |
+| `--baseline-override-file <path>` | off | Start evolution from this text instead of the live section. The live section is still the splice/restore target (backed up + restored); `--apply` still writes the evolved text. Use it to create headroom on an already-tuned section (e.g. a deliberately-weakened baseline) or for regression-injection ablations. |
+| `--skip-saturation-check` | off | Skip the saturation pre-flight entirely. |
+| `--force-saturation-check` | off | Run the pre-flight, render the panel, but proceed regardless of band — required to override a non-`healthy` verdict non-interactively. |
+| `--dry-run` | off | Resolve the baseline + build the modules, then stop — exercises wiring with no LM/agent calls. Writes a `decision="dry_run"` `gate_decision.json`. |
+| `--output-dir <path>` | `output/prompts/<section>/<timestamp>/` | Where `gate_decision.json` and the baseline/evolved section text files land. |
+
+### Exit conditions
+- `0` on a `deploy` decision (or a `--dry-run`).
+- `1` on `reject` (the holdout deploy gate found a regression), `denied` (saturated baseline default-denied non-interactively), or `aborted` (cost ceiling).
+- `ValueError` at startup if the suite has fewer than 2 tasks.
+
 ## CLI: `python -m evolution.core.external_importers`
 
 Standalone session-history importer. Useful for previewing what `--eval-source sessiondb` would produce without running the full evolution.
diff --git a/docs/workflows.md b/docs/workflows.md
index eb80148..e686908 100644
--- a/docs/workflows.md
+++ b/docs/workflows.md
@@ -545,6 +545,158 @@ When your daily-driver Hermes model is capable enough to solve every textbook bu
 
 Manual smoke harness: `tests/manual/skill_closed_loop_smoke.py` (supports `--suite {basic,advanced}`, `--agent-model MODEL`, `--task-timeout-seconds N`).
 
+## Workflow 12: Evolve a prompt section (deploy path)
+
+The prompt-section analog of Workflow 9 (tool descriptions), but **purely behavioral** end to end. There is no synthetic judge dataset and no paired-bootstrap gate: every candidate is spliced into the live `prompt_builder.py` and scored by a real `hermes -z` subprocess, and the deploy gate is a `ClosedLoopValidator` run. Three structural contrasts with the tool path:
+
+- **Integration is in-place splice-and-restore**, not an MCP manifest rewrite or a copied skill directory. The target is a single named string constant inside the user's `prompt_builder.py`; the harness backs it up byte-for-byte and restores it on exit.
+- **The deploy gate is closed-loop pass-rate / win-loss**, not a paired-bootstrap confidence interval. Decision = pass-rate no-regression + `n_wins >= 2 * n_losses` (the `ClosedLoopValidator.decide` rule), all behavioral.
+- **PR automation is deferred.** `--create-pr` is recorded as `skipped`; deploy means `--apply` writes the evolved section into `prompt_builder.py` in place, and the user opens a PR by hand.
+
+```bash
+python -m evolution.prompts.evolve_prompt_section \
+    --section MEMORY_GUIDANCE \
+    --hermes-repo ~/src/NousResearch/hermes-agent \
+    --tasks evolution/validation/suites/memory_guidance.jsonl \
+    --iterations 10 \
+    --apply
+```
+
+### Phase A — Setup: resolve baseline, split, build the behavioral harness
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant CLI as evolve_prompt_section
+    participant Src as HermesPromptSource
+    participant Suite as TaskSuite
+    participant Judge as SaveCallJudge
+    participant Inst as HermesPromptSectionInstaller
+    participant Run as HermesAgentRunner
+    participant V as ClosedLoopValidator
+
+    CLI->>Src: read(section_name) — validate it exists / is a string constant
+    alt --baseline-override-file
+        CLI->>CLI: baseline_text = override_file.read_text()
+    else
+        CLI->>Src: baseline_text = read(section_name)
+    end
+    CLI->>Suite: TaskSuite.from_jsonl(tasks) — reject < 2 tasks
+    CLI->>CLI: _split_train_holdout(seed) — ≥1 task each side
+    CLI->>Judge: SaveCallJudge(config)  → layer2_factory(task)
+    CLI->>Inst: HermesPromptSectionInstaller(repo, section)
+    CLI->>Run: HermesAgentRunner(timeout, agent_model?)
+    CLI->>V: ClosedLoopValidator(installer, runner, layer2_judge_factory, layer2_threshold)
+```
+
+The baseline is the **live section text** unless `--baseline-override-file` points evolution at arbitrary text — e.g. a deliberately-weakened baseline to manufacture headroom, or a regression-injection ablation. The override only changes where evolution *starts*; the guard still backs up and restores the real file, and `--apply` writes the evolved text back into the live section. The suite floor is 2 tasks so the seeded split yields a non-empty GEPA trainset **and** a non-empty deploy-gate holdout.
+
+### Phase B — Configure the global LM, then enter the guard
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant CLI as evolve_prompt_section
+    participant Scorer as memoizing_splice_scorer
+    participant Metric as prompt_fitness_metric
+    participant LM as eval_lm
+    participant DSPy as dspy.configure
+
+    CLI->>Scorer: make_memoizing_splice_scorer(install_fn=source.write, score_fn=run_one_task, lock)
+    CLI->>Metric: make_prompt_fitness_metric(baseline_text, max_growth, closed_loop_scorer=scorer)
+    CLI->>LM: instantiate eval_lm (role=eval, temp=0)
+    CLI->>DSPy: dspy.configure(lm=eval_lm, callbacks=[LMTimingCallback()])
+    Note over CLI,DSPy: global LM set so GEPA worker threads can run PromptModule's<br/>passthrough predictor — the pre-flight's dspy.context doesn't reach them
+```
+
+The `closed_loop_scorer` is the spine of behavioral scoring: `score(task_id, candidate_text)` splices the candidate into the live `prompt_builder.py` **only when it changes** (consecutive tasks for the same candidate reuse the live splice), runs the task via `hermes -z`, and reads the session back from the sandbox `state.db`. The splice+run is serialized under one `threading.Lock` because `dspy.Evaluate` scores with a thread pool but the spliced file is a single shared mutable resource — behavioral scoring is therefore effectively serial, an accepted v1 cost. The explicit `dspy.configure` is load-bearing: `dspy.context` inside the saturation pre-flight does **not** propagate into GEPA's worker threads, so without the global LM the passthrough predictor raises "No LM is loaded" → no trajectories → no proposal.
+
+### Phase C — Inside the guard: saturation pre-flight, then GEPA
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant CLI as evolve_prompt_section
+    participant Guard as _prompt_builder_guard
+    participant FS as live prompt_builder.py
+    participant Sat as saturation_preflight
+    participant GEPA as dspy.GEPA
+    participant PM as PromptModule
+    participant Prop as PromptSectionProposer
+    participant Scorer as splice scorer
+    participant H as hermes -z + state.db
+
+    CLI->>Guard: enter(installer.target_path)
+    Guard->>FS: refuse if stale .cl_backup; flock parent dir (LOCK_EX|NB)
+    Guard->>FS: atomic_write_bytes(.cl_backup, target.read_bytes())
+    opt not --skip-saturation-check
+        CLI->>Sat: saturation_preflight(baseline_module, holdout, metric, eval_lm, baseline_text)
+        Sat->>Scorer: behavioral score of baseline on each holdout task
+        Sat-->>CLI: SaturationReport(band, ...)
+        alt band != healthy
+            alt --force-saturation-check
+                Note over CLI: proceed regardless
+            else non-interactive
+                CLI->>FS: write gate_decision.json (decision=denied, reason=saturated_baseline)
+                Note over CLI: return — GEPA never runs (default-deny)
+            else interactive
+                CLI->>CLI: prompt "Continue anyway? [y/N]"
+            end
+        end
+    end
+    CLI->>GEPA: compile(PromptModule(baseline), trainset, valset, instruction_proposer=PromptSectionProposer)
+    loop per iteration
+        GEPA->>PM: forward(task, closed_loop_task_id) — candidate in sentinel region of predictor instructions
+        PM-->>GEPA: Prediction(_candidate_text, _closed_loop_task_id)
+        GEPA->>Scorer: metric → closed_loop_scorer(task_id, candidate_text)
+        Scorer->>FS: splice candidate into live section (only if changed)
+        Scorer->>H: run task; read session from sandbox state.db
+        H-->>Scorer: tool_calls_with_args + final text
+        Scorer->>Scorer: compound verdict = Layer 1 (memory fired?) + Layer 2 (judge on memory add/replace content)
+        Scorer-->>GEPA: score ∈ {0.0, 1.0}
+        GEPA->>Prop: reflect on failures → sentinel-preserving candidate
+    end
+    GEPA-->>CLI: optimized module with detailed_results
+    CLI->>Guard: exit → atomic_write_bytes(target, .cl_backup); unlink backup; release flock
+```
+
+Everything that mutates the file lives **inside** the guard, which holds an exclusive `flock` (the same lock name the deploy-gate `ClosedLoopValidator` uses — sequenced before it, never nested) and restores the original bytes on exit. The saturation pre-flight scores the baseline behaviorally on the holdout; a non-`healthy` band (e.g. `no_headroom` on an already-tuned section) **default-denies in non-interactive contexts** unless `--force-saturation-check`, writing a `decision="denied"` gate before GEPA spends a cent. The compound per-task verdict is two layers: **Layer 1** is trigger membership (did the `memory` tool fire, via `expected_tools` / `forbidden_tools`), **Layer 2** is the `SaveCallJudge` scoring `memory(action=add|replace)` content against the task's `expected_save_content` rubric (`remove` is not a save; a passing Layer 1 with no save action scores a vacuous 1.0 on Layer 2). GEPA mutates only the sentinel-delimited region of the passthrough predictor's instructions; the `PromptSectionProposer` rejects any reflection-LM output that fails sentinel preservation.
+
+### Phase D — Deploy gate (closed-loop on the holdout), persist, apply
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant CLI as evolve_prompt_section
+    participant Sel as candidate selection
+    participant V as ClosedLoopValidator
+    participant Inst as HermesPromptSectionInstaller
+    participant FS as live prompt_builder.py
+    participant H as hermes -z
+    participant Src as HermesPromptSource
+
+    Note over CLI: guard already exited — file restored to baseline
+    CLI->>Sel: evolved_text = section_from_candidate(best_idx)  # GEPA val-argmax
+    CLI->>FS: write baseline_section.txt + evolved_section.txt
+    CLI->>V: validate(ValidationInputs(section, holdout_suite, baseline_file, evolved_file))
+    Note over V: own backup/restore + flock — independent of the Phase C guard
+    loop baseline phase, then evolved phase
+        V->>Inst: install(section_file) — splice into live prompt_builder.py
+        loop each holdout task
+            V->>H: run task; score Layer 1 + Layer 2 via layer2_judge_factory
+        end
+    end
+    V-->>CLI: ValidationReport(baseline_pass_rate, evolved_pass_rate, n_wins/n_losses, decision)
+    CLI->>FS: write gate_decision.json (artifact_type="prompt_section", decision=deploy|reject)
+    alt decision == pass AND --apply
+        CLI->>Src: write(section_name, evolved_text) — live section updated in place
+    end
+```
+
+The selected candidate is GEPA's val-argmax (`detailed_results.best_idx`) — there's no knee-point parsimony pass on the prompt-section path. The deploy gate is a fresh `ClosedLoopValidator.validate` over the **holdout** suite, with its own backup/restore + `flock` (it runs after the Phase C guard has already exited and restored the file, so the two never nest). Its decision is closed-loop only: pass-rate no-regression plus `n_wins >= 2 * n_losses`. The gate decision is written with `artifact_type="prompt_section"`, `target_section`, `baseline_chars` / `evolved_chars` / `growth_pct`, a `closed_loop` block (both pass-rates + win/loss/tie counts), and `sentinel_failures`. `--create-pr` records a `skipped` PR block (deferred for sections); `--apply` is the only way to ship, writing the evolved text into the live section.
+
+**Empirical anchors.** The real `MEMORY_GUIDANCE` section saturates — it scored 1.0 across the holdout (`no_headroom` band) and the harness correctly default-denied a non-interactive run before GEPA started. To exercise the full deploy path, an adversarially-weakened baseline (via `--baseline-override-file`) evolved `0.67 → 1.00` pass-rate with 2 wins / 0 losses on the holdout, clearing the closed-loop gate and deploying. The saturating-real-section result is the expected, correct outcome, not a bug: there is no headroom to evolve into when the section already passes every behavioral task.
+
 ## Failure-mode summary
 
 | Trigger | Outcome | Where to look |
@@ -565,3 +717,8 @@ Manual smoke harness: `tests/manual/skill_closed_loop_smoke.py` (supports `--sui
 | Closed-loop validator concurrent run | `ConcurrentRunError` (`fcntl.flock` non-blocking acquire fails) | console only |
 | Closed-loop validator drift between tasks | `ChecksumDriftError` after the offending task; phase aborts, restore still runs | run.log + raised error |
 | Closed-loop cache validator failure during evolution | `WARNING` logged, cache returns `None`, GEPA continues without the verdict — never aborts the run | run.log |
+| Prompt-section suite < 2 tasks | `ValueError` (can't split into non-empty train + holdout) | console only |
+| Prompt-section stale `.cl_backup` on guard entry | `RuntimeError` naming the backup file; refuses to start | console only |
+| Prompt-section saturated baseline, non-interactive | `decision="denied"` `gate_decision.json`; GEPA never runs (override with `--force-saturation-check`) | `gate_decision.json` (`saturation_band`) |
+| Prompt-section closed-loop gate rejects | `decision="reject"` `reason="closed_loop_gate"`; section not applied | `gate_decision.json` (`closed_loop` block) |
+| Prompt-section `--create-pr` | recorded as `skipped` (PR automation deferred); use `--apply` + manual PR | `gate_decision.json` (`pr_created` block) |
diff --git a/generate_report.py b/generate_report.py
index 3008116..b7a1ec7 100644
--- a/generate_report.py
+++ b/generate_report.py
@@ -45,13 +45,81 @@
 DEFAULT_LOGO = REPO_ROOT / "assets" / "dna.png"
 
 
+def _extract_prompt_section_data(gate: dict, run_dir: Path) -> dict[str, Any]:
+    """Build the render context for a Phase 3 prompt-section run.
+
+    The prompt-section path is behavioral-only — its gate_decision carries a
+    ``closed_loop`` pass-rate / win-loss block instead of the skill/tool
+    bootstrap-CI + synthetic-dataset + knee-point fields, and self-sources
+    cost/timing/call-count (no metrics.json needed). The ``_experiment`` and
+    ``_results`` renderers branch on ``artifact_type`` to lay out the matching
+    tables; every other section is prose-driven via the keys returned here.
+    """
+    cl = gate.get("closed_loop", {})
+    cost = gate.get("cost", {})
+    resolved = (gate.get("run_inputs", {}) or {}).get("resolved_lms", {})
+
+    n_wins = int(cl.get("n_wins", 0))
+    n_losses = int(cl.get("n_losses", 0))
+    n_ties = int(cl.get("n_ties", 0))
+    cl_total = n_wins + n_losses + n_ties
+    baseline_rate = float(cl.get("baseline_pass_rate", 0.0))
+    evolved_rate = float(cl.get("evolved_pass_rate", 0.0))
+    cl_baseline_pass = round(baseline_rate * cl_total)
+    cl_evolved_pass = round(evolved_rate * cl_total)
+    elapsed = int(float(gate.get("elapsed_seconds", 0)))
+    lm_calls = sum(int(m.get("calls", 0)) for m in (cost.get("by_model") or {}).values())
+    decision = gate.get("decision", "")
+
+    def _model(role: str) -> str:
+        return (resolved.get(role) or {}).get("model", "—")
+
+    return {
+        "artifact_type": "prompt_section",
+        "skill_name": gate.get("target_section", run_dir.parent.name),
+        "section_name": gate.get("target_section", ""),
+        "baseline_chars": int(gate.get("baseline_chars", 0)),
+        "evolved_chars": int(gate.get("evolved_chars", 0)),
+        "growth_pct": float(gate.get("growth_pct", 0.0)),
+        "growth_abs_pct": abs(float(gate.get("growth_pct", 0.0))),
+        "decision": decision,
+        "decision_upper": "DEPLOYED" if decision == "deploy" else "REJECTED",
+        "decision_signal": gate.get("decision_signal", "closed_loop"),
+        "baseline_pass_rate": baseline_rate,
+        "evolved_pass_rate": evolved_rate,
+        "baseline_pass_pct": baseline_rate * 100,
+        "evolved_pass_pct": evolved_rate * 100,
+        "cl_baseline_pass": cl_baseline_pass,
+        "cl_evolved_pass": cl_evolved_pass,
+        "cl_total_tasks": cl_total,
+        "cl_tasks_gained": cl_evolved_pass - cl_baseline_pass,
+        "n_wins": n_wins,
+        "n_losses": n_losses,
+        "n_ties": n_ties,
+        "elapsed_seconds": elapsed,
+        "elapsed_minutes": elapsed // 60,
+        "cost_total_usd": float(cost.get("total_usd", 0.0)),
+        "lm_calls_metrics": lm_calls,
+        "optimizer_lm": _model("optimizer"),
+        "reflection_lm": _model("reflection"),
+        "eval_lm": _model("eval"),
+        "saturation_band": gate.get("saturation_band", ""),
+        "sentinel_failures": int(gate.get("sentinel_failures", 0)),
+        "decision_reasons": "; ".join(cl.get("decision_reasons", [])),
+    }
+
+
 def _extract_run_data(run_dir: Path) -> dict[str, Any]:
     """Pull all numbers the renderer needs from a run dir.
 
     Reads gate_decision.json (always present) + metrics.json (deploy only) +
-    run.log (LM call counts grep'd from timing-callback lines).
+    run.log (LM call counts grep'd from timing-callback lines). Prompt-section
+    (Phase 3) runs are behavioral-only and self-source from gate_decision.json
+    alone — see ``_extract_prompt_section_data``.
     """
     gate = json.loads((run_dir / "gate_decision.json").read_text())
+    if gate.get("artifact_type") == "prompt_section":
+        return _extract_prompt_section_data(gate, run_dir)
     metrics_path = run_dir / "metrics.json"
     metrics = json.loads(metrics_path.read_text()) if metrics_path.is_file() else {}
 
@@ -442,7 +510,7 @@ def _approach(prose: dict, ctx: dict, styles) -> list:
     ap = prose["approach"]
     engines = ap["engines"]
     flow = [
-        Paragraph("Approach: Evolutionary Skill Optimization", styles['SectionHead']),
+        Paragraph(ap.get("section_title", "Approach: Evolutionary Optimization"), styles['SectionHead']),
         Paragraph("Three Optimization Engines", styles['SubSection']),
         _highlight_table(
             header=engines["header"],
@@ -463,9 +531,57 @@ def _approach(prose: dict, ctx: dict, styles) -> list:
     return flow
 
 
+def _experiment_prompt_section(exp: dict, overrides: dict, ctx: dict, styles) -> list:
+    """Phase 3 experiment section: behavioral config (no synthetic eval set,
+    no knee-point), and the suite is described in prose rather than via a
+    train.jsonl examples table (prompt-section runs don't write one)."""
+    config_rows = [
+        ['Target Section', _fmt(overrides["target_section_label"], ctx)],
+        ['Baseline Size', f'{ctx["baseline_chars"]:,} characters'],
+        ['Optimizer LM', _fmt(overrides["optimizer_lm"], ctx)],
+        ['Reflection LM (GEPA)', _fmt(overrides["reflection_lm"], ctx)],
+        ['Content-Judge LM (Layer 2)', _fmt(overrides["eval_judge_lm"], ctx)],
+        ['Agent (hermes -z)', _fmt(overrides["agent_lm"], ctx)],
+        ['Behavioral Suite', f'{ctx["cl_total_tasks"]} holdout tasks (real hermes -z, scored end-to-end)'],
+        ['Total Optimization Time',
+         f'{ctx["elapsed_seconds"]:,} seconds (~{ctx["elapsed_minutes"]} minutes)'],
+        ['Total LM Calls (in-process)', f'{ctx["lm_calls_metrics"]:,}'],
+        ['Total Cost (USD, in-process)', f'${ctx["cost_total_usd"]:.2f}'],
+        ['Deploy Gate', _fmt(overrides["quality_gate_label"], ctx)],
+        ['Saturation Pre-flight', _fmt(overrides["saturation_label"], ctx)],
+    ]
+    config_data = [[_wrap_cell(c, styles['TableHeaderCell']) for c in ['Parameter', 'Value']]]
+    config_data += [[_wrap_cell(c, styles['TableCell']) for c in row] for row in config_rows]
+    config_table = Table(config_data, colWidths=[2.0 * inch, 4.0 * inch])
+    config_table.setStyle(TableStyle([
+        ('BACKGROUND', (0, 0), (-1, 0), HexColor('#1a1a2e')),
+        ('GRID', (0, 0), (-1, -1), 0.5, HexColor('#cccccc')),
+        ('TOPPADDING', (0, 0), (-1, -1), 5),
+        ('BOTTOMPADDING', (0, 0), (-1, -1), 5),
+        ('LEFTPADDING', (0, 0), (-1, -1), 8),
+    ]))
+    return [
+        Paragraph(exp.get("section_title", "Experiment"), styles['SectionHead']),
+        Paragraph("Configuration", styles['SubSection']),
+        config_table,
+        Paragraph("Evaluation Suite", styles['SubSection']),
+        Paragraph(_fmt(exp["dataset_intro"], ctx), styles['BodyJust']),
+        Paragraph("Fitness Function", styles['SubSection']),
+        Paragraph(_fmt(exp["fitness_intro"], ctx), styles['BodyJust']),
+        Paragraph(
+            f"<font face='Courier' size=9>{exp['fitness_formula']}</font>",
+            ParagraphStyle('Formula', parent=styles['Normal'], alignment=TA_CENTER,
+                           spaceBefore=8, spaceAfter=8, fontSize=10),
+        ),
+        Paragraph(_fmt(exp["fitness_closing"], ctx), styles['BodyJust']),
+    ]
+
+
 def _experiment(prose: dict, ctx: dict, styles, examples: list[tuple[str, str]]) -> list:
     exp = prose["experiment"]
     overrides = exp["config_overrides"]
+    if ctx.get("artifact_type") == "prompt_section":
+        return _experiment_prompt_section(exp, overrides, ctx, styles)
 
     # Phase 1 runs counted gpt-4.1-mini + gpt-5-mini explicitly via run.log grep;
     # Phase 2 runs use a single optimizer LM tier (e.g., gpt-5.4-mini), so fall
@@ -571,24 +687,38 @@ def _results(prose: dict, ctx: dict, styles) -> list:
         accent_bg = HexColor('#fff8e1')
         accent_fg = HexColor('#5d4037')
 
-    results_rows = [
-        ['Metric', 'Baseline', 'Evolved (knee-point pick)', 'Δ'],
-        ['Body size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'],
-        [f'Avg holdout score (n={ctx["n_holdout"]})',
-         f'{ctx["avg_baseline"]:.3f}', f'{ctx["avg_evolved"]:.3f}', f'{ctx["improvement"]:+.3f}'],
-        ['Bootstrap mean diff', '—', f'{ctx["bootstrap_mean"]:+.3f}', '—'],
-        ['Bootstrap 90% CI lower', '—', f'{ctx["bootstrap_lower"]:+.3f}', '—'],
-        ['Bootstrap 90% CI upper', '—', f'{ctx["bootstrap_upper"]:+.3f}', '—'],
-    ]
-    # Phase 2: surface the closed-loop behavioral signal when the v5 schema
-    # exposed it (absent on synthetic-only runs).
-    if ctx.get("cl_total_tasks"):
-        results_rows.append([
-            f'Closed-loop tasks (n={ctx["cl_total_tasks"]})',
-            f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}',
-            f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}',
-            f'+{ctx["cl_tasks_gained"]} (req ≥{ctx["cl_required_gain"]})',
-        ])
+    if ctx.get("artifact_type") == "prompt_section":
+        # Behavioral-only: pass-rate + win/loss, no bootstrap/synthetic rows.
+        delta_rate = ctx["evolved_pass_rate"] - ctx["baseline_pass_rate"]
+        results_rows = [
+            ['Metric', 'Baseline', 'Evolved', 'Δ'],
+            ['Section size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'],
+            [f'Holdout pass-rate (n={ctx["cl_total_tasks"]})',
+             f'{ctx["baseline_pass_rate"]:.0%}', f'{ctx["evolved_pass_rate"]:.0%}', f'{delta_rate:+.0%}'],
+            [f'Tasks passing (n={ctx["cl_total_tasks"]})',
+             f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}',
+             f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}',
+             f'+{ctx["n_wins"]}W / {ctx["n_losses"]}L'],
+        ]
+    else:
+        results_rows = [
+            ['Metric', 'Baseline', 'Evolved (knee-point pick)', 'Δ'],
+            ['Body size (chars)', f'{ctx["baseline_chars"]:,}', f'{ctx["evolved_chars"]:,}', f'{ctx["growth_pct"]:+.1%}'],
+            [f'Avg holdout score (n={ctx["n_holdout"]})',
+             f'{ctx["avg_baseline"]:.3f}', f'{ctx["avg_evolved"]:.3f}', f'{ctx["improvement"]:+.3f}'],
+            ['Bootstrap mean diff', '—', f'{ctx["bootstrap_mean"]:+.3f}', '—'],
+            ['Bootstrap 90% CI lower', '—', f'{ctx["bootstrap_lower"]:+.3f}', '—'],
+            ['Bootstrap 90% CI upper', '—', f'{ctx["bootstrap_upper"]:+.3f}', '—'],
+        ]
+        # Phase 2: surface the closed-loop behavioral signal when the v5 schema
+        # exposed it (absent on synthetic-only runs).
+        if ctx.get("cl_total_tasks"):
+            results_rows.append([
+                f'Closed-loop tasks (n={ctx["cl_total_tasks"]})',
+                f'{ctx["cl_baseline_pass"]}/{ctx["cl_total_tasks"]}',
+                f'{ctx["cl_evolved_pass"]}/{ctx["cl_total_tasks"]}',
+                f'+{ctx["cl_tasks_gained"]} (req ≥{ctx["cl_required_gain"]})',
+            ])
     results_rows.append(['Decision', '—', decision_cell, decision_note])
 
     # Per-cell style picks: header row uses bold/white; first column (metric
diff --git a/reports/phase3_prose.yaml b/reports/phase3_prose.yaml
new file mode 100644
index 0000000..e68499c
--- /dev/null
+++ b/reports/phase3_prose.yaml
@@ -0,0 +1,233 @@
+# Editorial content for the Phase 3 validation report.
+# Numbers come from the run dir's gate_decision.json (the prompt-section path is
+# behavioral-only and self-sources cost/timing/calls — no metrics.json/run.log
+# needed). Pass via `generate_report.py --run output/prompts/<run>/`. Text blocks
+# may include {placeholder} substitutions the renderer fills from that data.
+
+meta:
+  title: "Agent Self-Evolution"
+  subtitle: "Phase 3 Validation Report<br/>System-prompt section evolution via splice-and-restore"
+  organization: ""
+  repository: "github.com/jramos/agent-self-evolution"
+
+executive_summary:
+  framework_intro: >
+    Agent Self-Evolution is a standalone optimization pipeline that uses DSPy and GEPA
+    (Genetic-Pareto Prompt Evolution) to automatically improve an agent's skills, tool
+    descriptions, system prompts, and code through evolutionary search — all via API
+    calls with no GPU training required. Phase 1 shipped a synthetic-only deploy gate;
+    Phase 2 made it behavior-aware and brought tool-description parity. Phase 3 extends
+    the framework to the third instructions surface — <i>named sections of the agent's
+    system prompt</i> — evaluated end-to-end against the real agent.
+  run_summary: >
+    This report documents the Phase 3 validation of system-prompt section evolution.
+    The target is a top-level string constant in Hermes Agent's
+    <font face="Courier">prompt_builder.py</font> (here, <b>{section_name}</b>), evolved
+    via GEPA and validated <i>purely behaviorally</i>: every candidate is spliced into
+    the live prompt file and scored by running the real agent
+    (<font face="Courier">hermes -z</font>) against a curated task suite — there is no
+    synthetic LLM-as-judge signal to lean on. Production <b>{section_name}</b> is already
+    well-tuned, so the saturation pre-flight correctly default-denies it (no headroom).
+    To exercise the loop end-to-end, the headline run evolves a <i>deliberately-weakened</i>
+    baseline (supplied via <font face="Courier">--baseline-override-file</font>): the
+    agent's holdout pass-rate moved <b>{baseline_pass_rate:.0%} → {evolved_pass_rate:.0%}</b>
+    ({cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks} tasks,
+    <b>+{n_wins}W / {n_losses}L</b>) while the section shrank <b>{growth_pct:+.1%}</b>.
+    The closed-loop deploy gate decided <b>{decision_upper}</b>, and the live prompt file
+    was restored byte-for-byte after every trial.
+
+key_result_box:
+  title_template: "KEY RESULT — {section_name} (prompt-section deploy via closed-loop gate)"
+  rows:
+    - "Holdout pass-rate (n={cl_total_tasks}):   {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%}   (+{n_wins}W / {n_losses}L)"
+    - "Tasks passing:   {cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks}"
+    - "Section size:   {baseline_chars:,} → {evolved_chars:,} chars   ({growth_pct:+.1%})"
+    - "Decision:   {decision_upper}   via the closed-loop behavioral gate"
+
+background:
+  intro: >
+    Agent Self-Evolution targets the instructions layer of an LLM agent — skill files,
+    tool descriptions, and system prompts — and evolves the text via API-only
+    evolutionary search. An agent's behavior is governed by three layers:
+  layers:
+    header: ["Layer", "What It Is", "How It's Currently Improved"]
+    rows:
+      - ["Model Weights", "The underlying LLM (Claude, GPT, etc.)", "RL training (Tinker-Atropos)"]
+      - ["Instructions", "Skills, system prompts, tool descriptions", "Manual authoring (static)"]
+      - ["Tool Code", "Python implementations of each tool", "Manual development"]
+    highlight_row: 1
+  closing: >
+    Phases 1 and 2 validated skill files and tool descriptions. Phase 3 completes the
+    instructions trio with <b>system-prompt sections</b> — the highest-leverage, widest
+    blast-radius surface, since one section governs the agent across every task. The
+    section is a string constant inside Hermes' own source, so unlike the skill path
+    (separate writable workdir) there is no env-var hook or plugin seam: the framework
+    edits <font face="Courier">prompt_builder.py</font> in place. The integration is an
+    AST-precise <b>splice-and-restore</b> — the candidate is byte-spliced into the live
+    file for the duration of a trial and restored from an atomic backup afterward
+    (<font face="Courier">flock</font> + checksum-drift detection + parse-guard, reused
+    from the Phase 2 closed-loop validator). Crucially, a system-prompt section has no
+    cheap synthetic proxy: the only honest measure of "did this guidance help" is
+    running the real agent, so Phase 3's deploy gate is purely behavioral.
+
+approach:
+  section_title: "Approach: Behavioral Prompt-Section Evolution"
+  engines:
+    header: ["Engine", "What It Optimizes", "License", "Role"]
+    rows:
+      - ["DSPy + GEPA", "Skills, prompts, tool descriptions", "MIT", "Primary (validated)"]
+      - ["DSPy MIPROv2", "Few-shot examples, instruction text", "MIT", "Fallback optimizer"]
+      - ["Darwinian Evolver", "Code files, algorithms", "AGPL v3", "Code evolution (Phase 4)"]
+  gepa_narrative: >
+    <b>GEPA</b> (Genetic-Pareto Prompt Evolution) is the star engine — an ICLR 2026
+    Oral paper from Stanford/UC Berkeley. Unlike traditional evolutionary search that
+    only sees pass/fail scores, GEPA reads full execution traces to understand
+    <i>why</i> things failed, then proposes targeted mutations. Phase 3 wires GEPA to a
+    sentinel-preserving proposer (mutations are confined to the section's text, never
+    the surrounding scaffolding) and routes every candidate score through a real
+    <font face="Courier">hermes -z</font> subprocess. Because the spliced
+    <font face="Courier">prompt_builder.py</font> is a single shared file and DSPy
+    evaluates with a thread pool, candidate scoring is serialized under a lock — an
+    accepted cost of the splice-and-restore model.
+  pipeline_steps:
+    - "<b>Resolve baseline</b> — Read the section's current text from <font face=\"Courier\">prompt_builder.py</font> (or accept a weakened baseline via <font face=\"Courier\">--baseline-override-file</font> to create headroom on an already-tuned section)"
+    - "<b>Split</b> — Deterministic seeded train / holdout split of the curated JSONL task suite"
+    - "<b>Saturation pre-flight</b> — Score the baseline behaviorally on the holdout; a <font face=\"Courier\">no_headroom</font> band default-denies (correctly refusing to evolve a saturated section) unless overridden"
+    - "<b>GEPA loop</b> — The section text is a sentinel-delimited region of a passthrough predictor's instructions; GEPA mutates it with the sentinel-preserving proposer. Each candidate is spliced into the live file and scored by running the agent on each task"
+    - "<b>Compound verdict</b> — Layer 1: did the agent invoke the expected tool (e.g. <font face=\"Courier\">memory</font>)? Layer 2: an LLM judge scores the saved content against each task's rubric"
+    - "<b>Closed-loop deploy gate</b> — Select the GEPA val-best candidate, then run baseline vs. evolved on the holdout suite; deploy iff holdout pass-rate doesn't regress and per-task wins offset losses ≥ 2:1"
+    - "<b>Report + restore</b> — Structured <font face=\"Courier\">gate_decision.json</font> (v5 schema, prompt-section variant); the live file is restored byte-for-byte"
+  cost_paragraph: >
+    The honest Phase 3 story is two-part. First, the framework's <b>regression-catching
+    discipline</b>: the production <font face="Courier">{section_name}</font> is already
+    well-tuned, so a capable agent satisfies the suite regardless of small wording
+    changes — the saturation pre-flight scores the baseline at ceiling and correctly
+    <i>default-denies</i>, refusing to spend GEPA budget where no improvement is
+    possible. This mirrors the Phase 2 finding that the framework is improvement-finding
+    only where headroom genuinely exists. Second, to demonstrate that the loop produces
+    a real, grounded improvement when headroom <i>does</i> exist, the headline run
+    evolves a deliberately-adversarial baseline (one that instructs the agent <i>not</i>
+    to save) — exactly the weakened-target approach Phase 2 used for its headline. That
+    run consumed <b>${cost_total_usd:.2f}</b> across {lm_calls_metrics:,} in-process LM
+    calls in ~{elapsed_minutes:.0f} minutes (the agent's own subprocess spend is
+    separate). Splicing a different section measurably changed live agent behavior, and
+    GEPA recovered a corrected section that the closed-loop gate deployed.
+
+experiment:
+  section_title: "Phase 3 Experiment"
+  config_overrides:
+    target_section_label: "{section_name} — evolved from a deliberately-weakened baseline (production {section_name} is saturated; the weak baseline, supplied via --baseline-override-file, exercises the loop end-to-end)"
+    optimizer_lm: "{optimizer_lm}"
+    reflection_lm: "{reflection_lm}"
+    eval_judge_lm: "{eval_lm}"
+    agent_lm: "openai/gpt-5.4-mini (Hermes-configured default)"
+    quality_gate_label: "closed-loop behavioral — holdout pass-rate no-regression + per-task wins ≥ 2·losses; compound verdict (Layer 1 trigger + Layer 2 content judge)"
+    saturation_label: "forced via --force-saturation-check (the weakened baseline had real headroom; production {section_name} default-denies as no_headroom)"
+  dataset_intro: >
+    The evaluation suite is a curated, hand-authored JSONL benchmark
+    (<font face="Courier">memory_guidance.jsonl</font>, 12 tasks across five categories:
+    save-preference, save-correction, dont-save-task-progress,
+    dont-save-completed-work-log, and declarative-vs-imperative). Unlike Phases 1 and 2,
+    there is <i>no</i> synthetically-generated train/val/holdout of LLM-judge examples —
+    every task is scored behaviorally by running the real agent, and the deploy gate's
+    holdout is {cl_total_tasks} of those tasks. Each save task carries an
+    <font face="Courier">expected_save_content</font> rubric consumed by the Layer 2
+    content judge.
+  fitness_intro: >
+    Fitness is behavioral, not a synthetic judge score. For each task, the candidate
+    section is spliced into the live <font face="Courier">prompt_builder.py</font>, the
+    agent runs once via <font face="Courier">hermes -z</font>, and the resulting session
+    is read back from Hermes' SQLite session store. The verdict is compound:
+  fitness_formula: "pass  =  Layer1(expected memory action fired, forbidden actions absent)  AND  Layer2(content-judge score ≥ 0.7 on save tasks)"
+  fitness_closing: >
+    GEPA's reflection LM reads the per-task failures and proposes a targeted mutation of
+    the section text; the sentinel-preserving proposer confines edits to the section and
+    re-raises rather than admit a candidate that drops the markers. The deploy gate then
+    re-runs baseline vs. evolved on the holdout and decides on the behavioral signal
+    alone — holdout pass-rate no-regression plus a per-task win/loss rule — with no
+    paired-bootstrap CI, because there is no synthetic per-example distribution to
+    resample.
+
+results:
+  narrative: >
+    Evolving the weakened <b>{section_name}</b> baseline, the agent's holdout pass-rate
+    moved <b>{baseline_pass_rate:.0%} → {evolved_pass_rate:.0%}</b>
+    ({cl_baseline_pass}/{cl_total_tasks} → {cl_evolved_pass}/{cl_total_tasks} tasks,
+    <b>+{n_wins} wins / {n_losses} losses</b>, {n_ties} ties) while the section text
+    <i>shrank</i> <b>{growth_pct:+.1%}</b> ({baseline_chars:,} → {evolved_chars:,}
+    chars). GEPA learned from the save-task failures and inverted the adversarial
+    instruction — it removed the "never proactively save" misdirection and restored
+    proactive saving while keeping the legitimate "don't store passing remarks"
+    discrimination, in fewer characters. <b>Decision: {decision_upper}</b> via the
+    closed-loop gate ({decision_reasons}). The proposer rejected
+    {sentinel_failures} sentinel-breaking candidates. Throughout, the live
+    <font face="Courier">prompt_builder.py</font> was restored byte-for-byte after every
+    trial. The production {section_name} itself is saturated and correctly
+    default-denies — the framework is regression-catching, and only finds improvements
+    where real headroom exists.
+  how_produced_intro: "GEPA evolves the section text through a reflective loop; the gate then reads the behavioral signal:"
+  how_produced_steps:
+    - "Splice a candidate section into the live <font face=\"Courier\">prompt_builder.py</font> (only when the candidate changes); run each holdout task once via <font face=\"Courier\">hermes -z</font> and read the session from Hermes' <font face=\"Courier\">state.db</font>"
+    - "Score each run with the compound verdict (Layer 1 tool-trigger membership + Layer 2 content judge on memory-save content); abstentions (agent/runner errors) score 0 in-loop and tie at the gate"
+    - "The reflection LM reads the failures and proposes a sentinel-confined mutation of the section text; GEPA accepts on improvement-or-equal"
+    - "Select the GEPA val-best candidate; run the closed-loop deploy gate (baseline vs. evolved on {cl_total_tasks} holdout tasks, its own backup/restore)"
+    - "<b>Decide</b> — Deploy iff evolved holdout pass-rate ≥ baseline AND per-task wins offset losses ≥ 2:1. On this run: {baseline_pass_rate:.0%} → {evolved_pass_rate:.0%}, {n_wins}W/{n_losses}L → DEPLOY"
+  how_produced_closing: >
+    Two design choices made this outcome trustworthy. First, the splice-and-restore
+    guard (atomic backup + exclusive <font face="Courier">flock</font> + byte-restore,
+    with stale-backup refusal) means the user's Hermes checkout is never left mutated,
+    even on crash. Second, the deploy gate is the same proven closed-loop validator used
+    for tool descriptions — the prompt path adds only a thin installer plus a per-task
+    content judge, so the decision rule, audit trail, and restore machinery are shared
+    and already battle-tested. The behavioral-only design is not a shortcut: it is the
+    only honest measure for a system-prompt section, which has no cheap synthetic proxy.
+
+safety:
+  intro: "Every evolved section must clear these constraints, and the live prompt file is protected throughout:"
+  table:
+    header: ["Constraint", "Enforcement", "Status"]
+    rows:
+      - ["Self-evolution test suite", "1,232 pytest tests pass on the optimizer itself", "Implemented"]
+      - ["Byte-clean splice/restore", "Atomic backup + byte-for-byte restore of prompt_builder.py after every run", "Implemented"]
+      - ["Parse-guarded write", "Candidate spliced via repr() + ast.parse check; refuses to write non-parseable Python", "Implemented"]
+      - ["Exclusive lock + drift check", "flock on the prompt file's dir + sha-drift detection; stale-backup refusal on startup", "Implemented"]
+      - ["Compound verdict", "Layer 1 tool-trigger membership AND Layer 2 LLM content judge (≥ threshold)", "Implemented"]
+      - ["Abstain on corrupt session", "A malformed agent session abstains (neutral), never scores as a behavioral regression", "Implemented"]
+      - ["Closed-loop deploy gate", "Holdout pass-rate no-regression + per-task wins ≥ 2·losses", "Implemented"]
+      - ["Saturation pre-flight", "Default-denies a saturated (no_headroom) section before spending GEPA budget", "Implemented"]
+      - ["Budget ceiling", "--max-cost-usd aborts on in-process LM spend overrun", "Implemented"]
+      - ["Deployment via apply + review", "--apply writes the section; PR automation deferred for prompt sections", "By design"]
+      - ["Benchmark regression", "External --benchmark-cmd hook (TBLite / harness)", "Planned"]
+  closing: >
+    The source Hermes repository is never left modified: the section is spliced in only
+    for the duration of a trial and restored from an atomic backup, and all evolution
+    output (gate decisions, section before/after text, run logs) is written under the
+    framework's local <font face="Courier">output/</font> directory. PR automation is
+    deferred for prompt sections — a section-scoped PR path is future work — so the
+    deploy step is an explicit <font face="Courier">--apply</font> plus a human-authored
+    pull request.
+
+roadmap:
+  table:
+    header: ["Phase", "Target", "Engine", "Timeline", "Status"]
+    rows:
+      - ["Phase 1", "Skill files (SKILL.md)", "DSPy + GEPA", "3-4 weeks", "Validated ✓"]
+      - ["Phase 2", "Tool descriptions", "DSPy + GEPA", "2-3 weeks", "Validated ✓"]
+      - ["Phase 3", "System prompt sections", "DSPy + GEPA", "2-3 weeks", "Validated ✓"]
+      - ["Phase 4", "Tool implementation code", "Darwinian Evolver", "3-4 weeks", "Planned"]
+      - ["Phase 5", "Continuous improvement", "Automated pipeline", "2 weeks", "Planned"]
+    highlight_row: 2
+  closing: >
+    Phase 3 completes the instructions trio — skills, tool descriptions, and now
+    system-prompt sections — all gated by the same closed-loop discipline. The
+    behavioral-only deploy gate proves the framework can evolve the highest-blast-radius
+    instructions surface safely: it default-denies a saturated section, produces a real
+    grounded improvement where headroom exists, and never leaves the agent's source
+    mutated. Phase 4 (tool implementation code) and Phase 5 (continuous improvement)
+    extend the framework beyond the instructions layer.
+
+next_steps:
+  - "<b>Harder behavioral suites</b> — Production system-prompt sections are heavily tuned and saturate the current suites; develop richer, harder task suites (and weaker agent tiers) so headroom exists on real targets, not only adversarial baselines."
+  - "<b>Additional sections</b> — The same path supports any string-constant section (SKILLS_GUIDANCE, SESSION_SEARCH_GUIDANCE, etc.); MEMORY_GUIDANCE was the first proof point, chosen for its clear tool-call anchor."
+  - "<b>Section-scoped PR automation</b> — Wire --create-pr for prompt sections by splicing into origin/&lt;base&gt;'s prompt_builder.py (not the local checkout), so the PR diff carries only the section change."
+  - "<b>Agent-side cost capture</b> — The agent's own LM spend happens inside the hermes subprocess and is invisible to the in-process budget ceiling; surface it from the session store so --max-cost-usd accounts for end-to-end spend."
diff --git a/reports/phase3_validation_report.pdf b/reports/phase3_validation_report.pdf
new file mode 100644
index 0000000..b2cd415
Binary files /dev/null and b/reports/phase3_validation_report.pdf differ