feat(prompts): evolve Hermes system-prompt sections via GEPA + closed-loop validation by jramos · Pull Request #78 · jramos/agent-self-evolution

jramos · 2026-06-02T14:53:51Z

Phase 3 of the framework: evolve a named section of an agent's system prompt (first target: Hermes MEMORY_GUIDANCE) with the same GEPA + deploy-gate discipline Phases 1–2 brought to skills and tool descriptions.

What it ships

A new CLI, python -m evolution.prompts.evolve_prompt_section, that evolves a top-level string constant in agent/prompt_builder.py end-to-end:

Integration: in-place AST splice + restore. A candidate section is spliced byte-precisely into the live prompt_builder.py (repr-based, ast.parse-guarded, atomic write), the real agent runs against it, and the file is restored from an atomic backup. This reuses Phase 2's ClosedLoopValidator machinery (flock + sha-drift + stale-backup refusal) via a new HermesPromptSectionInstaller, so there's no parallel validator and no dependency on any upstream Hermes change.
Compound verdict. Layer 1 = did the agent invoke memory (trigger membership); Layer 2 = an LLM judge scores the saved content (memory(action=add/replace)) against each task's expected_save_content rubric, threaded as a per-task factory.
GEPA wiring. PromptModule + a sentinel-preserving PromptSectionProposer, scored by a serialized memoizing splice scorer (the shared file is mutated under a lock since DSPy evaluates multi-threaded).
Saturation pre-flight + budget cap + paired-bootstrap deploy gate, inherited from the shared infrastructure.

Validation

Demonstrated end-to-end against real hermes -z:

The polished real MEMORY_GUIDANCE scores 1.0/6 on the holdout — the saturation pre-flight correctly classifies no_headroom and default-denies (regression-catching, not improvement-finding, on a tuned prompt).
On a deliberately-adversarial baseline (via --baseline-override-file), the loop produces a real, grounded mutation: 0.67 → 1.00 holdout pass-rate, 2 wins / 0 losses → deploy, with the evolved section shorter than the baseline. prompt_builder.py is restored byte-clean after every run.

Notable

The smoke surfaced and fixed a shared-infrastructure issue: current hermes -z one-shot is ephemeral and no longer writes session_*.json; sessions persist to a SQLite state.db. HermesAgentRunner now reads sessions from state.db, which unblocks all closed-loop validation (skills + tools + prompts).
PR automation is deferred for prompt sections (a section-scoped PR path is future work); --apply writes the evolved section in place.
Full Phase 3 deviations are recorded in PLAN.md.

1232 tests passing.

Docs (README + reference docs) and the Phase 3 validation report will follow in a separate docs PR, matching the Phase 2 sequencing.

…re_task

…al fn)

… + closed-loop deploy gate Wires HermesPromptSectionInstaller + HermesAgentRunner + ClosedLoopValidator into a full-parity evolution pipeline for prompt sections. GEPA mutates via PromptSectionProposer; the inner loop scores through a serialized memoizing splice scorer; the deploy gate runs baseline-vs-evolved closed-loop on the holdout suite. Saturation pre-flight default-denies a saturated baseline; budget cap aborts on overrun. ClosedLoopValidator's Layer 2 hook becomes a per-task judge factory so the content judge can read each task's expected_save_content rubric. The memoizing scorer serializes splice+run under a lock — dspy.Evaluate is multi-threaded but prompt_builder.py is a single shared file. PR automation is deferred for prompt sections (copying a full evolved file over origin/base would pollute the diff with the local override-hook commit).

Modern hermes -z one-shot mode is ephemeral — it prints only the final response and no longer writes session_*.json. Sessions now persist to a SQLite state.db in HERMES_HOME. The runner globbed for the obsolete JSON files, so every closed-loop run abstained ('no session JSON'). Read the most-recent session's messages from state.db instead; the tool_calls column holds the same OpenAI-nested shape the extractors already parse, so the message-extraction core is shared between the JSON and DB paths. Unblocks all closed-loop validation (tools, skills, and prompt sections).

The Hermes memory tool's content-bearing actions are add and replace (full set: add/replace/remove/read); there is no 'save' action. The Layer 2 filter matched the nonexistent 'save', so it never scored any real call. Match SAVE_ACTIONS = {add, replace} instead.

PromptModule.forward returned a Prediction without calling the predictor, so GEPA captured no trace for passthrough.predict and make_reflective_dataset raised 'No valid predictions found' every iteration — no candidate was ever proposed. The tool path gets traces from synthetic examples; prompt sections are pure-behavioral, so forward must call the passthrough to produce a trace. The predictor output stays a placeholder; the real score is the metric's behavioral branch.

…ixtures

…predictor The forward() trace fix was necessary but insufficient: GEPA evaluates the module in worker threads that don't inherit the saturation pre-flight's dspy.context(lm=...), so the passthrough predictor raised 'No LM is loaded', captured no trajectories, and never proposed. Set the global default LM via dspy.configure (matching evolve_tool), which the parallelizer propagates to worker threads. GEPA now scores the valset correctly and the proposer fires; on a saturated target it correctly declines to mutate (no failures to ground a change in).

…ting text Lets evolution start from text other than the live section — e.g. a deliberately-weakened or adversarial baseline to create headroom for demonstrating a real mutation, or a regression-injection ablation. The live section remains the splice/restore target (backed up + restored), so the user's file is never left mutated; --apply still writes the evolved text. Verified end-to-end: an adversarial 'never save' baseline scored 0.67, GEPA proposed a corrected section, deploy gate measured 0.67 -> 1.00 (2W/0L).

…, state.db runner fix, adversarial-baseline proof

…ons, doc/comment accuracy, guards + tests - Critical: a malformed tool_calls column in state.db now abstains (error set + logged) instead of reading as 'agent invoked no tools', which scored a DB-format regression as a fake behavioral failure and contaminated fitness. - Surface previously-silent fallbacks: malformed tool-call args, a memory call with no save action, and an unparseable judge score now log. - Doc/comment accuracy: memory action is add (not the nonexistent 'save'); tool schema enum is add/replace/remove (not 'read'); state.db tool_calls is the flat shape (nested handled for compat); the memoizing-scorer/validator splice cadence; the guard wraps pre-flight + GEPA; _closed_loop_task_id is set by PromptModule.forward. - Reject a <2-task suite up front (empty GEPA trainset otherwise). - Tests: parse_session_from_db malformed/corrupt/missing-table matrix; the _prompt_builder_guard restore round-trip, stale-backup refusal, and concurrent lock refusal; the single-task-suite guard.

read/write are the only members the evolution driver exercises (the runtime override seam moved to HermesPromptSectionInstaller). name and list_sections had no production consumer, so they're no longer part of the shared contract — list_sections + SectionDescriptor remain as concrete conveniences on HermesPromptSource for a future --list-sections affordance. Every member of a Protocol is a cost on every future implementer; this keeps the contract to exactly what's shared.

jramos added 23 commits May 31, 2026 20:28

feat(prompts): add PromptSource protocol + SectionDescriptor

1a5714e

feat(prompts): HermesPromptSource AST-based read

2b857ac

feat(prompts): HermesPromptSource AST-based write

c9f48e7

feat(prompts): HermesPromptSource section enumeration

22fac1c

feat(validation): extend Task with expected_save_content

9f2f4a6

feat(validation): capture tool call args in session parser

b217e8b

feat(validation): HermesPromptSectionInstaller

3385837

feat(validation): optional Layer 2 judge in ClosedLoopValidator + sco…

552a408

…re_task

feat(prompts): SaveCallJudge signature + scorer

8435864

feat(prompts): PromptModule DSPy wrapper

2365031

feat(prompts): GEPA fitness metric + memoizing splice scorer

625f979

feat(prompts): memory_guidance dataset builder + curated eval suite

725908b

feat(prompts): PromptSectionProposer (sentinel-preserving GEPA propos…

d17ead8

…al fn)

test(validation): use real memory actions (add) in compound-verdict f…

3c96e57

…ixtures

docs(plan): Phase 3 deviations — splice-and-restore, compound verdict…

621b23d

…, state.db runner fix, adversarial-baseline proof

jramos merged commit 3585ae9 into main Jun 2, 2026
4 checks passed

jramos deleted the feat/prompt-section-evolution branch June 2, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(prompts): evolve Hermes system-prompt sections via GEPA + closed-loop validation#78

feat(prompts): evolve Hermes system-prompt sections via GEPA + closed-loop validation#78
jramos merged 23 commits into
mainfrom
feat/prompt-section-evolution

jramos commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented Jun 2, 2026

What it ships

Validation

Notable

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant