Skip to content

feat(prompts): evolve Hermes system-prompt sections via GEPA + closed-loop validation#78

Merged
jramos merged 23 commits into
mainfrom
feat/prompt-section-evolution
Jun 2, 2026
Merged

feat(prompts): evolve Hermes system-prompt sections via GEPA + closed-loop validation#78
jramos merged 23 commits into
mainfrom
feat/prompt-section-evolution

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented Jun 2, 2026

Phase 3 of the framework: evolve a named section of an agent's system prompt (first target: Hermes MEMORY_GUIDANCE) with the same GEPA + deploy-gate discipline Phases 1–2 brought to skills and tool descriptions.

What it ships

A new CLI, python -m evolution.prompts.evolve_prompt_section, that evolves a top-level string constant in agent/prompt_builder.py end-to-end:

  • Integration: in-place AST splice + restore. A candidate section is spliced byte-precisely into the live prompt_builder.py (repr-based, ast.parse-guarded, atomic write), the real agent runs against it, and the file is restored from an atomic backup. This reuses Phase 2's ClosedLoopValidator machinery (flock + sha-drift + stale-backup refusal) via a new HermesPromptSectionInstaller, so there's no parallel validator and no dependency on any upstream Hermes change.
  • Compound verdict. Layer 1 = did the agent invoke memory (trigger membership); Layer 2 = an LLM judge scores the saved content (memory(action=add/replace)) against each task's expected_save_content rubric, threaded as a per-task factory.
  • GEPA wiring. PromptModule + a sentinel-preserving PromptSectionProposer, scored by a serialized memoizing splice scorer (the shared file is mutated under a lock since DSPy evaluates multi-threaded).
  • Saturation pre-flight + budget cap + paired-bootstrap deploy gate, inherited from the shared infrastructure.

Validation

Demonstrated end-to-end against real hermes -z:

  • The polished real MEMORY_GUIDANCE scores 1.0/6 on the holdout — the saturation pre-flight correctly classifies no_headroom and default-denies (regression-catching, not improvement-finding, on a tuned prompt).
  • On a deliberately-adversarial baseline (via --baseline-override-file), the loop produces a real, grounded mutation: 0.67 → 1.00 holdout pass-rate, 2 wins / 0 losses → deploy, with the evolved section shorter than the baseline. prompt_builder.py is restored byte-clean after every run.

Notable

  • The smoke surfaced and fixed a shared-infrastructure issue: current hermes -z one-shot is ephemeral and no longer writes session_*.json; sessions persist to a SQLite state.db. HermesAgentRunner now reads sessions from state.db, which unblocks all closed-loop validation (skills + tools + prompts).
  • PR automation is deferred for prompt sections (a section-scoped PR path is future work); --apply writes the evolved section in place.
  • Full Phase 3 deviations are recorded in PLAN.md.

1232 tests passing.

Docs (README + reference docs) and the Phase 3 validation report will follow in a separate docs PR, matching the Phase 2 sequencing.

jramos added 23 commits May 31, 2026 20:28
… + closed-loop deploy gate

Wires HermesPromptSectionInstaller + HermesAgentRunner + ClosedLoopValidator
into a full-parity evolution pipeline for prompt sections. GEPA mutates via
PromptSectionProposer; the inner loop scores through a serialized memoizing
splice scorer; the deploy gate runs baseline-vs-evolved closed-loop on the
holdout suite. Saturation pre-flight default-denies a saturated baseline;
budget cap aborts on overrun.

ClosedLoopValidator's Layer 2 hook becomes a per-task judge factory so the
content judge can read each task's expected_save_content rubric. The memoizing
scorer serializes splice+run under a lock — dspy.Evaluate is multi-threaded but
prompt_builder.py is a single shared file. PR automation is deferred for prompt
sections (copying a full evolved file over origin/base would pollute the diff
with the local override-hook commit).
Modern hermes -z one-shot mode is ephemeral — it prints only the final
response and no longer writes session_*.json. Sessions now persist to a
SQLite state.db in HERMES_HOME. The runner globbed for the obsolete JSON
files, so every closed-loop run abstained ('no session JSON'). Read the
most-recent session's messages from state.db instead; the tool_calls column
holds the same OpenAI-nested shape the extractors already parse, so the
message-extraction core is shared between the JSON and DB paths. Unblocks all
closed-loop validation (tools, skills, and prompt sections).
The Hermes memory tool's content-bearing actions are add and replace (full
set: add/replace/remove/read); there is no 'save' action. The Layer 2 filter
matched the nonexistent 'save', so it never scored any real call. Match
SAVE_ACTIONS = {add, replace} instead.
PromptModule.forward returned a Prediction without calling the predictor, so
GEPA captured no trace for passthrough.predict and make_reflective_dataset
raised 'No valid predictions found' every iteration — no candidate was ever
proposed. The tool path gets traces from synthetic examples; prompt sections
are pure-behavioral, so forward must call the passthrough to produce a trace.
The predictor output stays a placeholder; the real score is the metric's
behavioral branch.
…predictor

The forward() trace fix was necessary but insufficient: GEPA evaluates the
module in worker threads that don't inherit the saturation pre-flight's
dspy.context(lm=...), so the passthrough predictor raised 'No LM is loaded',
captured no trajectories, and never proposed. Set the global default LM via
dspy.configure (matching evolve_tool), which the parallelizer propagates to
worker threads. GEPA now scores the valset correctly and the proposer fires;
on a saturated target it correctly declines to mutate (no failures to ground
a change in).
…ting text

Lets evolution start from text other than the live section — e.g. a
deliberately-weakened or adversarial baseline to create headroom for
demonstrating a real mutation, or a regression-injection ablation. The live
section remains the splice/restore target (backed up + restored), so the user's
file is never left mutated; --apply still writes the evolved text. Verified
end-to-end: an adversarial 'never save' baseline scored 0.67, GEPA proposed a
corrected section, deploy gate measured 0.67 -> 1.00 (2W/0L).
…, state.db runner fix, adversarial-baseline proof
…ons, doc/comment accuracy, guards + tests

- Critical: a malformed tool_calls column in state.db now abstains (error set +
  logged) instead of reading as 'agent invoked no tools', which scored a
  DB-format regression as a fake behavioral failure and contaminated fitness.
- Surface previously-silent fallbacks: malformed tool-call args, a memory call
  with no save action, and an unparseable judge score now log.
- Doc/comment accuracy: memory action is add (not the nonexistent 'save'); tool
  schema enum is add/replace/remove (not 'read'); state.db tool_calls is the
  flat shape (nested handled for compat); the memoizing-scorer/validator splice
  cadence; the guard wraps pre-flight + GEPA; _closed_loop_task_id is set by
  PromptModule.forward.
- Reject a <2-task suite up front (empty GEPA trainset otherwise).
- Tests: parse_session_from_db malformed/corrupt/missing-table matrix; the
  _prompt_builder_guard restore round-trip, stale-backup refusal, and concurrent
  lock refusal; the single-task-suite guard.
read/write are the only members the evolution driver exercises (the runtime
override seam moved to HermesPromptSectionInstaller). name and list_sections
had no production consumer, so they're no longer part of the shared contract —
list_sections + SectionDescriptor remain as concrete conveniences on
HermesPromptSource for a future --list-sections affordance. Every member of a
Protocol is a cost on every future implementer; this keeps the contract to
exactly what's shared.
@jramos jramos merged commit 3585ae9 into main Jun 2, 2026
4 checks passed
@jramos jramos deleted the feat/prompt-section-evolution branch June 2, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant