feat(prompts): evolve Hermes system-prompt sections via GEPA + closed-loop validation#78
Merged
Conversation
… + closed-loop deploy gate Wires HermesPromptSectionInstaller + HermesAgentRunner + ClosedLoopValidator into a full-parity evolution pipeline for prompt sections. GEPA mutates via PromptSectionProposer; the inner loop scores through a serialized memoizing splice scorer; the deploy gate runs baseline-vs-evolved closed-loop on the holdout suite. Saturation pre-flight default-denies a saturated baseline; budget cap aborts on overrun. ClosedLoopValidator's Layer 2 hook becomes a per-task judge factory so the content judge can read each task's expected_save_content rubric. The memoizing scorer serializes splice+run under a lock — dspy.Evaluate is multi-threaded but prompt_builder.py is a single shared file. PR automation is deferred for prompt sections (copying a full evolved file over origin/base would pollute the diff with the local override-hook commit).
Modern hermes -z one-shot mode is ephemeral — it prints only the final
response and no longer writes session_*.json. Sessions now persist to a
SQLite state.db in HERMES_HOME. The runner globbed for the obsolete JSON
files, so every closed-loop run abstained ('no session JSON'). Read the
most-recent session's messages from state.db instead; the tool_calls column
holds the same OpenAI-nested shape the extractors already parse, so the
message-extraction core is shared between the JSON and DB paths. Unblocks all
closed-loop validation (tools, skills, and prompt sections).
The Hermes memory tool's content-bearing actions are add and replace (full
set: add/replace/remove/read); there is no 'save' action. The Layer 2 filter
matched the nonexistent 'save', so it never scored any real call. Match
SAVE_ACTIONS = {add, replace} instead.
PromptModule.forward returned a Prediction without calling the predictor, so GEPA captured no trace for passthrough.predict and make_reflective_dataset raised 'No valid predictions found' every iteration — no candidate was ever proposed. The tool path gets traces from synthetic examples; prompt sections are pure-behavioral, so forward must call the passthrough to produce a trace. The predictor output stays a placeholder; the real score is the metric's behavioral branch.
…predictor The forward() trace fix was necessary but insufficient: GEPA evaluates the module in worker threads that don't inherit the saturation pre-flight's dspy.context(lm=...), so the passthrough predictor raised 'No LM is loaded', captured no trajectories, and never proposed. Set the global default LM via dspy.configure (matching evolve_tool), which the parallelizer propagates to worker threads. GEPA now scores the valset correctly and the proposer fires; on a saturated target it correctly declines to mutate (no failures to ground a change in).
…ting text Lets evolution start from text other than the live section — e.g. a deliberately-weakened or adversarial baseline to create headroom for demonstrating a real mutation, or a regression-injection ablation. The live section remains the splice/restore target (backed up + restored), so the user's file is never left mutated; --apply still writes the evolved text. Verified end-to-end: an adversarial 'never save' baseline scored 0.67, GEPA proposed a corrected section, deploy gate measured 0.67 -> 1.00 (2W/0L).
…, state.db runner fix, adversarial-baseline proof
…ons, doc/comment accuracy, guards + tests - Critical: a malformed tool_calls column in state.db now abstains (error set + logged) instead of reading as 'agent invoked no tools', which scored a DB-format regression as a fake behavioral failure and contaminated fitness. - Surface previously-silent fallbacks: malformed tool-call args, a memory call with no save action, and an unparseable judge score now log. - Doc/comment accuracy: memory action is add (not the nonexistent 'save'); tool schema enum is add/replace/remove (not 'read'); state.db tool_calls is the flat shape (nested handled for compat); the memoizing-scorer/validator splice cadence; the guard wraps pre-flight + GEPA; _closed_loop_task_id is set by PromptModule.forward. - Reject a <2-task suite up front (empty GEPA trainset otherwise). - Tests: parse_session_from_db malformed/corrupt/missing-table matrix; the _prompt_builder_guard restore round-trip, stale-backup refusal, and concurrent lock refusal; the single-task-suite guard.
read/write are the only members the evolution driver exercises (the runtime override seam moved to HermesPromptSectionInstaller). name and list_sections had no production consumer, so they're no longer part of the shared contract — list_sections + SectionDescriptor remain as concrete conveniences on HermesPromptSource for a future --list-sections affordance. Every member of a Protocol is a cost on every future implementer; this keeps the contract to exactly what's shared.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 3 of the framework: evolve a named section of an agent's system prompt (first target: Hermes
MEMORY_GUIDANCE) with the same GEPA + deploy-gate discipline Phases 1–2 brought to skills and tool descriptions.What it ships
A new CLI,
python -m evolution.prompts.evolve_prompt_section, that evolves a top-level string constant inagent/prompt_builder.pyend-to-end:prompt_builder.py(repr-based,ast.parse-guarded, atomic write), the real agent runs against it, and the file is restored from an atomic backup. This reuses Phase 2'sClosedLoopValidatormachinery (flock + sha-drift + stale-backup refusal) via a newHermesPromptSectionInstaller, so there's no parallel validator and no dependency on any upstream Hermes change.memory(trigger membership); Layer 2 = an LLM judge scores the saved content (memory(action=add/replace)) against each task'sexpected_save_contentrubric, threaded as a per-task factory.PromptModule+ a sentinel-preservingPromptSectionProposer, scored by a serialized memoizing splice scorer (the shared file is mutated under a lock since DSPy evaluates multi-threaded).Validation
Demonstrated end-to-end against real
hermes -z:MEMORY_GUIDANCEscores 1.0/6 on the holdout — the saturation pre-flight correctly classifiesno_headroomand default-denies (regression-catching, not improvement-finding, on a tuned prompt).--baseline-override-file), the loop produces a real, grounded mutation: 0.67 → 1.00 holdout pass-rate, 2 wins / 0 losses → deploy, with the evolved section shorter than the baseline.prompt_builder.pyis restored byte-clean after every run.Notable
hermes -zone-shot is ephemeral and no longer writessession_*.json; sessions persist to a SQLitestate.db.HermesAgentRunnernow reads sessions fromstate.db, which unblocks all closed-loop validation (skills + tools + prompts).--applywrites the evolved section in place.PLAN.md.1232 tests passing.
Docs (README + reference docs) and the Phase 3 validation report will follow in a separate docs PR, matching the Phase 2 sequencing.