Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1a5714e
feat(prompts): add PromptSource protocol + SectionDescriptor
jramos Jun 1, 2026
2b857ac
feat(prompts): HermesPromptSource AST-based read
jramos Jun 1, 2026
c9f48e7
feat(prompts): HermesPromptSource AST-based write
jramos Jun 1, 2026
22fac1c
feat(prompts): HermesPromptSource section enumeration
jramos Jun 1, 2026
9f2f4a6
feat(validation): extend Task with expected_save_content
jramos Jun 1, 2026
b217e8b
feat(validation): capture tool call args in session parser
jramos Jun 1, 2026
3385837
feat(validation): HermesPromptSectionInstaller
jramos Jun 1, 2026
552a408
feat(validation): optional Layer 2 judge in ClosedLoopValidator + sco…
jramos Jun 1, 2026
8435864
feat(prompts): SaveCallJudge signature + scorer
jramos Jun 1, 2026
2365031
feat(prompts): PromptModule DSPy wrapper
jramos Jun 1, 2026
625f979
feat(prompts): GEPA fitness metric + memoizing splice scorer
jramos Jun 1, 2026
725908b
feat(prompts): memory_guidance dataset builder + curated eval suite
jramos Jun 1, 2026
d17ead8
feat(prompts): PromptSectionProposer (sentinel-preserving GEPA propos…
jramos Jun 1, 2026
6b13272
feat(prompts): evolve_prompt_section CLI — GEPA + saturation + budget…
jramos Jun 1, 2026
fc853ab
fix(validation): read agent sessions from hermes state.db
jramos Jun 1, 2026
63f3fd2
fix(prompts): judge real memory actions (add/replace), not 'save'
jramos Jun 1, 2026
f685314
fix(prompts): invoke passthrough predictor so GEPA can reflect
jramos Jun 1, 2026
3c96e57
test(validation): use real memory actions (add) in compound-verdict f…
jramos Jun 1, 2026
384a6d4
fix(prompts): configure global LM so GEPA worker threads can run the …
jramos Jun 2, 2026
d44b81f
feat(prompts): --baseline-override-file to evolve from arbitrary star…
jramos Jun 2, 2026
621b23d
docs(plan): Phase 3 deviations — splice-and-restore, compound verdict…
jramos Jun 2, 2026
b681c08
fix(prompts,validation): address PR review — abstain on corrupt sessi…
jramos Jun 2, 2026
5658e7b
refactor(prompts): narrow PromptSource Protocol to read + write
jramos Jun 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,8 @@ These descriptions are sent with every API call as part of the tool schema — e

**Goal:** Optimize the sections of the system prompt that guide agent behavior.

**Status:** ✅ Complete (MEMORY_GUIDANCE proof point). See "Deviations from plan" at the end of this section.

**Prerequisite:** Phase 2 gate passed — benchmark gating validated, GEPA producing sensible text mutations.

**Week 1 (Build):** Build section-as-DSPy-parameter wrapper for the 5 evolvable prompt sections. Build behavioral test suite generator. This is the riskiest tier so far — system prompt changes affect everything.
Expand Down Expand Up @@ -535,6 +537,26 @@ The system prompt is assembled in `run_agent.py` / `agent/prompt_builder.py` fro
- Identity section must retain core traits (helpful, direct, admits uncertainty)
- Platform hints must remain platform-accurate (don't tell Telegram to use ANSI codes)

**Deviations from plan (Phase 3):**

1. **Integration is in-place splice-and-restore, not an env-var hook or a plugin.** The design's primary path routed candidate overrides through an upstream `HERMES_PROMPT_OVERRIDES_JSON` env var; that hook was not accepted upstream, so depending on it would make the framework a local-only patch that silently no-ops on any hermes pull. A plugin alternative was ruled out as non-viable: consumers bind the constants at import time (`from agent.prompt_builder import MEMORY_GUIDANCE` in `run_agent.py` and `agent/system_prompt.py`), so a plugin's `register()` runs too late to reach them. Phase 3 instead splices the candidate directly into `agent/prompt_builder.py` (byte-precise AST replacement via `repr()`, parse-checked) and restores from an atomic backup, reusing Phase 2's `ClosedLoopValidator` flock + sha-drift + stale-backup machinery. No upstream dependency; runs against stock Hermes. This also **collapsed the planned parallel `PromptSectionValidator`** into a small `HermesPromptSectionInstaller` (an `ArtifactInstaller`) plus a one-method Layer-2 hook on the shared validator — less code than the design called for.

2. **One target section per run; MEMORY_GUIDANCE is the only proof point.** Joint multi-section optimization and identity/persona evolution are deferred — joint runs carry Phase 2's "stealing selections" risk, and `DEFAULT_AGENT_IDENTITY` has no tool-call anchor for the verdict. The `PromptSource` abstraction supports the other string sections (`SKILLS_GUIDANCE`, `SESSION_SEARCH_GUIDANCE`, etc.) with no refactor; dict-typed sections like `PLATFORM_HINTS` are out of scope for v1 (string constants only).

3. **Verdict is compound (tool-membership + LLM content judge), threaded per-task.** Layer 1 is the Phase 2 expected/forbidden rule on whether `memory` was invoked; Layer 2 is an LLM judge scoring the saved content against each task's `expected_save_content` rubric. The validator builds the judge per-task (a factory) so the content judge sees the task's rubric and message — a fixed global judge couldn't. Note the real Hermes `memory` tool actions are **`add`/`replace`** (content-bearing), not `save`; the full set is add/replace/remove/read.

4. **Eval suite ships as a curated 12-task golden set, not 50 synthetic + 10 golden.** A hand-authored `memory_guidance.jsonl` spans the five categories (save-preference, save-correction, dont-save-task-progress, dont-save-completed-work-log, declarative-vs-imperative). The synthetic generator (`build_memory_guidance_dataset`) is built and unit-tested, but full synthetic expansion via a funded generation run is deferred — curation gives a higher-signal first suite and avoids upfront generation spend.

5. **PR automation is deferred for prompt sections.** `create_pr` atomically copies a full evolved artifact over `origin/<base>`'s file; deriving an evolved `prompt_builder.py` from the local checkout would carry the unmerged override-hook commit into the PR diff. `--create-pr` is accepted but records a skipped block; the deploy path is `--apply` (writes the evolved section into the live file) plus a manual PR. A section-scoped PR path (splice into `origin/<base>`'s file, not the local one) is future work.

6. **The shared closed-loop runner had to be rebuilt for current Hermes.** Surfaced by the Phase 3 end-to-end smoke: `hermes -z` one-shot mode is now ephemeral — it prints only the final response and no longer writes `session_*.json`; sessions persist to a SQLite `state.db` in `HERMES_HOME`. `HermesAgentRunner` globbed for the obsolete JSON files, so **every** closed-loop run had been silently abstaining ("no session JSON") across Phases 1–3. The runner now reads the most-recent session's messages from `state.db` (the `tool_calls` column carries the same OpenAI-nested shape the extractors already parse). This is a shared-infrastructure fix that unblocks all closed-loop validation, not just prompts.

7. **Behavioral eval is serialized, and agent-subprocess cost is invisible to the budget cap.** Because every candidate is spliced into one shared `prompt_builder.py`, the GEPA inner-loop scorer serializes splice+run under a lock (DSPy's evaluator is multi-threaded; the shared file is not). Per-section closed-loop is therefore effectively serial — an accepted v1 cost of the splice-and-restore model. The agent's own LM spend happens inside the `hermes` child process, invisible to the in-process cost ledger, so `--max-cost-usd` bounds only judge + reflection + passthrough spend; `sessions.actual_cost_usd` in `state.db` could close that gap later.

8. **Saturation default-deny confirmed on a capable agent; a demonstrated improvement required an adversarial baseline.** With `gpt-5.4-mini`, both the live `MEMORY_GUIDANCE` and a *passively*-weakened baseline scored 1.0 / 6 holdout — `no_headroom`, correctly default-denied. This matches Phase 2's "regression-catching, not improvement-finding on tuned artifacts" finding and the binary model-tier effect (a capable agent saves correctly regardless of vague guidance). A real mutation was demonstrated only by *actively misdirecting* the baseline: an adversarial "never proactively save" section scored 0.67, GEPA's reflective proposer inverted it to restore proactive saving (and made it shorter), and the deploy gate measured **0.67 → 1.00, 2 wins / 0 losses → deploy**. The `--baseline-override-file` flag enables this ablation (and regression-injection testing generally) without mutating the live section.

9. **Benchmark gating again not built in (same as Phases 1–2).** The built-in deploy gate is paired-bootstrap CI plus the dual-condition rule on the holdout; `--benchmark-cmd` remains the external-benchmark hook. TBLite / YC-Bench wiring is left to the user's `--benchmark-cmd`.

### Phase 4: Code Evolution via Darwinian Evolver

**Goal:** Evolve tool implementation code for better performance and fewer bugs.
Expand Down
99 changes: 99 additions & 0 deletions evolution/core/dataset_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -485,3 +485,102 @@ def load(path: Path, seed: int = 42) -> EvalDataset:
val_ratio=0.25,
holdout_ratio=0.25,
)


MEMORY_GUIDANCE_CATEGORIES = (
"save-preference",
"save-correction",
"dont-save-task-progress",
"dont-save-completed-work-log",
"declarative-vs-imperative",
)

_MEMORY_GUIDANCE_CATEGORY_PROMPTS = {
"save-preference": (
"Generate ONE closed-loop eval task (category: save-preference) where the "
"user explicitly states a durable preference the agent SHOULD save to "
"memory. Output a single JSON object with fields: user_message, "
"expected_tools=[\"memory\"], expected_save_content (a rubric describing "
"what a good save would look like — not exact text)."
),
"save-correction": (
"Generate ONE closed-loop eval task (category: save-correction) where the "
"user corrects the agent on a recurring pattern (e.g. 'no, I use uv not "
"pip'). The agent SHOULD save the correction. Output a single JSON object "
"with fields: user_message, expected_tools=[\"memory\"], "
"expected_save_content."
),
"dont-save-task-progress": (
"Generate ONE closed-loop eval task (category: dont-save-task-progress) "
"where the user asks the agent to complete a task (write code, fix a bug). "
"The agent SHOULD NOT save task progress to memory. Output a single JSON "
"object with fields: user_message, expected_tools=[], "
"forbidden_tools=[\"memory\"]."
),
"dont-save-completed-work-log": (
"Generate ONE closed-loop eval task (category: dont-save-completed-work-log) "
"where the user asks for a summary of work done. The agent SHOULD NOT log "
"the work to memory. Output a single JSON object with fields: user_message, "
"expected_tools=[], forbidden_tools=[\"memory\"]."
),
"declarative-vs-imperative": (
"Generate ONE closed-loop eval task (category: declarative-vs-imperative) "
"where the user states a preference in imperative form ('always respond "
"concisely'). The agent SHOULD save it in declarative form ('user prefers "
"concise responses'). Output a single JSON object with fields: "
"user_message, expected_tools=[\"memory\"], expected_save_content "
"(specifying the declarative-phrasing rubric)."
),
}


def build_memory_guidance_dataset(
*,
lm_call,
n_per_category: int = 10,
) -> list[dict]:
"""Generate synthetic MEMORY_GUIDANCE eval tasks across the 5 categories.

``lm_call`` is a callable taking a prompt string and returning a JSON
object (one task) as text. The builder issues ``n_per_category`` calls
per category and stamps a unique ``task_id`` on each parsed row so the
output is a valid closed-loop suite regardless of what the LM emits for
the id. Rows the LM returns that don't parse as a JSON object are
skipped (logged), not fatal — a single noisy generation shouldn't abort
the whole build.

Returns a flat list of Task-shaped dicts ready to write to a JSONL suite
(consumable by ``TaskSuite.from_jsonl``).
"""
out: list[dict] = []
for category in MEMORY_GUIDANCE_CATEGORIES:
prompt = _MEMORY_GUIDANCE_CATEGORY_PROMPTS[category]
for index in range(n_per_category):
raw = lm_call(prompt)
row = _parse_memory_task_row(raw)
if row is None:
logger.warning(
"build_memory_guidance_dataset: unparseable row for "
"category %s index %d", category, index,
)
continue
row["task_id"] = f"{category}-{index:03d}"
row.setdefault("expected_tools", [])
out.append(row)
return out


def _parse_memory_task_row(raw: str):
"""Parse a single JSON object from an LM response. Returns the dict, or
None if the text isn't a JSON object (tolerant of fenced/extra prose)."""
try:
obj = json.loads(raw)
except (json.JSONDecodeError, TypeError):
match = re.search(r"\{.*\}", str(raw), re.DOTALL)
if not match:
return None
try:
obj = json.loads(match.group())
except json.JSONDecodeError:
return None
return obj if isinstance(obj, dict) else None
2 changes: 1 addition & 1 deletion evolution/prompts/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
"""Phase placeholder: prompts evolution."""
"""Phase 3: system prompt section evolution."""
Loading
Loading