Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,22 @@ The framework parses every `*_SCHEMA = {...}` and `*_SCHEMAS = [...]` declaratio

With `--apply`, the evolved description is spliced into the source file's bytes at the original position — comments, formatting, and unrelated tools are untouched. Multi-line parenthesized concatenations collapse to a single triple-quoted string at the same indent.

### Evolve a system prompt section

For Hermes Agent, evolve a named section of the assembled system prompt — any top-level string constant in `agent/prompt_builder.py` (e.g. `MEMORY_GUIDANCE`, which governs when and what the agent saves to memory):

```bash
uv run python -m evolution.prompts.evolve_prompt_section \
--section MEMORY_GUIDANCE \
--hermes-repo /path/to/hermes-agent \
--tasks evolution/validation/suites/memory_guidance.jsonl \
--iterations 10
```

Unlike skill and tool evolution — where the deploy gate can lean on a synthetic LLM-judge signal — a prompt section is evaluated **purely behaviorally**: every candidate is spliced into the live `prompt_builder.py` and scored by running the real agent (`hermes -z`) against the task suite. The verdict is compound — Layer 1 checks whether the agent invoked the expected tool (e.g. `memory`), and Layer 2 runs an LLM judge over the saved content against each task's `expected_save_content` rubric. The candidate is spliced in only for the duration of the run; the file is restored byte-for-byte afterward (atomic backup + flock + checksum-drift detection, shared with the tool-description path).

`--apply` writes the evolved section into `prompt_builder.py` in place; results land in `output/prompts/<section>/<timestamp>/`. PR automation (`--create-pr`) is not yet wired for prompt sections — use `--apply` plus a manual PR. To demonstrate the loop on an already-tuned section (which the saturation pre-flight will otherwise correctly default-deny as having no headroom), `--baseline-override-file` starts evolution from arbitrary text — e.g. a deliberately-weakened baseline that gives GEPA real failures to learn from.

### Mine real session history for evals

For skill evolution:
Expand Down Expand Up @@ -331,7 +347,7 @@ Cost: each task is one `hermes -z` run (~$0.05–$0.50). The bundled `patch.json
|-------|--------|--------|--------|
| **Phase 1** | Skill files (SKILL.md) | DSPy + GEPA | ✅ [Validated](reports/phase1_validation_report.pdf) |
| **Phase 2** | Tool descriptions + dual-signal deploy gate | DSPy + GEPA | ✅ [Validated](reports/phase2_validation_report.pdf) |
| **Phase 3** | System prompt sections | DSPy + GEPA | 🔲 Planned |
| **Phase 3** | System prompt sections | DSPy + GEPA | ✅ [Validated](reports/phase3_validation_report.pdf) |
| **Phase 4** | Tool implementation code | Darwinian Evolver | 🔲 Planned |
| **Phase 5** | Continuous improvement loop | Automated pipeline | 🔲 Planned |

Expand Down
32 changes: 30 additions & 2 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,20 @@ graph TB
hermes_source[tools.hermes_source<br/>Hermes *_SCHEMA AST adapter]
end

subgraph prompts_tier[Prompt Tier]
evolve_prompt[prompts.evolve_prompt_section<br/>main + evolve]
prompt_module[prompts.prompt_module<br/>PromptModule + sentinels]
prompt_proposer[prompts.prompt_proposer<br/>PromptSectionProposer]
prompt_judge[prompts.prompt_judge<br/>SaveCallJudge + judge_save_calls<br/>+ prompt fitness/splice scorer]
prompt_source[prompts.prompt_source<br/>PromptSource protocol + SectionDescriptor]
hermes_prompt_source[prompts.hermes_prompt_source<br/>HermesPromptSource — prompt_builder.py AST]
end

subgraph validation_subsystem[Closed-loop validation]
validator[validation.validator<br/>ClosedLoopValidator]
hermes_runner[validation.hermes_runner<br/>hermes -z subprocess]
installer[validation.artifact_installer<br/>HermesToolDescriptionInstaller]
installer[validation.artifact_installer<br/>HermesToolDescriptionInstaller +<br/>HermesPromptSectionInstaller]
savejudge[validation.report<br/>score_task Layer-2 judge hook]
report[validation.report<br/>ValidationReport + decision]
task[validation.task<br/>Task + TaskSuite]
cl_cli[validation.closed_loop<br/>CLI]
Expand Down Expand Up @@ -117,10 +127,26 @@ graph TB
tool_judge --> fitness
tool_proposer --> budget

evolve_prompt --> prompt_module
evolve_prompt --> prompt_proposer
evolve_prompt --> prompt_judge
evolve_prompt --> prompt_source
evolve_prompt --> hermes_prompt_source
evolve_prompt --> config
evolve_prompt --> quality
evolve_prompt --> timing
evolve_prompt --> validator
hermes_prompt_source --> prompt_source
prompt_module --> dspy
prompt_proposer --> budget
prompt_judge --> fitness
installer --> hermes_prompt_source

validator --> hermes_runner
validator --> installer
validator --> report
validator --> task
validator --> savejudge
cl_cli --> validator
hermes_runner --> hermes

Expand All @@ -138,7 +164,9 @@ graph TB
importers --> dataset
```

`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools. This keeps the tier-3/4/5 expansion path open.
`evolution/core/` has no dependency on `evolution/skills/`, `evolution/tools/`, `evolution/prompts/`, or `evolution/validation/`. The reverse holds: tier packages use core helpers but core never imports from a tier package. `closed_loop_feedback.py` imports `evolution.validation.*` types because it's the integration seam, but the validation subpackage doesn't import from skills/tools/prompts. This keeps the tier-4/5 expansion path open.

The `prompts` tier (Phase 3) is the prompt-section evolution path: `evolve_prompt_section` wraps a named `prompt_builder.py` constant as a `PromptModule` (a passthrough predictor carrying the candidate in sentinel-delimited instructions), mutates it with `PromptSectionProposer`, and — because there is no synthetic classification signal for a system-prompt section — scores **purely behaviorally** through the closed-loop validator running a real `hermes -z` against a curated JSONL suite. The deploy gate is therefore a closed-loop pass-rate / win-loss decision, not a paired-bootstrap one. Unlike the skill/tool tiers it reuses `ClosedLoopValidator` directly rather than going through `closed_loop_feedback.py`, and it integrates by AST-splicing the candidate into the live `agent/prompt_builder.py` (`HermesPromptSectionInstaller`) with atomic restore. The Layer-2 content judge (`SaveCallJudge` / `judge_save_calls`) runs inside `score_task` to grade memory-save *content* on top of the Layer-1 trigger-membership check.

## Design patterns in active use

Expand Down
30 changes: 21 additions & 9 deletions docs/codebase_info.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,14 +67,20 @@ evolution/
│ └── tool_judge.py # tool-flavored LLMJudge + GEPA-shaped metric
├── validation/ # closed-loop validation against a real agent
│ ├── agent_runner.py # AgentRunner Protocol + AgentRunResult dataclass
│ ├── artifact_installer.py # ArtifactInstaller Protocol + HermesToolDescriptionInstaller
│ ├── artifact_installer.py # ArtifactInstaller Protocol + HermesToolDescriptionInstaller + HermesPromptSectionInstaller
│ ├── closed_loop.py # CLI: drive baseline + evolved through hermes -z, compare
│ ├── hermes_runner.py # HermesAgentRunner — subprocess hermes -z with sandboxed HOME
│ ├── report.py # ValidationReport + TaskResult + decision rule
│ ├── suites/ # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl)
│ ├── hermes_runner.py # HermesAgentRunner — subprocess hermes -z; reads sessions from SQLite state.db (parse_session_from_db)
│ ├── report.py # ValidationReport + TaskResult + decision rule + Layer-2 SaveCallJudge in score_task
│ ├── suites/ # JSONL task suites (patch.jsonl, write_file.jsonl, search_files.jsonl, memory_guidance.jsonl)
│ ├── task.py # Task + TaskSuite.from_jsonl (with sha256 audit)
│ └── validator.py # ClosedLoopValidator.validate — mutates + restores live agent file
├── prompts/ # Tier 3: planned, empty package
├── prompts/ # Tier 3: system-prompt-section evolution
│ ├── evolve_prompt_section.py # CLI + orchestration; purely-behavioral closed-loop gate
│ ├── prompt_source.py # PromptSource Protocol (read + write) + SectionDescriptor
│ ├── hermes_prompt_source.py # HermesPromptSource — AST read/write of prompt_builder.py constants
│ ├── prompt_module.py # PromptModule — passthrough predictor carrying candidate in sentinels
│ ├── prompt_proposer.py # PromptSectionProposer — sentinel-preserving GEPA proposer
│ └── prompt_judge.py # SaveCallJudge + judge_save_calls Layer-2 content judge + fitness/splice scorers
├── code/ # Tier 4: planned, empty package
└── monitor/ # planned, empty package
```
Expand All @@ -86,6 +92,7 @@ evolution/
| `evolution/skills/evolve_skill.py` | ~1340 | CLI, orchestration, gate-decision payload assembly |
| `evolution/tools/evolve_tool.py` | ~1170 | CLI + orchestration for tool-description evolution |
| `evolution/core/external_importers.py` | ~770 | 3 importers + relevance filter + standalone CLI |
| `evolution/prompts/evolve_prompt_section.py` | ~660 | CLI + orchestration; purely-behavioral closed-loop deploy gate |
| `evolution/core/dataset_builder.py` | ~480 | synthetic generator + golden loader + tool-selection three-bucket gen |
| `evolution/core/lm_timing_callback.py` | ~400 | DSPy BaseCallback + litellm.failure_callback + cost ledger |
| `evolution/core/fitness.py` | ~380 | LLMJudge + skill/tool fitness metrics + behavioral score helper |
Expand All @@ -94,22 +101,27 @@ evolution/
| `evolution/core/closed_loop_feedback.py` | ~320 | cache + saturation gate + deterministic feedback block + `force_run` (bypasses gate for pre-flight) |
| `evolution/core/saturation_check.py` | ~255 | pre-flight: band classifier + `SaturationReport` + Rich panel + interactive confirm |
| `evolution/tools/tool_judge.py` | ~230 | tool-flavored judge + GEPA-shaped metric with behavioral branch |
| `evolution/prompts/prompt_judge.py` | ~230 | SaveCallJudge + judge_save_calls Layer-2 content judge + prompt fitness/splice scorers |
| `evolution/validation/validator.py` | ~220 | mutate + restore live agent file with flock + checksum drift check |
| `evolution/validation/report.py` | ~225 | ValidationReport JSON + Rich rendering + two-condition decision |
| `evolution/core/skill_sources.py` | ~210 | Hermes / Claude Code / LocalDir |
| `evolution/core/quality_gate.py` | ~210 | preset table + proposer-mode resolution + gate-decision persistence |
| `evolution/skills/knee_point.py` | ~205 | parsimony-based candidate picker |
| `evolution/validation/hermes_runner.py` | ~205 | hermes -z subprocess with sandboxed HOME |
| `evolution/tools/tool_proposer.py` | ~200 | sentinel-preserving reflection prompt |
| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore |
| `evolution/prompts/prompt_proposer.py` | ~160 | sentinel-preserving GEPA proposer for prompt sections |
| `evolution/validation/artifact_installer.py` | ~150 | byte-precise splice + atomic restore (tool + prompt-section installers) |
| `evolution/prompts/hermes_prompt_source.py` | ~135 | AST read/write of prompt_builder.py string constants |
| `evolution/prompts/prompt_module.py` | ~120 | PromptModule passthrough predictor + sentinel parse |
| `evolution/validation/closed_loop.py` | ~135 | standalone closed-loop CLI |
| `evolution/skills/skill_module.py` | ~125 | wraps SKILL.md as `dspy.Module` |
| `evolution/validation/task.py` | ~90 | Task + TaskSuite.from_jsonl |
| `evolution/core/config.py` | ~80 | `EvolutionConfig` dataclass |
| `evolution/core/stats.py` | ~60 | `paired_bootstrap` helper |
| `evolution/prompts/prompt_source.py` | ~55 | PromptSource Protocol + SectionDescriptor |
| `evolution/validation/agent_runner.py` | ~55 | AgentRunner Protocol + dataclasses |
| `evolution/core/behavioral_example.py` | ~35 | builder for behavioral dspy.Examples |
| **Total** | **~9,000** | excludes empty `__init__.py` shims |
| **Total** | **~10,400** | excludes empty `__init__.py` shims |

Test suite: 61 test files under `tests/core/`, `tests/skills/`, `tests/tools/`, `tests/validation/`. **1166 tests** collected.

Expand Down Expand Up @@ -139,11 +151,11 @@ The README's table summarizes intent; reality:
|---|---|---|---|
| 1 | Skill files (SKILL.md) | DSPy + GEPA | ✅ implemented in `evolution/skills/` |
| 2 | Tool descriptions | DSPy + GEPA | ✅ implemented in `evolution/tools/` — MCP-JSON and Hermes-Python-AST adapters; one target tool per run |
| 3 | System prompt sections | DSPy + GEPA | 🔲 `evolution/prompts/` package exists, empty |
| 3 | System prompt sections | DSPy + GEPA | ✅ implemented in `evolution/prompts/` — AST splice of `prompt_builder.py` constants; purely-behavioral closed-loop deploy gate (no synthetic signal) |
| 4 | Tool implementation code | Darwinian Evolver | 🔲 `evolution/code/` package exists, empty; `[darwinian]` extra reserves the dep |
| 5 | Continuous improvement loop | Automated pipeline | 🔲 `evolution/monitor/` package exists, empty |

Tiers 1 and 2 are built. Tier 3-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec.
Tiers 1-3 are built. Tier 4-5 packages exist as empty stubs to anchor the planned architecture. See PLAN.md's per-phase "Deviations from plan" subsections for where the built tiers diverge from the original spec.

**Orthogonal validation surface.** `evolution/validation/` runs a real agent (`hermes -z`) through a JSONL task suite with baseline vs evolved artifacts spliced into the live install. Scores actual tool-selection behavior with `expected_tools` / `forbidden_tools` per task; compares with a two-condition decision rule. Available three ways:

Expand Down
Loading
Loading