Context
Two papers define the space for self-improving agent systems. We need to decide which architecture to adopt for devswarm's evolutionary grid (#274), prompt evolution (#353), and eval framework (#354) — and whether the cost is practical for production use.
Darwin Gödel Machine (DGM) — open-ended evolution of coding agents
HyperAgents (DGM-H) (arXiv:2603.19461) — extends DGM with editable meta-agents
The core difference
|
DGM |
HyperAgents (DGM-H) |
| What evolves |
Task agent (coding agent) |
Task agent + meta agent (both editable) |
| Meta-level |
Fixed, handcrafted scaffold |
Editable — agent can improve how it improves |
| Domain assumption |
Coding skill ≈ self-improvement skill (aligned) |
No alignment needed — meta agent is domain-agnostic |
| Archive |
Stepping stones of agent variants |
Same, but hyperagent variants (task + meta) |
| Transfer |
Weak — improvements are coding-specific |
Strong — meta-improvements transfer across domains |
What DGM does well
- Crisp formulation. Mutate coding agents, evaluate empirically, keep good branches, repeat. Clean bridge between Gödel-machine theory and running systems.
- Archive/open-ended search is the real contribution. Keeping many lineages matters because useful stepping stones may only pay off much later.
- Strong empirical results in its home domain. 20% → 50% on SWE-bench, 14.2% → 30.7% on Polyglot.
Critique of DGM
- "Self-improvement" is only partly self-improvement. The open-ended exploration process (archive maintenance, parent selection) is fixed and non-modifiable. The instruction-generation mechanism is handcrafted. DGM is really: self-editing agent inside a fixed evolutionary scaffold.
- Domain-specific alignment assumption. Coding performance as proxy for self-improvement ability is plausible in coding, weak outside it.
- May be too benchmark-coupled. Better at the benchmark distribution ≠ more general self-reflection.
What HyperAgents does well
- Attacks the real bottleneck. Makes the meta-level procedure editable — the strongest conceptual advance over DGM.
- Generalizes beyond coding. Evaluated on paper review, robotics reward design, Olympiad math grading.
- Transfer of meta-level improvements. Learned mechanisms (persistent memory, performance tracking) transfer across domains and accumulate across runs. DGM-H goes from 0.0 → 0.710 on paper review; DGM gets 0.0.
- Stays competitive on coding. 0.140 → 0.340 on Polyglot (comparable to DGM, without being handcrafted for coding).
Critique of HyperAgents
- Experimentally messier. Broader claim surface (coding, review, robotics, math, transfer, compounding) gives more room for hidden evaluator dependence and benchmark fragility.
- "Improves how it improves" is shallow-ish. Evidence includes memory and tracking — real meta-improvement, but still high-level software-engineering scaffolding, not deep algorithmic self-redesign.
- Non-coding tasks are less ground-truth hard. Paper review and grading can partially mirror the generator's style. "Works on any computable task" is a research direction, not a proven result.
- Still inherits fixed outer-loop choices. Human-defined evaluation tasks, archive rules, staged evaluation, compute budgets. More self-referential than DGM, but not fully unconstrained.
The production cost problem
Both approaches are expensive to run continuously in production software:
- DGM/DGM-H run in Docker containers, generating full code diffs, evaluating over 50-80 iterations
- Each generation costs real API calls (orchestrator + workers + evaluation)
- Our swarm already costs significant API spend at 4 agents per run
- Running an outer evolution loop ON TOP of the swarm is prohibitive for daily use
Proposed compromise: periodic tuning cadence
Instead of continuous evolution, devswarm should adopt a periodic tuning schedule:
| Cadence |
What evolves |
Cost |
| Per-run |
Nothing — use current grid + prompts as-is |
$0 extra |
| Weekly |
Light: prompt variants via QD search on accumulated telemetry |
Low — reuse existing swarm runs as eval signal |
| Bi-weekly |
Medium: grid mapping (role→model) + prompt text |
Moderate — dedicated eval runs |
| Monthly |
Heavy: full DGM-H style — meta agent + task agent + strategies |
High — but amortized over a month of usage |
Telemetry from every run_swarm call (#342, #346) feeds into the archive passively. The evolution loop runs offline, not in the hot path.
What this means for devswarm's architecture
Current state (what we have)
[Orchestrator] → [Workers] → [Synthesizer]
↑ fixed prompt ↑ fixed roles ↑ fixed strategy
#353 as filed (prompt-only evolution) — this is the "DGM-H w/o self-improve" baseline
[Orchestrator] → [Workers] → [Synthesizer]
↑ fixed ↑ EVOLVED prompts ↑ fixed
The HyperAgents paper shows this is the weaker approach.
What HyperAgents suggests (full hyperagent evolution)
[Orchestrator] → [Workers] → [Synthesizer]
↑ EVOLVED ↑ EVOLVED prompts ↑ EVOLVED
decomposition + tool chains synthesis strategy
strategy + meta strategies
The meta agent (orchestrator + synthesis) must also be evolvable.
Practical middle ground for devswarm
Monthly evolution loop:
1. Collect telemetry from all swarm runs (passive)
2. Evolve prompt variants (QD search) — cheap
3. Evolve decomposition strategies (orchestrator prompt variants) — moderate
4. Evaluate on accumulated task history — use real past runs as benchmark
5. Archive best variants in .devswarm/evolved/
6. Grid auto-updates to use winners
Per-run: just use the current best from the archive. Zero overhead.
Decision needed
-
Do we go DGM-style (evolve task agents only, fixed meta) or DGM-H-style (evolve everything)?
- DGM is simpler, cheaper, proven for coding
- DGM-H is more general, but costlier and less proven in production
-
What's the tuning cadence?
- Weekly light / bi-weekly medium / monthly heavy?
- Or triggered by performance regression (adaptive)?
-
What's the "genome" that evolves?
- Prompt text only? (cheapest, weakest)
- Prompt + grid mapping? (moderate)
- Prompt + grid + decomposition + synthesis strategies? (DGM-H level)
Verdict
DGM is the stronger paper qua experiment. HyperAgents is the stronger paper qua research agenda.
For devswarm: start with DGM-style (evolve prompts + grid, fixed meta) because it's cheaper and we have production constraints. But architect the system so the meta-level (orchestrator, synthesis) CAN be swapped in as evolvable later — don't paint ourselves into a corner.
Blocks
References
Context
Two papers define the space for self-improving agent systems. We need to decide which architecture to adopt for devswarm's evolutionary grid (#274), prompt evolution (#353), and eval framework (#354) — and whether the cost is practical for production use.
Darwin Gödel Machine (DGM) — open-ended evolution of coding agents
HyperAgents (DGM-H) (arXiv:2603.19461) — extends DGM with editable meta-agents
The core difference
What DGM does well
Critique of DGM
What HyperAgents does well
Critique of HyperAgents
The production cost problem
Both approaches are expensive to run continuously in production software:
Proposed compromise: periodic tuning cadence
Instead of continuous evolution, devswarm should adopt a periodic tuning schedule:
Telemetry from every
run_swarmcall (#342, #346) feeds into the archive passively. The evolution loop runs offline, not in the hot path.What this means for devswarm's architecture
Current state (what we have)
#353 as filed (prompt-only evolution) — this is the "DGM-H w/o self-improve" baseline
The HyperAgents paper shows this is the weaker approach.
What HyperAgents suggests (full hyperagent evolution)
The meta agent (orchestrator + synthesis) must also be evolvable.
Practical middle ground for devswarm
Decision needed
Do we go DGM-style (evolve task agents only, fixed meta) or DGM-H-style (evolve everything)?
What's the tuning cadence?
What's the "genome" that evolves?
Verdict
For devswarm: start with DGM-style (evolve prompts + grid, fixed meta) because it's cheaper and we have production constraints. But architect the system so the meta-level (orchestrator, synthesis) CAN be swapped in as evolvable later — don't paint ourselves into a corner.
Blocks
score_child_propparent selection from DGM-HReferences