research: DGM vs HyperAgents — what should devswarm's self-improvement architecture look like?

## Context

Two papers define the space for self-improving agent systems. We need to decide which architecture to adopt for devswarm's evolutionary grid (#274), prompt evolution (#353), and eval framework (#354) — and whether the cost is practical for production use.

**Darwin Gödel Machine (DGM)** — open-ended evolution of coding agents
**HyperAgents (DGM-H)** ([arXiv:2603.19461](https://arxiv.org/abs/2603.19461)) — extends DGM with editable meta-agents

## The core difference

| | DGM | HyperAgents (DGM-H) |
|---|---|---|
| **What evolves** | Task agent (coding agent) | Task agent + meta agent (both editable) |
| **Meta-level** | Fixed, handcrafted scaffold | Editable — agent can improve how it improves |
| **Domain assumption** | Coding skill ≈ self-improvement skill (aligned) | No alignment needed — meta agent is domain-agnostic |
| **Archive** | Stepping stones of agent variants | Same, but hyperagent variants (task + meta) |
| **Transfer** | Weak — improvements are coding-specific | Strong — meta-improvements transfer across domains |

## What DGM does well

1. **Crisp formulation.** Mutate coding agents, evaluate empirically, keep good branches, repeat. Clean bridge between Gödel-machine theory and running systems.
2. **Archive/open-ended search is the real contribution.** Keeping many lineages matters because useful stepping stones may only pay off much later.
3. **Strong empirical results in its home domain.** 20% → 50% on SWE-bench, 14.2% → 30.7% on Polyglot.

## Critique of DGM

1. **"Self-improvement" is only partly self-improvement.** The open-ended exploration process (archive maintenance, parent selection) is fixed and non-modifiable. The instruction-generation mechanism is handcrafted. DGM is really: *self-editing agent inside a fixed evolutionary scaffold.*
2. **Domain-specific alignment assumption.** Coding performance as proxy for self-improvement ability is plausible in coding, weak outside it.
3. **May be too benchmark-coupled.** Better at the benchmark distribution ≠ more general self-reflection.

## What HyperAgents does well

1. **Attacks the real bottleneck.** Makes the meta-level procedure editable — the strongest conceptual advance over DGM.
2. **Generalizes beyond coding.** Evaluated on paper review, robotics reward design, Olympiad math grading.
3. **Transfer of meta-level improvements.** Learned mechanisms (persistent memory, performance tracking) transfer across domains and accumulate across runs. DGM-H goes from 0.0 → 0.710 on paper review; DGM gets 0.0.
4. **Stays competitive on coding.** 0.140 → 0.340 on Polyglot (comparable to DGM, without being handcrafted for coding).

## Critique of HyperAgents

1. **Experimentally messier.** Broader claim surface (coding, review, robotics, math, transfer, compounding) gives more room for hidden evaluator dependence and benchmark fragility.
2. **"Improves how it improves" is shallow-ish.** Evidence includes memory and tracking — real meta-improvement, but still high-level software-engineering scaffolding, not deep algorithmic self-redesign.
3. **Non-coding tasks are less ground-truth hard.** Paper review and grading can partially mirror the generator's style. "Works on any computable task" is a research direction, not a proven result.
4. **Still inherits fixed outer-loop choices.** Human-defined evaluation tasks, archive rules, staged evaluation, compute budgets. More self-referential than DGM, but not fully unconstrained.

## The production cost problem

Both approaches are **expensive to run continuously in production software**:

- DGM/DGM-H run in Docker containers, generating full code diffs, evaluating over 50-80 iterations
- Each generation costs real API calls (orchestrator + workers + evaluation)
- Our swarm already costs significant API spend at 4 agents per run
- Running an outer evolution loop ON TOP of the swarm is prohibitive for daily use

### Proposed compromise: periodic tuning cadence

Instead of continuous evolution, devswarm should adopt a **periodic tuning schedule**:

| Cadence | What evolves | Cost |
|---|---|---|
| **Per-run** | Nothing — use current grid + prompts as-is | $0 extra |
| **Weekly** | Light: prompt variants via QD search on accumulated telemetry | Low — reuse existing swarm runs as eval signal |
| **Bi-weekly** | Medium: grid mapping (role→model) + prompt text | Moderate — dedicated eval runs |
| **Monthly** | Heavy: full DGM-H style — meta agent + task agent + strategies | High — but amortized over a month of usage |

Telemetry from every `run_swarm` call (#342, #346) feeds into the archive passively. The evolution loop runs offline, not in the hot path.

## What this means for devswarm's architecture

### Current state (what we have)
```
[Orchestrator] → [Workers] → [Synthesizer]
     ↑ fixed prompt    ↑ fixed roles    ↑ fixed strategy
```

### #353 as filed (prompt-only evolution) — this is the "DGM-H w/o self-improve" baseline
```
[Orchestrator] → [Workers] → [Synthesizer]
     ↑ fixed         ↑ EVOLVED prompts    ↑ fixed
```
The HyperAgents paper shows this is the weaker approach.

### What HyperAgents suggests (full hyperagent evolution)
```
[Orchestrator] → [Workers] → [Synthesizer]
     ↑ EVOLVED       ↑ EVOLVED prompts    ↑ EVOLVED
     decomposition    + tool chains         synthesis strategy
     strategy         + meta strategies
```
The meta agent (orchestrator + synthesis) must also be evolvable.

### Practical middle ground for devswarm
```
Monthly evolution loop:
  1. Collect telemetry from all swarm runs (passive)
  2. Evolve prompt variants (QD search) — cheap
  3. Evolve decomposition strategies (orchestrator prompt variants) — moderate
  4. Evaluate on accumulated task history — use real past runs as benchmark
  5. Archive best variants in .devswarm/evolved/
  6. Grid auto-updates to use winners

Per-run: just use the current best from the archive. Zero overhead.
```

## Decision needed

1. **Do we go DGM-style (evolve task agents only, fixed meta) or DGM-H-style (evolve everything)?**
   - DGM is simpler, cheaper, proven for coding
   - DGM-H is more general, but costlier and less proven in production

2. **What's the tuning cadence?**
   - Weekly light / bi-weekly medium / monthly heavy?
   - Or triggered by performance regression (adaptive)?

3. **What's the "genome" that evolves?**
   - Prompt text only? (cheapest, weakest)
   - Prompt + grid mapping? (moderate)
   - Prompt + grid + decomposition + synthesis strategies? (DGM-H level)

## Verdict

> DGM is the stronger paper qua experiment. HyperAgents is the stronger paper qua research agenda.

For devswarm: start with DGM-style (evolve prompts + grid, fixed meta) because it's cheaper and we have production constraints. But architect the system so the meta-level (orchestrator, synthesis) CAN be swapped in as evolvable later — don't paint ourselves into a corner.

## Blocks
- #274 (evolutionary grid tuning) — this issue defines the architecture
- #353 (prompt evolution) — scope depends on this decision
- #354 (eval framework) — needs `score_child_prop` parent selection from DGM-H

## References
- [DGM paper](https://arxiv.org/abs/2505.22827) — Darwin Gödel Machine
- [HyperAgents paper](https://arxiv.org/abs/2603.19461) — DGM-H
- [Rainbow Teaming](https://arxiv.org/abs/2402.16822) — QD search for prompt diversity
- [HyperAgents code](https://github.com/facebookresearch/HyperAgents)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: DGM vs HyperAgents — what should devswarm's self-improvement architecture look like? #355

Context

The core difference

What DGM does well

Critique of DGM

What HyperAgents does well

Critique of HyperAgents

The production cost problem

Proposed compromise: periodic tuning cadence

What this means for devswarm's architecture

Current state (what we have)

#353 as filed (prompt-only evolution) — this is the "DGM-H w/o self-improve" baseline

What HyperAgents suggests (full hyperagent evolution)

Practical middle ground for devswarm

Decision needed

Verdict

Blocks

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	DGM	HyperAgents (DGM-H)
What evolves	Task agent (coding agent)	Task agent + meta agent (both editable)
Meta-level	Fixed, handcrafted scaffold	Editable — agent can improve how it improves
Domain assumption	Coding skill ≈ self-improvement skill (aligned)	No alignment needed — meta agent is domain-agnostic
Archive	Stepping stones of agent variants	Same, but hyperagent variants (task + meta)
Transfer	Weak — improvements are coding-specific	Strong — meta-improvements transfer across domains

Cadence	What evolves	Cost
Per-run	Nothing — use current grid + prompts as-is	$0 extra
Weekly	Light: prompt variants via QD search on accumulated telemetry	Low — reuse existing swarm runs as eval signal
Bi-weekly	Medium: grid mapping (role→model) + prompt text	Moderate — dedicated eval runs
Monthly	Heavy: full DGM-H style — meta agent + task agent + strategies	High — but amortized over a month of usage

research: DGM vs HyperAgents — what should devswarm's self-improvement architecture look like? #355

Description

Context

The core difference

What DGM does well

Critique of DGM

What HyperAgents does well

Critique of HyperAgents

The production cost problem

Proposed compromise: periodic tuning cadence

What this means for devswarm's architecture

Current state (what we have)

#353 as filed (prompt-only evolution) — this is the "DGM-H w/o self-improve" baseline

What HyperAgents suggests (full hyperagent evolution)

Practical middle ground for devswarm

Decision needed

Verdict

Blocks

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions