Skip to content

research: DGM vs HyperAgents — what should devswarm's self-improvement architecture look like? #355

@justrach

Description

@justrach

Context

Two papers define the space for self-improving agent systems. We need to decide which architecture to adopt for devswarm's evolutionary grid (#274), prompt evolution (#353), and eval framework (#354) — and whether the cost is practical for production use.

Darwin Gödel Machine (DGM) — open-ended evolution of coding agents
HyperAgents (DGM-H) (arXiv:2603.19461) — extends DGM with editable meta-agents

The core difference

DGM HyperAgents (DGM-H)
What evolves Task agent (coding agent) Task agent + meta agent (both editable)
Meta-level Fixed, handcrafted scaffold Editable — agent can improve how it improves
Domain assumption Coding skill ≈ self-improvement skill (aligned) No alignment needed — meta agent is domain-agnostic
Archive Stepping stones of agent variants Same, but hyperagent variants (task + meta)
Transfer Weak — improvements are coding-specific Strong — meta-improvements transfer across domains

What DGM does well

  1. Crisp formulation. Mutate coding agents, evaluate empirically, keep good branches, repeat. Clean bridge between Gödel-machine theory and running systems.
  2. Archive/open-ended search is the real contribution. Keeping many lineages matters because useful stepping stones may only pay off much later.
  3. Strong empirical results in its home domain. 20% → 50% on SWE-bench, 14.2% → 30.7% on Polyglot.

Critique of DGM

  1. "Self-improvement" is only partly self-improvement. The open-ended exploration process (archive maintenance, parent selection) is fixed and non-modifiable. The instruction-generation mechanism is handcrafted. DGM is really: self-editing agent inside a fixed evolutionary scaffold.
  2. Domain-specific alignment assumption. Coding performance as proxy for self-improvement ability is plausible in coding, weak outside it.
  3. May be too benchmark-coupled. Better at the benchmark distribution ≠ more general self-reflection.

What HyperAgents does well

  1. Attacks the real bottleneck. Makes the meta-level procedure editable — the strongest conceptual advance over DGM.
  2. Generalizes beyond coding. Evaluated on paper review, robotics reward design, Olympiad math grading.
  3. Transfer of meta-level improvements. Learned mechanisms (persistent memory, performance tracking) transfer across domains and accumulate across runs. DGM-H goes from 0.0 → 0.710 on paper review; DGM gets 0.0.
  4. Stays competitive on coding. 0.140 → 0.340 on Polyglot (comparable to DGM, without being handcrafted for coding).

Critique of HyperAgents

  1. Experimentally messier. Broader claim surface (coding, review, robotics, math, transfer, compounding) gives more room for hidden evaluator dependence and benchmark fragility.
  2. "Improves how it improves" is shallow-ish. Evidence includes memory and tracking — real meta-improvement, but still high-level software-engineering scaffolding, not deep algorithmic self-redesign.
  3. Non-coding tasks are less ground-truth hard. Paper review and grading can partially mirror the generator's style. "Works on any computable task" is a research direction, not a proven result.
  4. Still inherits fixed outer-loop choices. Human-defined evaluation tasks, archive rules, staged evaluation, compute budgets. More self-referential than DGM, but not fully unconstrained.

The production cost problem

Both approaches are expensive to run continuously in production software:

  • DGM/DGM-H run in Docker containers, generating full code diffs, evaluating over 50-80 iterations
  • Each generation costs real API calls (orchestrator + workers + evaluation)
  • Our swarm already costs significant API spend at 4 agents per run
  • Running an outer evolution loop ON TOP of the swarm is prohibitive for daily use

Proposed compromise: periodic tuning cadence

Instead of continuous evolution, devswarm should adopt a periodic tuning schedule:

Cadence What evolves Cost
Per-run Nothing — use current grid + prompts as-is $0 extra
Weekly Light: prompt variants via QD search on accumulated telemetry Low — reuse existing swarm runs as eval signal
Bi-weekly Medium: grid mapping (role→model) + prompt text Moderate — dedicated eval runs
Monthly Heavy: full DGM-H style — meta agent + task agent + strategies High — but amortized over a month of usage

Telemetry from every run_swarm call (#342, #346) feeds into the archive passively. The evolution loop runs offline, not in the hot path.

What this means for devswarm's architecture

Current state (what we have)

[Orchestrator] → [Workers] → [Synthesizer]
     ↑ fixed prompt    ↑ fixed roles    ↑ fixed strategy

#353 as filed (prompt-only evolution) — this is the "DGM-H w/o self-improve" baseline

[Orchestrator] → [Workers] → [Synthesizer]
     ↑ fixed         ↑ EVOLVED prompts    ↑ fixed

The HyperAgents paper shows this is the weaker approach.

What HyperAgents suggests (full hyperagent evolution)

[Orchestrator] → [Workers] → [Synthesizer]
     ↑ EVOLVED       ↑ EVOLVED prompts    ↑ EVOLVED
     decomposition    + tool chains         synthesis strategy
     strategy         + meta strategies

The meta agent (orchestrator + synthesis) must also be evolvable.

Practical middle ground for devswarm

Monthly evolution loop:
  1. Collect telemetry from all swarm runs (passive)
  2. Evolve prompt variants (QD search) — cheap
  3. Evolve decomposition strategies (orchestrator prompt variants) — moderate
  4. Evaluate on accumulated task history — use real past runs as benchmark
  5. Archive best variants in .devswarm/evolved/
  6. Grid auto-updates to use winners

Per-run: just use the current best from the archive. Zero overhead.

Decision needed

  1. Do we go DGM-style (evolve task agents only, fixed meta) or DGM-H-style (evolve everything)?

    • DGM is simpler, cheaper, proven for coding
    • DGM-H is more general, but costlier and less proven in production
  2. What's the tuning cadence?

    • Weekly light / bi-weekly medium / monthly heavy?
    • Or triggered by performance regression (adaptive)?
  3. What's the "genome" that evolves?

    • Prompt text only? (cheapest, weakest)
    • Prompt + grid mapping? (moderate)
    • Prompt + grid + decomposition + synthesis strategies? (DGM-H level)

Verdict

DGM is the stronger paper qua experiment. HyperAgents is the stronger paper qua research agenda.

For devswarm: start with DGM-style (evolve prompts + grid, fixed meta) because it's cheaper and we have production constraints. But architect the system so the meta-level (orchestrator, synthesis) CAN be swapped in as evolvable later — don't paint ourselves into a corner.

Blocks

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions