The slow-path agent is CUCo's performance optimization engine. Starting from the fast-path baseline (generation 0), it runs an island-based evolutionary search with LLM-driven mutation to discover high-performance compute-communication kernels.
Module: cuco/core/runner.py
Class: EvolutionRunner
For each generation:
1. Select parent from population (fitness-weighted)
2. Sample archive inspirations + top-k programs
3. Assemble prompt (task msg + parent + inspirations + meta-recommendations)
4. LLM generates mutation (diff / full rewrite / crossover)
5. Apply patch to parent code
6. Novelty check (reject near-duplicates)
7. Submit for evaluation (build → run → score)
8. Store result in database (including failures)
9. Periodically: meta-summarization
10. Periodically: island migration
Evolution is typically split into two phases:
Phase 1 — Explore (first 40% of budget):
- 70% full rewrites, 15% diffs, 15% crossover
- Higher temperature (0.2, 0.5, 0.8)
- Goal: discover structurally diverse architectures (multi-stream, fused kernel, split put/wait, warp specialization)
Phase 2 — Exploit (remaining 60%):
- 60% full rewrites, 25% diffs, 15% crossover
- Lower temperature (0.0, 0.2, 0.5)
- Goal: refine the best architectures found during exploration
The phase split is controlled by explore_fraction in run_evo.py. The key insight: without initial diversity from the explore phase, the search converges slowly to a lower-quality local optimum.
The LLM acts as a variation operator — not an open-ended code generator. Its output is structurally bounded by the EVOLVE-BLOCK markers.
Localized SEARCH/REPLACE edits within evolve blocks. The LLM proposes specific code changes:
<<<< SEARCH
gin.put(world, r, recvwin, offset, sendwin, offset, size, ncclGin_SignalInc(0));
==== REPLACE
gin.put(world, r, recvwin, offset, sendwin, offset, chunk_size, ncclGin_SignalInc(0));
gin.put(world, r, recvwin, offset + chunk_size, sendwin, offset + chunk_size,
size - chunk_size, ncclGin_SignalInc(1));
>>>>
Best for: fine-grained parameter tuning, adding synchronization, reordering operations.
Complete replacement of all code within EVOLVE-BLOCK markers. The LLM generates an entirely new implementation while preserving the frozen interface.
Best for: architectural changes (sequential → pipelined, single-kernel → multi-stream).
Synthesis from multiple archive programs. The LLM receives 2-3 high-performing candidates and combines their best aspects.
Best for: combining complementary strategies (e.g., one program's stream topology with another's synchronization pattern).
Module: cuco/database/parents.py
Three strategies are available:
Default. Programs are ranked by fitness, and selection probability follows a power law distribution. The exploitation_alpha parameter controls selection pressure:
alpha = 0→ uniform random (maximum exploration)alpha = 1→ strong bias toward top programs (maximum exploitation)
Uses a sigmoid-based weighting over the fitness distribution. The parent_selection_lambda parameter controls sharpness — higher values concentrate selection on the best programs.
Maintains num_beams best programs and only selects from this beam. Most exploitative strategy.
Module: cuco/database/inspirations.py
Each mutation prompt includes context from other successful programs:
Drawn from a MAP-Elites diversity archive that maintains structurally distinct high-performing solutions. The archive ensures the LLM sees programs outside the current lineage, analogous to crossover across distant population members.
Configuration:
archive_size— maximum archive capacitynum_archive_inspirations— how many to include per promptelite_selection_ratio— proportion of archive slots reserved for fitness elites
The highest-scoring programs overall, regardless of structural diversity. Provides the LLM with clear performance targets.
Configuration:
num_top_k_inspirations— how many to include per prompt
Module: cuco/core/summarizer.py
Class: MetaSummarizer
Every meta_rec_interval generations, the meta-summarizer runs a three-step LLM pipeline:
- Summarize: Digest the most recent batch of candidates — their scores, mutation types, architectural choices, and evaluation feedback.
- Update scratchpad: Maintain a persistent global scratchpad tracking which strategies have been attempted, which succeeded, and which failed.
- Recommend: Produce a ranked list of concrete optimization directions for the next generation.
These recommendations are injected into subsequent mutation prompts, creating a closed-loop meta-learning signal. Early recommendations may suggest exploring different fusion levels; later recommendations — informed by observing that multi-stream overlap consistently outperforms full fusion for a particular workload — redirect effort toward refining that strategy.
Output is stored in:
meta_memory.json— persistent state (summaries, scratchpad, recommendations)meta_N.txt— human-readable snapshot at generation N
Module: cuco/core/novelty_judge.py
Class: NoveltyJudge
To prevent population collapse, candidates are checked for novelty before evaluation:
-
Embedding similarity: The candidate's code is embedded (via
EmbeddingClient), and cosine similarity is computed against all existing database entries. If similarity exceedscode_embed_sim_threshold(default: 0.995), the candidate is rejected. -
LLM novelty assessment (optional): An LLM judges whether the candidate introduces meaningful structural differences.
Rejected candidates are resampled up to max_novelty_attempts times.
Module: cuco/database/islands.py
The search uses multiple independent islands to maintain diversity:
Each program belongs to exactly one island. Islands can have different:
- Seed programs (
init_program_paths_per_island) - Task system messages (
task_sys_msg_per_island) - Communication APIs (
island_api_types: e.g., island 0 = LSA, island 1 = GIN)
Every migration_interval generations, top-performing programs are copied between islands:
- Elitist migration: Copy the best
migration_ratefraction of each island to randomly selected targets - Directional migration: Follow a configured
migration_graph(e.g., LSA island → hybrid island ← GIN island)
Migration cross-pollinates successful patterns without collapsing diversity.
Every candidate passes through a three-level cascade:
| Level | Check | Cost | On Failure |
|---|---|---|---|
| L1 | Compile (nvcc) | Seconds | Store with score 0, feed compiler errors to next mutation |
| L2 | Run + verify (mpirun) | Seconds-minutes | Store with score 0, feed runtime errors |
| L3 | Benchmark (best of N runs) | Seconds-minutes | Store with measured score |
Failed candidates are retained in the database with their diagnostics. They serve as negative examples that inform future mutations — a form of explicit negative selection absent from classical evolutionary methods.
At every cascade level, an LLM feedback agent receives the candidate code and evaluation outcome and generates a concise diagnostic. This feedback is stored with the candidate and injected when its lineage is later selected as a parent.
Module: cuco/database/dbase.py
Class: ProgramDatabase
All evaluated candidates — including failures — are persisted to an SQLite database. The database serves two roles:
- Candidate pool: Source for parent selection, archive sampling, and inspiration retrieval across all islands.
- Knowledge base: Backing store for the meta-summarizer, which queries historical results to distill cross-generation patterns.
Each candidate's code embedding enables:
- Novelty filtering: Reject near-duplicates before evaluation
- Nearest-neighbor lookup: Surface structurally similar programs and their feedback
- Clustering: Group candidates by architectural similarity for visualization
The programs table stores:
| Column | Type | Description |
|---|---|---|
id | TEXT | Unique identifier |
code | TEXT | Full source code |
generation | INTEGER | Generation number |
island_idx | INTEGER | Island assignment |
parent_id | TEXT | Parent program ID |
combined_score | REAL | Fitness score |
correct | INTEGER | 0 or 1 |
public_metrics | TEXT | JSON timing data |
text_feedback | TEXT | LLM feedback |
embedding | BLOB | Code embedding vector |
code_diff | TEXT | Mutation diff from parent |
in_archive | INTEGER | Whether in MAP-Elites archive |
Key EvolutionConfig parameters for the slow-path agent:
| Parameter | Default | Description |
|---|---|---|
num_generations | 10 | Total generation budget |
patch_types | ["diff"] | Available mutation forms |
patch_type_probs | [1.0] | Sampling probabilities |
llm_models | ["azure-gpt-4.1-mini"] | LLM models for mutation |
llm_kwargs | {} | Temperature, max_tokens, etc. |
meta_rec_interval | None | Generations between meta-summaries |
max_novelty_attempts | 3 | Resamples before accepting a duplicate |
code_embed_sim_threshold | 1.0 | Cosine similarity rejection threshold |
use_text_feedback | False | Include LLM feedback in prompts |
embedding_model | None | Model for code embeddings |
Key DatabaseConfig parameters:
| Parameter | Default | Description |
|---|---|---|
num_islands | 4 | Number of independent islands |
archive_size | 100 | MAP-Elites archive capacity |
migration_interval | 10 | Generations between migrations |
parent_selection_strategy | "power_law" | Selection algorithm |
See Configuration Reference for the complete list.