Skip to content

Latest commit

 

History

History
247 lines (167 loc) · 11.6 KB

File metadata and controls

247 lines (167 loc) · 11.6 KB

Slow-Path Agent

The slow-path agent is CUCo's performance optimization engine. Starting from the fast-path baseline (generation 0), it runs an island-based evolutionary search with LLM-driven mutation to discover high-performance compute-communication kernels.

Evolution Loop

Module: cuco/core/runner.py Class: EvolutionRunner

For each generation:
  1. Select parent from population (fitness-weighted)
  2. Sample archive inspirations + top-k programs
  3. Assemble prompt (task msg + parent + inspirations + meta-recommendations)
  4. LLM generates mutation (diff / full rewrite / crossover)
  5. Apply patch to parent code
  6. Novelty check (reject near-duplicates)
  7. Submit for evaluation (build → run → score)
  8. Store result in database (including failures)
  9. Periodically: meta-summarization
  10. Periodically: island migration

Two-Phase Scheduling

Evolution is typically split into two phases:

Phase 1 — Explore (first 40% of budget):

  • 70% full rewrites, 15% diffs, 15% crossover
  • Higher temperature (0.2, 0.5, 0.8)
  • Goal: discover structurally diverse architectures (multi-stream, fused kernel, split put/wait, warp specialization)

Phase 2 — Exploit (remaining 60%):

  • 60% full rewrites, 25% diffs, 15% crossover
  • Lower temperature (0.0, 0.2, 0.5)
  • Goal: refine the best architectures found during exploration

The phase split is controlled by explore_fraction in run_evo.py. The key insight: without initial diversity from the explore phase, the search converges slowly to a lower-quality local optimum.

Mutation Forms

The LLM acts as a variation operator — not an open-ended code generator. Its output is structurally bounded by the EVOLVE-BLOCK markers.

Diff Patches

Localized SEARCH/REPLACE edits within evolve blocks. The LLM proposes specific code changes:

<<<< SEARCH
gin.put(world, r, recvwin, offset, sendwin, offset, size, ncclGin_SignalInc(0));
==== REPLACE
gin.put(world, r, recvwin, offset, sendwin, offset, chunk_size, ncclGin_SignalInc(0));
gin.put(world, r, recvwin, offset + chunk_size, sendwin, offset + chunk_size,
        size - chunk_size, ncclGin_SignalInc(1));
>>>>

Best for: fine-grained parameter tuning, adding synchronization, reordering operations.

Full Rewrites

Complete replacement of all code within EVOLVE-BLOCK markers. The LLM generates an entirely new implementation while preserving the frozen interface.

Best for: architectural changes (sequential → pipelined, single-kernel → multi-stream).

Crossover

Synthesis from multiple archive programs. The LLM receives 2-3 high-performing candidates and combines their best aspects.

Best for: combining complementary strategies (e.g., one program's stream topology with another's synchronization pattern).

Parent Selection

Module: cuco/database/parents.py

Three strategies are available:

Power-Law Sampling

Default. Programs are ranked by fitness, and selection probability follows a power law distribution. The exploitation_alpha parameter controls selection pressure:

  • alpha = 0 → uniform random (maximum exploration)
  • alpha = 1 → strong bias toward top programs (maximum exploitation)

Weighted Tree Sampling

Uses a sigmoid-based weighting over the fitness distribution. The parent_selection_lambda parameter controls sharpness — higher values concentrate selection on the best programs.

Beam Search

Maintains num_beams best programs and only selects from this beam. Most exploitative strategy.

Archive and Inspirations

Module: cuco/database/inspirations.py

Each mutation prompt includes context from other successful programs:

Archive Inspirations

Drawn from a MAP-Elites diversity archive that maintains structurally distinct high-performing solutions. The archive ensures the LLM sees programs outside the current lineage, analogous to crossover across distant population members.

Configuration:

  • archive_size — maximum archive capacity
  • num_archive_inspirations — how many to include per prompt
  • elite_selection_ratio — proportion of archive slots reserved for fitness elites

Top-K Inspirations

The highest-scoring programs overall, regardless of structural diversity. Provides the LLM with clear performance targets.

Configuration:

  • num_top_k_inspirations — how many to include per prompt

Meta-Summarizer

Module: cuco/core/summarizer.py Class: MetaSummarizer

Every meta_rec_interval generations, the meta-summarizer runs a three-step LLM pipeline:

  1. Summarize: Digest the most recent batch of candidates — their scores, mutation types, architectural choices, and evaluation feedback.
  2. Update scratchpad: Maintain a persistent global scratchpad tracking which strategies have been attempted, which succeeded, and which failed.
  3. Recommend: Produce a ranked list of concrete optimization directions for the next generation.

These recommendations are injected into subsequent mutation prompts, creating a closed-loop meta-learning signal. Early recommendations may suggest exploring different fusion levels; later recommendations — informed by observing that multi-stream overlap consistently outperforms full fusion for a particular workload — redirect effort toward refining that strategy.

Output is stored in:

  • meta_memory.json — persistent state (summaries, scratchpad, recommendations)
  • meta_N.txt — human-readable snapshot at generation N

Novelty Filtering

Module: cuco/core/novelty_judge.py Class: NoveltyJudge

To prevent population collapse, candidates are checked for novelty before evaluation:

  1. Embedding similarity: The candidate's code is embedded (via EmbeddingClient), and cosine similarity is computed against all existing database entries. If similarity exceeds code_embed_sim_threshold (default: 0.995), the candidate is rejected.

  2. LLM novelty assessment (optional): An LLM judges whether the candidate introduces meaningful structural differences.

Rejected candidates are resampled up to max_novelty_attempts times.

Island Model

Module: cuco/database/islands.py

The search uses multiple independent islands to maintain diversity:

Assignment

Each program belongs to exactly one island. Islands can have different:

  • Seed programs (init_program_paths_per_island)
  • Task system messages (task_sys_msg_per_island)
  • Communication APIs (island_api_types: e.g., island 0 = LSA, island 1 = GIN)

Migration

Every migration_interval generations, top-performing programs are copied between islands:

  • Elitist migration: Copy the best migration_rate fraction of each island to randomly selected targets
  • Directional migration: Follow a configured migration_graph (e.g., LSA island → hybrid island ← GIN island)

Migration cross-pollinates successful patterns without collapsing diversity.

Cascade Evaluation

Every candidate passes through a three-level cascade:

LevelCheckCostOn Failure
L1Compile (nvcc)SecondsStore with score 0, feed compiler errors to next mutation
L2Run + verify (mpirun)Seconds-minutesStore with score 0, feed runtime errors
L3Benchmark (best of N runs)Seconds-minutesStore with measured score

Failed candidates are retained in the database with their diagnostics. They serve as negative examples that inform future mutations — a form of explicit negative selection absent from classical evolutionary methods.

LLM Feedback

At every cascade level, an LLM feedback agent receives the candidate code and evaluation outcome and generates a concise diagnostic. This feedback is stored with the candidate and injected when its lineage is later selected as a parent.

Candidate Database

Module: cuco/database/dbase.py Class: ProgramDatabase

All evaluated candidates — including failures — are persisted to an SQLite database. The database serves two roles:

  1. Candidate pool: Source for parent selection, archive sampling, and inspiration retrieval across all islands.
  2. Knowledge base: Backing store for the meta-summarizer, which queries historical results to distill cross-generation patterns.

Embedding-Guided Retrieval

Each candidate's code embedding enables:

  • Novelty filtering: Reject near-duplicates before evaluation
  • Nearest-neighbor lookup: Surface structurally similar programs and their feedback
  • Clustering: Group candidates by architectural similarity for visualization

Schema

The programs table stores:

ColumnTypeDescription
idTEXTUnique identifier
codeTEXTFull source code
generationINTEGERGeneration number
island_idxINTEGERIsland assignment
parent_idTEXTParent program ID
combined_scoreREALFitness score
correctINTEGER0 or 1
public_metricsTEXTJSON timing data
text_feedbackTEXTLLM feedback
embeddingBLOBCode embedding vector
code_diffTEXTMutation diff from parent
in_archiveINTEGERWhether in MAP-Elites archive

Configuration Summary

Key EvolutionConfig parameters for the slow-path agent:

ParameterDefaultDescription
num_generations10Total generation budget
patch_types["diff"]Available mutation forms
patch_type_probs[1.0]Sampling probabilities
llm_models["azure-gpt-4.1-mini"]LLM models for mutation
llm_kwargs{}Temperature, max_tokens, etc.
meta_rec_intervalNoneGenerations between meta-summaries
max_novelty_attempts3Resamples before accepting a duplicate
code_embed_sim_threshold1.0Cosine similarity rejection threshold
use_text_feedbackFalseInclude LLM feedback in prompts
embedding_modelNoneModel for code embeddings

Key DatabaseConfig parameters:

ParameterDefaultDescription
num_islands4Number of independent islands
archive_size100MAP-Elites archive capacity
migration_interval10Generations between migrations
parent_selection_strategy"power_law"Selection algorithm

See Configuration Reference for the complete list.