The fast-path agent converts host-driven CUDA+NCCL programs into device-initiated equivalents (GIN or LSA). It prioritizes correctness over performance, producing a conservative but verified baseline that the slow-path agent can then optimize.
Host-driven .cu CUDA Analysis Host-to-Device Evolve-Block
(ncclAlltoAll, etc.) ──► (CUDAAnalyzer) ──► Transformation ──► Annotation
│ (LLM + Judge) (LLM + regex)
│ │ │
▼ ▼ ▼
AnalysisReport Verified device Annotated seed
(comm graph) .cu file with EVOLVE-BLOCK
The pipeline is implemented in two layers:
- PreTransformPipeline (
cuco/transform/pipeline.py) — orchestrates the ordered steps - HostToDeviceTransformer (
cuco/transform/transformer.py) — runs the LLM build/verify loop
Module: cuco/transform/cuda_analyzer.py
Class: CUDAAnalyzer
The analyzer extracts a communication dependency graph from the input source using regex-based pattern matching. It identifies:
- NCCL collectives:
ncclAllReduce,ncclAllGather,ncclAlltoAll,ncclSend,ncclRecv,ncclGroupStart/End - Memory allocations:
cudaMallocvs.ncclMemAlloc(device APIs require the latter) - Kernel launches: names, grid/block dimensions, stream assignments
- Synchronization points:
cudaStreamSynchronize,cudaDeviceSynchronize,MPI_Barrier - Communication graph: For each collective, which buffers are sent/received, which kernels produce/consume them
analyzer = CUDAAnalyzer("my_kernel.cu")
report = analyzer.analyze()
# Check if host-side collectives exist
if report.has_host_communication():
# Feed to the transformer
llm_context = report.format_for_llm()format_for_llm() produces a structured text summary that the rewriter LLM uses to understand the communication pattern, data dependencies, and which buffers need migration from cudaMalloc to ncclMemAlloc.
Module: cuco/transform/transformer.py
Class: HostToDeviceTransformer
This is the core of the fast-path agent. It runs an LLM-driven feedback loop:
┌─────────────────────┐
│ LLM Rewrite │
│ (code generation) │
└────────┬────────────┘
│ candidate .cu
▼
┌─────────────────────┐
│ Build (nvcc) │
└────────┬────────────┘
│
success│ fail → error feedback ──► back to LLM
▼
┌─────────────────────┐
│ Run (mpirun) │
└────────┬────────────┘
│
pass │ fail → diagnostic rerun ──► back to LLM
▼
┌─────────────────────┐
│ LLM Judge │
│ (analyzes result) │
└────────┬────────────┘
│
verified │ issues found ──► corrective feedback ──► back to LLM
▼
Success!
By default (two_stage=True), transformation is split into two stages:
Stage A: Infrastructure Setup
The LLM adds device-side NCCL infrastructure while keeping host collectives intact:
- Replace
cudaMallocwithncclMemAllocfor communication buffers - Create and register NCCL windows (
ncclCommWindowRegister) - Configure device communicator requirements (
ncclDevCommRequirements) - Instantiate the device communicator (
ncclDevCommCreate) - Set up cooperative kernel launch infrastructure
The host-side NCCL collectives (ncclAlltoAll, etc.) remain unchanged. This isolates infrastructure errors from communication logic errors.
Stage B: Collective Replacement
With infrastructure in place, the LLM replaces host collectives with device-initiated equivalents:
- For GIN:
ncclAlltoAll→gin.put()+gin.flush()+gin.waitSignal() - For LSA:
ncclAlltoAll→ncclGetLsaPointer()+ direct stores +ncclLsaBarrierSession.sync()
All directives are set conservatively:
- CTA-level issuance (
ncclCoopCta) to avoid warp-level divergence - Fully deferred placement to minimize ordering complexity
- Global synchronization scope for cross-rank visibility
- Coarse transfer granularity
Set two_stage=False for simpler programs where infrastructure and collective replacement can be done together.
When a runtime failure occurs, the transformer can inject cudaDeviceSynchronize() after each kernel launch to isolate which kernel is causing the fault. This provides much more targeted feedback to the LLM than a generic crash report.
The combined loop typically converges in 2-4 iterations per stage. Stage A usually succeeds in 1-2 iterations (infrastructure is more formulaic). Stage B may need 3-4 iterations due to the complexity of synchronization semantics.
Function: insert_evolve_markers() in transformer.py
After transformation, the code is annotated with mutable-region markers:
// EVOLVE-BLOCK-START
// ... kernel definitions and pipeline logic ...
// EVOLVE-BLOCK-ENDThe annotator uses LLM analysis with regex fallback:
- LLM pass: Ask the LLM to identify which regions are safe to mutate
- Regex fallback: If LLM fails, use pattern matching to find kernel definitions and pipeline sections
Frozen regions are explicitly excluded:
- MPI/NCCL initialization and teardown
- Verification and output formatting
- Main function structure
The pipeline can optionally inject communication warmup before the timed section. The first GIN/LSA/NCCL call triggers lazy RDMA/NIC initialization (10-50 ms). Warmup rounds amortize this cost.
The warmup step:
- Asks the LLM to add 2 rounds of dummy communication calls before timing starts
- Builds and verifies the result compiles and runs correctly
- Skips if warmup is already detected in the code
Module: cuco/transform/pipeline.py
Class: PreTransformPipeline
Orchestrates the four steps in order, with skip logic:
pipeline = PreTransformPipeline(
config=transform_config,
steps=["analyze", "host_to_device", "evolve_markers", "warmup"],
)
result = pipeline.run(source_path="my_kernel.cu", output_dir="_transform_output")| Step | Method | Skipped when |
|---|---|---|
analyze | Regex (Python) | Never — always runs |
host_to_device | LLM loop | No host NCCL collectives detected |
evolve_markers | LLM + regex | EVOLVE-BLOCK markers already present |
warmup | LLM + build/verify | Warmup section already detected |
Each step produces a PipelineStepResult with timing, cost, and error information.
CUCo offers two modes for the fast-path transformation:
Uses Claude Code CLI (claude -p) with full file system autonomy. The agent:
- Reads the source file
- Iteratively edits, builds, and runs until verification passes
- Has access to Bash, Read, Write, and Edit tools
python run_transform.py # Agent mode (default)Advantages: More flexible, can handle unexpected edge cases. Disadvantages: Higher cost, less predictable.
Uses the HostToDeviceTransformer with a fixed rewrite-build-judge cycle:
python run_transform.py --no-agentAdvantages: Deterministic, lower cost, easier to debug. Disadvantages: Less flexible when the transformation requires creative solutions.
See Configuration Reference for all TransformConfig parameters. The most important ones:
| Parameter | Default | Description |
|---|---|---|
api_type |
"gin" |
Target API: "gin" or "lsa" |
two_stage |
True |
Split into infrastructure + replacement stages |
max_iterations |
5 | Max iterations (single-stage mode) |
stage_a_max_iterations |
5 | Max iterations for infrastructure stage |
stage_b_max_iterations |
10 | Max iterations for replacement stage |
rewrite_model |
Sonnet 4.6 | LLM for code generation |
judge_model |
"" (same) |
LLM for judge feedback |
reference_code |
"" |
Working device-side example to show the LLM |
nccl_api_docs |
"" |
NCCL API documentation string |
verification_pass_str |
"Verification: PASS" |
Expected output for success |