A CLI agent that writes, profiles, benchmarks, and optimizes GPU kernels. SSH into any GPU machine, iterate on CUDA/Triton/HIP/OpenCL kernels with automated profiling loops, and track every optimization iteration.
# Install dependencies
npm install
# Set your Anthropic API key
npm run dev -- auth
# Load built-in GPU optimization docs (one-time)
npm run dev -- gpu docs builtin
# Start the agent
npm run devThen inside the REPL:
> Connect to my-gpu-server.com as ubuntu and show the GPU status
> Upload my_kernel.cu, compile with nvcc -O3, benchmark it, profile with ncu,
and optimize it to maximize memory bandwidth
L2 connects to a GPU machine (remote via SSH or local), then enters an autonomous optimization loop:
Write kernel → Upload → Compile → Benchmark → Profile → Analyze → Optimize → Repeat
The agent checkpoints every 3 iterations with a comparison table, asking whether to continue or stop. It tracks every iteration in a SQLite database so you can compare any two versions.
git clone https://github.com/AnishKamatam/L2.git
cd L2
npm install
npm run build- Node.js >= 20
- Anthropic API key — get one at console.anthropic.com
- GPU machine with SSH access (or local GPU)
The remote (or local) GPU machine needs:
| Tool | Required For | Install |
|---|---|---|
nvidia-smi |
GPU status | Comes with NVIDIA driver |
nvcc |
Compiling CUDA kernels | apt install nvidia-cuda-toolkit |
ncu |
Nsight Compute profiling | apt install nsight-compute (auto-installed if missing) |
nsys |
Nsight Systems tracing | apt install nsight-systems (auto-installed if missing) |
python3 + triton |
Triton kernels | pip install triton |
python3 + torch |
PyTorch profiling | pip install torch |
L2 will auto-detect and auto-install ncu and nsys if they're missing.
l2 # Start interactive session (default)
l2 chat # Same as above
l2 chat -m <model> # Use a specific Claude model
l2 auth # Set Anthropic API key
l2 auth --show # Show current key status
l2 gpu connect <host> # Test GPU connection
l2 gpu status # Show configured GPU info
l2 gpu docs builtin # Load built-in CUDA/Triton/profiling docs
l2 gpu docs ingest <url> # Ingest custom documentation/help Show all commands
/gpu Show GPU connection status
/iterations <kernel> Show optimization history for a kernel
/benchmark <kernel> Show best benchmark results
/compare <kernel> [a] [b] Compare two iterations side-by-side
/clear Reset conversation
/history Show message history
/exit Exit
Remote via SSH (auto-discovers your SSH key):
Connect to gpu-server.example.com as ubuntu
Local GPU (runs on the same machine):
Connect to the local GPU
Password auth:
Connect to 192.168.1.100 with username root using password authentication
SSH key auto-discovery checks ~/.ssh/id_ed25519, id_rsa, and id_ecdsa in order.
Optimize a CUDA kernel:
Upload vector_add.cu to the GPU, compile with nvcc -O3 -arch=sm_90,
benchmark it 100 times, profile with ncu, analyze the bottlenecks,
and optimize it. Target: 80% of peak memory bandwidth.
Optimize a Triton kernel:
Upload matmul.py to the GPU and run it. Profile with the PyTorch profiler,
then tune the tile sizes (BLOCK_M, BLOCK_N, BLOCK_K) for maximum TFLOPS.
Compare iterations:
Show me the comparison of all iterations for my_kernel
src/
├── cli/ # CLI entry point (Commander)
│ └── app.ts
├── core/ # Agent orchestration
│ ├── agent.ts # Main agent loop (Claude + tools)
│ ├── session.ts # REPL with slash commands
│ ├── config.ts # API key config (~/.l2/config.json)
│ └── types.ts
├── providers/ # LLM integration
│ ├── anthropic.ts # Streaming Anthropic client
│ ├── system-prompt.ts # Base system prompt
│ └── gpu-system-prompt.ts # GPU-specialized system prompt
├── gpu/ # GPU subsystem
│ ├── connection.ts # SSH client (ssh2)
│ ├── sftp.ts # File transfer
│ ├── local-adapter.ts # Local GPU mode
│ ├── cloud-adapter.ts # Lambda Labs / RunPod APIs
│ ├── manager.ts # Connection manager singleton
│ ├── iteration.ts # Autonomous optimization loop
│ ├── config.ts # GPU config (~/.l2/gpu.json)
│ ├── types.ts # GPU type definitions
│ ├── parsers/ # Profiling output parsers
│ │ ├── ncu-parser.ts # Nsight Compute CSV/text
│ │ ├── nsys-parser.ts # Nsight Systems CSV
│ │ ├── torch-profiler-parser.ts
│ │ └── nvidia-smi-parser.ts
│ └── docs/ # Documentation RAG
│ ├── ingest.ts # Chunk + store docs
│ └── retrieve.ts # Semantic search + retrieval
├── tools/ # Tool implementations
│ ├── schema.ts # All 28 tool definitions
│ ├── executor.ts # Tool dispatch
│ ├── gpu/ # GPU-specific tools
│ │ ├── connect.ts # gpu_connect / gpu_disconnect
│ │ ├── status.ts # gpu_status
│ │ ├── exec.ts # gpu_exec
│ │ ├── transfer.ts # gpu_upload / gpu_download
│ │ ├── profile-ncu.ts # profile_ncu
│ │ ├── profile-nsys.ts # profile_nsys
│ │ ├── profile-torch.ts # profile_torch
│ │ ├── nvidia-smi.ts # nvidia_smi
│ │ ├── benchmark.ts # benchmark_kernel
│ │ ├── compare.ts # compare_iterations
│ │ └── ensure-tools.ts # Auto-find/install profiling binaries
│ └── ... # Standard tools (read, write, grep, shell, etc.)
├── memory/ # Persistent memory
│ ├── db.ts # SQLite schema
│ ├── store.ts # Conversation/tool logging
│ ├── search.ts # Hybrid FTS + semantic search
│ ├── embeddings.ts # Local embeddings (MiniLM)
│ └── gpu-store.ts # Kernel iterations + benchmark storage
└── ui/ # Terminal rendering
└── ...
| Tool | Purpose |
|---|---|
read_file |
Read file contents |
write_file |
Create/overwrite files |
edit_file |
Replace unique string in file |
search_and_replace |
Pattern replace (literal/regex) |
insert_lines |
Insert text at line number |
grep |
Regex search across files |
glob |
Find files by pattern |
list_dir |
List directory contents |
shell |
Execute local shell commands |
mkdir / cd / delete_file / move_file / copy_file |
Filesystem ops |
web_fetch |
Fetch URL content |
memory_search |
Search conversation history |
| Tool | Purpose |
|---|---|
gpu_connect |
SSH into GPU machine (or detect local GPU) |
gpu_disconnect |
Close connection |
gpu_status |
Parsed nvidia-smi output |
gpu_exec |
Run commands on GPU machine (5min timeout) |
gpu_upload / gpu_download |
SFTP file transfer |
profile_ncu |
Nsight Compute profiling with structured metrics |
profile_nsys |
Nsight Systems tracing |
profile_torch |
PyTorch profiler |
nvidia_smi |
Quick GPU monitoring |
benchmark_kernel |
N-run benchmarking with statistics |
compare_iterations |
Side-by-side iteration comparison |
The agent has built-in expertise in:
- CUDA — thread hierarchy, memory model, warp semantics, shared memory tiling, coalescing, bank conflicts, occupancy tuning, warp-level primitives
- Triton — tile-based programming, autotune,
tl.dot, block size selection, num_warps/num_stages tuning - HIP — CUDA-to-HIP porting, wavefront (64 threads) differences, hipcc compilation
- OpenCL — work-groups, work-items, local/global memory model
And knows how to interpret profiling output:
- ncu metrics — SM throughput, DRAM throughput, occupancy, warp stall reasons, L1/L2 hit rates, roofline analysis
- nsys traces — kernel launch gaps, memory transfer overlap, API call overhead
- PyTorch profiler — CUDA vs CPU time, operator breakdown, memory allocations
{
"apiKey": "sk-ant-..."
}{
"defaultConnection": {
"host": "gpu-server.example.com",
"username": "ubuntu",
"authMethod": "key"
},
"profiling": {
"benchmarkWarmup": 5,
"benchmarkIterations": 100
},
"checkpointEvery": 3,
"maxAutoIterations": 10
}{
"framework": "cuda",
"remoteWorkDir": "/home/ubuntu/kernels/my-project",
"compileCommand": "nvcc -O3 -arch=sm_90 kernel.cu -o kernel",
"runCommand": "./kernel"
}All data is stored in .l2/memory.db (SQLite, per-project):
- Conversations — every message and tool call
- Kernel iterations — code snapshots, benchmark results, profiling data, analysis notes
- Benchmark results — granular metrics (mean, median, p95, GFLOPS, bandwidth)
- Documentation chunks — ingested docs with FTS5 search
Query iteration history anytime with /iterations <kernel> or compare_iterations tool.
npm run dev # Run in development mode (tsx)
npm run build # Build with tsup
npm run typecheck # Type check without emitting
npm run clean # Remove dist/ISC