Skip to content

AnishKamatam/L2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

L2: Performance Engineering Agent

A CLI agent that writes, profiles, benchmarks, and optimizes GPU kernels. SSH into any GPU machine, iterate on CUDA/Triton/HIP/OpenCL kernels with automated profiling loops, and track every optimization iteration.

Quick Start

# Install dependencies
npm install

# Set your Anthropic API key
npm run dev -- auth

# Load built-in GPU optimization docs (one-time)
npm run dev -- gpu docs builtin

# Start the agent
npm run dev

Then inside the REPL:

> Connect to my-gpu-server.com as ubuntu and show the GPU status

> Upload my_kernel.cu, compile with nvcc -O3, benchmark it, profile with ncu,
  and optimize it to maximize memory bandwidth

What It Does

L2 connects to a GPU machine (remote via SSH or local), then enters an autonomous optimization loop:

Write kernel → Upload → Compile → Benchmark → Profile → Analyze → Optimize → Repeat

The agent checkpoints every 3 iterations with a comparison table, asking whether to continue or stop. It tracks every iteration in a SQLite database so you can compare any two versions.

Installation

git clone https://github.com/AnishKamatam/L2.git
cd L2
npm install
npm run build

Requirements

  • Node.js >= 20
  • Anthropic API key — get one at console.anthropic.com
  • GPU machine with SSH access (or local GPU)

GPU Machine Requirements

The remote (or local) GPU machine needs:

Tool Required For Install
nvidia-smi GPU status Comes with NVIDIA driver
nvcc Compiling CUDA kernels apt install nvidia-cuda-toolkit
ncu Nsight Compute profiling apt install nsight-compute (auto-installed if missing)
nsys Nsight Systems tracing apt install nsight-systems (auto-installed if missing)
python3 + triton Triton kernels pip install triton
python3 + torch PyTorch profiling pip install torch

L2 will auto-detect and auto-install ncu and nsys if they're missing.

Usage

CLI Commands

l2                          # Start interactive session (default)
l2 chat                     # Same as above
l2 chat -m <model>          # Use a specific Claude model
l2 auth                     # Set Anthropic API key
l2 auth --show              # Show current key status
l2 gpu connect <host>       # Test GPU connection
l2 gpu status               # Show configured GPU info
l2 gpu docs builtin         # Load built-in CUDA/Triton/profiling docs
l2 gpu docs ingest <url>    # Ingest custom documentation

Slash Commands (inside REPL)

/help                       Show all commands
/gpu                        Show GPU connection status
/iterations <kernel>        Show optimization history for a kernel
/benchmark <kernel>         Show best benchmark results
/compare <kernel> [a] [b]   Compare two iterations side-by-side
/clear                      Reset conversation
/history                    Show message history
/exit                       Exit

Connecting to a GPU

Remote via SSH (auto-discovers your SSH key):

Connect to gpu-server.example.com as ubuntu

Local GPU (runs on the same machine):

Connect to the local GPU

Password auth:

Connect to 192.168.1.100 with username root using password authentication

SSH key auto-discovery checks ~/.ssh/id_ed25519, id_rsa, and id_ecdsa in order.

Example Workflows

Optimize a CUDA kernel:

Upload vector_add.cu to the GPU, compile with nvcc -O3 -arch=sm_90,
benchmark it 100 times, profile with ncu, analyze the bottlenecks,
and optimize it. Target: 80% of peak memory bandwidth.

Optimize a Triton kernel:

Upload matmul.py to the GPU and run it. Profile with the PyTorch profiler,
then tune the tile sizes (BLOCK_M, BLOCK_N, BLOCK_K) for maximum TFLOPS.

Compare iterations:

Show me the comparison of all iterations for my_kernel

Architecture

src/
├── cli/                    # CLI entry point (Commander)
│   └── app.ts
├── core/                   # Agent orchestration
│   ├── agent.ts            # Main agent loop (Claude + tools)
│   ├── session.ts          # REPL with slash commands
│   ├── config.ts           # API key config (~/.l2/config.json)
│   └── types.ts
├── providers/              # LLM integration
│   ├── anthropic.ts        # Streaming Anthropic client
│   ├── system-prompt.ts    # Base system prompt
│   └── gpu-system-prompt.ts # GPU-specialized system prompt
├── gpu/                    # GPU subsystem
│   ├── connection.ts       # SSH client (ssh2)
│   ├── sftp.ts             # File transfer
│   ├── local-adapter.ts    # Local GPU mode
│   ├── cloud-adapter.ts    # Lambda Labs / RunPod APIs
│   ├── manager.ts          # Connection manager singleton
│   ├── iteration.ts        # Autonomous optimization loop
│   ├── config.ts           # GPU config (~/.l2/gpu.json)
│   ├── types.ts            # GPU type definitions
│   ├── parsers/            # Profiling output parsers
│   │   ├── ncu-parser.ts       # Nsight Compute CSV/text
│   │   ├── nsys-parser.ts      # Nsight Systems CSV
│   │   ├── torch-profiler-parser.ts
│   │   └── nvidia-smi-parser.ts
│   └── docs/               # Documentation RAG
│       ├── ingest.ts           # Chunk + store docs
│       └── retrieve.ts         # Semantic search + retrieval
├── tools/                  # Tool implementations
│   ├── schema.ts           # All 28 tool definitions
│   ├── executor.ts         # Tool dispatch
│   ├── gpu/                # GPU-specific tools
│   │   ├── connect.ts          # gpu_connect / gpu_disconnect
│   │   ├── status.ts           # gpu_status
│   │   ├── exec.ts             # gpu_exec
│   │   ├── transfer.ts         # gpu_upload / gpu_download
│   │   ├── profile-ncu.ts      # profile_ncu
│   │   ├── profile-nsys.ts     # profile_nsys
│   │   ├── profile-torch.ts    # profile_torch
│   │   ├── nvidia-smi.ts       # nvidia_smi
│   │   ├── benchmark.ts        # benchmark_kernel
│   │   ├── compare.ts          # compare_iterations
│   │   └── ensure-tools.ts     # Auto-find/install profiling binaries
│   └── ...                 # Standard tools (read, write, grep, shell, etc.)
├── memory/                 # Persistent memory
│   ├── db.ts               # SQLite schema
│   ├── store.ts            # Conversation/tool logging
│   ├── search.ts           # Hybrid FTS + semantic search
│   ├── embeddings.ts       # Local embeddings (MiniLM)
│   └── gpu-store.ts        # Kernel iterations + benchmark storage
└── ui/                     # Terminal rendering
    └── ...

Tools

Standard Tools (16)

Tool Purpose
read_file Read file contents
write_file Create/overwrite files
edit_file Replace unique string in file
search_and_replace Pattern replace (literal/regex)
insert_lines Insert text at line number
grep Regex search across files
glob Find files by pattern
list_dir List directory contents
shell Execute local shell commands
mkdir / cd / delete_file / move_file / copy_file Filesystem ops
web_fetch Fetch URL content
memory_search Search conversation history

GPU Tools (12)

Tool Purpose
gpu_connect SSH into GPU machine (or detect local GPU)
gpu_disconnect Close connection
gpu_status Parsed nvidia-smi output
gpu_exec Run commands on GPU machine (5min timeout)
gpu_upload / gpu_download SFTP file transfer
profile_ncu Nsight Compute profiling with structured metrics
profile_nsys Nsight Systems tracing
profile_torch PyTorch profiler
nvidia_smi Quick GPU monitoring
benchmark_kernel N-run benchmarking with statistics
compare_iterations Side-by-side iteration comparison

GPU Optimization Knowledge

The agent has built-in expertise in:

  • CUDA — thread hierarchy, memory model, warp semantics, shared memory tiling, coalescing, bank conflicts, occupancy tuning, warp-level primitives
  • Triton — tile-based programming, autotune, tl.dot, block size selection, num_warps/num_stages tuning
  • HIP — CUDA-to-HIP porting, wavefront (64 threads) differences, hipcc compilation
  • OpenCL — work-groups, work-items, local/global memory model

And knows how to interpret profiling output:

  • ncu metrics — SM throughput, DRAM throughput, occupancy, warp stall reasons, L1/L2 hit rates, roofline analysis
  • nsys traces — kernel launch gaps, memory transfer overlap, API call overhead
  • PyTorch profiler — CUDA vs CPU time, operator breakdown, memory allocations

Configuration

Global config (~/.l2/config.json)

{
  "apiKey": "sk-ant-..."
}

GPU config (~/.l2/gpu.json)

{
  "defaultConnection": {
    "host": "gpu-server.example.com",
    "username": "ubuntu",
    "authMethod": "key"
  },
  "profiling": {
    "benchmarkWarmup": 5,
    "benchmarkIterations": 100
  },
  "checkpointEvery": 3,
  "maxAutoIterations": 10
}

Per-project overrides (.l2/gpu.json)

{
  "framework": "cuda",
  "remoteWorkDir": "/home/ubuntu/kernels/my-project",
  "compileCommand": "nvcc -O3 -arch=sm_90 kernel.cu -o kernel",
  "runCommand": "./kernel"
}

Memory & Iteration Tracking

All data is stored in .l2/memory.db (SQLite, per-project):

  • Conversations — every message and tool call
  • Kernel iterations — code snapshots, benchmark results, profiling data, analysis notes
  • Benchmark results — granular metrics (mean, median, p95, GFLOPS, bandwidth)
  • Documentation chunks — ingested docs with FTS5 search

Query iteration history anytime with /iterations <kernel> or compare_iterations tool.

Development

npm run dev          # Run in development mode (tsx)
npm run build        # Build with tsup
npm run typecheck    # Type check without emitting
npm run clean        # Remove dist/

License

ISC

About

Autonomous GPU kernel optimization agent (CUDA/Triton) with profiling-driven iteration loops.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors