L2: Performance Engineering Agent

A CLI agent that writes, profiles, benchmarks, and optimizes GPU kernels. SSH into any GPU machine, iterate on CUDA/Triton/HIP/OpenCL kernels with automated profiling loops, and track every optimization iteration.

Quick Start

# Install dependencies
npm install

# Set your Anthropic API key
npm run dev -- auth

# Load built-in GPU optimization docs (one-time)
npm run dev -- gpu docs builtin

# Start the agent
npm run dev

Then inside the REPL:

> Connect to my-gpu-server.com as ubuntu and show the GPU status

> Upload my_kernel.cu, compile with nvcc -O3, benchmark it, profile with ncu,
  and optimize it to maximize memory bandwidth

What It Does

L2 connects to a GPU machine (remote via SSH or local), then enters an autonomous optimization loop:

Write kernel → Upload → Compile → Benchmark → Profile → Analyze → Optimize → Repeat

The agent checkpoints every 3 iterations with a comparison table, asking whether to continue or stop. It tracks every iteration in a SQLite database so you can compare any two versions.

Installation

git clone https://github.com/AnishKamatam/L2.git
cd L2
npm install
npm run build

Requirements

Node.js >= 20
Anthropic API key — get one at console.anthropic.com
GPU machine with SSH access (or local GPU)

GPU Machine Requirements

The remote (or local) GPU machine needs:

Tool	Required For	Install
`nvidia-smi`	GPU status	Comes with NVIDIA driver
`nvcc`	Compiling CUDA kernels	`apt install nvidia-cuda-toolkit`
`ncu`	Nsight Compute profiling	`apt install nsight-compute` (auto-installed if missing)
`nsys`	Nsight Systems tracing	`apt install nsight-systems` (auto-installed if missing)
`python3` + `triton`	Triton kernels	`pip install triton`
`python3` + `torch`	PyTorch profiling	`pip install torch`

L2 will auto-detect and auto-install ncu and nsys if they're missing.

Usage

CLI Commands

l2                          # Start interactive session (default)
l2 chat                     # Same as above
l2 chat -m <model>          # Use a specific Claude model
l2 auth                     # Set Anthropic API key
l2 auth --show              # Show current key status
l2 gpu connect <host>       # Test GPU connection
l2 gpu status               # Show configured GPU info
l2 gpu docs builtin         # Load built-in CUDA/Triton/profiling docs
l2 gpu docs ingest <url>    # Ingest custom documentation

Slash Commands (inside REPL)

/help                       Show all commands
/gpu                        Show GPU connection status
/iterations <kernel>        Show optimization history for a kernel
/benchmark <kernel>         Show best benchmark results
/compare <kernel> [a] [b]   Compare two iterations side-by-side
/clear                      Reset conversation
/history                    Show message history
/exit                       Exit

Connecting to a GPU

Remote via SSH (auto-discovers your SSH key):

Connect to gpu-server.example.com as ubuntu

Local GPU (runs on the same machine):

Connect to the local GPU

Password auth:

Connect to 192.168.1.100 with username root using password authentication

SSH key auto-discovery checks ~/.ssh/id_ed25519, id_rsa, and id_ecdsa in order.

Example Workflows

Optimize a CUDA kernel:

Upload vector_add.cu to the GPU, compile with nvcc -O3 -arch=sm_90,
benchmark it 100 times, profile with ncu, analyze the bottlenecks,
and optimize it. Target: 80% of peak memory bandwidth.

Optimize a Triton kernel:

Upload matmul.py to the GPU and run it. Profile with the PyTorch profiler,
then tune the tile sizes (BLOCK_M, BLOCK_N, BLOCK_K) for maximum TFLOPS.

Compare iterations:

Show me the comparison of all iterations for my_kernel

Architecture

src/
├── cli/                    # CLI entry point (Commander)
│   └── app.ts
├── core/                   # Agent orchestration
│   ├── agent.ts            # Main agent loop (Claude + tools)
│   ├── session.ts          # REPL with slash commands
│   ├── config.ts           # API key config (~/.l2/config.json)
│   └── types.ts
├── providers/              # LLM integration
│   ├── anthropic.ts        # Streaming Anthropic client
│   ├── system-prompt.ts    # Base system prompt
│   └── gpu-system-prompt.ts # GPU-specialized system prompt
├── gpu/                    # GPU subsystem
│   ├── connection.ts       # SSH client (ssh2)
│   ├── sftp.ts             # File transfer
│   ├── local-adapter.ts    # Local GPU mode
│   ├── cloud-adapter.ts    # Lambda Labs / RunPod APIs
│   ├── manager.ts          # Connection manager singleton
│   ├── iteration.ts        # Autonomous optimization loop
│   ├── config.ts           # GPU config (~/.l2/gpu.json)
│   ├── types.ts            # GPU type definitions
│   ├── parsers/            # Profiling output parsers
│   │   ├── ncu-parser.ts       # Nsight Compute CSV/text
│   │   ├── nsys-parser.ts      # Nsight Systems CSV
│   │   ├── torch-profiler-parser.ts
│   │   └── nvidia-smi-parser.ts
│   └── docs/               # Documentation RAG
│       ├── ingest.ts           # Chunk + store docs
│       └── retrieve.ts         # Semantic search + retrieval
├── tools/                  # Tool implementations
│   ├── schema.ts           # All 28 tool definitions
│   ├── executor.ts         # Tool dispatch
│   ├── gpu/                # GPU-specific tools
│   │   ├── connect.ts          # gpu_connect / gpu_disconnect
│   │   ├── status.ts           # gpu_status
│   │   ├── exec.ts             # gpu_exec
│   │   ├── transfer.ts         # gpu_upload / gpu_download
│   │   ├── profile-ncu.ts      # profile_ncu
│   │   ├── profile-nsys.ts     # profile_nsys
│   │   ├── profile-torch.ts    # profile_torch
│   │   ├── nvidia-smi.ts       # nvidia_smi
│   │   ├── benchmark.ts        # benchmark_kernel
│   │   ├── compare.ts          # compare_iterations
│   │   └── ensure-tools.ts     # Auto-find/install profiling binaries
│   └── ...                 # Standard tools (read, write, grep, shell, etc.)
├── memory/                 # Persistent memory
│   ├── db.ts               # SQLite schema
│   ├── store.ts            # Conversation/tool logging
│   ├── search.ts           # Hybrid FTS + semantic search
│   ├── embeddings.ts       # Local embeddings (MiniLM)
│   └── gpu-store.ts        # Kernel iterations + benchmark storage
└── ui/                     # Terminal rendering
    └── ...

Tools

Standard Tools (16)

Tool	Purpose
`read_file`	Read file contents
`write_file`	Create/overwrite files
`edit_file`	Replace unique string in file
`search_and_replace`	Pattern replace (literal/regex)
`insert_lines`	Insert text at line number
`grep`	Regex search across files
`glob`	Find files by pattern
`list_dir`	List directory contents
`shell`	Execute local shell commands
`mkdir` / `cd` / `delete_file` / `move_file` / `copy_file`	Filesystem ops
`web_fetch`	Fetch URL content
`memory_search`	Search conversation history

GPU Tools (12)

Tool	Purpose
`gpu_connect`	SSH into GPU machine (or detect local GPU)
`gpu_disconnect`	Close connection
`gpu_status`	Parsed nvidia-smi output
`gpu_exec`	Run commands on GPU machine (5min timeout)
`gpu_upload` / `gpu_download`	SFTP file transfer
`profile_ncu`	Nsight Compute profiling with structured metrics
`profile_nsys`	Nsight Systems tracing
`profile_torch`	PyTorch profiler
`nvidia_smi`	Quick GPU monitoring
`benchmark_kernel`	N-run benchmarking with statistics
`compare_iterations`	Side-by-side iteration comparison

GPU Optimization Knowledge

The agent has built-in expertise in:

CUDA — thread hierarchy, memory model, warp semantics, shared memory tiling, coalescing, bank conflicts, occupancy tuning, warp-level primitives
Triton — tile-based programming, autotune, tl.dot, block size selection, num_warps/num_stages tuning
HIP — CUDA-to-HIP porting, wavefront (64 threads) differences, hipcc compilation
OpenCL — work-groups, work-items, local/global memory model

And knows how to interpret profiling output:

ncu metrics — SM throughput, DRAM throughput, occupancy, warp stall reasons, L1/L2 hit rates, roofline analysis
nsys traces — kernel launch gaps, memory transfer overlap, API call overhead
PyTorch profiler — CUDA vs CPU time, operator breakdown, memory allocations

Configuration

Global config (`~/.l2/config.json`)

{
  "apiKey": "sk-ant-..."
}

GPU config (`~/.l2/gpu.json`)

{
  "defaultConnection": {
    "host": "gpu-server.example.com",
    "username": "ubuntu",
    "authMethod": "key"
  },
  "profiling": {
    "benchmarkWarmup": 5,
    "benchmarkIterations": 100
  },
  "checkpointEvery": 3,
  "maxAutoIterations": 10
}

Per-project overrides (`.l2/gpu.json`)

{
  "framework": "cuda",
  "remoteWorkDir": "/home/ubuntu/kernels/my-project",
  "compileCommand": "nvcc -O3 -arch=sm_90 kernel.cu -o kernel",
  "runCommand": "./kernel"
}

Memory & Iteration Tracking

All data is stored in .l2/memory.db (SQLite, per-project):

Conversations — every message and tool call
Kernel iterations — code snapshots, benchmark results, profiling data, analysis notes
Benchmark results — granular metrics (mean, median, p95, GFLOPS, bandwidth)
Documentation chunks — ingested docs with FTS5 search

Query iteration history anytime with /iterations <kernel> or compare_iterations tool.

Development

npm run dev          # Run in development mode (tsx)
npm run build        # Build with tsup
npm run typecheck    # Type check without emitting
npm run clean        # Remove dist/

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

L2: Performance Engineering Agent

Quick Start

What It Does

Installation

Requirements

GPU Machine Requirements

Usage

CLI Commands

Slash Commands (inside REPL)

Connecting to a GPU

Example Workflows

Architecture

Tools

Standard Tools (16)

GPU Tools (12)

GPU Optimization Knowledge

Configuration

Global config (`~/.l2/config.json`)

GPU config (`~/.l2/gpu.json`)

Per-project overrides (`.l2/gpu.json`)

Memory & Iteration Tracking

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

L2: Performance Engineering Agent

Quick Start

What It Does

Installation

Requirements

GPU Machine Requirements

Usage

CLI Commands

Slash Commands (inside REPL)

Connecting to a GPU

Example Workflows

Architecture

Tools

Standard Tools (16)

GPU Tools (12)

GPU Optimization Knowledge

Configuration

Global config (~/.l2/config.json)

GPU config (~/.l2/gpu.json)

Per-project overrides (.l2/gpu.json)

Memory & Iteration Tracking

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Global config (`~/.l2/config.json`)

GPU config (`~/.l2/gpu.json`)

Per-project overrides (`.l2/gpu.json`)

Packages