NKI Kernel Optimization Agent

An autonomous, profile-guided optimization agent for NKI (Neuron Kernel Interface) kernels on AWS Trainium2. It iteratively improves NKI kernel performance by combining LLM-based code generation with hardware profiling feedback, structured self-reflection, and automatic regression detection.

The agent is a hybrid of two systems:

AccelOpt — an NKI-native kernel optimization framework with a Planner → Executor → Summarizer workflow and optimization memory (slow→fast experience accumulation). Provides the NKI-specific profiling infrastructure, prompt engineering, and domain knowledge.
KernelAgent — Meta's multi-agent GPU kernel synthesis system. Provides the stability patterns: reflexion (structured self-reflection), divergence-based revert, error feedback loops, and experiment history tracking.

AccelOpt handles the what (NKI-aware optimization planning and code generation), while KernelAgent patterns handle the how (keeping the agent loop stable over many rounds of autonomous execution).

How It Works

Local Machine                          Trn2 Instance
    │                                      │
    │  orchestrator.py                     │
    │  ├── Plan (LLM: planner prompt)      │
    │  ├── Execute (LLM: executor prompt)  │
    │  ├── Upload ─── scp ──────────────>  │ target kernel file
    │  │                                   │ ├── compile (neuronxcc)
    │  │                                   │ ├── correctness check
    │  │  <─── ssh result ────────────────── ├── benchmark
    │  ├── Decide (divergence guard)       │
    │  ├── Summarize (optimization memory) │
    │  └── Reflexion (self-reflection)     │
    │                                      │
    └── repeat ────────────────────────────┘

Each round:

Profile the current kernel on Trn2 via neuron-profile summary-json
Plan an optimization using profile metrics + accumulated experience (AccelOpt planner)
Execute the plan as code, injecting error feedback and reflexion context (AccelOpt executor + KernelAgent patterns)
Verify correctness with multi-seed L2-norm checks — incorrect kernels are immediately discarded
Benchmark end-to-end latency and throughput
Decide whether to keep or revert — if the new kernel regresses beyond a threshold, the agent automatically reverts to the best known version (KernelAgent divergence guard)
Summarize improvements into optimization memory for future rounds (AccelOpt summarizer)
Reflect on what worked or failed, feeding lessons into subsequent prompts (KernelAgent reflexion)

Quick Start

# 1. Set up the Trn2 instance (run once)
ssh ubuntu@<trn2-ip> 'bash -s' < setup_trn2.sh

# 2. Deploy your target code on Trn2
ssh ubuntu@<trn2-ip> "cd ~/nki-moe && git clone <YOUR_REPO> ."

# 3. Run the agent locally
chmod +x run.sh
./run.sh ubuntu@<trn2-ip>

# Or with more control:
python3 orchestrator.py \
    --host ubuntu@<trn2-ip> \
    --rounds 50 \
    --kernel qwen_with_nki.py \
    --problem reference_implementation.py

Project Structure

nki-kernel-agent/
├── orchestrator.py                    # Main optimization loop
├── config.py                          # All configuration (remote, compile, agent params)
├── CLAUDE.md                          # Instructions for Claude Code integration
│
├── accelopt_core/
│   └── kernel_wrapper.py              # NKI profiling, benchmarking, correctness checking
│                                      # (ported from AccelOpt, SDK 2.28 namespace)
│
├── ka_extensions/
│   └── stability.py                   # ReflexionManager, DivergenceGuard, AttemptHistory
│                                      # (patterns adapted from KernelAgent)
│
├── infra/
│   ├── ssh_runner.py                  # SSH remote execution, file transfer, cache management
│   └── e2e_benchmark.py              # End-to-end benchmark parsing and score calculation
│
├── prompts/
│   ├── planner_prompts/
│   │   ├── base_prompt.txt            # NKI API reference + Trn2 features + bottleneck taxonomy
│   │   ├── planner_prompt_template.txt
│   │   └── construct_base_prompt.py   # Injects optimization memory into planner prompt
│   ├── executor_prompts/
│   │   ├── base_prompt.txt            # NKI constraints + Trn2-specific opportunities
│   │   └── user_prompt_template.txt   # Slots for error feedback + reflexion context
│   ├── summarizer_prompts/
│   │   ├── base_prompt.txt
│   │   └── user_prompt_template.txt
│   ├── profile_list.json              # 18 neuron-profile metrics (16 original + 2 Trn2)
│   └── displayed_profiles.json
│
├── nkibench_seeds/                    # Reference NKI kernels (SDK 2.28 namespace)
│   ├── add_rmsnorm_matmul_*.py        # Fused residual + norm + matmul
│   ├── matmul_add_rmsnorm_*.py        # Fused matmul + residual + norm
│   ├── rope_single_freq_apply.py      # Rotary position embedding
│   └── ref_add_rmsnorm_matmul.py      # NumPy reference for correctness
│
├── optimization_memory/
│   └── rewrites.json                  # Accumulated slow→fast transformation experiences
│
├── checkpoints/                       # Saved kernel versions at each improvement
├── experiments.jsonl                  # Structured experiment log (auto-generated)
│
├── test_agent.py                      # 15-test validation suite (runs without Trn2)
├── setup_trn2.sh                      # One-time Trn2 instance setup
└── run.sh                             # Convenience launcher

Trn2 Adaptations

The original AccelOpt was built for Trainium1 (NKI Beta 1, SDK ≤2.27). This agent applies the following adaptations for Trainium2:

Must-fix (compile errors without these):

neuronxcc.nki.* → nki.* namespace (SDK 2.28)
Top-level kernel I/O must be HBM (buffer=nl.shared_hbm)
Auto/direct SBUF/PSUM allocation cannot be mixed in the same kernel

Performance-relevant (Trn2 features not in original AccelOpt):

Engine parallelism: VectorE+GPSIMD can access SBUF simultaneously; VectorE+ScalarE can access PSUM simultaneously
DMA transpose during HBM→SBUF transfer (replaces identity matrix trick)
GPSIMD integrated DMA at 307 GB/s (useful for irregular data movement)
New APIs: gather_flattened, no_reorder, range_select
"Underutilized" bottleneck category added (neither compute nor memory bound)
tensor_engine_active_time_percent and dma_active_time metrics added to profile list

Testing

python3 test_agent.py

Validates all components that work without a Trn2 instance: imports, config sanity, reflexion/divergence/history logic, prompt construction, score calculation, code extraction, namespace compliance of seed kernels.

References

AccelOpt: Jia et al., "AccelOpt: An NKI Kernel Optimization Agent on Trainium" (2025). GitHub arXiv:2502.15253 — NKI-native Planner/Executor/Summarizer workflow, optimization memory, NKIBench profiling metrics.
KernelAgent: Meta PyTorch, "KernelAgent — Multi-Agent GPU Kernel Synthesis and Optimization." GitHub / Blog — Parallel worker verification, reflexion, divergence-based revert, roofline-guided bottleneck analysis.
NKI Documentation: AWS Neuron NKI Guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NKI Kernel Optimization Agent

How It Works

Quick Start

Project Structure

Trn2 Adaptations

Testing

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
accelopt_core		accelopt_core
infra		infra
ka_extensions		ka_extensions
nkibench_seeds		nkibench_seeds
optimization_memory		optimization_memory
prompts		prompts
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
orchestrator.py		orchestrator.py
run.sh		run.sh
setup_trn2.sh		setup_trn2.sh
test_agent.py		test_agent.py

Folders and files

Latest commit

History

Repository files navigation

NKI Kernel Optimization Agent

How It Works

Quick Start

Project Structure

Trn2 Adaptations

Testing

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages