PyGPUkit - Claude Code Guidelines

Goal Statement

PyGPUkit aims to free developers from the complexity of CUDA Toolkit, Anaconda, and fragile GPU environments. Its goal is to make GPU programming and model execution feel like using a standard Python library: installable via pip, minimal setup, and no mandatory external SDKs. PyGPUkit provides high-performance GPU kernels, memory management, scheduling, and model execution (e.g. SafeTensors) through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively without fighting their environment.

Project Goals

Provide the smallest usable GPU runtime for Python
Expose GPU scheduling (bandwidth, memory, partitioning)
Make writing custom GPU kernels easy
Serve as a building block for inference engines, DSP systems, and real-time workloads

Architecture

Layer Model

Python (High-level orchestration only)
    ↓
Rust (Core scheduling, memory management, GPU coordination)
    ↓
C++ (CUDA Driver/Runtime API, NVRTC, kernel launch)

Python is ONLY a high-level orchestration layer. The core scheduling, memory management, GPU coordination, and performance-critical components MUST remain implemented in Rust.

Directory Structure

PyGPUkit/
├── src/pygpukit/           # Python API (NumPy-compatible)
│   ├── core/               # Core abstractions
│   │   ├── array.py        # GPUArray implementation
│   │   ├── backend.py      # Backend detection/initialization
│   │   ├── memory.py       # Memory utilities (copy, sync)
│   │   └── stream.py       # CUDA Stream wrapper
│   ├── ops/                # GPU operations (modular packages)
│   │   ├── matmul/         # Matrix multiplication
│   │   │   ├── gemm/       # GEMM operations (M > 1)
│   │   │   └── gemv/       # GEMV operations (M = 1)
│   │   ├── nn/             # Neural network ops
│   │   │   ├── activation.py   # GELU, SiLU, etc.
│   │   │   ├── attention.py    # SDPA, paged attention
│   │   │   ├── norm.py         # RMSNorm, LayerNorm
│   │   │   └── rope.py         # Rotary position embedding
│   │   └── audio/          # Audio processing
│   │       ├── transforms/ # FFT, Mel spectrogram
│   │       └── analysis/   # Pitch, onset detection
│   ├── llm/                # LLM inference (modular)
│   │   ├── models/         # Model implementations
│   │   │   └── causal_transformer.py
│   │   ├── layers/         # Layer types
│   │   │   ├── attention.py    # Multi-head attention
│   │   │   ├── ffn.py          # Feed-forward networks
│   │   │   ├── norm.py         # Normalization layers
│   │   │   ├── embedding.py    # Token/position embeddings
│   │   │   └── recurrent.py    # LSTM, Mamba
│   │   ├── decode/         # Decoding strategies
│   │   ├── loader/         # Model loading
│   │   │   ├── safetensors.py  # SafeTensors loader
│   │   │   └── tokenizer.py    # Tokenizer wrapper
│   │   └── quantization/   # Quantization utilities
│   │       ├── config.py       # Quant configs
│   │       └── repack.py       # Weight repacking
│   ├── asr/                # Speech recognition (Whisper)
│   │   └── whisper/        # Whisper model implementation
│   └── tts/                # Text-to-speech (Kokoro)
│       └── kokoro/         # Kokoro TTS model
├── native/
│   ├── core/               # C++ (CUDA Runtime/Driver API)
│   ├── jit/                # C++ (NVRTC)
│   ├── ops/                # C++ (CUDA kernels)
│   │   ├── matmul/         # MatMul kernels (see below)
│   │   │   ├── matmul.cu       # Main dispatcher
│   │   │   ├── fused.cu        # Fused ops (linear+bias+GELU)
│   │   │   └── batched.cu      # Batched GEMM
│   │   ├── nn/             # Neural network ops
│   │   │   ├── activation/ # Activation functions
│   │   │   ├── attention/  # Attention kernels
│   │   │   ├── norm/       # Normalization kernels
│   │   │   ├── rope/       # RoPE kernels
│   │   │   └── recurrent/  # LSTM/Mamba kernels
│   │   └── audio/          # Audio processing kernels
│   └── bindings/           # pybind11 (modular)
│       ├── gemm/           # GEMM bindings by dtype
│       ├── gemv/           # GEMV bindings by dtype
│       └── nn/             # NN operation bindings
├── rust/
│   ├── pygpukit-core/      # Pure Rust GPU runtime
│   │   └── src/
│   │       ├── memory/     # MemoryPool, LRU, size-class allocator
│   │       ├── scheduler/  # Task state machine, QoS policies
│   │       └── device.rs   # DeviceCapabilities, KernelType
│   └── pygpukit-python/    # PyO3 bindings
├── examples/               # Example scripts (organized)
│   ├── benchmarks/         # Performance benchmarks
│   ├── chat/               # Chat CLI applications
│   ├── demos/              # Feature demos
│   │   └── archived/       # Version-specific demos (historical)
│   └── demo_*.py           # Current feature demos
└── tests/

MatMul Kernel Structure

native/ops/matmul/
├── common/                          # Shared utilities
│   └── aligned_copy_sm120.cuh
├── gemm/                            # GEMM kernels (M > 1)
│   └── {w_dtype}_{a_dtype}_{out_dtype}/{arch}/{kernel}.{cu,cuh}
├── gemv/                            # GEMV kernels (M = 1)
│   └── {w_dtype}_{a_dtype}_{out_dtype}/{arch}/{kernel}.{cu,cuh}
├── cublaslt.cuh                     # cuBLASLt wrapper
├── matmul.cu                        # Main dispatcher
└── matmul_cutlass.cu                # CUTLASS dispatcher

Path Convention: {gemm|gemv}/{w{weight}a{act}_{out}}/{arch}/{kernel}.cu

Component	Values	Description
`w_dtype`	`w4`, `w8`, `bf16`, `f32`, `int4`, `int8`	Weight dtype (w=weight)
`a_dtype`	`a4`, `a8`, `a16`, `bf16`, `f32`, `int4`, `int8`	Activation dtype (a=act)
`out_dtype`	`bf16`, `f32`	Output dtype
`arch`	`generic`, `sm80`, `sm90`, `sm100`, `sm120`	Target architecture

Naming Rationale (Issue #122 Option 2):

w8a16_bf16: FP8 weights, BF16 activations, BF16 output (W8A16 GEMM)
w4a16_bf16: NVF4 weights, BF16 activations, BF16 output (NVF4 GEMV)
w8a8_bf16: FP8 weights, FP8 activations, BF16 output (pure FP8)
bf16_bf16: BF16 weights, BF16 activations (no quantization)
f32_f32: FP32 weights, FP32 activations (baseline)

Examples:

gemm/bf16_bf16/sm80/bf16_cutlass.cuh     # BF16 GEMM, SM80, CUTLASS
gemm/w8a8_f32/sm90/fp8_cutlass.cu        # FP8->F32 GEMM, SM90, CUTLASS
gemm/w4a16_bf16/sm120/nvf4_cutlass.cu    # NVF4 weights, BF16 act->BF16, SM120
gemv/w4a16_bf16/sm120/nvf4.cu            # NVF4 GEMV, SM120
gemv/w8a16_bf16/sm120/fp8_opt_kernels.cu # FP8 weight, BF16 act GEMV, SM120
gemm/f32_f32/generic/tf32_mma.cuh        # TF32 GEMM, generic (SM80+)

Module Separation Policy

Module	Purpose	Input	Output
`llm/`	Text generation	Text tokens	Text tokens
`asr/`	Speech recognition	Audio waveform	Text
`ops/`	Low-level GPU ops	GPUArray	GPUArray

Rationale: Modules are separated by modality (audio vs text), not by architecture (transformer). This follows industry conventions (HuggingFace, OpenAI API) and enables clean future expansion (TTS, vision, etc.).

Language Responsibilities

Component	Language	Reason
Python API	Python	NumPy-compatible user interface
CUDA Driver/Runtime	C++	Direct hardware access
NVRTC JIT	C++	Kernel compilation
Memory Pool/LRU	Rust	Safe, fast memory management
Scheduler State	Rust	Thread-safe state machine
Kernel Launch	C++	CUDA kernel dispatch
Bindings	pybind11, PyO3	C++/Rust to Python

Required Rust Components (MUST NOT be removed)

Rust memory pool (with LRU eviction)
Rust GPU scheduler state machine
Rust-side async GPU memory transfer engine
Rust-side kernel dispatch controller

Architecture Rules

pygpukit-core is the authoritative runtime - MemoryPool, Scheduler, Task, LRU, SizeClass MUST be implemented here
All GPU memory management MUST live in rust/pygpukit-core/src/memory/
All scheduling logic MUST live in rust/pygpukit-core/src/scheduler/
Python bindings MUST be thin wrappers only - no logic duplication
When adding new features, always add them to Rust first, then expose via PyO3

GPU Backend Model

Code Generation Pipeline

Python API → pybind11 → C++ backend → CUDA Driver API (cu*) / Runtime API (cuda*) / NVRTC

source.cu (string) → NVRTC → PTX → CUDA Driver API → CUmodule → CUfunction

NO cuda-python
NO external Python CUDA dependencies
ALL GPU kernels compiled at runtime
PTX → SASS handled by NVIDIA driver

Dependencies

PyGPUkit uses its own C++ backend with CUDA Driver API / Runtime API / NVRTC.

Do NOT mention or require:

❌ cuda-python
❌ numba.cuda
❌ cupy.cuda
❌ PyCUDA-style wrappers

GPU Initialization

GPU availability is detected via these C++ calls:

cudaGetDeviceCount()
cudaDriverGetVersion()
cudaRuntimeGetVersion()
nvrtcVersion()

CPU fallback happens only if one of these fails.

CPU Fallback

When GPU is unavailable, PyGPUkit must:

Run scheduler in CPU simulation mode
Use NumPy as backend for GPUArray ops
Disable NVRTC
Still expose full API (no errors)

Backend Loader Model

Python loads a shared library:

Linux: _pygpukit_native.cpython-3xx-x86_64-linux-gnu.so
Windows: _pygpukit_native.cp3xx-win_amd64.pyd
macOS: CPU backend only

DLL Loading Model (Windows)

v0.1.x:

Requires CUDA Toolkit installation
Loads DLLs from CUDA_PATH/bin

v0.2.x (Current):

cuBLASLt loaded dynamically at runtime
Searches: CUDA_PATH/bin/x64 → CUDA_PATH/bin → system PATH
Descriptor caching for matmul performance
Falls back gracefully if cuBLASLt unavailable

// Dynamic loading sequence
cublasLt64_13.dll  // CUDA 13.x
cublasLt64_12.dll  // CUDA 12.x
cublasLt64_11.dll  // CUDA 11.x

Future (Driver-Only Mode):

NVRTC DLL shipped inside the wheel
CUDA Driver (nvcuda.dll) provided by NVIDIA GPU drivers
No cudart dependency

Error Messages

NEVER generate:

❌ "Please install cuda-python"
❌ "GPU mode requires the cuda-python package"

Instead use:

✅ "CUDA driver not detected"
✅ "NVRTC JIT compiler not available"
✅ "No GPU devices found (cudaGetDeviceCount == 0)"
✅ "Falling back to CPU simulation backend"

Critical Rules

DO NOT

Use or mention cuda-python - it is NOT a dependency
Call CUDA APIs from Python directly
Implement memory management in pure Python (use Rust)
Ship precompiled CUDA kernels
Require specific CUDA toolkit versions at runtime
Convert Rust features to Python, Cython, Numba, or pure CUDA kernels
Delete Rust tasks from roadmap
Simplify architecture by removing Rust layer
Use emoji or non-ASCII characters in source code or comments (cp932/Shift-JIS compatibility)

DO

Use C++ for all CUDA Driver/Runtime API calls
Compile all kernels at runtime with NVRTC
Use pybind11 for C++ to Python bindings
Keep Python layer thin - only API surface and NumPy interop
Support CPU fallback when GPU unavailable
Add new features to Rust first, then expose via PyO3

Kernel Optimization

Target Architectures

Supported: Ampere (SM 80-86), Ada (SM 89), Hopper (SM 90), Blackwell (SM 100, 120a)
Unsupported: Architectures below SM80
Build default: SM 80, 86, 89, 90, 100, 120a (CUDA 13.1+)

Design Philosophy

DO NOT use classic shared-memory tiling as the main optimization. On Ampere, L2 is large and fast; naive or warp-level kernels outperform tiled kernels.

Prefer:

L2-friendly memory access patterns
Coalesced loads (ld.global.cs)
Warp-level primitives (shuffle, reduce)
Tensor-core paths when possible (wmma, mma.sync)
Asynchronous copy (cp.async) for global→shared prefetch

Avoid:

Unnecessary __syncthreads()
Complex shared-memory patterns designed for Pascal/Turing
Block sizes > 256 unless occupancy analysis proves benefit

Kernel Autoselection

int sm = device_sm_major * 10 + device_sm_minor;

if (sm >= 90) {
    use_mma_sync_kernels();  // Hopper/Ada
} else if (sm >= 80) {
    use_ampere_optimized_kernels();  // Ampere
} else {
    throw std::runtime_error("PyGPUkit requires SM >= 80 (Ampere)");
}

MatMul Variants

For Ampere, implement two variants:

L2-optimized naive kernel (fast for FP32)
Warp-level MMA kernel (TensorCore for TF32/FP16/BF16)

Block sizes: (16, 16) or (32, 8) - do NOT increase to 32×32 unless profiler proves faster.

Memory Access Rules

Align pointers to 128 bytes where possible
Ensure loads are coalesced across warps
Prefer float4 / half8 vectorized loads
Avoid bank conflicts in shared memory
Use register blocking aggressively

Benchmark Targets

MatMul Performance

GPU	FP32	TF32 TensorCore
RTX 3090 Ti	18 TFLOPS	27+ TFLOPS
A100	5.5+ TFLOPS	156 TFLOPS

Achieved (v0.2.3): TF32 on RTX 3090 Ti: 27.38 TFLOPS (8192×8192×8192)

LLM Inference (Qwen3-8B, RTX 3090 Ti, FP16)

Single Token Decode (M=1):

Mode	Tokens/sec	ms/token
Non-graph decode	1.84	544
CUDA Graph decode	2.19	457
Speedup	1.19x	-

Batch Decode (v0.2.11):

Batch Size	Per Token (us)	Throughput	Speedup
1	381,303	2.6 tok/s	1.00x
2	205,030	4.9 tok/s	1.86x
4	108,521	9.2 tok/s	3.51x
8	55,845	17.9 tok/s	6.83x

E2E Batch Verification (32 tokens):

Method	Time (ms)	tok/s	Speedup
Sequential	14,541	2.13	1.00x
Batch Verify (batch=4)	4,082	7.59	3.56x
Batch Verify (batch=8)	2,147	14.44	6.77x

Decode Strategy Benchmark (v0.2.11):

Model: Qwen2.5-7B-Instruct (bfloat16), RTX 3090 Ti

Strategy	tok/s	Speedup	Notes
DecodeM1 (baseline)	3.2	1.00x	Single token per step
DecodeBatch (batch=8)	19.6	6.06x	TensorCore efficient
DecodeSpeculative	1.4	0.42x	Self-speculative (early layers)
DecodeJacobi	1.7	0.53x	Parallel iterative

Note: Large models (8B+) are GPU compute-bound; CUDA Graph benefit is modest. Batch decode shows near-linear scaling with TensorCore utilization.

CMake Flags

-arch=sm_80
--expt-relaxed-constexpr
--use_fast_math

TF32 TensorCore Implementation

PTX mma.sync Fragment Mapping

CRITICAL: PTX inline assembly mma.sync has DIFFERENT fragment layouts than WMMA API. Verified empirically using dump_c_fragment.cu.

`mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32`

Each thread in a warp (lane 0-31) holds:

A fragment: 4 registers (16×8 matrix, row-major)
B fragment: 2 registers (8×8 matrix, col-major)
C fragment: 4 registers (16×8 matrix)

A fragment (16×8):
  a[0] = A[lane/4][lane%4]           // rows 0-7,  cols 0-3
  a[1] = A[lane/4 + 8][lane%4]       // rows 8-15, cols 0-3
  a[2] = A[lane/4][lane%4 + 4]       // rows 0-7,  cols 4-7
  a[3] = A[lane/4 + 8][lane%4 + 4]   // rows 8-15, cols 4-7

B fragment (8×8):
  b[0] = B[lane%4][lane/4]           // rows 0-3, cols 0-7
  b[1] = B[lane%4 + 4][lane/4]       // rows 4-7, cols 0-7

C fragment (16×8) - KEY DIFFERENCE FROM WMMA:
  c[0] = C[lane/4][(lane%4)*2]       // rows 0-7,  cols 0,2,4,6
  c[1] = C[lane/4][(lane%4)*2 + 1]   // rows 0-7,  cols 1,3,5,7
  c[2] = C[lane/4 + 8][(lane%4)*2]   // rows 8-15, cols 0,2,4,6
  c[3] = C[lane/4 + 8][(lane%4)*2 + 1] // rows 8-15, cols 1,3,5,7

Common Mistakes

C fragment column stride: PTX uses (lane%4)*2 (stride 2), NOT lane%4 (stride 1)
C fragment pairs: c[0],c[1] are adjacent columns; c[2],c[3] are +8 rows

WMMA API vs PTX

Aspect	WMMA API	PTX mma.sync
Fragment types	`wmma::fragment<>`	Raw registers
Layout	Opaque (compiler-managed)	Must match PTX spec exactly
Flexibility	Limited shapes	Full control

Size Difference

API	A	B	C
WMMA 16×16×8	16×8	8×16	16×16
PTX m16n8k8	16×8	8×8	16×8

PTX m16n8k8 uses only the left half (cols 0-7) of WMMA's B/C.

cp.async Double-Buffering

Common Bug: Prefetching into the wrong stage.

// WRONG - overwrites current buffer
for (int kt = 0; kt < num_k_tiles; ++kt) {
    int curr = kt & 1;
    if (kt + 2 < num_k_tiles) {
        load_async((kt+2) & 1, kt + 2);  // BUG!
    }
    process(curr);
}

// CORRECT - prefetch into OTHER stage
load_async(0, 0);
cp_async_wait_0();

for (int kt = 0; kt < num_k_tiles; ++kt) {
    int curr = kt & 1;
    int next = curr ^ 1;  // OTHER stage

    if (kt + 1 < num_k_tiles) {
        load_async(next, kt + 1);
    }
    process(curr);
    cp_async_wait_0();
}

Key Insight: Always prefetch into the stage you're NOT currently reading from.

Verified WMMA Kernel

// WMMA row_major × row_major (PASS)
fragment<matrix_a, 16, 16, 8, precision::tf32, row_major> a_frag;
fragment<matrix_b, 16, 16, 8, precision::tf32, row_major> b_frag;
fragment<accumulator, 16, 16, 8, float> c_frag;

load_matrix_sync(a_frag, A + k, K);
load_matrix_sync(b_frag, B + k * N, N);
mma_sync(c_frag, a_frag, b_frag, c_frag);
store_matrix_sync(C, c_frag, N, mem_row_major);

Note: row_major A + col_major B combination fails due to different memory layout interpretation.

File Locations

native/ops/matmul_f32_tf32.cuh - TF32 kernel
native/ops/basic.cu - Dispatch logic
Environment variable PYGPUKIT_ALLOW_TF32=1 to enable

TF32 Optimization Research (Issue #53)

Current Performance Status

Metric	Value
Current	27.38 TFLOPS (8192×8192)
RTX 3090 Ti TF32 Theoretical	~40 TFLOPS
cuBLAS Reference	~59 TFLOPS
Gap to cuBLAS	47%

Current Implementation Parameters

Block Tile: BM=128, BN=128, BK=16
Warp Tile: WARP_TILES_M=2, WARP_TILES_N=8 (32×64 per warp)
MMA Instruction: mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32
Pipeline: 2-stage double buffering
Thread Block: 256 threads (8 warps)
Shared Memory: ~37KB/block → occupancy ~16.7%

CUTLASS Optimization Techniques

1. Swizzled Shared Memory Layout (High Priority)

Current implementation uses simple padding (A_PAD=4, B_PAD=4) but bank conflicts are not fully eliminated.

CUTLASS Approach:

// XOR-based swizzle pattern
int store_column = (lane_id % 8) ^ (lane_id / 8);

Store and Load phases use transposed index relationship
XOR operation applied per 8×8 block unit
Combined with ldmatrix for fully bank conflict-free access

Key Insight:

"the indexing in the 'Loading from Shared Memory to Registers' slide is transposed from the indexing in 'Load from Global/Store to Shared' slide."

2. ldmatrix Instruction (High Priority)

Current implementation manually loads from shared memory to registers:

// Current implementation
float a0 = smA[curr][tile_m + a_row_base][kk + a_col_base];

CUTLASS Approach:

Uses ldmatrix.sync.aligned.m8n8.x4.shared.b16
Single instruction loads four 8×8 matrices (entire warp)

TF32 Limitation:

"ldmatrix cannot transpose 32-bit data. CUTLASS uses 32-bit shared memory load to load data from shared memory to the registers to do the transpose right before calling tf32 tensor core."

3. Multi-stage Pipeline (Medium-High Priority)

Current: 2-stage → CUTLASS default: 4-stage

Past Failed Attempt:

"3-stage pipeline: -28% (50% more smem reduced occupancy)"

Considerations:

Trade-off between shared memory usage and occupancy
RTX 3090 Ti: 100KB/SM available
Current 37KB → 4-stage at ~74KB should fit

Recommended Implementation Order

Priority	Optimization	Expected Gain	Difficulty
1	Swizzled shared memory layout	+10-15%	Medium
2	4-stage pipeline (proper smem sizing)	+5-10%	Medium
3	Warp tile tuning (BM/BN/BK re-tuning)	+5-10%	Low
4	Epilogue fusion (bias + activation)	Memory reduction	Medium

Path to 35 TFLOPS

Current: 27.38 TFLOPS (68% of target)
Swizzle + 4-stage: 32-34 TFLOPS expected
Fine-tuning: 35+ TFLOPS

Reference Materials

Development Workflow

Kernel Development Cycle

Edit → Build → Validate → Benchmark → Commit

Always commit after validation and benchmark, regardless of results.

Build Instructions (IMPORTANT)

Git Bashからビルド（推奨）：

cd /d/Projects/m96-chan/PyGPUkit
./build.sh 86       # SM 86のみ (RTX 3090 Ti)
./build.sh 120a     # SM 120aのみ (RTX 5090)
./build.sh          # デフォルト: SM 120a

注意事項：

RTX 5090 (SM 120a) はCUDA 13.1以降が必要
サポートSM: 80, 86, 89, 90, 100, 120a

Pre-Commit Checks (MANDATORY)

Before EVERY commit, run these checks:

# 1. Ruff lint check (auto-fix and format)
git ls-files "*.py" | xargs python -m ruff check --fix
git ls-files "*.py" | xargs python -m ruff format

# 2. Mypy type check
python -m mypy src/ --ignore-missing-imports --disable-error-code=union-attr --disable-error-code=no-redef --disable-error-code=no-any-return --disable-error-code=attr-defined --disable-error-code=assignment --disable-error-code=arg-type --disable-error-code=index --disable-error-code=misc

NEVER commit without passing ALL checks. CI will reject PRs with lint/type errors.

PR Checklist (MANDATORY before `gh pr create`)

Before creating a PR, verify ALL of the following:

# 1. Lint passes
git ls-files "*.py" | xargs python -m ruff check

# 2. Mypy passes
python -m mypy src/ --ignore-missing-imports --disable-error-code=union-attr --disable-error-code=no-redef --disable-error-code=no-any-return --disable-error-code=attr-defined --disable-error-code=assignment --disable-error-code=arg-type --disable-error-code=index --disable-error-code=misc

# 3. Tests pass
python -m pytest tests/ -v

# 4. Benchmark runs (optional but recommended)
python -m pygpukit.benchmark --quick

DO NOT create PR until all checks pass locally.

Commit Rules

Run lint check before commit (see above)
Commit after every validation/benchmark completion, regardless of outcome
Include benchmark results in commit message
Never proceed to next kernel edit until commit is complete
Never overwrite a working kernel without committing first

Commit Message Format

wip(tf32): <summary of changes>

Benchmark results (RTX 5090):
- 2048x2048: XX.XX TFLOPS
- 4096x4096: XX.XX TFLOPS
- 8192x8192: XX.XX TFLOPS

Correctness: <PASS/FAIL>

Commit Triggers (ABSOLUTE)

You MUST commit immediately when:

Benchmark improves in ANY matrix size (even +0.01 TFLOPS)
Correctness achieved (relative error < 1e-3 for all sizes)
After EVERY benchmark execution - even if no improvement, commit with bench: results logged (no improvement)

Regression Handling

If performance or correctness degrades:

MUST revert to the previous commit BEFORE continuing

Rationale:

Prevent losing fast kernel versions
Track performance changes over time
Preserve trial-and-error history

Benchmarking

Use unified benchmark suite: python -m pygpukit.benchmark

# Quick benchmark (GEMM + GEMV)
python -m pygpukit.benchmark --quick

# Full benchmark
python -m pygpukit.benchmark

# Save results and compare with baseline
python -m pygpukit.benchmark --quick --save baseline.json
python -m pygpukit.benchmark --compare baseline.json --fail-on-regression

# Specific benchmarks
python -m pygpukit.benchmark --gemm --sizes 4096,8192
python -m pygpukit.benchmark --gemv --dtypes bf16,fp8
python -m pygpukit.benchmark --attention --seq-lens 512,1024

# All benchmarks including FP8 (SM120+)
python -m pygpukit.benchmark --all --fp8

# Markdown output for README
python -m pygpukit.benchmark --quick --markdown

Output includes:

Time in microseconds (us)
TFLOPS for compute benchmarks
Correctness verification
JSON export for regression tracking

Environment Variables:

PYGPUKIT_ALLOW_TF32=1 - Enable TF32 TensorCore
PYGPUKIT_TF32_V2=1 - Use PTX mma.sync kernel (default when TF32 enabled)

CUDA Graph Guidelines

M=1 decode separates CUDA Graph and Non-Graph versions.

Graph Version Requirements

Use CUDA Graph ONLY when ALL conditions are met:

Fixed shapes/dtypes/RoPE tables - No dynamic changes during replay
Identical kernel path - warmup / capture / replay use the same code path
No KV cache pollution - Graph must not write to real KV cache during warmup/capture
H2D copies on capture stream - All host-to-device copies must be on the stream being captured

Fallback Rules

If any condition is NOT met, fallback to Non-Graph version.

Prohibited in Graph

Conditional branches based on runtime values
copy_to operations (use direct buffer writes instead)
Any operation that reads from or writes to KV cache
SDPA (Scaled Dot-Product Attention) - always run outside graph

Implementation Pattern

# Graph captures ONLY stateless operations:
# - Embedding lookup (via GPU pointer)
# - Linear projections (QKV, O, MLP)
# - RMSNorm
# - RoPE (via pre-computed GPU tables)

# These run OUTSIDE graph:
# - KV cache update
# - SDPA attention
# - Any operation that depends on context_len at runtime

Design Principles

1. GPU Systems Toolkit, Not ML Framework

PyGPUkit is not a replacement for PyTorch, JAX, or TensorFlow. Its purpose is to provide low-level, explicit, and controllable GPU execution primitives.

Focus: memory, kernels, scheduling, bandwidth, latency
Not focus: autograd graphs, optimizers, training loops

2. Performance Is a Prerequisite, Not the Goal

High performance is assumed. Optimization enables scheduling, concurrency, and predictability.

Slower-than-cuBLAS requires justification
Faster-than-cuBLAS is welcome, but not mandatory
Performance regressions are unacceptable without explicit trade-offs

3. NumPy-like Semantics

User-facing APIs should resemble NumPy-style array operations.

C = A @ B is preferred over opaque operator graphs
Explicit is better than implicit
Users should understand when and how GPU work is executed

4. GPU Scheduling Is First-Class

PyGPUkit treats the GPU as a shared, schedulable resource (Kubernetes-inspired).

Admission control, QoS, memory reservation, kernel pacing
Scheduling decisions are explicit and inspectable
Kernels are workloads, not side effects

5. SafeTensors Are Immutable Resources

SafeTensors are treated as immutable, read-only GPU resources.

No in-place mutation
No hidden ownership or lifecycle coupling

6. Using cuBLAS / CUTLASS Is Not a Failure

Leveraging vendor or OSS-optimized kernels is acceptable and encouraged.

Value lies in orchestration, scheduling, and integration
Reusing proven kernels is preferable to reinventing them

7. Determinism and Correctness Are Explicit

TF32 precision loss is acceptable when explicitly enabled
FP32 correctness must remain available
Non-determinism must be explainable and bounded

LLM Inference Architecture

Overview

PyGPUkit includes a minimal LLM inference engine for SafeTensors models (Qwen, LLaMA, etc.).

SafeTensors → Model Loading → Prefill → Decode Loop → Token Output
                                ↓
                         CUDA Graph (optional)

Decode Modes

Mode	Description	Use Case
Standard	`model.forward()` with allocation	Simple usage
Zero-Alloc	`_decode_step_zero_alloc()`	Low-latency
CUDA Graph	`_decode_step_graph_replay()`	Reduced kernel launch overhead
Jacobi	Parallel iterative decode	Speculative execution

CUDA Graph Implementation

Capture Stream

All kernels must use internal::get_capture_stream() for CUDA Graph compatibility:

cudaStream_t stream = internal::get_capture_stream();
my_kernel<<<grid, block, 0, stream>>>(...);

Critical: Kernels launched without stream parameter will NOT be captured in the graph.

Pointer-Based Kernels

For dynamic values during graph replay, use _ptr kernel variants:

// Static value (captured at graph creation)
sdpa_causal_fixed_cache(..., context_len, ...);

// Pointer-based (read from GPU buffer at runtime)
sdpa_causal_fixed_cache_ptr(..., context_len_buf, max_kv_len, ...);

DecodeBuffers

Pre-allocated buffers for zero-allocation decode:

@dataclass
class DecodeBuffers:
    hidden: GPUArray       # [1, hidden_size]
    q: GPUArray            # [1, num_heads, head_dim]
    k: GPUArray            # [1, num_kv_heads, head_dim]
    v: GPUArray            # [1, num_kv_heads, head_dim]
    attn_out: GPUArray     # [num_heads, 1, head_dim]
    # ... (layer-shared, reused across all layers)

Graph Capture Flow

model.init_decode_graph(max_seq_len=512)  # Capture graph

# Replay loop
for i in range(num_tokens):
    logits = model._decode_step_graph_replay(token_id, position, context_len)
    next_token = sample(logits)

Performance Notes

Scenario	CUDA Graph Speedup
Full decode loop (with D2H)	~1.2x
Kernel-only (large model)	~1.0x (GPU-bound)
Small model / many kernels	Higher benefit

Limitation: Current implementation has 2 device syncs per replay (H2D visibility + completion wait), which reduces benefit for large models.

KV Cache

Fixed-length KV cache with GQA support:

# Initialize
for block in model.blocks:
    block.attn.init_fixed_cache(max_seq_len, dtype="float16")

# Prefill
hidden, past_kv = model(input_ids, use_cache=True)
for i, block in enumerate(model.blocks):
    kv_cache_prefill_gqa(past_kv[i][0], block.attn._k_cache, num_heads, start_pos=0)
    kv_cache_prefill_gqa(past_kv[i][1], block.attn._v_cache, num_heads, start_pos=0)

# Backup/Restore for benchmarking
kv_backup = model.snapshot_kv_cache()
model.restore_kv_cache(kv_backup)

Jacobi Decoding

Parallel iterative generation for speculative execution:

# Initialize Jacobi buffers
model.init_jacobi_decode(lookahead_k=4, max_seq_len=512)

# Parallel decode
accepted_tokens = model.jacobi_decode_step(draft_tokens, position)

Non-goals

Full Training Framework - No optimizers, training loops, dataset pipelines, autograd engines
Abstracting Away GPU Reality - Memory transfers, sync points, kernel costs, precision trade-offs are NOT hidden
Supporting Legacy GPUs - Only Ampere/Ada and newer; Turing and below are out of scope
PyTorch API Compatibility - Clarity over familiarity; APIs may diverge intentionally
"Magic" Performance - No undocumented heuristics; all optimizations must be explainable

Build System

C++/CUDA: CMake with CUDA toolkit
Python: scikit-build-core for CMake integration
Rust: Cargo with PyO3
CI/CD: cibuildwheel with CUDA

Branch Strategy

Change Type	Branch	Flow
Hotfix (v0.1.x)	main	Direct push → tag
Minor/Major (v0.2+)	feature/*	Branch → PR → CI test → main → tag

Current State

v0.1 (Released)

✅ Native C++ backend with CUDA Runtime/Driver API
✅ NVRTC JIT compilation
✅ pybind11 bindings
✅ Zero-copy Python↔Native interop
✅ CPU simulation fallback

v0.2.x (Released)

✅ Rust memory pool with LRU eviction
✅ Rust GPU scheduler state machine
✅ L2-optimized naive matmul (18 TFLOPS)
✅ TF32 TensorCore GEMM (27 TFLOPS)
✅ SM >= 80 runtime check
✅ 106 Rust tests

v0.2.10 (Released)

✅ CUDA Graph for single-token decode (M=1)
✅ cuBLASLt dynamic loading with descriptor caching
✅ Top-k sampling in graph capture
✅ Zero-allocation decode path (DecodeBuffers)

v0.2.11 (Current)

✅ CUDA Graph stream fix (RoPE/SDPA now properly captured)
✅ Batch decode support (seq_len > 1)
✅ Jacobi decoding for parallel iterative generation
✅ Self-Speculative decoding framework
✅ GPU-side Lookahead KV Cache
✅ CUDA Events API

Remaining Work

Rust-side async memory transfer engine
Rust-side kernel dispatch controller
Python API wrappers for Rust scheduler/memory pool (thin wrappers only)

Development Environment

Build Instructions

Git Bashからビルド（推奨）：

cd /d/Projects/m96-chan/PyGPUkit
./build.sh 86       # SM 86のみ (RTX 3090 Ti)
./build.sh 120a     # SM 120aのみ (RTX 5090)
./build.sh          # デフォルト: SM 120a

サポートSM: 80, 86, 89, 90, 100, 120a

Local Development Hardware

Machine	GPU	SM	CUDA Toolkit	Notes
Primary	RTX 5090	120a	13.1	Blackwell GeForce, FP8 testing
Secondary	RTX 3090 Ti	86	12.x	Ampere, TF32 benchmarks

Tokenizer

PyGPUkit内蔵のTokenizerは使用しない。HuggingFace tokenizersライブラリを使用する。

# 推奨: HuggingFace tokenizers
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("/path/to/tokenizer.json")

# 非推奨: 内蔵Tokenizer (互換性問題あり)
# from pygpukit.llm import Tokenizer

LLM Models Directory

Primary model storage: F:/LLM/

All LLM models for inference testing are stored in F:/LLM/. Use this path when loading models.

F:/LLM/
├── Qwen2.5-7B-Instruct/           # Main test model
├── Qwen3-8B/                       # Qwen3 variant
├── TinyLlama-1.1B-Chat-v1.0/      # Small model for quick tests
└── ...

Usage example:

from pygpukit.llm import QwenModel

model = QwenModel.from_safetensors("F:/LLM/Qwen2.5-7B-Instruct")

Note: HuggingFace cache (~/.cache/huggingface/) may also contain models but F:/LLM/ is the canonical location.

Claude Code Configuration

Skills (.claude/skills/)

Development workflow automation:

Skill	Description
`build`	Build native module with SM selection
`benchmark`	Run matmul performance benchmarks
`lint`	Ruff lint + format
`typecheck`	Mypy type check
`test`	Run pytest
`precommit`	Pre-commit checks (lint + typecheck)
`check-all`	Full validation (lint + typecheck + test)
`chat-test`	LLM inference testing
`kernel-dev`	Kernel development workflow

Subagents (.claude/agents/)

Specialized agents for specific tasks:

Agent	Model	Description
`kernel-reviewer`	opus	CUDA kernel code review
`perf-analyzer`	opus	Benchmark analysis and optimization
`api-designer`	sonnet	Python API design review
`commit-helper`	haiku	Commit message and PR generation
`doc-generator`	haiku	Documentation updates

Usage

Skills and agents are automatically invoked based on task context. Examples:

"Build for RTX 5090" -> build skill
"Review the kernel changes" -> kernel-reviewer agent
"Analyze benchmark results" -> perf-analyzer agent
"Commit these changes" -> commit-helper agent

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

PyGPUkit - Claude Code Guidelines

Goal Statement

Project Goals

Architecture

Layer Model

Directory Structure

MatMul Kernel Structure

Module Separation Policy

Language Responsibilities

Required Rust Components (MUST NOT be removed)

Architecture Rules

GPU Backend Model

Code Generation Pipeline

Dependencies

GPU Initialization

CPU Fallback

Backend Loader Model

DLL Loading Model (Windows)

Error Messages

Critical Rules

DO NOT

DO

Kernel Optimization

Target Architectures

Design Philosophy

Kernel Autoselection

MatMul Variants

Memory Access Rules

Benchmark Targets

MatMul Performance

LLM Inference (Qwen3-8B, RTX 3090 Ti, FP16)

CMake Flags

TF32 TensorCore Implementation

PTX mma.sync Fragment Mapping

mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32

Common Mistakes

WMMA API vs PTX

Size Difference

cp.async Double-Buffering

Verified WMMA Kernel

File Locations

TF32 Optimization Research (Issue #53)

Current Performance Status

Current Implementation Parameters

CUTLASS Optimization Techniques

1. Swizzled Shared Memory Layout (High Priority)

2. ldmatrix Instruction (High Priority)

3. Multi-stage Pipeline (Medium-High Priority)

Recommended Implementation Order

Path to 35 TFLOPS

Reference Materials

Development Workflow

Kernel Development Cycle

Build Instructions (IMPORTANT)

Pre-Commit Checks (MANDATORY)

PR Checklist (MANDATORY before gh pr create)

Commit Rules

Commit Message Format

Commit Triggers (ABSOLUTE)

Regression Handling

Benchmarking

CUDA Graph Guidelines

Graph Version Requirements

Fallback Rules

Prohibited in Graph

Implementation Pattern

Design Principles

1. GPU Systems Toolkit, Not ML Framework

2. Performance Is a Prerequisite, Not the Goal

3. NumPy-like Semantics

4. GPU Scheduling Is First-Class

5. SafeTensors Are Immutable Resources

6. Using cuBLAS / CUTLASS Is Not a Failure

7. Determinism and Correctness Are Explicit

`mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32`

PR Checklist (MANDATORY before `gh pr create`)