PyGPUkit aims to free developers from the complexity of CUDA Toolkit, Anaconda, and fragile GPU environments. Its goal is to make GPU programming and model execution feel like using a standard Python library: installable via pip, minimal setup, and no mandatory external SDKs. PyGPUkit provides high-performance GPU kernels, memory management, scheduling, and model execution (e.g. SafeTensors) through a NumPy-like API and a Kubernetes-inspired resource model, allowing developers to use GPUs explicitly, predictably, and productively without fighting their environment.
- Provide the smallest usable GPU runtime for Python
- Expose GPU scheduling (bandwidth, memory, partitioning)
- Make writing custom GPU kernels easy
- Serve as a building block for inference engines, DSP systems, and real-time workloads
Python (High-level orchestration only)
↓
Rust (Core scheduling, memory management, GPU coordination)
↓
C++ (CUDA Driver/Runtime API, NVRTC, kernel launch)
Python is ONLY a high-level orchestration layer. The core scheduling, memory management, GPU coordination, and performance-critical components MUST remain implemented in Rust.
PyGPUkit/
├── src/pygpukit/ # Python API (NumPy-compatible)
│ ├── core/ # Core abstractions
│ │ ├── array.py # GPUArray implementation
│ │ ├── backend.py # Backend detection/initialization
│ │ ├── memory.py # Memory utilities (copy, sync)
│ │ └── stream.py # CUDA Stream wrapper
│ ├── ops/ # GPU operations (modular packages)
│ │ ├── matmul/ # Matrix multiplication
│ │ │ ├── gemm/ # GEMM operations (M > 1)
│ │ │ └── gemv/ # GEMV operations (M = 1)
│ │ ├── nn/ # Neural network ops
│ │ │ ├── activation.py # GELU, SiLU, etc.
│ │ │ ├── attention.py # SDPA, paged attention
│ │ │ ├── norm.py # RMSNorm, LayerNorm
│ │ │ └── rope.py # Rotary position embedding
│ │ └── audio/ # Audio processing
│ │ ├── transforms/ # FFT, Mel spectrogram
│ │ └── analysis/ # Pitch, onset detection
│ ├── llm/ # LLM inference (modular)
│ │ ├── models/ # Model implementations
│ │ │ └── causal_transformer.py
│ │ ├── layers/ # Layer types
│ │ │ ├── attention.py # Multi-head attention
│ │ │ ├── ffn.py # Feed-forward networks
│ │ │ ├── norm.py # Normalization layers
│ │ │ ├── embedding.py # Token/position embeddings
│ │ │ └── recurrent.py # LSTM, Mamba
│ │ ├── decode/ # Decoding strategies
│ │ ├── loader/ # Model loading
│ │ │ ├── safetensors.py # SafeTensors loader
│ │ │ └── tokenizer.py # Tokenizer wrapper
│ │ └── quantization/ # Quantization utilities
│ │ ├── config.py # Quant configs
│ │ └── repack.py # Weight repacking
│ ├── asr/ # Speech recognition (Whisper)
│ │ └── whisper/ # Whisper model implementation
│ └── tts/ # Text-to-speech (Kokoro)
│ └── kokoro/ # Kokoro TTS model
├── native/
│ ├── core/ # C++ (CUDA Runtime/Driver API)
│ ├── jit/ # C++ (NVRTC)
│ ├── ops/ # C++ (CUDA kernels)
│ │ ├── matmul/ # MatMul kernels (see below)
│ │ │ ├── matmul.cu # Main dispatcher
│ │ │ ├── fused.cu # Fused ops (linear+bias+GELU)
│ │ │ └── batched.cu # Batched GEMM
│ │ ├── nn/ # Neural network ops
│ │ │ ├── activation/ # Activation functions
│ │ │ ├── attention/ # Attention kernels
│ │ │ ├── norm/ # Normalization kernels
│ │ │ ├── rope/ # RoPE kernels
│ │ │ └── recurrent/ # LSTM/Mamba kernels
│ │ └── audio/ # Audio processing kernels
│ └── bindings/ # pybind11 (modular)
│ ├── gemm/ # GEMM bindings by dtype
│ ├── gemv/ # GEMV bindings by dtype
│ └── nn/ # NN operation bindings
├── rust/
│ ├── pygpukit-core/ # Pure Rust GPU runtime
│ │ └── src/
│ │ ├── memory/ # MemoryPool, LRU, size-class allocator
│ │ ├── scheduler/ # Task state machine, QoS policies
│ │ └── device.rs # DeviceCapabilities, KernelType
│ └── pygpukit-python/ # PyO3 bindings
├── examples/ # Example scripts (organized)
│ ├── benchmarks/ # Performance benchmarks
│ ├── chat/ # Chat CLI applications
│ ├── demos/ # Feature demos
│ │ └── archived/ # Version-specific demos (historical)
│ └── demo_*.py # Current feature demos
└── tests/
native/ops/matmul/
├── common/ # Shared utilities
│ └── aligned_copy_sm120.cuh
├── gemm/ # GEMM kernels (M > 1)
│ └── {w_dtype}_{a_dtype}_{out_dtype}/{arch}/{kernel}.{cu,cuh}
├── gemv/ # GEMV kernels (M = 1)
│ └── {w_dtype}_{a_dtype}_{out_dtype}/{arch}/{kernel}.{cu,cuh}
├── cublaslt.cuh # cuBLASLt wrapper
├── matmul.cu # Main dispatcher
└── matmul_cutlass.cu # CUTLASS dispatcher
Path Convention: {gemm|gemv}/{w{weight}a{act}_{out}}/{arch}/{kernel}.cu
| Component | Values | Description |
|---|---|---|
w_dtype |
w4, w8, bf16, f32, int4, int8 |
Weight dtype (w=weight) |
a_dtype |
a4, a8, a16, bf16, f32, int4, int8 |
Activation dtype (a=act) |
out_dtype |
bf16, f32 |
Output dtype |
arch |
generic, sm80, sm90, sm100, sm120 |
Target architecture |
Naming Rationale (Issue #122 Option 2):
w8a16_bf16: FP8 weights, BF16 activations, BF16 output (W8A16 GEMM)w4a16_bf16: NVF4 weights, BF16 activations, BF16 output (NVF4 GEMV)w8a8_bf16: FP8 weights, FP8 activations, BF16 output (pure FP8)bf16_bf16: BF16 weights, BF16 activations (no quantization)f32_f32: FP32 weights, FP32 activations (baseline)
Examples:
gemm/bf16_bf16/sm80/bf16_cutlass.cuh # BF16 GEMM, SM80, CUTLASS
gemm/w8a8_f32/sm90/fp8_cutlass.cu # FP8->F32 GEMM, SM90, CUTLASS
gemm/w4a16_bf16/sm120/nvf4_cutlass.cu # NVF4 weights, BF16 act->BF16, SM120
gemv/w4a16_bf16/sm120/nvf4.cu # NVF4 GEMV, SM120
gemv/w8a16_bf16/sm120/fp8_opt_kernels.cu # FP8 weight, BF16 act GEMV, SM120
gemm/f32_f32/generic/tf32_mma.cuh # TF32 GEMM, generic (SM80+)
| Module | Purpose | Input | Output |
|---|---|---|---|
llm/ |
Text generation | Text tokens | Text tokens |
asr/ |
Speech recognition | Audio waveform | Text |
ops/ |
Low-level GPU ops | GPUArray | GPUArray |
Rationale: Modules are separated by modality (audio vs text), not by architecture (transformer). This follows industry conventions (HuggingFace, OpenAI API) and enables clean future expansion (TTS, vision, etc.).
| Component | Language | Reason |
|---|---|---|
| Python API | Python | NumPy-compatible user interface |
| CUDA Driver/Runtime | C++ | Direct hardware access |
| NVRTC JIT | C++ | Kernel compilation |
| Memory Pool/LRU | Rust | Safe, fast memory management |
| Scheduler State | Rust | Thread-safe state machine |
| Kernel Launch | C++ | CUDA kernel dispatch |
| Bindings | pybind11, PyO3 | C++/Rust to Python |
- Rust memory pool (with LRU eviction)
- Rust GPU scheduler state machine
- Rust-side async GPU memory transfer engine
- Rust-side kernel dispatch controller
- pygpukit-core is the authoritative runtime - MemoryPool, Scheduler, Task, LRU, SizeClass MUST be implemented here
- All GPU memory management MUST live in
rust/pygpukit-core/src/memory/ - All scheduling logic MUST live in
rust/pygpukit-core/src/scheduler/ - Python bindings MUST be thin wrappers only - no logic duplication
- When adding new features, always add them to Rust first, then expose via PyO3
Python API → pybind11 → C++ backend → CUDA Driver API (cu*) / Runtime API (cuda*) / NVRTC
source.cu (string) → NVRTC → PTX → CUDA Driver API → CUmodule → CUfunction
- NO cuda-python
- NO external Python CUDA dependencies
- ALL GPU kernels compiled at runtime
- PTX → SASS handled by NVIDIA driver
PyGPUkit uses its own C++ backend with CUDA Driver API / Runtime API / NVRTC.
Do NOT mention or require:
- ❌
cuda-python - ❌
numba.cuda - ❌
cupy.cuda - ❌ PyCUDA-style wrappers
GPU availability is detected via these C++ calls:
cudaGetDeviceCount()cudaDriverGetVersion()cudaRuntimeGetVersion()nvrtcVersion()
CPU fallback happens only if one of these fails.
When GPU is unavailable, PyGPUkit must:
- Run scheduler in CPU simulation mode
- Use NumPy as backend for GPUArray ops
- Disable NVRTC
- Still expose full API (no errors)
Python loads a shared library:
- Linux:
_pygpukit_native.cpython-3xx-x86_64-linux-gnu.so - Windows:
_pygpukit_native.cp3xx-win_amd64.pyd - macOS: CPU backend only
v0.1.x:
- Requires CUDA Toolkit installation
- Loads DLLs from
CUDA_PATH/bin
v0.2.x (Current):
- cuBLASLt loaded dynamically at runtime
- Searches:
CUDA_PATH/bin/x64→CUDA_PATH/bin→ system PATH - Descriptor caching for matmul performance
- Falls back gracefully if cuBLASLt unavailable
// Dynamic loading sequence
cublasLt64_13.dll // CUDA 13.x
cublasLt64_12.dll // CUDA 12.x
cublasLt64_11.dll // CUDA 11.xFuture (Driver-Only Mode):
- NVRTC DLL shipped inside the wheel
- CUDA Driver (
nvcuda.dll) provided by NVIDIA GPU drivers - No cudart dependency
NEVER generate:
- ❌ "Please install cuda-python"
- ❌ "GPU mode requires the cuda-python package"
Instead use:
- ✅ "CUDA driver not detected"
- ✅ "NVRTC JIT compiler not available"
- ✅ "No GPU devices found (cudaGetDeviceCount == 0)"
- ✅ "Falling back to CPU simulation backend"
- Use or mention
cuda-python- it is NOT a dependency - Call CUDA APIs from Python directly
- Implement memory management in pure Python (use Rust)
- Ship precompiled CUDA kernels
- Require specific CUDA toolkit versions at runtime
- Convert Rust features to Python, Cython, Numba, or pure CUDA kernels
- Delete Rust tasks from roadmap
- Simplify architecture by removing Rust layer
- Use emoji or non-ASCII characters in source code or comments (cp932/Shift-JIS compatibility)
- Use C++ for all CUDA Driver/Runtime API calls
- Compile all kernels at runtime with NVRTC
- Use pybind11 for C++ to Python bindings
- Keep Python layer thin - only API surface and NumPy interop
- Support CPU fallback when GPU unavailable
- Add new features to Rust first, then expose via PyO3
- Supported: Ampere (SM 80-86), Ada (SM 89), Hopper (SM 90), Blackwell (SM 100, 120a)
- Unsupported: Architectures below SM80
- Build default: SM 80, 86, 89, 90, 100, 120a (CUDA 13.1+)
DO NOT use classic shared-memory tiling as the main optimization. On Ampere, L2 is large and fast; naive or warp-level kernels outperform tiled kernels.
Prefer:
- L2-friendly memory access patterns
- Coalesced loads (
ld.global.cs) - Warp-level primitives (shuffle, reduce)
- Tensor-core paths when possible (
wmma,mma.sync) - Asynchronous copy (
cp.async) for global→shared prefetch
Avoid:
- Unnecessary
__syncthreads() - Complex shared-memory patterns designed for Pascal/Turing
- Block sizes > 256 unless occupancy analysis proves benefit
int sm = device_sm_major * 10 + device_sm_minor;
if (sm >= 90) {
use_mma_sync_kernels(); // Hopper/Ada
} else if (sm >= 80) {
use_ampere_optimized_kernels(); // Ampere
} else {
throw std::runtime_error("PyGPUkit requires SM >= 80 (Ampere)");
}For Ampere, implement two variants:
- L2-optimized naive kernel (fast for FP32)
- Warp-level MMA kernel (TensorCore for TF32/FP16/BF16)
Block sizes: (16, 16) or (32, 8) - do NOT increase to 32×32 unless profiler proves faster.
- Align pointers to 128 bytes where possible
- Ensure loads are coalesced across warps
- Prefer
float4/half8vectorized loads - Avoid bank conflicts in shared memory
- Use register blocking aggressively
| GPU | FP32 | TF32 TensorCore |
|---|---|---|
| RTX 3090 Ti | 18 TFLOPS | 27+ TFLOPS |
| A100 | 5.5+ TFLOPS | 156 TFLOPS |
Achieved (v0.2.3): TF32 on RTX 3090 Ti: 27.38 TFLOPS (8192×8192×8192)
Single Token Decode (M=1):
| Mode | Tokens/sec | ms/token |
|---|---|---|
| Non-graph decode | 1.84 | 544 |
| CUDA Graph decode | 2.19 | 457 |
| Speedup | 1.19x | - |
Batch Decode (v0.2.11):
| Batch Size | Per Token (us) | Throughput | Speedup |
|---|---|---|---|
| 1 | 381,303 | 2.6 tok/s | 1.00x |
| 2 | 205,030 | 4.9 tok/s | 1.86x |
| 4 | 108,521 | 9.2 tok/s | 3.51x |
| 8 | 55,845 | 17.9 tok/s | 6.83x |
E2E Batch Verification (32 tokens):
| Method | Time (ms) | tok/s | Speedup |
|---|---|---|---|
| Sequential | 14,541 | 2.13 | 1.00x |
| Batch Verify (batch=4) | 4,082 | 7.59 | 3.56x |
| Batch Verify (batch=8) | 2,147 | 14.44 | 6.77x |
Decode Strategy Benchmark (v0.2.11):
Model: Qwen2.5-7B-Instruct (bfloat16), RTX 3090 Ti
| Strategy | tok/s | Speedup | Notes |
|---|---|---|---|
| DecodeM1 (baseline) | 3.2 | 1.00x | Single token per step |
| DecodeBatch (batch=8) | 19.6 | 6.06x | TensorCore efficient |
| DecodeSpeculative | 1.4 | 0.42x | Self-speculative (early layers) |
| DecodeJacobi | 1.7 | 0.53x | Parallel iterative |
Note: Large models (8B+) are GPU compute-bound; CUDA Graph benefit is modest. Batch decode shows near-linear scaling with TensorCore utilization.
-arch=sm_80
--expt-relaxed-constexpr
--use_fast_mathCRITICAL: PTX inline assembly mma.sync has DIFFERENT fragment layouts than WMMA API.
Verified empirically using dump_c_fragment.cu.
Each thread in a warp (lane 0-31) holds:
- A fragment: 4 registers (16×8 matrix, row-major)
- B fragment: 2 registers (8×8 matrix, col-major)
- C fragment: 4 registers (16×8 matrix)
A fragment (16×8):
a[0] = A[lane/4][lane%4] // rows 0-7, cols 0-3
a[1] = A[lane/4 + 8][lane%4] // rows 8-15, cols 0-3
a[2] = A[lane/4][lane%4 + 4] // rows 0-7, cols 4-7
a[3] = A[lane/4 + 8][lane%4 + 4] // rows 8-15, cols 4-7
B fragment (8×8):
b[0] = B[lane%4][lane/4] // rows 0-3, cols 0-7
b[1] = B[lane%4 + 4][lane/4] // rows 4-7, cols 0-7
C fragment (16×8) - KEY DIFFERENCE FROM WMMA:
c[0] = C[lane/4][(lane%4)*2] // rows 0-7, cols 0,2,4,6
c[1] = C[lane/4][(lane%4)*2 + 1] // rows 0-7, cols 1,3,5,7
c[2] = C[lane/4 + 8][(lane%4)*2] // rows 8-15, cols 0,2,4,6
c[3] = C[lane/4 + 8][(lane%4)*2 + 1] // rows 8-15, cols 1,3,5,7
- C fragment column stride: PTX uses
(lane%4)*2(stride 2), NOTlane%4(stride 1) - C fragment pairs: c[0],c[1] are adjacent columns; c[2],c[3] are +8 rows
| Aspect | WMMA API | PTX mma.sync |
|---|---|---|
| Fragment types | wmma::fragment<> |
Raw registers |
| Layout | Opaque (compiler-managed) | Must match PTX spec exactly |
| Flexibility | Limited shapes | Full control |
| API | A | B | C |
|---|---|---|---|
| WMMA 16×16×8 | 16×8 | 8×16 | 16×16 |
| PTX m16n8k8 | 16×8 | 8×8 | 16×8 |
PTX m16n8k8 uses only the left half (cols 0-7) of WMMA's B/C.
Common Bug: Prefetching into the wrong stage.
// WRONG - overwrites current buffer
for (int kt = 0; kt < num_k_tiles; ++kt) {
int curr = kt & 1;
if (kt + 2 < num_k_tiles) {
load_async((kt+2) & 1, kt + 2); // BUG!
}
process(curr);
}
// CORRECT - prefetch into OTHER stage
load_async(0, 0);
cp_async_wait_0();
for (int kt = 0; kt < num_k_tiles; ++kt) {
int curr = kt & 1;
int next = curr ^ 1; // OTHER stage
if (kt + 1 < num_k_tiles) {
load_async(next, kt + 1);
}
process(curr);
cp_async_wait_0();
}Key Insight: Always prefetch into the stage you're NOT currently reading from.
// WMMA row_major × row_major (PASS)
fragment<matrix_a, 16, 16, 8, precision::tf32, row_major> a_frag;
fragment<matrix_b, 16, 16, 8, precision::tf32, row_major> b_frag;
fragment<accumulator, 16, 16, 8, float> c_frag;
load_matrix_sync(a_frag, A + k, K);
load_matrix_sync(b_frag, B + k * N, N);
mma_sync(c_frag, a_frag, b_frag, c_frag);
store_matrix_sync(C, c_frag, N, mem_row_major);Note: row_major A + col_major B combination fails due to different memory layout interpretation.
native/ops/matmul_f32_tf32.cuh- TF32 kernelnative/ops/basic.cu- Dispatch logic- Environment variable
PYGPUKIT_ALLOW_TF32=1to enable
| Metric | Value |
|---|---|
| Current | 27.38 TFLOPS (8192×8192) |
| RTX 3090 Ti TF32 Theoretical | ~40 TFLOPS |
| cuBLAS Reference | ~59 TFLOPS |
| Gap to cuBLAS | 47% |
Block Tile: BM=128, BN=128, BK=16
Warp Tile: WARP_TILES_M=2, WARP_TILES_N=8 (32×64 per warp)
MMA Instruction: mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32
Pipeline: 2-stage double buffering
Thread Block: 256 threads (8 warps)
Shared Memory: ~37KB/block → occupancy ~16.7%
Current implementation uses simple padding (A_PAD=4, B_PAD=4) but bank conflicts are not fully eliminated.
CUTLASS Approach:
// XOR-based swizzle pattern
int store_column = (lane_id % 8) ^ (lane_id / 8);- Store and Load phases use transposed index relationship
- XOR operation applied per 8×8 block unit
- Combined with
ldmatrixfor fully bank conflict-free access
Key Insight:
"the indexing in the 'Loading from Shared Memory to Registers' slide is transposed from the indexing in 'Load from Global/Store to Shared' slide."
Current implementation manually loads from shared memory to registers:
// Current implementation
float a0 = smA[curr][tile_m + a_row_base][kk + a_col_base];CUTLASS Approach:
- Uses
ldmatrix.sync.aligned.m8n8.x4.shared.b16 - Single instruction loads four 8×8 matrices (entire warp)
TF32 Limitation:
"ldmatrix cannot transpose 32-bit data. CUTLASS uses 32-bit shared memory load to load data from shared memory to the registers to do the transpose right before calling tf32 tensor core."
Current: 2-stage → CUTLASS default: 4-stage
Past Failed Attempt:
"3-stage pipeline: -28% (50% more smem reduced occupancy)"
Considerations:
- Trade-off between shared memory usage and occupancy
- RTX 3090 Ti: 100KB/SM available
- Current 37KB → 4-stage at ~74KB should fit
| Priority | Optimization | Expected Gain | Difficulty |
|---|---|---|---|
| 1 | Swizzled shared memory layout | +10-15% | Medium |
| 2 | 4-stage pipeline (proper smem sizing) | +5-10% | Medium |
| 3 | Warp tile tuning (BM/BN/BK re-tuning) | +5-10% | Low |
| 4 | Epilogue fusion (bias + activation) | Memory reduction | Medium |
- Current: 27.38 TFLOPS (68% of target)
- Swizzle + 4-stage: 32-34 TFLOPS expected
- Fine-tuning: 35+ TFLOPS
- CUTLASS TF32 GEMM Example
- CUTLASS Efficient GEMM Documentation
- CUTLASS Swizzled Layouts Discussion
- Understanding CUTLASS Permuted Shared Memory
- Dissecting Tensor Cores (Academic Paper)
Edit → Build → Validate → Benchmark → Commit
Always commit after validation and benchmark, regardless of results.
Git Bashからビルド(推奨):
cd /d/Projects/m96-chan/PyGPUkit
./build.sh 86 # SM 86のみ (RTX 3090 Ti)
./build.sh 120a # SM 120aのみ (RTX 5090)
./build.sh # デフォルト: SM 120a注意事項:
- RTX 5090 (SM 120a) はCUDA 13.1以降が必要
- サポートSM: 80, 86, 89, 90, 100, 120a
Before EVERY commit, run these checks:
# 1. Ruff lint check (auto-fix and format)
git ls-files "*.py" | xargs python -m ruff check --fix
git ls-files "*.py" | xargs python -m ruff format
# 2. Mypy type check
python -m mypy src/ --ignore-missing-imports --disable-error-code=union-attr --disable-error-code=no-redef --disable-error-code=no-any-return --disable-error-code=attr-defined --disable-error-code=assignment --disable-error-code=arg-type --disable-error-code=index --disable-error-code=miscNEVER commit without passing ALL checks. CI will reject PRs with lint/type errors.
Before creating a PR, verify ALL of the following:
# 1. Lint passes
git ls-files "*.py" | xargs python -m ruff check
# 2. Mypy passes
python -m mypy src/ --ignore-missing-imports --disable-error-code=union-attr --disable-error-code=no-redef --disable-error-code=no-any-return --disable-error-code=attr-defined --disable-error-code=assignment --disable-error-code=arg-type --disable-error-code=index --disable-error-code=misc
# 3. Tests pass
python -m pytest tests/ -v
# 4. Benchmark runs (optional but recommended)
python -m pygpukit.benchmark --quickDO NOT create PR until all checks pass locally.
- Run lint check before commit (see above)
- Commit after every validation/benchmark completion, regardless of outcome
- Include benchmark results in commit message
- Never proceed to next kernel edit until commit is complete
- Never overwrite a working kernel without committing first
wip(tf32): <summary of changes>
Benchmark results (RTX 5090):
- 2048x2048: XX.XX TFLOPS
- 4096x4096: XX.XX TFLOPS
- 8192x8192: XX.XX TFLOPS
Correctness: <PASS/FAIL>
You MUST commit immediately when:
- Benchmark improves in ANY matrix size (even +0.01 TFLOPS)
- Correctness achieved (relative error < 1e-3 for all sizes)
- After EVERY benchmark execution - even if no improvement, commit with
bench: results logged (no improvement)
If performance or correctness degrades:
- MUST revert to the previous commit BEFORE continuing
Rationale:
- Prevent losing fast kernel versions
- Track performance changes over time
- Preserve trial-and-error history
Use unified benchmark suite: python -m pygpukit.benchmark
# Quick benchmark (GEMM + GEMV)
python -m pygpukit.benchmark --quick
# Full benchmark
python -m pygpukit.benchmark
# Save results and compare with baseline
python -m pygpukit.benchmark --quick --save baseline.json
python -m pygpukit.benchmark --compare baseline.json --fail-on-regression
# Specific benchmarks
python -m pygpukit.benchmark --gemm --sizes 4096,8192
python -m pygpukit.benchmark --gemv --dtypes bf16,fp8
python -m pygpukit.benchmark --attention --seq-lens 512,1024
# All benchmarks including FP8 (SM120+)
python -m pygpukit.benchmark --all --fp8
# Markdown output for README
python -m pygpukit.benchmark --quick --markdownOutput includes:
- Time in microseconds (us)
- TFLOPS for compute benchmarks
- Correctness verification
- JSON export for regression tracking
Environment Variables:
PYGPUKIT_ALLOW_TF32=1- Enable TF32 TensorCorePYGPUKIT_TF32_V2=1- Use PTX mma.sync kernel (default when TF32 enabled)
M=1 decode separates CUDA Graph and Non-Graph versions.
Use CUDA Graph ONLY when ALL conditions are met:
- Fixed shapes/dtypes/RoPE tables - No dynamic changes during replay
- Identical kernel path - warmup / capture / replay use the same code path
- No KV cache pollution - Graph must not write to real KV cache during warmup/capture
- H2D copies on capture stream - All host-to-device copies must be on the stream being captured
If any condition is NOT met, fallback to Non-Graph version.
- Conditional branches based on runtime values
copy_tooperations (use direct buffer writes instead)- Any operation that reads from or writes to KV cache
- SDPA (Scaled Dot-Product Attention) - always run outside graph
# Graph captures ONLY stateless operations:
# - Embedding lookup (via GPU pointer)
# - Linear projections (QKV, O, MLP)
# - RMSNorm
# - RoPE (via pre-computed GPU tables)
# These run OUTSIDE graph:
# - KV cache update
# - SDPA attention
# - Any operation that depends on context_len at runtimePyGPUkit is not a replacement for PyTorch, JAX, or TensorFlow. Its purpose is to provide low-level, explicit, and controllable GPU execution primitives.
- Focus: memory, kernels, scheduling, bandwidth, latency
- Not focus: autograd graphs, optimizers, training loops
High performance is assumed. Optimization enables scheduling, concurrency, and predictability.
- Slower-than-cuBLAS requires justification
- Faster-than-cuBLAS is welcome, but not mandatory
- Performance regressions are unacceptable without explicit trade-offs
User-facing APIs should resemble NumPy-style array operations.
C = A @ Bis preferred over opaque operator graphs- Explicit is better than implicit
- Users should understand when and how GPU work is executed
PyGPUkit treats the GPU as a shared, schedulable resource (Kubernetes-inspired).
- Admission control, QoS, memory reservation, kernel pacing
- Scheduling decisions are explicit and inspectable
- Kernels are workloads, not side effects
SafeTensors are treated as immutable, read-only GPU resources.
- No in-place mutation
- No hidden ownership or lifecycle coupling
Leveraging vendor or OSS-optimized kernels is acceptable and encouraged.
- Value lies in orchestration, scheduling, and integration
- Reusing proven kernels is preferable to reinventing them
- TF32 precision loss is acceptable when explicitly enabled
- FP32 correctness must remain available
- Non-determinism must be explainable and bounded
PyGPUkit includes a minimal LLM inference engine for SafeTensors models (Qwen, LLaMA, etc.).
SafeTensors → Model Loading → Prefill → Decode Loop → Token Output
↓
CUDA Graph (optional)
| Mode | Description | Use Case |
|---|---|---|
| Standard | model.forward() with allocation |
Simple usage |
| Zero-Alloc | _decode_step_zero_alloc() |
Low-latency |
| CUDA Graph | _decode_step_graph_replay() |
Reduced kernel launch overhead |
| Jacobi | Parallel iterative decode | Speculative execution |
All kernels must use internal::get_capture_stream() for CUDA Graph compatibility:
cudaStream_t stream = internal::get_capture_stream();
my_kernel<<<grid, block, 0, stream>>>(...);Critical: Kernels launched without stream parameter will NOT be captured in the graph.
For dynamic values during graph replay, use _ptr kernel variants:
// Static value (captured at graph creation)
sdpa_causal_fixed_cache(..., context_len, ...);
// Pointer-based (read from GPU buffer at runtime)
sdpa_causal_fixed_cache_ptr(..., context_len_buf, max_kv_len, ...);Pre-allocated buffers for zero-allocation decode:
@dataclass
class DecodeBuffers:
hidden: GPUArray # [1, hidden_size]
q: GPUArray # [1, num_heads, head_dim]
k: GPUArray # [1, num_kv_heads, head_dim]
v: GPUArray # [1, num_kv_heads, head_dim]
attn_out: GPUArray # [num_heads, 1, head_dim]
# ... (layer-shared, reused across all layers)model.init_decode_graph(max_seq_len=512) # Capture graph
# Replay loop
for i in range(num_tokens):
logits = model._decode_step_graph_replay(token_id, position, context_len)
next_token = sample(logits)| Scenario | CUDA Graph Speedup |
|---|---|
| Full decode loop (with D2H) | ~1.2x |
| Kernel-only (large model) | ~1.0x (GPU-bound) |
| Small model / many kernels | Higher benefit |
Limitation: Current implementation has 2 device syncs per replay (H2D visibility + completion wait), which reduces benefit for large models.
Fixed-length KV cache with GQA support:
# Initialize
for block in model.blocks:
block.attn.init_fixed_cache(max_seq_len, dtype="float16")
# Prefill
hidden, past_kv = model(input_ids, use_cache=True)
for i, block in enumerate(model.blocks):
kv_cache_prefill_gqa(past_kv[i][0], block.attn._k_cache, num_heads, start_pos=0)
kv_cache_prefill_gqa(past_kv[i][1], block.attn._v_cache, num_heads, start_pos=0)
# Backup/Restore for benchmarking
kv_backup = model.snapshot_kv_cache()
model.restore_kv_cache(kv_backup)Parallel iterative generation for speculative execution:
# Initialize Jacobi buffers
model.init_jacobi_decode(lookahead_k=4, max_seq_len=512)
# Parallel decode
accepted_tokens = model.jacobi_decode_step(draft_tokens, position)- Full Training Framework - No optimizers, training loops, dataset pipelines, autograd engines
- Abstracting Away GPU Reality - Memory transfers, sync points, kernel costs, precision trade-offs are NOT hidden
- Supporting Legacy GPUs - Only Ampere/Ada and newer; Turing and below are out of scope
- PyTorch API Compatibility - Clarity over familiarity; APIs may diverge intentionally
- "Magic" Performance - No undocumented heuristics; all optimizations must be explainable
- C++/CUDA: CMake with CUDA toolkit
- Python: scikit-build-core for CMake integration
- Rust: Cargo with PyO3
- CI/CD: cibuildwheel with CUDA
| Change Type | Branch | Flow |
|---|---|---|
| Hotfix (v0.1.x) | main | Direct push → tag |
| Minor/Major (v0.2+) | feature/* | Branch → PR → CI test → main → tag |
- ✅ Native C++ backend with CUDA Runtime/Driver API
- ✅ NVRTC JIT compilation
- ✅ pybind11 bindings
- ✅ Zero-copy Python↔Native interop
- ✅ CPU simulation fallback
- ✅ Rust memory pool with LRU eviction
- ✅ Rust GPU scheduler state machine
- ✅ L2-optimized naive matmul (18 TFLOPS)
- ✅ TF32 TensorCore GEMM (27 TFLOPS)
- ✅ SM >= 80 runtime check
- ✅ 106 Rust tests
- ✅ CUDA Graph for single-token decode (M=1)
- ✅ cuBLASLt dynamic loading with descriptor caching
- ✅ Top-k sampling in graph capture
- ✅ Zero-allocation decode path (DecodeBuffers)
- ✅ CUDA Graph stream fix (RoPE/SDPA now properly captured)
- ✅ Batch decode support (seq_len > 1)
- ✅ Jacobi decoding for parallel iterative generation
- ✅ Self-Speculative decoding framework
- ✅ GPU-side Lookahead KV Cache
- ✅ CUDA Events API
- Rust-side async memory transfer engine
- Rust-side kernel dispatch controller
- Python API wrappers for Rust scheduler/memory pool (thin wrappers only)
Git Bashからビルド(推奨):
cd /d/Projects/m96-chan/PyGPUkit
./build.sh 86 # SM 86のみ (RTX 3090 Ti)
./build.sh 120a # SM 120aのみ (RTX 5090)
./build.sh # デフォルト: SM 120aサポートSM: 80, 86, 89, 90, 100, 120a
| Machine | GPU | SM | CUDA Toolkit | Notes |
|---|---|---|---|---|
| Primary | RTX 5090 | 120a | 13.1 | Blackwell GeForce, FP8 testing |
| Secondary | RTX 3090 Ti | 86 | 12.x | Ampere, TF32 benchmarks |
PyGPUkit内蔵のTokenizerは使用しない。HuggingFace tokenizersライブラリを使用する。
# 推奨: HuggingFace tokenizers
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("/path/to/tokenizer.json")
# 非推奨: 内蔵Tokenizer (互換性問題あり)
# from pygpukit.llm import TokenizerPrimary model storage: F:/LLM/
All LLM models for inference testing are stored in F:/LLM/. Use this path when loading models.
F:/LLM/
├── Qwen2.5-7B-Instruct/ # Main test model
├── Qwen3-8B/ # Qwen3 variant
├── TinyLlama-1.1B-Chat-v1.0/ # Small model for quick tests
└── ...
Usage example:
from pygpukit.llm import QwenModel
model = QwenModel.from_safetensors("F:/LLM/Qwen2.5-7B-Instruct")Note: HuggingFace cache (~/.cache/huggingface/) may also contain models but F:/LLM/ is the canonical location.
Development workflow automation:
| Skill | Description |
|---|---|
build |
Build native module with SM selection |
benchmark |
Run matmul performance benchmarks |
lint |
Ruff lint + format |
typecheck |
Mypy type check |
test |
Run pytest |
precommit |
Pre-commit checks (lint + typecheck) |
check-all |
Full validation (lint + typecheck + test) |
chat-test |
LLM inference testing |
kernel-dev |
Kernel development workflow |
Specialized agents for specific tasks:
| Agent | Model | Description |
|---|---|---|
kernel-reviewer |
opus | CUDA kernel code review |
perf-analyzer |
opus | Benchmark analysis and optimization |
api-designer |
sonnet | Python API design review |
commit-helper |
haiku | Commit message and PR generation |
doc-generator |
haiku | Documentation updates |
Skills and agents are automatically invoked based on task context. Examples:
- "Build for RTX 5090" ->
buildskill - "Review the kernel changes" ->
kernel-revieweragent - "Analyze benchmark results" ->
perf-analyzeragent - "Commit these changes" ->
commit-helperagent