All notable changes to PyGPUkit will be documented in this file.
For detailed release notes with code examples, see README.md.
- Triton Backend MVP: Optional Triton backend for rapid kernel prototyping
- pygpukit.triton module: TritonArray wrapper, from_gpuarray, triton_available
- Triton Kernels: RMSNorm, LayerNorm, Softmax, Rotary
- Hybrid Execution: Mix Triton + Native CUDA in same model
- examples/chat_cli_triton.py: Hybrid chat example demonstrating Triton + CUDA
- TritonArray dtype mapping: Support both PascalCase and lowercase dtype strings
- MoE (Mixture of Experts): Full Mixtral support with TopK routing, grouped GEMM
- Thinking Model: Qwen3
<think>...</think>block parsing - GEMV Kernels (SM120): FP8/FP8 (W8A8), NVF4/NVF4 (W4A4), Int4
- GEMM Kernels (SM120): W8A16, Int8 native (dp4a), Int4 via Int8, Grouped GEMM v2
- Claude Code Skills: Build, benchmark, lint, test automation
- Subagents: kernel-reviewer, perf-analyzer, api-designer
- Kernel directory restructure:
{gemm|gemv}/{input}/{output}/{arch}/ - Removed redundant slow kernels (FP8 GEMV basic, Int8 via FP8)
- Whisper ASR Module: Full encoder/decoder, preprocessing, streaming transcription
- GEMV Kernels: BF16 (vectorized BF16x2), NVF4 (pre-scaled LUT)
- FP8 I/O GEMM (SM120): Pure FP8 E4M3 input/output with blockwise scaling
- Pure NVF4 GEMM: 446 TFLOPS with GPU-side quantization (170x vs CPU)
- GPUArray improvements: Scalar arithmetic, transpose, reshape, indexing
- GPU Transpose Kernels: 2D, 3D (0,2,1), 4D (0,2,1,3), 4D (0,1,3,2)
- Math operations: sin, cos, sqrt, rsqrt, abs, neg, clamp, where, sigmoid, tanh, argmax, min, sum_axis
- uint8/int8 NumPy support:
from_numpyfor FP8 data handling
- Linear layer uses GEMV for M=1 decode (1.3-2.4x faster than matmul)
- SM120 compatibility via CUTLASS fork with alignment fixes
- Windows wheel RECORD file: Missing
licenses/LICENSEentry
- RECORD file generation: Dynamic dist-info folder detection
- Moved benchmark/demo files to
benchmarks/andexamples/
- GPU Audio Processing (no cuFFT dependency):
- Time-Frequency:
istft,griffin_lim - Spectral:
spectral_centroid,spectral_bandwidth,spectral_rolloff,spectral_flatness,spectral_contrast - Pitch:
detect_pitch_yin,detect_pitch_yin_frames,autocorrelation - Music:
cqt,chroma_stft,chroma_cqt,zero_crossing_rate - Source Separation:
hpss,harmonic,percussive - Time/Pitch:
time_stretch,pitch_shift
- Time-Frequency:
- Batch decode: Near-linear speedup (6.83x at batch=8)
- Decode strategies: DecodeM1, DecodeM1Graph, DecodeBatch, DecodeJacobi
- Driver API: Async memory ops, pinned malloc
- RTX 5090 (SM120): Full support via CUDA 13.x
- Qwen2 architecture:
QWEN2_SPECfor Qwen2/2.5 - Audio ops: STFT, Mel filterbank, MFCC, VAD, streaming
- CUDA Graph stream fix (RoPE/SDPA properly captured)
- Dynamic cuBLASLt loading: True driver-only deployment
- Descriptor caching: 2.67x faster (395ms -> 148ms for 224 matmuls)
- CUDA Graph optimizations (eliminated GPU allocations)
- Unified LLM Interface:
CausalTransformerModelwithModelSpec - Architecture support: GPT-2, LLaMA 2/3, Qwen2/2.5, Qwen3
- Hybrid attention: Auto CPU/GPU switching
- LLM operations:
sdpa_causal,rope_inplace,silu,rmsnorm - Sharded models: Auto-load split safetensors
- CUTLASS epilogue fusion: Linear + Bias + GELU
- Multi-SM kernels: SM80/86/89/90/100/120 optimized
- Operations:
transpose,bias_add_inplace,linear_bias_gelu
- Complete public API exports with snake_case naming
- CUTLASS backend: Default GEMM (TF32: 31 TFLOPS, FP16/BF16: 63 TFLOPS)
- Multi-LLM scheduling: Concurrent execution with VRAM budgets
- FP16/BF16 TensorCore: Via CUTLASS
- FP16/BF16 support: Half and brain float types
- Type conversion:
astype()method - Reduction ops:
sum,mean,max - Operator overloads:
+,-,*,/,@
- Driver-only mode: No cudart dependency
- Dynamic NVRTC: JIT loaded at runtime
- TF32 TensorCore GEMM: PTX mma.sync with cp.async pipeline
For full release notes with code examples, see the README.md "What's New" sections.