v0.2.19 - FLUX.1 Image Generation
Highlights
FLUX.1 Image Generation
Text-to-image generation with Black Forest Labs' FLUX.1 model:
- Full FLUX.1-schnell transformer (19 joint + 38 single blocks)
- Flow matching Euler scheduler
- GPU-native operations (transpose, batched matmul, RoPE)
Lazy Model Loading with Streaming
Memory-efficient model loading strategies:
StreamingStrategy.EAGER- Load all at once (default)StreamingStrategy.PROGRESSIVE- Load during first forwardStreamingStrategy.LAYER_BY_LAYER- Minimal memory usage
cuBLAS Dynamic Loader
- Runtime DLL loading without compile-time CUDA Toolkit
- Auto-detection of cuBLASLt versions (13/12/11)
- Graceful fallback to native kernels
C++ Kernel Profiler
- Built-in CUDA kernel profiling with minimal overhead
- Per-kernel timing statistics
HuggingFace T5 Encoder
- Sharded safetensors support
- Full T5 encoder for FLUX/SD3
DiT Architecture
- PixArt transformer with AdaLN-Zero
- Self/cross attention with GQA
- GEGLU FFN
New GPU Operations
transpose_4d_0213,transpose_3d_012gpu_batched_matmul,gpu_softmax,gpu_apply_ropecross_attention,conv2d,group_norm
Known Issues
- FLUX.1 performance needs optimization (#187)