v0.2.19 - FLUX.1 Image Generation

m96-chan released this 01 Jan 19:25

· 7 commits to main since this release

7adbe5f

Highlights

FLUX.1 Image Generation

Text-to-image generation with Black Forest Labs' FLUX.1 model:

Full FLUX.1-schnell transformer (19 joint + 38 single blocks)
Flow matching Euler scheduler
GPU-native operations (transpose, batched matmul, RoPE)

Lazy Model Loading with Streaming

Memory-efficient model loading strategies:

StreamingStrategy.EAGER - Load all at once (default)
StreamingStrategy.PROGRESSIVE - Load during first forward
StreamingStrategy.LAYER_BY_LAYER - Minimal memory usage

cuBLAS Dynamic Loader

Runtime DLL loading without compile-time CUDA Toolkit
Auto-detection of cuBLASLt versions (13/12/11)
Graceful fallback to native kernels

C++ Kernel Profiler

Built-in CUDA kernel profiling with minimal overhead
Per-kernel timing statistics

HuggingFace T5 Encoder

Sharded safetensors support
Full T5 encoder for FLUX/SD3

DiT Architecture

PixArt transformer with AdaLN-Zero
Self/cross attention with GQA
GEGLU FFN

New GPU Operations

transpose_4d_0213, transpose_3d_012
gpu_batched_matmul, gpu_softmax, gpu_apply_rope
cross_attention, conv2d, group_norm

Known Issues

FLUX.1 performance needs optimization (#187)

Full Changelog

v0.2.18...v0.2.19

Assets 7