Skip to content

v0.2.19 - FLUX.1 Image Generation

Choose a tag to compare

@m96-chan m96-chan released this 01 Jan 19:25
· 7 commits to main since this release
7adbe5f

Highlights

FLUX.1 Image Generation

Text-to-image generation with Black Forest Labs' FLUX.1 model:

  • Full FLUX.1-schnell transformer (19 joint + 38 single blocks)
  • Flow matching Euler scheduler
  • GPU-native operations (transpose, batched matmul, RoPE)

Lazy Model Loading with Streaming

Memory-efficient model loading strategies:

  • StreamingStrategy.EAGER - Load all at once (default)
  • StreamingStrategy.PROGRESSIVE - Load during first forward
  • StreamingStrategy.LAYER_BY_LAYER - Minimal memory usage

cuBLAS Dynamic Loader

  • Runtime DLL loading without compile-time CUDA Toolkit
  • Auto-detection of cuBLASLt versions (13/12/11)
  • Graceful fallback to native kernels

C++ Kernel Profiler

  • Built-in CUDA kernel profiling with minimal overhead
  • Per-kernel timing statistics

HuggingFace T5 Encoder

  • Sharded safetensors support
  • Full T5 encoder for FLUX/SD3

DiT Architecture

  • PixArt transformer with AdaLN-Zero
  • Self/cross attention with GQA
  • GEGLU FFN

New GPU Operations

  • transpose_4d_0213, transpose_3d_012
  • gpu_batched_matmul, gpu_softmax, gpu_apply_rope
  • cross_attention, conv2d, group_norm

Known Issues

  • FLUX.1 performance needs optimization (#187)

Full Changelog

v0.2.18...v0.2.19