Skip to content

zoecarver/tt-lang-models

Repository files navigation

tt-lang-models

Reference models that are partially or entirely implemented using TT-Lang.


DFlash is a lightweight cross-attention draft model for speculative decoding on Tenstorrent hardware. It proposes 16 tokens in parallel, verified by a target Qwen3-30B LLM, achieving a 5-6x decoding speedup. Draft model kernels (RoPE, RMSNorm, SiLU, residual adds) run entirely on device via TT-Lang.

Acceptance rate matches the PyTorch reference model. With caching and 120k context, the draft forward pass runs in 93ms (vs 887ms without caching).

Also includes a full Qwen3-Coder-30B-A3B inference implementation, a 48-layer MoE target model running on 4-chip TP with traced execution and zero host transfers in the hot loop. TT-Lang kernels cover RMSNorm, per-head RMSNorm, RoPE, SiLU, residual adds, softmax, cross-attention, and argmax.

A port of the DeepSeek Engram conditional memory module to TT-Lang on Wormhole. Engram uses streaming dataflow kernels with inter-core boundary sharing via PipeNet for overlap-aware depthwise convolution.

Gating Conv All Kernels
TTNN 3.86 ms 0.99 ms 4.84 ms
TT-Lang 1.15 ms 1.02 ms 2.17 ms

Autoregressive inference for Google's Gemma 4 E4B on a single Blackhole chip. TT-Lang kernels cover linear, flash attention, RoPE, SwiGLU, and softcap across all 42 layers with sliding/global attention and KV sharing. Runs at ~17 tok/s.

Inference and training for nanochat entirely in TT-Lang. Every kernel has a backwards version. Single-file implementations at nanochat/ttlang/inference.py and nanochat/ttlang/train.py.

Notable fusions:

  • Fused MLP projection -- replaces 7 dispatches (4 slice matmuls + 3 residual adds) with a single kernel using L1 accumulation via ping-pong DFBs. 13.13 to 15.89 tok/s (+21%).
  • Fused QKV projection -- reads input once and computes Q, K, V in one dispatch, reducing DRAM reads. 12.30 to 13.13 tok/s (+6.7%).

DeepSeek-V4-Flash inference on Tenstorrent Galaxy (4×8 BH), end-to-end in a single self-contained inference.py. Hot ops are TT-Lang fused kernels (RMSNorm, MHC pre/post mixes, log-domain Sinkhorn, KV act-quant, compressor slot-shift and softmax/sum/RMSNorm fusion, SwiGLU); the rest is TTNN. The whole decode step is captured as a TTNN trace replayed in alternation between emit and no-emit branches.

Routed-expert weights are ported from fp4 e2m1 + e8m0 scales to native ttnn.bfloat4_b via an offline lattice cache and an offload-time algebraic remap, so the hot path is a single ttnn.matmul(bf16, bfp4_b) with no per-call dequant.

Real-time Minecraft world generation on Tenstorrent Blackhole using the Oasis 500M diffusion transformer. Runs end-to-end inference (DiT denoising, VAE decode, video output) in a single captured trace at 8 FPS. Supports multi-chip 4-way tensor parallelism.

oasis

TTNN + TT-Lang implementation of Qwen-Image 20B image generation across 4 Blackhole chips.

TT-Lang:

Resolution Steps Time
256x256 4 1.1s
256x256 20 5.3s
512x512 60 37.7s
1024x1024 60 146.6s

XLA:

Resolution CFG Steps Per-step Total
256x256 4.0 15 1.75s 28s
256x256 1.0 15 1.04s 18s
512x512 4.0 20 5.42s 112s

Normalized per-step, TT-Lang is ~4-7x faster at 256x256 and ~8.6x faster at 512x512.

qwen-image

Cell-list molecular dynamics on Tenstorrent hardware using TT-Lang. Full Ewald electrostatics with LJ short-range forces, periodic boundary conditions, and on-device Verlet integration. Validated at 10K atoms, 10K steps, 1.1ms/step.

micelle

UNet-based diffusion world model (DIAMOND, NeurIPS 2024) running on a single Blackhole card. Generates Atari game frames autoregressively using a 4-level encoder/decoder with 3 Euler denoising steps per frame. Runs at ~14 FPS, with interactive browser play across 26 Atari games.

diamond

Video generation from the LingBot-World-Fast 14B DiT model on a 4-chip QuietBox with tensor parallelism. Generates 480x832 video with camera pose conditioning at 0.47 fps. TT-Lang kernels cover 3D RoPE and AdaLN broadcast modulation.

lingbot-world

A Pong world model based on a diffusion transformer, trained on 9 hours of gameplay, running interactively on a single Blackhole card.

toy-wm

About

A collection of models that use tt-lang for some part of their implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages