tt-lang-models

Reference models that are partially or entirely implemented using TT-Lang.

DFlash

DFlash is a lightweight cross-attention draft model for speculative decoding on Tenstorrent hardware. It proposes 16 tokens in parallel, verified by a target Qwen3-30B LLM, achieving a 5-6x decoding speedup. Draft model kernels (RoPE, RMSNorm, SiLU, residual adds) run entirely on device via TT-Lang.

Acceptance rate matches the PyTorch reference model. With caching and 120k context, the draft forward pass runs in 93ms (vs 887ms without caching).

Also includes a full Qwen3-Coder-30B-A3B inference implementation, a 48-layer MoE target model running on 4-chip TP with traced execution and zero host transfers in the hot loop. TT-Lang kernels cover RMSNorm, per-head RMSNorm, RoPE, SiLU, residual adds, softmax, cross-attention, and argmax.

Engram

A port of the DeepSeek Engram conditional memory module to TT-Lang on Wormhole. Engram uses streaming dataflow kernels with inter-core boundary sharing via PipeNet for overlap-aware depthwise convolution.

	Gating	Conv	All Kernels
TTNN	3.86 ms	0.99 ms	4.84 ms
TT-Lang	1.15 ms	1.02 ms	2.17 ms

Gemma 4

Autoregressive inference for Google's Gemma 4 E4B on a single Blackhole chip. TT-Lang kernels cover linear, flash attention, RoPE, SwiGLU, and softcap across all 42 layers with sliding/global attention and KV sharing. Runs at ~17 tok/s.

nanochat

Inference and training for nanochat entirely in TT-Lang. Every kernel has a backwards version. Single-file implementations at nanochat/ttlang/inference.py and nanochat/ttlang/train.py.

Notable fusions:

Fused MLP projection -- replaces 7 dispatches (4 slice matmuls + 3 residual adds) with a single kernel using L1 accumulation via ping-pong DFBs. 13.13 to 15.89 tok/s (+21%).
Fused QKV projection -- reads input once and computes Q, K, V in one dispatch, reducing DRAM reads. 12.30 to 13.13 tok/s (+6.7%).

DeepSeek-V4-Flash

DeepSeek-V4-Flash inference on Tenstorrent Galaxy (4×8 BH), end-to-end in a single self-contained inference.py. Hot ops are TT-Lang fused kernels (RMSNorm, MHC pre/post mixes, log-domain Sinkhorn, KV act-quant, compressor slot-shift and softmax/sum/RMSNorm fusion, SwiGLU); the rest is TTNN. The whole decode step is captured as a TTNN trace replayed in alternation between emit and no-emit branches.

Routed-expert weights are ported from fp4 e2m1 + e8m0 scales to native ttnn.bfloat4_b via an offline lattice cache and an offload-time algebraic remap, so the hot path is a single ttnn.matmul(bf16, bfp4_b) with no per-call dequant.

Oasis

Real-time Minecraft world generation on Tenstorrent Blackhole using the Oasis 500M diffusion transformer. Runs end-to-end inference (DiT denoising, VAE decode, video output) in a single captured trace at 8 FPS. Supports multi-chip 4-way tensor parallelism.

Qwen-Image

TTNN + TT-Lang implementation of Qwen-Image 20B image generation across 4 Blackhole chips.

TT-Lang:

Resolution	Steps	Time
256x256	4	1.1s
256x256	20	5.3s
512x512	60	37.7s
1024x1024	60	146.6s

XLA:

Resolution	CFG	Steps	Per-step	Total
256x256	4.0	15	1.75s	28s
256x256	1.0	15	1.04s	18s
512x512	4.0	20	5.42s	112s

Normalized per-step, TT-Lang is ~4-7x faster at 256x256 and ~8.6x faster at 512x512.

Micelle MD

Cell-list molecular dynamics on Tenstorrent hardware using TT-Lang. Full Ewald electrostatics with LJ short-range forces, periodic boundary conditions, and on-device Verlet integration. Validated at 10K atoms, 10K steps, 1.1ms/step.

Diamond

UNet-based diffusion world model (DIAMOND, NeurIPS 2024) running on a single Blackhole card. Generates Atari game frames autoregressively using a 4-level encoder/decoder with 3 Euler denoising steps per frame. Runs at ~14 FPS, with interactive browser play across 26 Atari games.

LingBot-World (WIP)

Video generation from the LingBot-World-Fast 14B DiT model on a 4-chip QuietBox with tensor parallelism. Generates 480x832 video with camera pose conditioning at 0.47 fps. TT-Lang kernels cover 3D RoPE and AdaLN broadcast modulation.

Toy World Model

A Pong world model based on a diffusion transformer, trained on 9 hours of gameplay, running interactively on a single Blackhole card.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Engram @ ff8e1ac		Engram @ ff8e1ac
deepseek @ 56630aa		deepseek @ 56630aa
dflash @ 5e0d24d		dflash @ 5e0d24d
diamond @ b719c3a		diamond @ b719c3a
doc		doc
gemma4 @ fef2e9b		gemma4 @ fef2e9b
lingbot-world @ 1085222		lingbot-world @ 1085222
micelle-demo		micelle-demo
nanochat @ 16d92bf		nanochat @ 16d92bf
open-oasis @ ce51a2d		open-oasis @ ce51a2d
qwen-image-tt-xla @ e79e1af		qwen-image-tt-xla @ e79e1af
toy-wm @ 64e5b07		toy-wm @ 64e5b07
.DS_Store		.DS_Store
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tt-lang-models

DFlash

Engram

Gemma 4

nanochat

DeepSeek-V4-Flash

Oasis

Qwen-Image

Micelle MD

Diamond

LingBot-World (WIP)

Toy World Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tt-lang-models

LingBot-World (WIP)

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages