LingBot-VLA is a Qwen2.5-VL backbone + flow-matching action expert. This doc covers building, the FA4 fast path, accuracy/latency, and running it on Jetson AGX Thor (sm_110).
Status — low-level path only. LingBot runs through the low-level
graph_runnerpath (graph_runner.sample_actions_graph: weight spec + CUDA-graph capture). It is not registered in_PIPELINE_MAPandflash_rt.load_modeldoes not dispatch alingbotconfig; this is not the stableload_model()API.LingbotTorchFrontendThor(flash_rt/frontends/torch/lingbot_thor.py) is a G1 scaffold whose methods raiseNotImplementedError. Useexamples/lingbot_quickstart.py/benchmarks/lingbot_thor_latency.py, notflash_rt.load_model("lingbot").
| stage | layers | notes |
|---|---|---|
| ViT (SigLIP-style) | 32 | 3 camera views, 224² |
| VLM prefix (Qwen2.5-VL) | 36 | FP8; FA4 for the prefix self-attention |
| Action expert (flow-matching) | 36 | per-step denoise loop, FP8 + FP4 gate_up; FA4 denoise attention |
Action chunk: [1, 50, 75] (horizon 50, action dim 75). Denoise step count is
configurable (10 / 25 / 50).
LingBot's model-specific kernels (fused AdaRMSNorm, SwiGLU tail, QKV+RoPE) are
compiled into flash_rt_kernels — same pattern as the qwen36 kernels in
csrc/kernels/. There is no separate flash_rt_lingbot.so. The kernels are
lingbot_-prefixed and gated behind ENABLE_LINGBOT, built only when
FLASHRT_ENABLE_LINGBOT=ON (default) and GPU_ARCH=110 (Thor). RTX
(sm_120) / Orin / L40 builds compile neither the sources nor the bindings.
git clone --depth 1 --branch v4.4.2 \
https://github.com/NVIDIA/cutlass.git third_party/cutlass
cmake -B build -S . -DGPU_ARCH=110 # -DFLASHRT_ENABLE_LINGBOT=OFF to skip
cmake --build build -j --target flash_rt_kernels flash_rt_fp4 fmha_fp16_strided
pip install -e ".[torch,thor-fa4]"Sanity-check that the LingBot kernels and FA4 are present:
python - <<'PY'
import flash_rt.flash_rt_kernels as k
print("lingbot kernels:", sum(x.startswith("lingbot_") for x in dir(k))) # 15
from flash_rt.hardware.thor import fa4_backend
print("FA4:", fa4_backend.is_available(), "-", fa4_backend.status()) # True - active
PYFA4 (CuTe-DSL) gives the denoise + prefix attention ~17% over the fmha path
(pack_gqa, cosine preserved). On Thor it must be compiled for sm_101a (the
sm_110 Blackwell alias; fa4_backend sets CUTE_DSL_ARCH=sm_101a for you).
- The FA4 forward source is vendored, trimmed, and privately namespaced at
csrc/attention/flash_attn_4_src/(packageflashrt_fa4, a forward / SM100-only subset offlash_attn/cute; see itsVENDOR.md). Noflash-attnwheel is needed, and it never shadows a pip-installedflash_attn.Install note: the vendor lives under
csrc/and is loaded from the source tree, so it is not bundled into a built wheel. Use an editable / source-tree install (pip install -e .); a plainpip install .wheel would not ship it (FA4 would silently fall back to fmha). If wheel packaging is needed later, move the vendor underflash_rt/or add it topackage-data. - The import is isolated in
flash_rt/hardware/thor/fa4_backend.py— the only place that touches FA4. It returnsNone(→ fmha fallback) if unavailable. - Its runtime deps (
nvidia-cutlass-dsl,quack-kernels) come from thethor-fa4extra:pip install ".[thor-fa4]". They are not inall. - FA4 is an optional fast path: if its deps are missing it silently falls back to the fmha kernel (correct, ~+18 ms@25).
A/B the attention path:
FLASHRT_THOR_FA4=1 python benchmarks/lingbot_thor_latency.py ... # FA4
FLASHRT_THOR_FA4=0 python benchmarks/lingbot_thor_latency.py ... # force fmhaThe benchmark prints FA4 status: active or the failure reason. To debug an
unexpected fallback, set LINGBOT_FA4_DEBUG=1 to print the import traceback,
and check (1) pip install .[thor-fa4], (2) the vendored
csrc/attention/flash_attn_4_src exists, (3) FLASHRT_THOR_FA4 is not 0.
python examples/lingbot_quickstart.py \
--checkpoint /path/to/lingbot-vla-4b \
--calibration /path/to/lingbot_thor_static.json \
--inputs /path/to/baseline_artifacts_10/inputs \
--steps 50 25 10--checkpoint is the lingbot-vla-4b/ dir (model.safetensors + config.json;
modelscope download --model Robbyant/lingbot-vla-4b). --inputs is a dir of
images/img_masks/lang_tokens/lang_masks/state/noise .pt tensors.
Measured back-to-back A/B (fixed noise from baseline_artifacts_10). Cosine is
the action chunk [1,50,75] vs the upstream LingBot BF16 PyTorch reference
(baseline_artifacts_10/outputs/actions.pt, available for the 10-step run):
| attention path | steps | cosine vs 10-step ref | P50 |
|---|---|---|---|
| upstream LingBot BF16 (reference) | 10 | 1.000000 | — |
| FlashRT (FA4) | 10 | 0.996245 | 64.1 ms |
| FlashRT (fmha fallback) | 10 | 0.996067 | 73.0 ms |
| FlashRT (FA4) | 25 | 0.995721 | 97.5 ms |
| FlashRT (fmha fallback) | 25 | 0.994890 | 118.1 ms |
| FlashRT (FA4) | 50 | 0.995455 | 155.8 ms |
| FlashRT (fmha fallback) | 50 | 0.994928 | 193.8 ms |
LingBot model cleanup baseline:
| ns | Baseline latency |
|---|---|
| 5 | 1501 ms |
| 10 | 1741 ms |
| 50 | 2481 ms |
TRT-aligned FP4 loop comparison, using the same quantization scheme:
| steps | TRT aligned FP4 loop | FlashRT full E2E | Speedup |
|---|---|---|---|
| 10 | ~122 ms | 64.1 ms | ~1.9x |
| 25 | ~304 ms | 97.5 ms | ~3.1x |
| 50 | ~608 ms | 155.8 ms | ~3.9x |
- Reference: upstream LingBot BF16, fixed-noise action chunk, 10 denoise steps
(
baseline_artifacts_10/outputs/actions.pt). The 25/50-step rows are compared against the same 10-step reference, hence the slightly lower cosine — they are not a step-matched comparison. - Acceptance: cosine ≥ 0.995 vs the step-matched BF16 reference (FP8 bring-up threshold) — met by both paths at 10 steps.
- FA4 vs fmha are numerically equivalent paths (the FP8/FP4 GEMMs are identical;
only the attention kernel differs); FA4 is ~15–20% faster (e.g. 155.8 vs
193.8 ms @50, 97.5 vs 118.1 ms @25), measured back-to-back with
FLASHRT_THOR_FA4=1vs=0.
Thor has CUDA-graph tactic jitter (±2–3 ms); always A/B back-to-back and don't compare runs taken at different times.
- Thor (sm_110) only. The kernels and FA4 are additive — other hardware
builds neither compile the
lingbot_*sources (gated) nor inherit the FA4 deps. There is no RTX / JAX LingBot path. - All intermediate buffers are pre-allocated; the denoise loop is captured into a CUDA Graph (no dynamic allocation on the hot path). Shapes (action dim 75, horizon 50) and step counts are fixed per captured graph.
- FP8 static scales come from the calibration JSON (
docs/calibration.mdcontract); calibration is required.