Status: design draft · 2026-05-28 · branch
spec/exec-contractThis layer defines FlashRT's common execution contract: a clean, minimal C ABI that takes the already-proven "kernelize + CUDA Graph replay" seam — today an implicit convention scattered across the Python frontends — and turns it into an explicit, embeddable common layer. It fixes mechanism only, never scenario policy, and has zero dependency on the csrc kernel layer.
FlashRT runs a model by capturing its kernel sequence into a CUDA Graph once (setup), then
replaying that graph forever (hot path). This contract is a thin C ABI that takes over only the
replay side. A model frontend captures its graph however it likes — torch (torch.cuda.graph) or
the framework's ctypes CUDAGraph — and adopts the instantiated graph-exec into the contract;
the contract then owns replay, shape-variant selection, zero-copy buffer hand-off, streams, events,
and a Plan DAG. It adds zero replay overhead (the very same cudaGraphLaunch) and the default
path is byte-identical. All capture-time work — autotune, calibration, warmup — stays in the
frontend and is fully under the caller's explicit control; the contract never sees it.
Buffer named device memory; the ONLY "state" (KV cache, scales, noise, subgoal embedding...).
Graphs are wired together by SHARING one Buffer (zero-copy hand-off).
Graph a ShapeKey -> graph-exec variant table (exact key + LRU). capture (own) or adopt
(external exec); replay(key, stream); bind named I/O ports.
Plan a dumb DAG of (Graph, ShapeKey) replays across streams with explicit cross-stream deps.
Data dependencies only — no priority/deadline/preemption (that's upper-layer policy).
ShapeKey opaque u64 encoding (B, S, ...). Batch is just a FIELD of the key, not a new axis.
Plus mechanism helpers: prioritized + wrappable streams (incl. the default stream), cross-stream
events, and device-to-device buffer_copy. Lives in a top-level exec/ layer, sibling to csrc/
with zero dependency on it (§5). Capture/autotune/calibration/sessions/schedulers/protocols are
out of scope — the upper layer (frontend / serving) owns them; the contract only adapts (§8.5).
frt_ctx c = frt_ctx_create();
/* (1) LLM/VLA: drive a graph the frontend already captured, bit-identical */
frt_graph g = frt_graph_create(c, "decode", /*max_variants=*/256); /* exact-key LRU table */
frt_graph_adopt(g, /*key=*/cur_pos, external_exec); /* e.g. torch raw_cuda_graph_exec() */
int s = frt_ctx_wrap_stream(c, host_stream); /* reuse the frontend's own stream */
frt_graph_replay(g, /*key=*/cur_pos, s); /* == cudaGraphLaunch, zero overhead */
/* (2) Multi-subgraph (VLA): vision -> encoder -> action, zero-copy + cross-stream */
frt_buffer x = frt_buffer_alloc(c, "enc_x", nbytes);
frt_graph_bind(vision, "out", x); frt_graph_bind(encoder, "in", x); /* same buffer => no copy */
frt_plan p = frt_plan_create(c);
int a = frt_plan_add(p, vision, 0, s0);
int b = frt_plan_add(p, encoder, 0, s1);
frt_plan_after(p, b, a); /* encoder (s1) waits vision (s0) via an event */
frt_plan_execute(p, 0);Proven (v1): drives a real LLM (Qwen3.6 decode+spec — bit-identical, 133 tok/s, no regression) and a real VLA (Pi0.5 FP16 + FP8 — cosine 1.0); every primitive validated. Acceptance: §8.
Everything model-specific and iterative is setup (cold, runs once): weight load, autotune,
calibration (single / multi-frame N>1 dataset / int8-static / AWQ), graph capture, and adopt.
It stays in the Python frontend — fastest to develop, inspect, debug; adding a model = a script, no
recompile. The contract sees only the result: an adopted graph (+ variants) + buffers (scales/state).
SETUP (cold, once, model-specific) HOT PATH (replay, hot, model-agnostic)
─────────────────────────────────── ──────────────────────────────────────
Python frontend: host loop (Python today; C++/Rust at deploy):
load weights write inputs -> Buffer
calibrate (N>1 dataset) / autotune ──▶ frt_graph_replay / frt_plan_execute ──▶ GPU
capture graph (torch or ctypes) read outputs kernel
frt_graph.adopt(exec) / buffer.wrap() (one cudaGraphLaunch per replay = pure GPU) graph
Why this split is optimal: the rigid, debug-heavy "engine-as-code" problem (TensorRT-LLM-style one model = one compiled engine) comes from compiling model logic. Here the kernel sequence is captured at runtime into a graph (data), so the C++ layer holds zero model-specific code — one binary drives all models. Flexibility lives in Python; only the stable, mechanical replay is native.
(A) today — Python frontend drives; exec is the C++ library it calls
Python frontend ──pybind──▶ C++ exec (libflashrt_exec) ──cudaGraphLaunch──▶ GPU kernel graph
(pybind crosses the SAME C ABI a native host would; Python is the loop driver for now)
(B) deployment — a native host drives; Python only at setup (same process) or absent
C++/Rust host ──C ABI──▶ C++ exec (libflashrt_exec.so) ──cudaGraphLaunch──▶ GPU kernel graph
(no Python in the hot loop; capture/adopt done once — in-process Python, or ported to C++)
A captured CUDA graph is not serializable across processes (baked pointers are process-local), so capture must run in the same process that replays — which is why "Python setup once + native hot loop in one process" is the pragmatic embedded path (Python out of the hot loop without porting capture).
Same libflashrt_exec; the host language is chosen per scenario.
serving/robot_host/ (C++) — real-time VLA serving/llm_agent/ (Rust) — LLM server
─────────────────────────────────────────── ─────────────────────────────────────────
C++ host loop: Rust host (axum/SSE + session registry):
vision ─enc_x─▶ action (Plan, 1 DAG) per token: write tok -> Buffer
ASR on a concurrent stream (overlap/interrupt) frt_graph_replay(decode, key=cur_pos, s)
subgoal change = overwrite Buffer (no recapture) read logits -> argmax -> append
interrupt granularity = one short replay spec/MTP = more adopted graphs, same way
Why C++: real-time, ROS2-adjacent, one toolchain Why Rust: async/safety shell; FFI seam
with exec+kernel, no FFI seam, easiest debug. crossed once per token (thin).
What FlashRT already does (see flash_rt/core/cuda_graph.py, flash_rt/frontends/):
setup-once (
set_prompt: tokenize / calibrate / autotune / capture) → replay-forever (infer: copy inputs into static buffers →graph.replay(stream)→ clone outputs)
Multi-subgraph wiring (vision→encoder→action) already exists, and it is dead simple: the vision
graph writes _enc_x, the encoder graph reads _enc_x from the same device pointer — no HBM
round-trip, no copy. Multi-batch already exists too: a separate graph is captured for B=2.
This layer's whole job is to make that already-validated seam an explicit contract. It invents no new abstraction.
- A replayable graph node + its bound, named I/O buffers
- Zero-copy hand-off between graphs via a shared buffer
selectamong shape variants (batch 1-8, seqlen buckets) and replay- Multi-stream + event sync + stream priority (the hardware mechanisms behind parallel scheduling, interruption, gap-fill)
- Imperative driving: the host may fire any graph/Plan on demand
- ❌ session registry, prefix/radix cache, KV append/fork/evict semantics
- ❌ scheduler: priority decisions, deadlines, preemption policy
- ❌ protocols: OpenAI
/v1/chat, SSE, MCP, agent protocols - ❌ robotics: sensor triggers, action cadence, multi-rate orchestration
- ❌ scenario tags on a graph like
family: llm|vla,latency_hint,bottleneck_hint
Red line: if any scenario forces you to add a field to the common layer, that field is policy and belongs back in an example. The only things allowed to grow are
ShapeKeysemantics and the number of buffers/graphs.
Buffer A named device memory region. Graphs are wired together by SHARING one Buffer
(zero-copy). Every "mutable state" (KV, vision cache, subgoal embedding, scales)
is a Buffer. The framework owns its lifetime + device pointer only; append/fork/
evict are caller logic written on top of the pointer.
Graph A captured graph-exec + its bound named I/O Buffers + replay(stream).
Internally a ShapeKey -> graph-exec variant table (batch 1-8 / seqlen buckets go here).
Plan An ordered replay of (Graph, ShapeKey) across streams, with explicit event deps.
A dumb DAG: it expresses DATA dependencies only — never priority/deadline/preemption.
ShapeKey An opaque u64 encoding (B, S, ...). Batch is NOT a new axis — just one field of the key.
- Batch is not a new axis; it is a field of
ShapeKey. And the ShapeKey is an exact key, not a bucket: Qwen3.6 keys a decode graph per exactcur_pos, a verify graph per(cur_pos, K), evicted by an LRU table (cap 256). Bucketing (e.g. seqlen {512,1024}) is just one caller strategy layered on top of an exact key — not a framework concept. Which keys to capture and how to evict is caller/example policy; the framework only provides the keyed variant table + capacity LRU + "does this key have a variant". - No
Stateobject. State is uniformly aBuffer. The framework defines no append/fork verbs. Planis dumb. It expresses data deps only. Multi-model co-host and VLA multi-rate are composed by the upper layer using Plans as building blocks.- Capture "intelligence" stays out of the common layer. Capture is cold and model-specific
(autotune/calibrate) and stays in the Python frontend. The common layer provides only
framework-agnostic
begin/end capture(stream level) + buffer registration, owning the replay-time contract + buffer registry. This is the core red line that keeps the layer thin.
typedef struct frt_ctx_s* frt_ctx; /* owns arena + stream/event pool */
typedef struct frt_buffer_s* frt_buffer;
typedef struct frt_graph_s* frt_graph;
typedef struct frt_plan_s* frt_plan;
typedef struct frt_event_s* frt_event; /* cross-stream sync point */
typedef uint64_t frt_shape_key;
/* --- ctx / stream / event --- */
frt_ctx frt_ctx_create(void);
int frt_ctx_stream(frt_ctx, int priority); /* prioritized stream; 0 = normal */
frt_event frt_ctx_event (frt_ctx); /* imperative cross-stream sync (interrupt / spec snapshot) */
int frt_event_record(frt_event, int stream_id);
int frt_stream_wait (frt_ctx, int stream_id, frt_event);
/* --- Buffer: named device memory; graphs wire via sharing it --- */
frt_buffer frt_buffer_alloc(frt_ctx, const char* name, size_t bytes);
frt_buffer frt_buffer_wrap (frt_ctx, const char* name, void* dptr, size_t bytes);
void* frt_buffer_dptr (frt_buffer); /* caller does copy_/append/zero; framework owns no semantics */
int frt_buffer_copy (frt_ctx, frt_buffer dst, size_t dst_off,
frt_buffer src, size_t src_off, size_t bytes, int stream_id);
/* --- Graph: a ShapeKey -> graph-exec variant table (exact key + LRU, cap = max_variants) --- */
frt_graph frt_graph_create (frt_ctx, const char* name, size_t max_variants);
int frt_graph_capture(frt_graph, frt_shape_key key,
void (*record)(void* user, void* stream), void* user);
int frt_graph_bind (frt_graph, const char* port, frt_buffer); /* named I/O */
int frt_graph_replay (frt_graph, frt_shape_key key, int stream_id); /* missing key -> error */
int frt_graph_has_variant(frt_graph, frt_shape_key key);
/* --- Plan: dumb DAG, data dependencies only --- */
frt_plan frt_plan_create (frt_ctx);
int frt_plan_add (frt_plan, frt_graph, frt_shape_key key, int stream_id);
int frt_plan_after (frt_plan, int node_idx, int dep_node_idx); /* event sync */
int frt_plan_execute(frt_plan, frt_shape_key key);That's all. No session, scheduler, OpenAI, KV, or family.
A graph is a closed capsule. Every split point = one HBM materialization (intermediate values stay in registers/smem within a capsule; across graphs they must land in HBM). So split a boundary only when any of these holds, otherwise fuse into one graph:
- The output is consumed at a different cadence (vision 30Hz → action 50Hz)
- Inputs must be mutated between stages (copy new obs / new token)
- Control flow must return to host (spec accept, sampling, early exit, interrupt point)
- The stage is reused (vision encoder across calls / across models)
Without one of these, don't split — you would forfeit the megakernel fusion you worked for. Don't split where you just finished fusing.
To reclaim small-batch GPU headroom, put the second model entirely on another stream and let the hardware overlap — not by finely splitting the first model and hand-inserting work into bubbles. The mechanism (multi-stream + event) is free within the 3 atoms; whether to overlap, or to keep the GPU idle to protect p99, is upper-layer policy and stays out of the common layer.
A CUDA graph cannot be preempted mid-replay — but that is exactly the dividend of kernelization: the graphs are short (VLA inference ~17ms, LLM decode sub-ms), so interrupt granularity = one replay duration, and the host re-decides between replays. All three real-robot interrupt actions are expressible with the 3 atoms + multi-stream:
| Robot action | Implementation | New concept? |
|---|---|---|
| Voice concurrency (ASR ‖ VLA) | ASR on a separate stream, hardware overlaps | No |
| Change subgoal / goal | subgoal embedding is a Buffer; host overwrites it; next replay uses it — no recapture |
No |
| Hard interrupt (abort current action) | host loop stops issuing the next replay, issues a high-priority graph; granularity = one short graph | No |
Two accompanying disciplines (usage, not new concepts):
- Imperative driving is first-class.
Plancovers only the static DAG inside one inference (vision→encoder→action). The interruptible outer loop (read sensor/voice → decide which Plan to fire → swap a buffer → abort/switch) belongs to the robot host, not the framework. The framework only guarantees graphs/Plans can be fired imperatively on demand. - Anything mutable at "interrupt cadence" must be a bound
Buffer, never baked into the graph. subgoal/prompt embeddings, target poses, mode flags — all become overwritable Buffers, so changing the goal is a µs-scale copy, not a seconds-scale recapture. (Proven pattern: FP8 scales are updated in-place today, graph pointers stay valid, no recapture.)
| Scenario | How the atoms compose | Framework knows the "scenario"? |
|---|---|---|
| Multi-subgraph VLA (vision→llm→action) | 3 Graphs; vision out-port and llm in-port bind the same Buffer; a Plan chains them |
No — just 3 graphs sharing a buffer |
| Multi-model co-host (Pi05 + Qwen) | Two graph sets + two Plans on different stream_ids; ordering is the host's call |
No — just executes DAGs |
| LLM decode + KV | KV is a Buffer; the decode graph binds it; each step the host copies into the KV offset, then replay(key=seqlen_bucket) |
No — unaware of KV/session |
| Batch 1-8 | ShapeKey carries B; capture 4 variants; host picks the key by the packed B |
No — just a key |
| Spec decode / MTP | draft/verify are Graphs; accept length and rollback live in host code; rollback = rewrite the KV buffer's logical length |
No — unaware of spec |
| Voice interrupt → change subgoal | the ASR Graph runs on another stream; the subgoal-embedding Buffer is overwritten by the host |
No — just a buffer overwrite |
This table is the acceptance test: if any scenario forces a new common-layer field, the design failed — go back to the §0 red line.
Core point: the exec layer has zero dependency on the csrc kernel layer. It captures whatever
the record callback enqueues onto a stream (FlashRT kernels / torch ops / raw CUDA), and only ever
sees streams / graphs / events. It is a kernel-agnostic + framework-agnostic orchestration layer,
orthogonal to csrc's compute kernels, with its own independent cross-hardware backend axis. So it sits
as a sibling to csrc, not inside it.
exec/ # NEW top-level: execution-contract layer (orchestration, not kernels; zero csrc dep)
include/flashrt/exec.h # public C ABI — the authoritative form of this spec
src/
context.cpp # frt_ctx: arena + stream/event pool
buffer.cpp # frt_buffer
graph.cpp # frt_graph: ShapeKey -> graph-exec variant table + capture/replay
plan.cpp # frt_plan: dumb DAG executor
backend/ # exec's own cross-hardware backend axis (orthogonal to csrc's kernel backends)
backend.h # graph/stream/event abstract interface
cuda/cuda_backend.cpp # CUDA impl (cudaGraph/Stream/Event)
# future: hip/ , level_zero/ ...
bindings/
exec_pybind.cpp # pybind -> flash_rt.runtime.exec (dev/migration only)
CMakeLists.txt # builds libflashrt_exec (.so) as an independent target
flash_rt/runtime/exec.py # thin Python wrapper (sibling to rtc.py), dev only
serving/robot_host/ # scenario: multi-model + interrupt (VLA + ASR), policy-layer demo
serving/llm_agent/ # scenario: session/KV/OpenAI shell (folds in existing *_openai_server.py)
docs/exec_contract.md # this document
csrc/ # unchanged: pure compute kernels (attention/gemm/conv/...), own cross-hw backends
Capture "intelligence" (autotune/calibrate) stays in the Python frontend; this layer owns only the replay-time contract.
Map both real frontends' graph/buffer/key structures onto frt_* to test sufficiency.
Conclusion: the 3-atom core holds, with no scenario field forced out; validation did force out 3
mechanism primitives (now added to exec.h), none of which touch the policy red line.
- Graphs (all exact-key, LRU table cap 256): decode by
cur_pos, verify by(cur_pos,K), MTP-draft bymtp_pos, prefill-chunk by(start,len). → onefrt_graphper type, ShapeKey encodes those exact values,max_variants=256LRU. No batch axis (B=1). - All state is
Buffer:K_cache/V_cache(16 layers), linear-attn_lin_state/_lin_conv_state, MTP KV,_static_token_id,_logits_buf, RoPE tables, snapshot bufs. → allfrt_buffer_alloc+bind. - KV append = write into
K_cache[layer, cur_pos:cur_pos+1](host writes the offset before/within the replay). No append verb. - Spec flow: host imperatively runs K draft graphs → concurrent snapshot on a side stream
(
_snap_stream+ wait_stream) → replay the(cur_pos,K)verify graph → host argmax-compares to pick accept length N → on mismatch,Buffer-to-Buffercopy rollback + re-advance with theK=N+1graph. Accept/rollback are entirely host control flow; the framework is unaware of spec. - Forced-out mechanisms: ① multi-stream + standalone event (the snapshot side-stream's
imperative wait_stream, outside any Plan) →
frt_ctx_event/record/stream_wait; ②frt_buffer_copyfor snapshot/restore; ③ Graph variant tablemax_variants+ LRU.
- Graphs: SigLIP (vision) one; encoder+decoder fused into one (all 10 diffusion steps inside
the graph, one replay runs them all, no host loop); CFG B=2 captures a separate
_enc_ae_graph_b2(there is even an outer graph fusing lang-swap + SigLIP×2 + enc + dec into one). → vision/enc-dec are twofrt_graphs; B=2 is the B-field variant of a ShapeKey. - Zero-copy hand-off (the core): SigLIP writes
_enc_x→ encoder reads the same_enc_x.data_ptr(), no copy; encoder writes_Kc/_Vc→ decoder reads the same pointers. → both graphsbindthe samefrt_buffer— exactly the wiring mechanism the design intends. - Per-step time embeds are precomputed into
_sa_all/_sf_all/_fs_allat set_prompt time, read-only inside the graph. →Buffer, bind. - Batch/CFG:
set_batched_mode(B=2)triggers a separate capture = the B-variant of the ShapeKey. B is not a new concept. - "Plan vs one big graph" is the author's choice: Pi05 can either chain vision + enc-dec via a
Plan, or capture them as one graph like the CFG outer graph. The contract supports both — which validates the flexibility. - Today Pi05 is single-stream; multi-stream/multi-rate is forward-looking for
serving/robot_host/, and the mechanism is already in place.
- ✅ KV append / rollback / accept / batch / multi-subgraph are all expressed with only Buffer + exact-key Graph + host control flow + shared bind, with no session/scheduler/family/KV-verb policy field added. Red line held.
- ➕ Added 3 mechanisms:
event(imperative cross-stream),buffer_copy(torch-free rollback), graphmax_variants+ LRU. This is "minimal complete", not bloat — without them, imperative interrupt sync and spec rollback would have to drop to raw CUDA, defeating the embeddability goal.
- Spec (~1 week): this document +
exec/include/flashrt/exec.h. Acceptance: use the §4 table to wire Qwen decode and Pi05 vision→action on paper without adding any field. (Done — see §5b.) - C++ impl + pybind + one real model (2-3 weeks): implement Buffer/Graph/Plan + the CUDA backend;
route the existing Qwen frontend's replay hot path through it (capture stays in Python).
Acceptance: E2E tok/s and cos do not regress (the floor); multi-session B≤8 runs through the
same
select(key). - One multi-node + interrupt example (1-2 weeks):
serving/robot_host/chains Pi05 vision→llm→action across streams via a Plan, and demonstrates concurrent ASR + subgoal-buffer overwrite interruption. Acceptance: Pi05 latency does not regress; no scenario field was added to the common layer throughout (if one was, go back to step 1).
The Rust shell, OpenAI, and scheduler all live in serving / the upper layer, off this main line.
- Single-model B=1 decode loop:
replayis already a ~10µs ctypes call; Python is fine, and a C++ rewrite buys ≈0. - Where C++ actually earns its keep: (a) the multi-node / multi-stream / event Plan execution loop
(escaping GIL/GC jitter); (b) embeddability — a robot host or Rust agent server can link a pure
C-ABI
.sodirectly. - Therefore the common layer = replay-time execution of Buffer/Graph/Plan + a C ABI, and not one bit more.
Status of the first version (branch spec/exec-contract). Everything below is implemented and
verified in-container on RTX 5090 (sm_120).
-
Bufferalloc / wrap / dptr / bytes / name — owned + external pointers -
frt_buffer_copydevice-to-device async on a stream -
Graphcapture (own, via record callback) — stream-level RELAXED capture -
Graphadopt an external instantiated exec (torchraw_cuda_graph_exec()or ctypesCUDAGraph._graph_exec); non-owned, never freed by the layer -
Graphbindnamed ports (zero-copy hand-off bookkeeping + lifetime) -
GraphShapeKey variant table + LRU (exact key,max_variants); missing key →FRT_ERR_NO_VARIANT(never a silent no-op) -
Plandumb DAG:add/after(cross-stream event dep) /execute/sync - Events:
frt_ctx_event/frt_event_record/frt_stream_wait(imperative cross-stream) - Streams:
frt_ctx_stream(priority)+frt_ctx_wrap_stream(external); default stream (handle 0) is a first-class wrappable stream (id validated by range, not handle nullness) - Lifetime: ctx owns streams/events/buffers/graphs/plans; clean teardown, no leaks of owned memory
Test: exec/tests/test_exec.py (5/5) + exec/tests/test_plan_vla.py (VLA-shaped Plan chain).
- Qwen3.6-27B NVFP4 (LLM) decode-S=1 graph driven via exec — bit-identical to torch replay
(
test_adopt_qwen36.py, max abs diff 0.0) - Qwen3.6 spec-decode path (MTP-chain + verify graphs) via exec — output sha identical,
spec stats identical (commit
0b988d8) - Qwen3.6 spec rollback (lin/conv restore) via
frt_buffer_copy— sha identical (c3c66f4) - Pi0.5 (VLA) FP16 full infer graph via exec —
cos = 1.000000, bit-identical (test_adopt_pi05.py) - Pi0.5 (VLA) FP8 full infer graph via exec —
cos = 1.000000, bit-identical (test_adopt_pi05.py --fp8)
- Qwen3.6 decode tok/s: 133.7 (off) vs 133.9 (on), K=6 representative prompt — within noise
- Qwen3.6 prefill TTFT: 217.1 ms off vs on — unchanged
- Pi0.5 FP16 p50: 34.5 ms off vs on — unchanged
- Pi0.5 FP8 p50: 24.55 ms off vs on — unchanged
- Rationale: frt replay is the same
cudaGraphLaunchas the baseline; zero added replay overhead by design. - Measurement notes:
wall ≈ internal(the predict() Python pre/post is ~0.1 ms — negligible). Absolute latency is warmup/clock sensitive: the RTX 5090 SM clock ramps toward its 3105 MHz peak over a run, so a short warmup reads ~1–2 ms high (e.g. Pi0.5 FP16 internal_p50 34.0 ms at warmup=30 vs 33.1 ms / min 32.2 ms at warmup=200). Warmup, calibration, and clock pinning stay under the caller's explicit control (the contract does not touch capture/warmup); the flag on-vs-off A/B is parity at any warmup level — that is the acceptance signal, not the absolute number.
- Qwen3.6:
FLASHRT_QWEN36_USE_EXEC=1(decode + spec + rollback) - Pi0.5 FP16:
FLASHRT_PI05_USE_EXEC=1(Pi05PipelineFP16) - Pi0.5 FP8/default:
FLASHRT_PI05_USE_EXEC=1(Pi05Pipeline) - Flag unset → original path runs unchanged (verified: identical output + latency)
Validated against the real VLA feature surface — all are capture/setup-time and stay in the frontend; each resolves to "a captured graph (+ batch/CFG/decoder variants) + buffers (scales/state)", which the spec already expresses:
- autotune (
autotune_gemms), FP8 / int8-static / AWQ calibration, multi-frame (N>1) dataset calibration → run beforerecord_infer_graph; the layer only adopts the resulting graph - in-place recalibration without recapture → scales are
Buffers; adopted exec stays valid set_prompt_batch/set_batched_mode(B=2 CFG) → a ShapeKey B-variant (another adopted graph)- CFG/RL pipelines, temporal-KV decoder-only graph → additional adopted graphs
- sessions, schedulers, OpenAI/MCP, sensor/cadence orchestration, interruption policy
# exec layer (seconds)
cmake -S exec -B exec/build -DCMAKE_BUILD_TYPE=Release && cmake --build exec/build -j
# flash_rt fa2 with fp16 (needed by Pi0.5 FP16): default FA2_DTYPES already fp16;bf16
cmake -S . -B build -DGPU_ARCH=120 && cmake --build build -j --target flash_rt_fa2
# tests
python exec/tests/test_exec.py
python exec/tests/test_plan_vla.py
python exec/tests/test_adopt_qwen36.py # needs FLASHRT_QWEN36_{NVFP4,MTP}_CKPT_DIR
python exec/tests/test_adopt_pi05.py [--fp8] # needs a pi05 ckpt
- Long-context Qwen3.6 spec paths (≥512-token FP8-KV / TQ routes) not yet routed through exec (short BF16/spec path is)
serving/robot_host/interrupt demo,serving/llm_agent/OpenAI shell, Rust shell — upper layer- Full 64K–256K Qwen TTFT/decode re-alignment with the doc (512–32K done)
- Merge
spec/exec-contractback to main
This section is the maintenance contract for the execution layer and everything built on top of it
(serving/). It is the rule every contract-touching or serving PR is reviewed against. Keep it short
on purpose — if a change needs a longer justification than this, it is probably policy.
- The contract fixes MECHANISM, never scenario policy. Three primitives plus one key:
Buffer/Graph/Plan/ShapeKey. The only allowed extensions are ShapeKey semantics and the number of buffers/graphs. Anything else that wants a field is policy. - If a scenario pushes you to add a field or verb to
exec/, that field is policy. Push it back toserving/or user code. There is no session, KV, cache, schedule, batch-policy, protocol, agent, or robot concept in the contract, and there must never be one. - Policy lives above the contract (in
serving/or the user host), and includes: sessions and prefix/KV reuse, eviction/scheduling, batching policy, OpenAI/MCP protocol and tool-call parsing, agent/ReAct loops, robot episode/cadence/interruption orchestration, and prompt routing. - Capture-time intelligence stays in the Python frontend. Autotune, FP8/NVFP4/AWQ calibration, multi-frame calibration, CFG-batched capture, and warmup all run before graph capture; the contract only adopts the resulting instantiated exec and owns replay-time. The contract never captures on behalf of a scenario.
exec/is a top-level sibling ofcsrc/with zerocsrcdependency. It is kernel-agnostic: it only sees streams/graphs/events and replays whatever was captured (kernel, torch op, or raw CUDA). It must not#includeor link csrc kernels.- Additive-only at the integration seam. Existing kernels, bindings, loaders, and graph-capture paths are not modified; new behavior is new methods / new files / new ShapeKey variants, gated by an opt-in flag, with the default path byte-identical. A shared-method edit that changes behavior on a path other than the new one is the one thing to flag, justify, and back with before/after numbers.
- Cut graphs only at real seams — data/cadence boundaries, host control-flow returns, or reuse points — never to fill scheduling bubbles. Each cut materializes one HBM round-trip and can erase a fusion win.
-
git diff main -- exec/is empty, or the change only adds a Buffer/Graph/Plan/Event/ShapeKey mechanism (no session/KV/schedule/protocol/scenario field, verb, or struct member). - The serving/example layer does not import or call into the scenario's behavior from inside the
contract; orchestration (sessions, loops, interruption, tool calls) is in
serving/only. - GPU/model state stays owned by the frontend; the serving layer holds metadata only (token journal, cache plan, episode bookkeeping) — never device buffers it allocated behind the contract.
- Capture/calibration/warmup stays in the frontend; the contract only adopts. No new capture path
was added to
exec/. - The change is additive: new methods/files/flags, default path unchanged. If a shared method was edited, the PR lists the affected non-new paths and shows output is preserved (bit-identical / token-exact) and reports before/after latency.
- Correctness gate: the new path produces output identical to the pre-existing path it replaces (bit-identical for deterministic LLM decode, token-exact for spec decode, cosine ≥ 0.999 for VLA diffusion), with a runnable command in the PR.
- Optional dependencies in tests use
pytest.importorskip(...)so they skip (not fail) when a serving-only dependency (e.g.fastapi) is absent from a kernel-only environment. - Public repo hygiene: English docs/code/comments, no local paths, no machine/container/env names, no generated build outputs or checkpoints committed.
The Qwen3.6 agent-serving host is the reference application of this red line, and is what a reviewer should compare a new serving example against:
exec/untouched —git diff main -- exec/is empty; the agent adds zero contract verbs.- Pure policy layer —
server.pyis a thin FastAPI shell;service.py/session.py/prefix.pyown session cache, exact token-prefix reuse, and request planning;tool_stream.pyparses tool calls without leaking partial JSON. All of this is policy and lives only inserving/qwen36_agent/. - State ownership respected —
SessionRecordexplicitly tracks only the token journal and cache metadata; the GPU KV / linear-attention / MTP state stays owned by the Qwen frontend. - Drives the model through the frontend, not the contract —
qwen36_engine.pyis a thin adapter to theAgentEngineprotocol; the contract is reached (optionally) inside the frontend viaFLASHRT_QWEN36_USE_EXEC, not from the serving layer. - Additive frontend split — new
*_agent/*_committed_streammethods; the one shared-method edit (_long_tq_effective_k, a speculative-K tuning heuristic) is verify-corrected, so decode output is unchanged and only speed is affected. - Verified — policy + OpenAI HTTP surface:
tests/test_qwen36_agent_serving_policy.pyandtests/test_qwen36_server_warmup.py; real-model equivalence:tests/test_qwen36_agent_gpu_split.pyasserts the split prefill + committed-stream path is token-identical to the old full-generate path (short, long, long-append, long-text).