feat(optimizations): Luce Spark — calibrated hot/cold expert residency by davide221 · Pull Request #345 · Luce-Org/lucebox-hub

davide221 · 2026-06-05T14:58:54Z

Adds optimizations/spark/, the placement + caching product layer on top of the merged hot/cold MoE offload engine (server/src/common/moe_hybrid_*). Mirrors optimizations/pflash/: standalone tooling + docs here, engine code stays in server/.

What Spark is

A 33B-total MoE only fires ~8 of 256 experts/token, but naive hot/cold offload still hits the CPU tier ~36% of the time if it keeps the wrong experts resident. Spark:

Calibrates the hot set from the traffic you actually serve (replays real agent sessions, accumulates per-(layer,expert) routing → placement profile).
Caches the per-session tail in a bounded ring of spare GPU slots (LRU swap), driving cold-misses to ~0 in fixed VRAM.
Pre-gate (research): captures (hidden → experts) traces + trains a predictor; documents why fusion needs a model fine-tune, not a fitted predictor.

Results (RTX 3090, Laguna-XS.2 Q4_K_M 33B-total, held-out Claude Code sessions)

Config	tok/s	% all-GPU	cold-hit	VRAM
All-GPU	111	100%	–	18.8 GiB
Uniform 60%	66	59%	36%	10.6 GiB
Spark calibrated 60%	81	73%	6.6%	10.6 GiB
Spark + cache (32 slots)	85–88	~79%	~0%	14.6 GiB

Calibration: 333 chunks / ~171K tokens from real sessions, split by session (no leakage). Full tables in RESULTS.md.

Scope of this PR

In: standalone tooling (extract_sessions, tokenizer, calibrate, validate, research train_pregate) + README + RESULTS + hero. Nothing here compiles into the daemon.
Not in: the engine-side Spark additions (bounded cache / gpu_remap / trace hook). Those land separately in server/. The calibration path runs on merged main via DFLASH_LAGUNA_HOTNESS / NEXT_PLACEMENT_OUT; the cache flags are documented for when the engine PR lands.

Built on the merged hot/cold offload engine (#289 et al). Honest negative result included: a fitted pre-gate caps at ~53% recall@8 (pre/post-attention information gap), so full all-GPU speed needs a gate fine-tune, not a predictor.

🧙 Built with WOZCODE

…ency) Spark is the placement + caching layer on top of the merged hot/cold MoE offload engine. Standalone tooling to calibrate expert placement from real agent traffic, validate on held-out sessions, and (research) train a pre-gate predictor from routing traces. Laguna-XS.2 Q4_K_M (33B total MoE) / RTX 3090, held-out Claude Code sessions: calibration lifts naive offload 66 -> 81 tok/s (cold-hit 36% -> 6.6%); the bounded expert cache reaches ~88 tok/s at ~0 cold, 14.6 GiB peak (vs 18.8 GiB to hold the full model). Engine (cache, gpu_remap, trace) lives in server/. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

9 issues found across 10 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/tokenizer.py">

<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>

<file name="optimizations/spark/spark/extract_sessions.py">

<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>

<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>

<file name="optimizations/spark/README.md">

<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>

<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>
</file>

<file name="optimizations/spark/spark/train_pregate.py">

<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>

<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>

_{Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T15:08:18Z

+    def list_str(fl):
+        return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data]
+
+    model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode()


P2: Unguarded GGUF field access on f["tokenizer.ggml.model"] and f["tokenizer.ggml.tokens"]. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/tokenizer.py, line 24: <comment>Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</comment> <file context> @@ -0,0 +1,56 @@ + def list_str(fl): + return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data] + + model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode() + if model != "gpt2": + raise SystemExit(f"only gpt2 byte-level BPE supported, gguf says model={model!r}") </file context>

cubic-dev-ai · 2026-06-05T15:08:18Z

+    train, test = [], []
+    for f in files:
+        # split by session-path hash so a whole session is train xor test
+        bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train


P2: hashlib.md5() without usedforsecurity=False will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79: <comment>`hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</comment> <file context> @@ -0,0 +1,115 @@ + train, test = [], [] + for f in files: + # split by session-path hash so a whole session is train xor test + bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train + sess = [] + try: </file context>

cubic-dev-ai · 2026-06-05T15:08:18Z

+
+```bash
+cd optimizations/spark
+uv sync                                   # tokenizers (+ gguf/torch optional extras)


P2: uv sync does not install optional dependencies (gguf, numpy) required by the tokenizer step. The comment on the uv sync line is misleading — it lists gguf/torch optional extras but the command does not install them. Step 0 (python -m spark.tokenizer) will fail with ImportError because gguf and numpy are in the tokenizer extra, not the base dependency set.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 104: <comment>`uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</comment> <file context> @@ -0,0 +1,187 @@ + +```bash +cd optimizations/spark +uv sync # tokenizers (+ gguf/torch optional extras) + +# 0. one tokenizer, extracted from the GGUF (gpt2 byte-level BPE) </file context>

Suggested change

uv sync # tokenizers (+ gguf/torch optional extras)

uv sync --all-extras # tokenizers (+ gguf/torch optional extras)

cubic-dev-ai · 2026-06-05T15:08:19Z

+    train, test = [], []
+    for f in files:
+        # split by session-path hash so a whole session is train xor test
+        bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train


P2: --test-frac is not validated, so --test-frac 0 crashes with ZeroDivisionError during split assignment.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79: <comment>`--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</comment> <file context> @@ -0,0 +1,115 @@ + train, test = [], [] + for f in files: + # split by session-path hash so a whole session is train xor test + bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train + sess = [] + try: </file context>

cubic-dev-ai · 2026-06-05T15:08:19Z

+            proc.kill()
+            raise SystemExit("ready timeout")
+
+    pp = Path("/tmp/spark_val_chunk.bin")


P2: Using fixed /tmp filenames causes cross-process collisions when multiple validation runs execute concurrently.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 95: <comment>Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</comment> <file context> @@ -0,0 +1,142 @@ + proc.kill() + raise SystemExit("ready timeout") + + pp = Path("/tmp/spark_val_chunk.bin") + op = Path("/tmp/spark_val_out.bin") + toks = [] </file context>

cubic-dev-ai · 2026-06-05T15:08:19Z

+    import torch.nn as nn
+    H, E = args.n_embd, args.n_expert
+    dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))])
+    arr = np.fromfile(args.trace, dtype=dt)


P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before np.fromfile, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.)

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 42: <comment>The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early. (Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) </comment> <file context> @@ -0,0 +1,85 @@ + import torch.nn as nn + H, E = args.n_embd, args.n_expert + dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))]) + arr = np.fromfile(args.trace, dtype=dt) + print(f"records={len(arr)}", flush=True) + dev = "cuda" if torch.cuda.is_available() else "cpu" </file context>

cubic-dev-ai · 2026-06-05T15:08:19Z

+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+
+    agg = {8: [], 16: [], 24: []}
+    for L in range(1, args.n_layer):


P3: The per-layer training loop skips layer 0 (range(1, args.n_layer)), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 47: <comment>The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</comment> <file context> @@ -0,0 +1,85 @@ + dev = "cuda" if torch.cuda.is_available() else "cpu" + + agg = {8: [], 16: [], 24: []} + for L in range(1, args.n_layer): + idx = np.where(arr["layer"] == L)[0][:args.max_per_layer] + if len(idx) < 400: </file context>

A blocking proc.stdout.readline() returns only on a new line or EOF, so the time-based timeout check ran *after* it: a daemon that goes silent (stalled model load, deadlock, disk stall) blocked the caller forever and the timeout never fired. This broke unattended calibration/validation runs. Move the protocol into spark/_daemon.py: a reader thread pumps stdout into a queue and callers wait with queue.get(timeout=...), so a read can never outlive its deadline. wait_ready() and the per-chunk generate loops in calibrate.py and validate.py now honor --ready-timeout / --gen-timeout and kill the daemon on stall. (select() + buffered readline would miss buffered data; the reader thread does the blocking readline itself.) Verified: silent daemon times out in ~1s, dead daemon returns immediately, happy path unchanged. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

5 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>

<violation number="2" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>

<file name="optimizations/spark/spark/calibrate.py">

<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>

<file name="optimizations/spark/spark/_daemon.py">

<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>

<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T15:34:02Z

+        if len(ids) < 8:
+            continue
+        write_counted_i32(pp, ids)
+        reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)


P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 91: <comment>daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</comment> <file context> @@ -100,30 +88,16 @@ def drain(): - break - if okl and okl.startswith("ok "): - m = re.search(r"decode_tok_s=([0-9.]+)", okl) + reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout) + if reply is None: + print("daemon stalled/closed; stopping early") </file context>

cubic-dev-ai · 2026-06-05T15:34:02Z

+    if toks:
+        print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} "
+              f"over {len(toks)} chunks")
+    colds = [float(m.group(1)) for l in daemon.stderr_lines


P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 109: <comment>daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</comment> <file context> @@ -132,7 +106,7 @@ def drain(): print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} " f"over {len(toks)} chunks") - colds = [float(m.group(1)) for l in prof + colds = [float(m.group(1)) for l in daemon.stderr_lines for m in [re.search(r"cold_experts/tok=([0-9.]+)", l)] if m] if colds: </file context>

cubic-dev-ai · 2026-06-05T15:34:02Z

+        if len(ids) < 8:
+            continue
+        write_counted_i32(pp, ids)
+        reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)


P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/calibrate.py, line 78: <comment>Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</comment> <file context> @@ -84,38 +75,18 @@ def main(): - if time.time() - ts > 120: - break - if not ok: + reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout) + if reply is None: + print(f"[calib] daemon stalled/closed at chunk {i}; stopping", flush=True) </file context>

cubic-dev-ai · 2026-06-05T15:34:02Z

+        self._q.put(None)  # EOF sentinel
+
+    def _pump_stderr(self):
+        for line in self.proc.stderr:


P2: _pump_stderr silently crashes on encoding errors — same UnicodeDecodeError risk as stdout, causing data loss for callers that read stderr_lines (e.g. validate.py extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 44: <comment>`_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</comment> <file context> @@ -0,0 +1,106 @@ + self._q.put(None) # EOF sentinel + + def _pump_stderr(self): + for line in self.proc.stderr: + self.stderr_lines.append(line.rstrip()) + </file context>

cubic-dev-ai · 2026-06-05T15:34:02Z

+                line = self.readline(deadline - time.time())
+            except DaemonTimeout:
+                self.kill()
+                raise SystemExit(f"daemon did not become ready within {timeout}s (no output)")


P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 66: <comment>Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts. (Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .</comment> <file context> @@ -0,0 +1,106 @@ + line = self.readline(deadline - time.time()) + except DaemonTimeout: + self.kill() + raise SystemExit(f"daemon did not become ready within {timeout}s (no output)") + if line is None: + raise SystemExit("daemon died before ready banner") </file context>

Server side of Luce Spark, on top of the merged hybrid-offload engine: - Bounded GPU expert cache: spare slots over-allocated on the hot expert stack + moe_hybrid_cache_swap_in (LRU). On a cold hit the expert is swapped into a spare slot and served through the unified GPU FFN, so cold-misses fall to ~0 in fixed VRAM after warmup. Gated by DFLASH_LAGUNA_EXPERT_CACHE / _CACHE_SLOTS / _GPU_REMAP. - Swap-rebuild fix + clean profile flush: build_hybrid_storage_from_file() re-reads experts from the GGUF mmap (partial-load keeps no full expert tensors resident), used by both init and post-request swap (was asserting in ggml_backend_tensor_get). routing_stats_ now also allocates when swap is enabled, and the placement profile flushes after each hybrid generate so NEXT_PLACEMENT_OUT works without the swap path. - Pre-gate trace capture (DFLASH_LAGUNA_PREGATE_TRACE), profile-gated, feeds the optimizations/spark pregate trainer. Laguna-XS.2 Q4_K_M / RTX 3090, calibrated 60% + 32 cache slots: ~85-88 tok/s (cold ~0) at 14.6 GiB peak, vs 66 uniform / 111 all-GPU @18.8 GiB. Built + smoke-tested against origin/main (3-way merge over the generate_impl rename). Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

5 issues found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>

<violation number="2" location="server/src/common/moe_hybrid_storage.cpp:507">
P0: `moe_hybrid_cache_swap_in` only handles non-fused gate/up/down tensors. For fused gate-up architectures (Laguna), `gate_hot`/`up_hot` are nullptr so the function always returns -1, making the bounded cache a complete no-op for the very architecture it targets.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>

<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T16:09:12Z

        const int cold_count = (int)dst.cold_expert_ids.size();
+        const int spare = (cold_count > 0 && cache_slots > 0)
+                          ? std::min(cache_slots, cold_count) : 0;
+        const int hot_alloc = hot_count + spare;


P2: Hot tensor allocation still depends on hot_count > 0, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_storage.cpp, line 377: <comment>Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</comment> <file context> @@ -371,6 +372,13 @@ bool build_moe_hybrid_storage_from_file( const int cold_count = (int)dst.cold_expert_ids.size(); + const int spare = (cold_count > 0 && cache_slots > 0) + ? std::min(cache_slots, cold_count) : 0; + const int hot_alloc = hot_count + spare; + dst.hot_active = hot_count; + dst.cache_slots = spare; </file context>

cubic-dev-ai · 2026-06-05T16:09:12Z

-    int n_hot);
+    int n_hot,
+    bool gpu_remap = false,
+    int n_expert = 0);


P2: int n_expert = 0 default silently disables remapping when gpu_remap=true is set without n_expert. The n_expert > 0 guard in the implementation (moe_hybrid_ffn_eval.cpp:440) prevents a crash but silently produces a non-remapped graph, making gpu_remap=true a non-binding hint rather than a semantic guarantee. Future callers adding /*gpu_remap=*/true without also passing cfg.n_expert will get incorrect (non-remapped) behavior with no warning.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.h, line 175: <comment>`int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</comment> <file context> @@ -170,7 +170,9 @@ bool build_cached_hot_graph( - int n_hot); + int n_hot, + bool gpu_remap = false, + int n_expert = 0); // Build/rebuild cached cold FFN graph. </file context>

cubic-dev-ai · 2026-06-05T16:09:12Z

    for (int s = 0; s < req.n_gen; ++s) {
        maybe_force_close(next_tok, s);
-        if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break;
+        if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;


P2: DFLASH_IGNORE_EOS is interpreted by presence instead of value, so setting it to 0 still disables EOS stopping.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 297: <comment>`DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</comment> <file context> @@ -283,7 +294,7 @@ GenerateResult LagunaBackend::generate_impl(const GenerateRequest & req, for (int s = 0; s < req.n_gen; ++s) { maybe_force_close(next_tok, s); - if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break; + if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break; result.tokens.push_back(next_tok); history.push_back(next_tok); </file context>

Suggested change

if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;

if (!([]{ const char * v = std::getenv("DFLASH_IGNORE_EOS"); return v && std::atoi(v) != 0; })() && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;

cubic-dev-ai · 2026-06-05T16:09:12Z

-        if (result.ok) maybe_post_request_swap();
+        if (result.ok) {
+            // Flush routing-frequency profile if requested (independent of swap).
+            if (!routing_stats_out_path_.empty() && routing_stats_) {


P3: Routing stats CSV saved twice per request when swaps occur — once in generate_impl and again in maybe_post_request_swap, both writing identical data to the same file.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 169: <comment>Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</comment> <file context> @@ -163,7 +164,17 @@ GenerateResult LagunaBackend::generate_impl(const GenerateRequest & req, - if (result.ok) maybe_post_request_swap(); + if (result.ok) { + // Flush routing-frequency profile if requested (independent of swap). + if (!routing_stats_out_path_.empty() && routing_stats_) { + std::string serr; + if (!routing_stats_->save_csv(routing_stats_out_path_, &serr)) </file context>

dflash_server --spark: one self-tuning command for both MoE backends. - Enables the bounded expert cache (auto working set) with sized slots (--spark-slots, default 32). - Auto-loads <model>.gguf.spark.csv if present and keeps persisting it after each request from live routing (laguna via NEXT_PLACEMENT_OUT, qwen35moe via RUNTIME_STATS_OUT, which is the var that allocates its routing accumulator). - Wires the expert cache into qwen35moe: cache_slots into its from-file storage + moe_hybrid_cache_swap_in before both host-partition loops in the pipelined decode (symmetric to laguna; swaps are auto-picked-up by the hot_local_by_global lookup, no graph surgery). Verified end-to-end on RTX 3090: laguna and Qwen3.6-35B-A3B both write + reload the profile (source=hotness:...) and stay coherent under forced offload + cache, no crash. dflash_server needs libcurl dev headers (find_package(CURL)). Built against origin/main. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/tokenizer.py">

<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>

<file name="optimizations/spark/spark/extract_sessions.py">

<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>

<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>

<file name="optimizations/spark/README.md">

<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>

<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>

<violation number="2" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>

<violation number="3" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>

<file name="optimizations/spark/spark/train_pregate.py">

<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>

<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>

<file name="optimizations/spark/spark/calibrate.py">

<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>

<file name="optimizations/spark/spark/_daemon.py">

<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>

<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>

<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>

<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>

<file name="server/src/server/server_main.cpp">

<violation number="1" location="server/src/server/server_main.cpp:644">
P2: Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T16:37:25Z

+            const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_";
+            const std::string profile = std::string(bargs.model_path) + ".spark.csv";
+            std::FILE * pf = std::fopen(profile.c_str(), "rb");
+            const bool have_profile = (pf != nullptr);


P2: Stale profile sidecar not validated against the model file. The .spark.csv profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/server_main.cpp, line 644: <comment>Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</comment> <file context> @@ -623,6 +629,43 @@ int main(int argc, char ** argv) { + const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_"; + const std::string profile = std::string(bargs.model_path) + ".spark.csv"; + std::FILE * pf = std::fopen(profile.c_str(), "rb"); + const bool have_profile = (pf != nullptr); + if (pf) std::fclose(pf); + const std::string slots = std::to_string(spark_slots); </file context>

moe_hybrid_cache_swap_in only validated/copied the separate gate/up/down tensors, so for a fused-gate-up MoE (gate_up_hot populated, gate_hot/up_hot null, a layout the storage supports via fused_gate_up) it returned -1 and the bounded cache silently no-op'd. Branch on fused_gate_up: copy gate_up + down when fused, gate + up + down otherwise. Current targets (laguna-xs2, Qwen3.6-35B-A3B) use SEPARATE tensors, so no behavior change for them (laguna cache verified unchanged: cold 38.5 -> 1.9 / 77 -> 85 tok/s at 32 slots). Fixes the latent no-op for fused-gate-up archs. Co-Authored-By: WOZCODE <contact@withwoz.com>

extract_sessions.py now pulls both Claude Code (~/.claude/projects) and Codex (~/.codex/sessions/**/rollout-*.jsonl) by default (--source claude|codex|both). Codex rollouts are parsed from response_item user+assistant content blocks (input_text/output_text/text), skipping the developer/system boilerplate so the corpus reflects real traffic, not instructions. Both sources merge into one corpus with the same per-session train/held-out split. Verified locally: 157 sessions (claude + codex) -> 432 train chunks (~212K tok), held-out split by session. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai Bot reviewed Jun 5, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026

docs: refresh auto-integration manifest for PR Luce-Org#345

6f81085

cubic-dev-ai Bot reviewed Jun 5, 2026

View reviewed changes

Davide Cifarelli and others added 2 commits June 5, 2026 18:45

	uv sync # tokenizers (+ gguf/torch optional extras)
	uv sync --all-extras # tokenizers (+ gguf/torch optional extras)

	if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id \|\| next_tok == w_.eos_chat_id)) break;
	if (!([]{ const char * v = std::getenv("DFLASH_IGNORE_EOS"); return v && std::atoi(v) != 0; })() && (next_tok == w_.eos_id \|\| next_tok == w_.eos_chat_id)) break;

Conversation

davide221 commented Jun 5, 2026

What Spark is

Results (RTX 3090, Laguna-XS.2 Q4_K_M 33B-total, held-out Claude Code sessions)

Scope of this PR

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading