Skip to content

feat(optimizations): Luce Spark — calibrated hot/cold expert residency#345

Open
davide221 wants to merge 6 commits into
mainfrom
feat/optimizations-spark
Open

feat(optimizations): Luce Spark — calibrated hot/cold expert residency#345
davide221 wants to merge 6 commits into
mainfrom
feat/optimizations-spark

Conversation

@davide221
Copy link
Copy Markdown
Contributor

Adds optimizations/spark/, the placement + caching product layer on top of the merged hot/cold MoE offload engine (server/src/common/moe_hybrid_*). Mirrors optimizations/pflash/: standalone tooling + docs here, engine code stays in server/.

What Spark is

A 33B-total MoE only fires ~8 of 256 experts/token, but naive hot/cold offload still hits the CPU tier ~36% of the time if it keeps the wrong experts resident. Spark:

  1. Calibrates the hot set from the traffic you actually serve (replays real agent sessions, accumulates per-(layer,expert) routing → placement profile).
  2. Caches the per-session tail in a bounded ring of spare GPU slots (LRU swap), driving cold-misses to ~0 in fixed VRAM.
  3. Pre-gate (research): captures (hidden → experts) traces + trains a predictor; documents why fusion needs a model fine-tune, not a fitted predictor.

Results (RTX 3090, Laguna-XS.2 Q4_K_M 33B-total, held-out Claude Code sessions)

Config tok/s % all-GPU cold-hit VRAM
All-GPU 111 100% 18.8 GiB
Uniform 60% 66 59% 36% 10.6 GiB
Spark calibrated 60% 81 73% 6.6% 10.6 GiB
Spark + cache (32 slots) 85–88 ~79% ~0% 14.6 GiB

Calibration: 333 chunks / ~171K tokens from real sessions, split by session (no leakage). Full tables in RESULTS.md.

Scope of this PR

  • In: standalone tooling (extract_sessions, tokenizer, calibrate, validate, research train_pregate) + README + RESULTS + hero. Nothing here compiles into the daemon.
  • Not in: the engine-side Spark additions (bounded cache / gpu_remap / trace hook). Those land separately in server/. The calibration path runs on merged main via DFLASH_LAGUNA_HOTNESS / NEXT_PLACEMENT_OUT; the cache flags are documented for when the engine PR lands.

Built on the merged hot/cold offload engine (#289 et al). Honest negative result included: a fitted pre-gate caps at ~53% recall@8 (pre/post-attention information gap), so full all-GPU speed needs a gate fine-tune, not a predictor.

🧙 Built with WOZCODE

…ency)

Spark is the placement + caching layer on top of the merged hot/cold MoE
offload engine. Standalone tooling to calibrate expert placement from real
agent traffic, validate on held-out sessions, and (research) train a pre-gate
predictor from routing traces.

Laguna-XS.2 Q4_K_M (33B total MoE) / RTX 3090, held-out Claude Code sessions:
calibration lifts naive offload 66 -> 81 tok/s (cold-hit 36% -> 6.6%); the
bounded expert cache reaches ~88 tok/s at ~0 cold, 14.6 GiB peak (vs 18.8 GiB
to hold the full model). Engine (cache, gpu_remap, trace) lives in server/.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 10 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/tokenizer.py">

<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>

<file name="optimizations/spark/spark/extract_sessions.py">

<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>

<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>

<file name="optimizations/spark/README.md">

<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>

<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>
</file>

<file name="optimizations/spark/spark/train_pregate.py">

<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>

<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>

Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.

Re-trigger cubic

Comment thread optimizations/spark/spark/calibrate.py Outdated
Comment thread optimizations/spark/spark/validate.py Outdated
def list_str(fl):
return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data]

model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Unguarded GGUF field access on f["tokenizer.ggml.model"] and f["tokenizer.ggml.tokens"]. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/tokenizer.py, line 24:

<comment>Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</comment>

<file context>
@@ -0,0 +1,56 @@
+    def list_str(fl):
+        return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data]
+
+    model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode()
+    if model != "gpt2":
+        raise SystemExit(f"only gpt2 byte-level BPE supported, gguf says model={model!r}")
</file context>

train, test = [], []
for f in files:
# split by session-path hash so a whole session is train xor test
bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: hashlib.md5() without usedforsecurity=False will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79:

<comment>`hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</comment>

<file context>
@@ -0,0 +1,115 @@
+    train, test = [], []
+    for f in files:
+        # split by session-path hash so a whole session is train xor test
+        bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train
+        sess = []
+        try:
</file context>


```bash
cd optimizations/spark
uv sync # tokenizers (+ gguf/torch optional extras)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: uv sync does not install optional dependencies (gguf, numpy) required by the tokenizer step. The comment on the uv sync line is misleading — it lists gguf/torch optional extras but the command does not install them. Step 0 (python -m spark.tokenizer) will fail with ImportError because gguf and numpy are in the tokenizer extra, not the base dependency set.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 104:

<comment>`uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</comment>

<file context>
@@ -0,0 +1,187 @@
+
+```bash
+cd optimizations/spark
+uv sync                                   # tokenizers (+ gguf/torch optional extras)
+
+# 0. one tokenizer, extracted from the GGUF (gpt2 byte-level BPE)
</file context>
Suggested change
uv sync # tokenizers (+ gguf/torch optional extras)
uv sync --all-extras # tokenizers (+ gguf/torch optional extras)

train, test = [], []
for f in files:
# split by session-path hash so a whole session is train xor test
bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: --test-frac is not validated, so --test-frac 0 crashes with ZeroDivisionError during split assignment.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79:

<comment>`--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</comment>

<file context>
@@ -0,0 +1,115 @@
+    train, test = [], []
+    for f in files:
+        # split by session-path hash so a whole session is train xor test
+        bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train
+        sess = []
+        try:
</file context>

proc.kill()
raise SystemExit("ready timeout")

pp = Path("/tmp/spark_val_chunk.bin")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Using fixed /tmp filenames causes cross-process collisions when multiple validation runs execute concurrently.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 95:

<comment>Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</comment>

<file context>
@@ -0,0 +1,142 @@
+            proc.kill()
+            raise SystemExit("ready timeout")
+
+    pp = Path("/tmp/spark_val_chunk.bin")
+    op = Path("/tmp/spark_val_out.bin")
+    toks = []
</file context>

import torch.nn as nn
H, E = args.n_embd, args.n_expert
dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))])
arr = np.fromfile(args.trace, dtype=dt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before np.fromfile, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.)

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 42:

<comment>The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) </comment>

<file context>
@@ -0,0 +1,85 @@
+    import torch.nn as nn
+    H, E = args.n_embd, args.n_expert
+    dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))])
+    arr = np.fromfile(args.trace, dtype=dt)
+    print(f"records={len(arr)}", flush=True)
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
</file context>

dev = "cuda" if torch.cuda.is_available() else "cpu"

agg = {8: [], 16: [], 24: []}
for L in range(1, args.n_layer):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The per-layer training loop skips layer 0 (range(1, args.n_layer)), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 47:

<comment>The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</comment>

<file context>
@@ -0,0 +1,85 @@
+    dev = "cuda" if torch.cuda.is_available() else "cpu"
+
+    agg = {8: [], 16: [], 24: []}
+    for L in range(1, args.n_layer):
+        idx = np.where(arr["layer"] == L)[0][:args.max_per_layer]
+        if len(idx) < 400:
</file context>

A blocking proc.stdout.readline() returns only on a new line or EOF, so the
time-based timeout check ran *after* it: a daemon that goes silent (stalled
model load, deadlock, disk stall) blocked the caller forever and the timeout
never fired. This broke unattended calibration/validation runs.

Move the protocol into spark/_daemon.py: a reader thread pumps stdout into a
queue and callers wait with queue.get(timeout=...), so a read can never outlive
its deadline. wait_ready() and the per-chunk generate loops in calibrate.py and
validate.py now honor --ready-timeout / --gen-timeout and kill the daemon on
stall. (select() + buffered readline would miss buffered data; the reader thread
does the blocking readline itself.)

Verified: silent daemon times out in ~1s, dead daemon returns immediately, happy
path unchanged.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>

<violation number="2" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>

<file name="optimizations/spark/spark/calibrate.py">

<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>

<file name="optimizations/spark/spark/_daemon.py">

<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>

<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

if len(ids) < 8:
continue
write_counted_i32(pp, ids)
reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 91:

<comment>daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</comment>

<file context>
@@ -100,30 +88,16 @@ def drain():
-                break
-        if okl and okl.startswith("ok "):
-            m = re.search(r"decode_tok_s=([0-9.]+)", okl)
+        reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)
+        if reply is None:
+            print("daemon stalled/closed; stopping early")
</file context>

if toks:
print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} "
f"over {len(toks)} chunks")
colds = [float(m.group(1)) for l in daemon.stderr_lines
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 109:

<comment>daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</comment>

<file context>
@@ -132,7 +106,7 @@ def drain():
         print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} "
               f"over {len(toks)} chunks")
-    colds = [float(m.group(1)) for l in prof
+    colds = [float(m.group(1)) for l in daemon.stderr_lines
              for m in [re.search(r"cold_experts/tok=([0-9.]+)", l)] if m]
     if colds:
</file context>

if len(ids) < 8:
continue
write_counted_i32(pp, ids)
reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/calibrate.py, line 78:

<comment>Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</comment>

<file context>
@@ -84,38 +75,18 @@ def main():
-            if time.time() - ts > 120:
-                break
-        if not ok:
+        reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)
+        if reply is None:
+            print(f"[calib] daemon stalled/closed at chunk {i}; stopping", flush=True)
</file context>

self._q.put(None) # EOF sentinel

def _pump_stderr(self):
for line in self.proc.stderr:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: _pump_stderr silently crashes on encoding errors — same UnicodeDecodeError risk as stdout, causing data loss for callers that read stderr_lines (e.g. validate.py extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 44:

<comment>`_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</comment>

<file context>
@@ -0,0 +1,106 @@
+        self._q.put(None)  # EOF sentinel
+
+    def _pump_stderr(self):
+        for line in self.proc.stderr:
+            self.stderr_lines.append(line.rstrip())
+
</file context>

line = self.readline(deadline - time.time())
except DaemonTimeout:
self.kill()
raise SystemExit(f"daemon did not become ready within {timeout}s (no output)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 66:

<comment>Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .</comment>

<file context>
@@ -0,0 +1,106 @@
+                line = self.readline(deadline - time.time())
+            except DaemonTimeout:
+                self.kill()
+                raise SystemExit(f"daemon did not become ready within {timeout}s (no output)")
+            if line is None:
+                raise SystemExit("daemon died before ready banner")
</file context>

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
Server side of Luce Spark, on top of the merged hybrid-offload engine:

- Bounded GPU expert cache: spare slots over-allocated on the hot expert stack
  + moe_hybrid_cache_swap_in (LRU). On a cold hit the expert is swapped into a
  spare slot and served through the unified GPU FFN, so cold-misses fall to ~0
  in fixed VRAM after warmup. Gated by DFLASH_LAGUNA_EXPERT_CACHE / _CACHE_SLOTS
  / _GPU_REMAP.
- Swap-rebuild fix + clean profile flush: build_hybrid_storage_from_file()
  re-reads experts from the GGUF mmap (partial-load keeps no full expert tensors
  resident), used by both init and post-request swap (was asserting in
  ggml_backend_tensor_get). routing_stats_ now also allocates when swap is
  enabled, and the placement profile flushes after each hybrid generate so
  NEXT_PLACEMENT_OUT works without the swap path.
- Pre-gate trace capture (DFLASH_LAGUNA_PREGATE_TRACE), profile-gated, feeds the
  optimizations/spark pregate trainer.

Laguna-XS.2 Q4_K_M / RTX 3090, calibrated 60% + 32 cache slots: ~85-88 tok/s
(cold ~0) at 14.6 GiB peak, vs 66 uniform / 111 all-GPU @18.8 GiB. Built +
smoke-tested against origin/main (3-way merge over the generate_impl rename).

Co-Authored-By: WOZCODE <contact@withwoz.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>

<violation number="2" location="server/src/common/moe_hybrid_storage.cpp:507">
P0: `moe_hybrid_cache_swap_in` only handles non-fused gate/up/down tensors. For fused gate-up architectures (Laguna), `gate_hot`/`up_hot` are nullptr so the function always returns -1, making the bounded cache a complete no-op for the very architecture it targets.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>

<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/src/common/moe_hybrid_storage.cpp Outdated
const int cold_count = (int)dst.cold_expert_ids.size();
const int spare = (cold_count > 0 && cache_slots > 0)
? std::min(cache_slots, cold_count) : 0;
const int hot_alloc = hot_count + spare;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hot tensor allocation still depends on hot_count > 0, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_storage.cpp, line 377:

<comment>Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</comment>

<file context>
@@ -371,6 +372,13 @@ bool build_moe_hybrid_storage_from_file(
         const int cold_count = (int)dst.cold_expert_ids.size();
+        const int spare = (cold_count > 0 && cache_slots > 0)
+                          ? std::min(cache_slots, cold_count) : 0;
+        const int hot_alloc = hot_count + spare;
+        dst.hot_active  = hot_count;
+        dst.cache_slots = spare;
</file context>

int n_hot);
int n_hot,
bool gpu_remap = false,
int n_expert = 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: int n_expert = 0 default silently disables remapping when gpu_remap=true is set without n_expert. The n_expert > 0 guard in the implementation (moe_hybrid_ffn_eval.cpp:440) prevents a crash but silently produces a non-remapped graph, making gpu_remap=true a non-binding hint rather than a semantic guarantee. Future callers adding /*gpu_remap=*/true without also passing cfg.n_expert will get incorrect (non-remapped) behavior with no warning.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.h, line 175:

<comment>`int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</comment>

<file context>
@@ -170,7 +170,9 @@ bool build_cached_hot_graph(
-    int n_hot);
+    int n_hot,
+    bool gpu_remap = false,
+    int n_expert = 0);
 
 // Build/rebuild cached cold FFN graph.
</file context>

for (int s = 0; s < req.n_gen; ++s) {
maybe_force_close(next_tok, s);
if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break;
if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: DFLASH_IGNORE_EOS is interpreted by presence instead of value, so setting it to 0 still disables EOS stopping.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 297:

<comment>`DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</comment>

<file context>
@@ -283,7 +294,7 @@ GenerateResult LagunaBackend::generate_impl(const GenerateRequest & req,
     for (int s = 0; s < req.n_gen; ++s) {
         maybe_force_close(next_tok, s);
-        if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break;
+        if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;
         result.tokens.push_back(next_tok);
         history.push_back(next_tok);
</file context>
Suggested change
if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;
if (!([]{ const char * v = std::getenv("DFLASH_IGNORE_EOS"); return v && std::atoi(v) != 0; })() && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;

if (result.ok) maybe_post_request_swap();
if (result.ok) {
// Flush routing-frequency profile if requested (independent of swap).
if (!routing_stats_out_path_.empty() && routing_stats_) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Routing stats CSV saved twice per request when swaps occur — once in generate_impl and again in maybe_post_request_swap, both writing identical data to the same file.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 169:

<comment>Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</comment>

<file context>
@@ -163,7 +164,17 @@ GenerateResult LagunaBackend::generate_impl(const GenerateRequest & req,
-        if (result.ok) maybe_post_request_swap();
+        if (result.ok) {
+            // Flush routing-frequency profile if requested (independent of swap).
+            if (!routing_stats_out_path_.empty() && routing_stats_) {
+                std::string serr;
+                if (!routing_stats_->save_csv(routing_stats_out_path_, &serr))
</file context>

dflash_server --spark: one self-tuning command for both MoE backends.
- Enables the bounded expert cache (auto working set) with sized slots
  (--spark-slots, default 32).
- Auto-loads <model>.gguf.spark.csv if present and keeps persisting it after each
  request from live routing (laguna via NEXT_PLACEMENT_OUT, qwen35moe via
  RUNTIME_STATS_OUT, which is the var that allocates its routing accumulator).
- Wires the expert cache into qwen35moe: cache_slots into its from-file storage
  + moe_hybrid_cache_swap_in before both host-partition loops in the pipelined
  decode (symmetric to laguna; swaps are auto-picked-up by the hot_local_by_global
  lookup, no graph surgery).

Verified end-to-end on RTX 3090: laguna and Qwen3.6-35B-A3B both write + reload
the profile (source=hotness:...) and stay coherent under forced offload + cache,
no crash. dflash_server needs libcurl dev headers (find_package(CURL)). Built
against origin/main.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="optimizations/spark/spark/tokenizer.py">

<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>

<file name="optimizations/spark/spark/extract_sessions.py">

<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>

<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>

<file name="optimizations/spark/README.md">

<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>

<file name="optimizations/spark/spark/validate.py">

<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>

<violation number="2" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>

<violation number="3" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>

<file name="optimizations/spark/spark/train_pregate.py">

<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.

(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>

<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>

<file name="optimizations/spark/spark/calibrate.py">

<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>

<file name="optimizations/spark/spark/_daemon.py">

<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>

<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.

(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>

<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.h">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>

<file name="server/src/laguna/laguna_backend.cpp">

<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>

<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>

<file name="server/src/server/server_main.cpp">

<violation number="1" location="server/src/server/server_main.cpp:644">
P2: Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_";
const std::string profile = std::string(bargs.model_path) + ".spark.csv";
std::FILE * pf = std::fopen(profile.c_str(), "rb");
const bool have_profile = (pf != nullptr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Stale profile sidecar not validated against the model file. The .spark.csv profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/server_main.cpp, line 644:

<comment>Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</comment>

<file context>
@@ -623,6 +629,43 @@ int main(int argc, char ** argv) {
+            const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_";
+            const std::string profile = std::string(bargs.model_path) + ".spark.csv";
+            std::FILE * pf = std::fopen(profile.c_str(), "rb");
+            const bool have_profile = (pf != nullptr);
+            if (pf) std::fclose(pf);
+            const std::string slots = std::to_string(spark_slots);
</file context>

Davide Cifarelli and others added 2 commits June 5, 2026 18:45
moe_hybrid_cache_swap_in only validated/copied the separate gate/up/down
tensors, so for a fused-gate-up MoE (gate_up_hot populated, gate_hot/up_hot
null, a layout the storage supports via fused_gate_up) it returned -1 and the
bounded cache silently no-op'd. Branch on fused_gate_up: copy gate_up + down
when fused, gate + up + down otherwise.

Current targets (laguna-xs2, Qwen3.6-35B-A3B) use SEPARATE tensors, so no
behavior change for them (laguna cache verified unchanged: cold 38.5 -> 1.9 /
77 -> 85 tok/s at 32 slots). Fixes the latent no-op for fused-gate-up archs.

Co-Authored-By: WOZCODE <contact@withwoz.com>
extract_sessions.py now pulls both Claude Code (~/.claude/projects) and Codex
(~/.codex/sessions/**/rollout-*.jsonl) by default (--source claude|codex|both).
Codex rollouts are parsed from response_item user+assistant content blocks
(input_text/output_text/text), skipping the developer/system boilerplate so the
corpus reflects real traffic, not instructions. Both sources merge into one
corpus with the same per-session train/held-out split.

Verified locally: 157 sessions (claude + codex) -> 432 train chunks (~212K tok),
held-out split by session.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant