feat(optimizations): Luce Spark — calibrated hot/cold expert residency#345
feat(optimizations): Luce Spark — calibrated hot/cold expert residency#345davide221 wants to merge 6 commits into
Conversation
…ency) Spark is the placement + caching layer on top of the merged hot/cold MoE offload engine. Standalone tooling to calibrate expert placement from real agent traffic, validate on held-out sessions, and (research) train a pre-gate predictor from routing traces. Laguna-XS.2 Q4_K_M (33B total MoE) / RTX 3090, held-out Claude Code sessions: calibration lifts naive offload 66 -> 81 tok/s (cold-hit 36% -> 6.6%); the bounded expert cache reaches ~88 tok/s at ~0 cold, 14.6 GiB peak (vs 18.8 GiB to hold the full model). Engine (cache, gpu_remap, trace) lives in server/. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
9 issues found across 10 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="optimizations/spark/spark/tokenizer.py">
<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>
<file name="optimizations/spark/spark/extract_sessions.py">
<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>
<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>
<file name="optimizations/spark/README.md">
<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>
<file name="optimizations/spark/spark/validate.py">
<violation number="1" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>
</file>
<file name="optimizations/spark/spark/train_pregate.py">
<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>
<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>
Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.
Re-trigger cubic
| def list_str(fl): | ||
| return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data] | ||
|
|
||
| model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode() |
There was a problem hiding this comment.
P2: Unguarded GGUF field access on f["tokenizer.ggml.model"] and f["tokenizer.ggml.tokens"]. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/tokenizer.py, line 24:
<comment>Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</comment>
<file context>
@@ -0,0 +1,56 @@
+ def list_str(fl):
+ return [bytes(fl.parts[d]).decode("utf-8", errors="replace") for d in fl.data]
+
+ model = bytes(f["tokenizer.ggml.model"].parts[f["tokenizer.ggml.model"].data[0]]).decode()
+ if model != "gpt2":
+ raise SystemExit(f"only gpt2 byte-level BPE supported, gguf says model={model!r}")
</file context>
| train, test = [], [] | ||
| for f in files: | ||
| # split by session-path hash so a whole session is train xor test | ||
| bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train |
There was a problem hiding this comment.
P2: hashlib.md5() without usedforsecurity=False will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79:
<comment>`hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</comment>
<file context>
@@ -0,0 +1,115 @@
+ train, test = [], []
+ for f in files:
+ # split by session-path hash so a whole session is train xor test
+ bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train
+ sess = []
+ try:
</file context>
|
|
||
| ```bash | ||
| cd optimizations/spark | ||
| uv sync # tokenizers (+ gguf/torch optional extras) |
There was a problem hiding this comment.
P2: uv sync does not install optional dependencies (gguf, numpy) required by the tokenizer step. The comment on the uv sync line is misleading — it lists gguf/torch optional extras but the command does not install them. Step 0 (python -m spark.tokenizer) will fail with ImportError because gguf and numpy are in the tokenizer extra, not the base dependency set.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/README.md, line 104:
<comment>`uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</comment>
<file context>
@@ -0,0 +1,187 @@
+
+```bash
+cd optimizations/spark
+uv sync # tokenizers (+ gguf/torch optional extras)
+
+# 0. one tokenizer, extracted from the GGUF (gpt2 byte-level BPE)
</file context>
| uv sync # tokenizers (+ gguf/torch optional extras) | |
| uv sync --all-extras # tokenizers (+ gguf/torch optional extras) |
| train, test = [], [] | ||
| for f in files: | ||
| # split by session-path hash so a whole session is train xor test | ||
| bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train |
There was a problem hiding this comment.
P2: --test-frac is not validated, so --test-frac 0 crashes with ZeroDivisionError during split assignment.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/extract_sessions.py, line 79:
<comment>`--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</comment>
<file context>
@@ -0,0 +1,115 @@
+ train, test = [], []
+ for f in files:
+ # split by session-path hash so a whole session is train xor test
+ bucket = test if int(hashlib.md5(str(f).encode()).hexdigest(), 16) % args.test_frac == 0 else train
+ sess = []
+ try:
</file context>
| proc.kill() | ||
| raise SystemExit("ready timeout") | ||
|
|
||
| pp = Path("/tmp/spark_val_chunk.bin") |
There was a problem hiding this comment.
P2: Using fixed /tmp filenames causes cross-process collisions when multiple validation runs execute concurrently.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 95:
<comment>Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</comment>
<file context>
@@ -0,0 +1,142 @@
+ proc.kill()
+ raise SystemExit("ready timeout")
+
+ pp = Path("/tmp/spark_val_chunk.bin")
+ op = Path("/tmp/spark_val_out.bin")
+ toks = []
</file context>
| import torch.nn as nn | ||
| H, E = args.n_embd, args.n_expert | ||
| dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))]) | ||
| arr = np.fromfile(args.trace, dtype=dt) |
There was a problem hiding this comment.
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before np.fromfile, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 42:
<comment>The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) </comment>
<file context>
@@ -0,0 +1,85 @@
+ import torch.nn as nn
+ H, E = args.n_embd, args.n_expert
+ dt = np.dtype([("layer", "<i2"), ("nsel", "<i2"), ("sel", "<i4", (8,)), ("hid", "<f4", (H,))])
+ arr = np.fromfile(args.trace, dtype=dt)
+ print(f"records={len(arr)}", flush=True)
+ dev = "cuda" if torch.cuda.is_available() else "cpu"
</file context>
| dev = "cuda" if torch.cuda.is_available() else "cpu" | ||
|
|
||
| agg = {8: [], 16: [], 24: []} | ||
| for L in range(1, args.n_layer): |
There was a problem hiding this comment.
P3: The per-layer training loop skips layer 0 (range(1, args.n_layer)), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/train_pregate.py, line 47:
<comment>The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</comment>
<file context>
@@ -0,0 +1,85 @@
+ dev = "cuda" if torch.cuda.is_available() else "cpu"
+
+ agg = {8: [], 16: [], 24: []}
+ for L in range(1, args.n_layer):
+ idx = np.where(arr["layer"] == L)[0][:args.max_per_layer]
+ if len(idx) < 400:
</file context>
A blocking proc.stdout.readline() returns only on a new line or EOF, so the time-based timeout check ran *after* it: a daemon that goes silent (stalled model load, deadlock, disk stall) blocked the caller forever and the timeout never fired. This broke unattended calibration/validation runs. Move the protocol into spark/_daemon.py: a reader thread pumps stdout into a queue and callers wait with queue.get(timeout=...), so a read can never outlive its deadline. wait_ready() and the per-chunk generate loops in calibrate.py and validate.py now honor --ready-timeout / --gen-timeout and kill the daemon on stall. (select() + buffered readline would miss buffered data; the reader thread does the blocking readline itself.) Verified: silent daemon times out in ~1s, dead daemon returns immediately, happy path unchanged. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
5 issues found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="optimizations/spark/spark/validate.py">
<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>
<violation number="2" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>
<file name="optimizations/spark/spark/calibrate.py">
<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>
<file name="optimizations/spark/spark/_daemon.py">
<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>
<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| if len(ids) < 8: | ||
| continue | ||
| write_counted_i32(pp, ids) | ||
| reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout) |
There was a problem hiding this comment.
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 91:
<comment>daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</comment>
<file context>
@@ -100,30 +88,16 @@ def drain():
- break
- if okl and okl.startswith("ok "):
- m = re.search(r"decode_tok_s=([0-9.]+)", okl)
+ reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)
+ if reply is None:
+ print("daemon stalled/closed; stopping early")
</file context>
| if toks: | ||
| print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} " | ||
| f"over {len(toks)} chunks") | ||
| colds = [float(m.group(1)) for l in daemon.stderr_lines |
There was a problem hiding this comment.
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/validate.py, line 109:
<comment>daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</comment>
<file context>
@@ -132,7 +106,7 @@ def drain():
print(f"decode tok/s: mean={statistics.mean(toks):.1f} median={statistics.median(toks):.1f} "
f"over {len(toks)} chunks")
- colds = [float(m.group(1)) for l in prof
+ colds = [float(m.group(1)) for l in daemon.stderr_lines
for m in [re.search(r"cold_experts/tok=([0-9.]+)", l)] if m]
if colds:
</file context>
| if len(ids) < 8: | ||
| continue | ||
| write_counted_i32(pp, ids) | ||
| reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout) |
There was a problem hiding this comment.
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/calibrate.py, line 78:
<comment>Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</comment>
<file context>
@@ -84,38 +75,18 @@ def main():
- if time.time() - ts > 120:
- break
- if not ok:
+ reply = daemon.request(f"generate {pp} {args.n_gen} {op}", timeout=args.gen_timeout)
+ if reply is None:
+ print(f"[calib] daemon stalled/closed at chunk {i}; stopping", flush=True)
</file context>
| self._q.put(None) # EOF sentinel | ||
|
|
||
| def _pump_stderr(self): | ||
| for line in self.proc.stderr: |
There was a problem hiding this comment.
P2: _pump_stderr silently crashes on encoding errors — same UnicodeDecodeError risk as stdout, causing data loss for callers that read stderr_lines (e.g. validate.py extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 44:
<comment>`_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</comment>
<file context>
@@ -0,0 +1,106 @@
+ self._q.put(None) # EOF sentinel
+
+ def _pump_stderr(self):
+ for line in self.proc.stderr:
+ self.stderr_lines.append(line.rstrip())
+
</file context>
| line = self.readline(deadline - time.time()) | ||
| except DaemonTimeout: | ||
| self.kill() | ||
| raise SystemExit(f"daemon did not become ready within {timeout}s (no output)") |
There was a problem hiding this comment.
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At optimizations/spark/spark/_daemon.py, line 66:
<comment>Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) .</comment>
<file context>
@@ -0,0 +1,106 @@
+ line = self.readline(deadline - time.time())
+ except DaemonTimeout:
+ self.kill()
+ raise SystemExit(f"daemon did not become ready within {timeout}s (no output)")
+ if line is None:
+ raise SystemExit("daemon died before ready banner")
</file context>
Server side of Luce Spark, on top of the merged hybrid-offload engine: - Bounded GPU expert cache: spare slots over-allocated on the hot expert stack + moe_hybrid_cache_swap_in (LRU). On a cold hit the expert is swapped into a spare slot and served through the unified GPU FFN, so cold-misses fall to ~0 in fixed VRAM after warmup. Gated by DFLASH_LAGUNA_EXPERT_CACHE / _CACHE_SLOTS / _GPU_REMAP. - Swap-rebuild fix + clean profile flush: build_hybrid_storage_from_file() re-reads experts from the GGUF mmap (partial-load keeps no full expert tensors resident), used by both init and post-request swap (was asserting in ggml_backend_tensor_get). routing_stats_ now also allocates when swap is enabled, and the placement profile flushes after each hybrid generate so NEXT_PLACEMENT_OUT works without the swap path. - Pre-gate trace capture (DFLASH_LAGUNA_PREGATE_TRACE), profile-gated, feeds the optimizations/spark pregate trainer. Laguna-XS.2 Q4_K_M / RTX 3090, calibrated 60% + 32 cache slots: ~85-88 tok/s (cold ~0) at 14.6 GiB peak, vs 66 uniform / 111 all-GPU @18.8 GiB. Built + smoke-tested against origin/main (3-way merge over the generate_impl rename). Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
5 issues found across 6 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_storage.cpp">
<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
<violation number="2" location="server/src/common/moe_hybrid_storage.cpp:507">
P0: `moe_hybrid_cache_swap_in` only handles non-fused gate/up/down tensors. For fused gate-up architectures (Laguna), `gate_hot`/`up_hot` are nullptr so the function always returns -1, making the bounded cache a complete no-op for the very architecture it targets.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.h">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>
<file name="server/src/laguna/laguna_backend.cpp">
<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>
<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| const int cold_count = (int)dst.cold_expert_ids.size(); | ||
| const int spare = (cold_count > 0 && cache_slots > 0) | ||
| ? std::min(cache_slots, cold_count) : 0; | ||
| const int hot_alloc = hot_count + spare; |
There was a problem hiding this comment.
P2: Hot tensor allocation still depends on hot_count > 0, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_storage.cpp, line 377:
<comment>Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</comment>
<file context>
@@ -371,6 +372,13 @@ bool build_moe_hybrid_storage_from_file(
const int cold_count = (int)dst.cold_expert_ids.size();
+ const int spare = (cold_count > 0 && cache_slots > 0)
+ ? std::min(cache_slots, cold_count) : 0;
+ const int hot_alloc = hot_count + spare;
+ dst.hot_active = hot_count;
+ dst.cache_slots = spare;
</file context>
| int n_hot); | ||
| int n_hot, | ||
| bool gpu_remap = false, | ||
| int n_expert = 0); |
There was a problem hiding this comment.
P2: int n_expert = 0 default silently disables remapping when gpu_remap=true is set without n_expert. The n_expert > 0 guard in the implementation (moe_hybrid_ffn_eval.cpp:440) prevents a crash but silently produces a non-remapped graph, making gpu_remap=true a non-binding hint rather than a semantic guarantee. Future callers adding /*gpu_remap=*/true without also passing cfg.n_expert will get incorrect (non-remapped) behavior with no warning.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.h, line 175:
<comment>`int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</comment>
<file context>
@@ -170,7 +170,9 @@ bool build_cached_hot_graph(
- int n_hot);
+ int n_hot,
+ bool gpu_remap = false,
+ int n_expert = 0);
// Build/rebuild cached cold FFN graph.
</file context>
| for (int s = 0; s < req.n_gen; ++s) { | ||
| maybe_force_close(next_tok, s); | ||
| if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break; | ||
| if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break; |
There was a problem hiding this comment.
P2: DFLASH_IGNORE_EOS is interpreted by presence instead of value, so setting it to 0 still disables EOS stopping.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 297:
<comment>`DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</comment>
<file context>
@@ -283,7 +294,7 @@ GenerateResult LagunaBackend::generate_impl(const GenerateRequest & req,
for (int s = 0; s < req.n_gen; ++s) {
maybe_force_close(next_tok, s);
- if (next_tok == w_.eos_id || next_tok == w_.eos_chat_id) break;
+ if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break;
result.tokens.push_back(next_tok);
history.push_back(next_tok);
</file context>
| if (!std::getenv("DFLASH_IGNORE_EOS") && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break; | |
| if (!([]{ const char * v = std::getenv("DFLASH_IGNORE_EOS"); return v && std::atoi(v) != 0; })() && (next_tok == w_.eos_id || next_tok == w_.eos_chat_id)) break; |
| if (result.ok) maybe_post_request_swap(); | ||
| if (result.ok) { | ||
| // Flush routing-frequency profile if requested (independent of swap). | ||
| if (!routing_stats_out_path_.empty() && routing_stats_) { |
There was a problem hiding this comment.
P3: Routing stats CSV saved twice per request when swaps occur — once in generate_impl and again in maybe_post_request_swap, both writing identical data to the same file.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_backend.cpp, line 169:
<comment>Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</comment>
<file context>
@@ -163,7 +164,17 @@ GenerateResult LagunaBackend::generate_impl(const GenerateRequest & req,
- if (result.ok) maybe_post_request_swap();
+ if (result.ok) {
+ // Flush routing-frequency profile if requested (independent of swap).
+ if (!routing_stats_out_path_.empty() && routing_stats_) {
+ std::string serr;
+ if (!routing_stats_->save_csv(routing_stats_out_path_, &serr))
</file context>
dflash_server --spark: one self-tuning command for both MoE backends. - Enables the bounded expert cache (auto working set) with sized slots (--spark-slots, default 32). - Auto-loads <model>.gguf.spark.csv if present and keeps persisting it after each request from live routing (laguna via NEXT_PLACEMENT_OUT, qwen35moe via RUNTIME_STATS_OUT, which is the var that allocates its routing accumulator). - Wires the expert cache into qwen35moe: cache_slots into its from-file storage + moe_hybrid_cache_swap_in before both host-partition loops in the pipelined decode (symmetric to laguna; swaps are auto-picked-up by the hot_local_by_global lookup, no graph surgery). Verified end-to-end on RTX 3090: laguna and Qwen3.6-35B-A3B both write + reload the profile (source=hotness:...) and stay coherent under forced offload + cache, no crash. dflash_server needs libcurl dev headers (find_package(CURL)). Built against origin/main. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="optimizations/spark/spark/tokenizer.py">
<violation number="1" location="optimizations/spark/spark/tokenizer.py:24">
P2: Unguarded GGUF field access on `f["tokenizer.ggml.model"]` and `f["tokenizer.ggml.tokens"]`. Missing keys or empty data arrays will raise unhandled KeyError/IndexError at runtime instead of producing a clear error message.</violation>
</file>
<file name="optimizations/spark/spark/extract_sessions.py">
<violation number="1" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `hashlib.md5()` without `usedforsecurity=False` will raise a ValueError on FIPS-enforcing systems, breaking the calibration pipeline.</violation>
<violation number="2" location="optimizations/spark/spark/extract_sessions.py:79">
P2: `--test-frac` is not validated, so `--test-frac 0` crashes with `ZeroDivisionError` during split assignment.</violation>
</file>
<file name="optimizations/spark/README.md">
<violation number="1" location="optimizations/spark/README.md:104">
P2: `uv sync` does not install optional dependencies (`gguf`, `numpy`) required by the tokenizer step. The comment on the `uv sync` line is misleading — it lists `gguf/torch optional extras` but the command does not install them. Step 0 (`python -m spark.tokenizer`) will fail with `ImportError` because `gguf` and `numpy` are in the `tokenizer` extra, not the base dependency set.</violation>
</file>
<file name="optimizations/spark/spark/validate.py">
<violation number="1" location="optimizations/spark/spark/validate.py:91">
P1: daemon.request timeout yields exit code 0, masking an incomplete validation run. Automation wrappers expecting GNU-timeout semantics (exit 124) or any non-zero code will see a false success.</violation>
<violation number="2" location="optimizations/spark/spark/validate.py:95">
P2: Using fixed `/tmp` filenames causes cross-process collisions when multiple validation runs execute concurrently.</violation>
<violation number="3" location="optimizations/spark/spark/validate.py:109">
P2: daemon.stderr_lines may miss the last few stderr lines because the pumping daemon thread is not joined before reading. The cold_experts/tok metric aggregation could slightly undercount.</violation>
</file>
<file name="optimizations/spark/spark/train_pregate.py">
<violation number="1" location="optimizations/spark/spark/train_pregate.py:42">
P2: The trace reader does not validate that file size is an exact multiple of the fixed record size before `np.fromfile`, so truncated/corrupt traces can be consumed as partial datasets instead of failing early.
(Based on your team's feedback about validating binary sidecar/input freshness using file-size checks.) [FEEDBACK_USED]</violation>
<violation number="2" location="optimizations/spark/spark/train_pregate.py:47">
P3: The per-layer training loop skips layer 0 (`range(1, args.n_layer)`), which likely omits one layer from the reported aggregate recall unless the trace is explicitly 1-indexed.</violation>
</file>
<file name="optimizations/spark/spark/calibrate.py">
<violation number="1" location="optimizations/spark/spark/calibrate.py:78">
P2: Gen-timeout silently exits with code 0, and ready-timeout exits with code 1 — neither produces exit code 124 for GNU timeout-compatible wrapper detection.</violation>
</file>
<file name="optimizations/spark/spark/_daemon.py">
<violation number="1" location="optimizations/spark/spark/_daemon.py:44">
P2: `_pump_stderr` silently crashes on encoding errors — same `UnicodeDecodeError` risk as stdout, causing data loss for callers that read `stderr_lines` (e.g. `validate.py` extracts cold_experts/tok metrics from stderr). Wrap in try/except to prevent thread death and keep accumulating lines.</violation>
<violation number="2" location="optimizations/spark/spark/_daemon.py:66">
P2: Ready-timeout exits with a generic status instead of timeout code 124, so wrappers cannot reliably detect and handle daemon load timeouts.
(Based on your team's feedback about preserving GNU-timeout-compatible timeout exit semantics.) [FEEDBACK_USED].</violation>
</file>
<file name="server/src/common/moe_hybrid_storage.cpp">
<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:377">
P2: Hot tensor allocation still depends on `hot_count > 0`, so layers with zero pinned-hot experts cannot use newly requested cache spare slots.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.h">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.h:175">
P2: `int n_expert = 0` default silently disables remapping when `gpu_remap=true` is set without `n_expert`. The `n_expert > 0` guard in the implementation (`moe_hybrid_ffn_eval.cpp:440`) prevents a crash but silently produces a non-remapped graph, making `gpu_remap=true` a non-binding hint rather than a semantic guarantee. Future callers adding `/*gpu_remap=*/true` without also passing `cfg.n_expert` will get incorrect (non-remapped) behavior with no warning.</violation>
</file>
<file name="server/src/laguna/laguna_backend.cpp">
<violation number="1" location="server/src/laguna/laguna_backend.cpp:169">
P3: Routing stats CSV saved twice per request when swaps occur — once in `generate_impl` and again in `maybe_post_request_swap`, both writing identical data to the same file.</violation>
<violation number="2" location="server/src/laguna/laguna_backend.cpp:297">
P2: `DFLASH_IGNORE_EOS` is interpreted by presence instead of value, so setting it to `0` still disables EOS stopping.</violation>
</file>
<file name="server/src/server/server_main.cpp">
<violation number="1" location="server/src/server/server_main.cpp:644">
P2: Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_"; | ||
| const std::string profile = std::string(bargs.model_path) + ".spark.csv"; | ||
| std::FILE * pf = std::fopen(profile.c_str(), "rb"); | ||
| const bool have_profile = (pf != nullptr); |
There was a problem hiding this comment.
P2: Stale profile sidecar not validated against the model file. The .spark.csv profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/server_main.cpp, line 644:
<comment>Stale profile sidecar not validated against the model file. The `.spark.csv` profile is trusted purely by existence — if the model GGUF is replaced at the same path, the old stale profile is loaded without warning.</comment>
<file context>
@@ -623,6 +629,43 @@ int main(int argc, char ** argv) {
+ const std::string pfx = is_laguna ? "DFLASH_LAGUNA_" : "DFLASH_QWEN35MOE_";
+ const std::string profile = std::string(bargs.model_path) + ".spark.csv";
+ std::FILE * pf = std::fopen(profile.c_str(), "rb");
+ const bool have_profile = (pf != nullptr);
+ if (pf) std::fclose(pf);
+ const std::string slots = std::to_string(spark_slots);
</file context>
moe_hybrid_cache_swap_in only validated/copied the separate gate/up/down tensors, so for a fused-gate-up MoE (gate_up_hot populated, gate_hot/up_hot null, a layout the storage supports via fused_gate_up) it returned -1 and the bounded cache silently no-op'd. Branch on fused_gate_up: copy gate_up + down when fused, gate + up + down otherwise. Current targets (laguna-xs2, Qwen3.6-35B-A3B) use SEPARATE tensors, so no behavior change for them (laguna cache verified unchanged: cold 38.5 -> 1.9 / 77 -> 85 tok/s at 32 slots). Fixes the latent no-op for fused-gate-up archs. Co-Authored-By: WOZCODE <contact@withwoz.com>
extract_sessions.py now pulls both Claude Code (~/.claude/projects) and Codex (~/.codex/sessions/**/rollout-*.jsonl) by default (--source claude|codex|both). Codex rollouts are parsed from response_item user+assistant content blocks (input_text/output_text/text), skipping the developer/system boilerplate so the corpus reflects real traffic, not instructions. Both sources merge into one corpus with the same per-session train/held-out split. Verified locally: 157 sessions (claude + codex) -> 432 train chunks (~212K tok), held-out split by session. Co-Authored-By: WOZCODE <contact@withwoz.com>
Adds
optimizations/spark/, the placement + caching product layer on top of the merged hot/cold MoE offload engine (server/src/common/moe_hybrid_*). Mirrorsoptimizations/pflash/: standalone tooling + docs here, engine code stays inserver/.What Spark is
A 33B-total MoE only fires ~8 of 256 experts/token, but naive hot/cold offload still hits the CPU tier ~36% of the time if it keeps the wrong experts resident. Spark:
(hidden → experts)traces + trains a predictor; documents why fusion needs a model fine-tune, not a fitted predictor.Results (RTX 3090, Laguna-XS.2 Q4_K_M 33B-total, held-out Claude Code sessions)
Calibration: 333 chunks / ~171K tokens from real sessions, split by session (no leakage). Full tables in
RESULTS.md.Scope of this PR
extract_sessions,tokenizer,calibrate,validate, researchtrain_pregate) +README+RESULTS+ hero. Nothing here compiles into the daemon.gpu_remap/ trace hook). Those land separately inserver/. The calibration path runs on merged main viaDFLASH_LAGUNA_HOTNESS/NEXT_PLACEMENT_OUT; the cache flags are documented for when the engine PR lands.Built on the merged hot/cold offload engine (#289 et al). Honest negative result included: a fitted pre-gate caps at ~53% recall@8 (pre/post-attention information gap), so full all-GPU speed needs a gate fine-tune, not a predictor.
🧙 Built with WOZCODE