feat(drafter): ee3 as production default (depends on #274) by dusterbloom · Pull Request #275 · Luce-Org/lucebox-hub

dusterbloom · 2026-05-24T17:19:13Z

Summary

Changes the recommended early-exit default from N=7 to N=3 based on empirical N-sweep. Depends on PR #274 landing first (introduces PFLASH_DRAFTER_EARLY_EXIT_N).

Source change

dflash/README.md — documents ee3 as the recommended default with reproduction instructions.

Evidence

Results not committed. Reproduce via:

N-sweep NIAH @ 32K/64K/128K (ee3=3/3 everywhere, 24.3× drafter speedup at 128K vs baseline): dflash/bench/run_ee_n_sweep.sh
5-client multi-client accept_rate (ee3 +1.2 pp vs ee7, all clients within ±2 pp): dflash/bench/run_ee_n_multiclient.sh
Sweep plan + rationale: bench/2026-05-25_ee_n_sweep/PLAN.md

Headline numbers (RTX 3090, Qwen3.6-27B-Q4_K_M, Qwen2.5-0.5B-BF16):

ee3 drafter speedup: 6.9× @ 32K, 24.3× @ 128K
accept_rate vs ee7: +1.2 pp mean (claude_code +0.0, hermes -0.4, opencode +1.7, pi +0.0, codex +4.6)
NIAH 3/3 at 32K, 64K, 128K

Dependency

Requires PR #274 to merge before this can land (env vars are introduced there).

Reviewer note

Bench results not committed; run the scripts above to regenerate.

…idation) Runner scripts: run_ee_n_sweep.sh, run_ee_n_sweep_niah.py, run_ee_n_multiclient.sh. Decision: ee3 drafter speedup 6.9x@32K, 24.3x@128K, accept_rate within ±2 pp of ee7. Bench results not committed; reproduce via the added scripts.

…s on Luce-Org#274) PFLASH_DRAFTER_EARLY_EXIT_N=3 PFLASH_DRAFTER_SCORE_LAYERS=3 is the production default after ee_n sweep: 6.9x@32K, 24.3x@128K, accept_rate +1.2 pp vs ee7. Reproduce via dflash/bench/run_ee_n_sweep.sh + run_ee_n_multiclient.sh.

cubic-dev-ai

3 issues found across 5 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/bench/run_ee_n_sweep_niah.py">

<violation number="1" location="dflash/bench/run_ee_n_sweep_niah.py:167">
P1: `ttft_s` measures total completion time, not time-to-first-token, because the request uses non-streaming mode (`"stream": False`)</violation>
</file>

<file name="dflash/bench/run_ee_n_multiclient.sh">

<violation number="1" location="dflash/bench/run_ee_n_multiclient.sh:12">
P2: Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark</violation>

<violation number="2" location="dflash/bench/run_ee_n_multiclient.sh:71">
P2: Server-log capture uses global `ls -t` latest rather than run-specific path, causing stale/mismatched server logs after failures.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-05-24T17:27:30Z

+        t0 = time.perf_counter()
+        try:
+            r = requests.post(f"{BASE_URL}/v1/chat/completions", json=payload, timeout=600)
+            result["ttft_s"] = time.perf_counter() - t0


P1: ttft_s measures total completion time, not time-to-first-token, because the request uses non-streaming mode ("stream": False)

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_sweep_niah.py, line 167: <comment>`ttft_s` measures total completion time, not time-to-first-token, because the request uses non-streaming mode (`"stream": False`)</comment> <file context> @@ -0,0 +1,296 @@ + t0 = time.perf_counter() + try: + r = requests.post(f"{BASE_URL}/v1/chat/completions", json=payload, timeout=600) + result["ttft_s"] = time.perf_counter() - t0 + r.raise_for_status() + data = r.json() </file context>

cubic-dev-ai · 2026-05-24T17:27:30Z

+            || echo "FAIL: $name x $client (see $cond_dir/${client}.log)"
+
+        # Capture server log if the harness wrote one to the standard evidence dir.
+        latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true)


P2: Server-log capture uses global ls -t latest rather than run-specific path, causing stale/mismatched server logs after failures.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_multiclient.sh, line 71: <comment>Server-log capture uses global `ls -t` latest rather than run-specific path, causing stale/mismatched server logs after failures.</comment> <file context> @@ -0,0 +1,78 @@ + || echo "FAIL: $name x $client (see $cond_dir/${client}.log)" + + # Capture server log if the harness wrote one to the standard evidence dir. + latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true) + if [[ -n "$latest_server_log" ]]; then + cp "$latest_server_log" "$cond_dir/${client}_server.log" </file context>

cubic-dev-ai · 2026-05-24T17:27:30Z

@@ -0,0 +1,78 @@
+#!/usr/bin/env bash


P2: Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_multiclient.sh, line 12: <comment>Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark</comment> <file context> @@ -0,0 +1,78 @@ +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" + +WORKTREE="/home/peppi/Dev/lucebox-hub/.claude/worktrees/harness-adapters" +DRIVER="$WORKTREE/harness/client_test_runner.py" + </file context>

davide221 · 2026-05-24T22:02:31Z

NIAH is a simple benchmark, it is a very interesting result. Can you check if we can push deeper this finding on agentic coding use cases/benchmarks? Worth exploring before default to N layers

dusterbloom · 2026-05-25T09:55:03Z

NIAH is a simple benchmark, it is a very interesting result. Can you check if we can push deeper this finding on agentic coding use cases/benchmarks? Worth exploring before default to N layers

Absolutely 100% agreed. It has to work. Initial testing seems it works but Qwen3.6 tool calling is acting wierd and not really functioning at least on claude-code. Digging deeper on that today

dusterbloom added 2 commits May 24, 2026 19:15

cubic-dev-ai Bot reviewed May 24, 2026

View reviewed changes

dusterbloom marked this pull request as draft May 26, 2026 15:32

This was referenced May 28, 2026

feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity #274

Open

pflash + dflash optimization on top of qwen35moe (PR #262) #280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(drafter): ee3 as production default (depends on #274)#275

feat(drafter): ee3 as production default (depends on #274)#275
dusterbloom wants to merge 2 commits into
Luce-Org:mainfrom
dusterbloom:feat/pflash-drafter-ee3-default

dusterbloom commented May 24, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 24, 2026

Uh oh!

cubic-dev-ai Bot May 24, 2026

Uh oh!

cubic-dev-ai Bot May 24, 2026

Uh oh!

davide221 commented May 24, 2026

Uh oh!

dusterbloom commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dusterbloom commented May 24, 2026

Summary

Source change

Evidence

Dependency

Reviewer note

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

davide221 commented May 24, 2026

Uh oh!

dusterbloom commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants