feat(drafter): ee3 as production default (depends on #274)#275
feat(drafter): ee3 as production default (depends on #274)#275dusterbloom wants to merge 2 commits into
Conversation
…idation) Runner scripts: run_ee_n_sweep.sh, run_ee_n_sweep_niah.py, run_ee_n_multiclient.sh. Decision: ee3 drafter speedup 6.9x@32K, 24.3x@128K, accept_rate within ±2 pp of ee7. Bench results not committed; reproduce via the added scripts.
…s on Luce-Org#274) PFLASH_DRAFTER_EARLY_EXIT_N=3 PFLASH_DRAFTER_SCORE_LAYERS=3 is the production default after ee_n sweep: 6.9x@32K, 24.3x@128K, accept_rate +1.2 pp vs ee7. Reproduce via dflash/bench/run_ee_n_sweep.sh + run_ee_n_multiclient.sh.
There was a problem hiding this comment.
3 issues found across 5 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/bench/run_ee_n_sweep_niah.py">
<violation number="1" location="dflash/bench/run_ee_n_sweep_niah.py:167">
P1: `ttft_s` measures total completion time, not time-to-first-token, because the request uses non-streaming mode (`"stream": False`)</violation>
</file>
<file name="dflash/bench/run_ee_n_multiclient.sh">
<violation number="1" location="dflash/bench/run_ee_n_multiclient.sh:12">
P2: Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark</violation>
<violation number="2" location="dflash/bench/run_ee_n_multiclient.sh:71">
P2: Server-log capture uses global `ls -t` latest rather than run-specific path, causing stale/mismatched server logs after failures.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| t0 = time.perf_counter() | ||
| try: | ||
| r = requests.post(f"{BASE_URL}/v1/chat/completions", json=payload, timeout=600) | ||
| result["ttft_s"] = time.perf_counter() - t0 |
There was a problem hiding this comment.
P1: ttft_s measures total completion time, not time-to-first-token, because the request uses non-streaming mode ("stream": False)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_sweep_niah.py, line 167:
<comment>`ttft_s` measures total completion time, not time-to-first-token, because the request uses non-streaming mode (`"stream": False`)</comment>
<file context>
@@ -0,0 +1,296 @@
+ t0 = time.perf_counter()
+ try:
+ r = requests.post(f"{BASE_URL}/v1/chat/completions", json=payload, timeout=600)
+ result["ttft_s"] = time.perf_counter() - t0
+ r.raise_for_status()
+ data = r.json()
</file context>
| || echo "FAIL: $name x $client (see $cond_dir/${client}.log)" | ||
|
|
||
| # Capture server log if the harness wrote one to the standard evidence dir. | ||
| latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true) |
There was a problem hiding this comment.
P2: Server-log capture uses global ls -t latest rather than run-specific path, causing stale/mismatched server logs after failures.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_multiclient.sh, line 71:
<comment>Server-log capture uses global `ls -t` latest rather than run-specific path, causing stale/mismatched server logs after failures.</comment>
<file context>
@@ -0,0 +1,78 @@
+ || echo "FAIL: $name x $client (see $cond_dir/${client}.log)"
+
+ # Capture server log if the harness wrote one to the standard evidence dir.
+ latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true)
+ if [[ -n "$latest_server_log" ]]; then
+ cp "$latest_server_log" "$cond_dir/${client}_server.log"
</file context>
| @@ -0,0 +1,78 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
P2: Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_multiclient.sh, line 12:
<comment>Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark</comment>
<file context>
@@ -0,0 +1,78 @@
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+WORKTREE="/home/peppi/Dev/lucebox-hub/.claude/worktrees/harness-adapters"
+DRIVER="$WORKTREE/harness/client_test_runner.py"
+
</file context>
|
NIAH is a simple benchmark, it is a very interesting result. Can you check if we can push deeper this finding on agentic coding use cases/benchmarks? Worth exploring before default to N layers |
Absolutely 100% agreed. It has to work. Initial testing seems it works but Qwen3.6 tool calling is acting wierd and not really functioning at least on claude-code. Digging deeper on that today |
Summary
Changes the recommended early-exit default from N=7 to N=3 based on empirical N-sweep. Depends on PR #274 landing first (introduces
PFLASH_DRAFTER_EARLY_EXIT_N).Source change
dflash/README.md— documents ee3 as the recommended default with reproduction instructions.Evidence
Results not committed. Reproduce via:
dflash/bench/run_ee_n_sweep.shdflash/bench/run_ee_n_multiclient.shbench/2026-05-25_ee_n_sweep/PLAN.mdHeadline numbers (RTX 3090, Qwen3.6-27B-Q4_K_M, Qwen2.5-0.5B-BF16):
Dependency
Requires PR #274 to merge before this can land (env vars are introduced there).
Reviewer note
Bench results not committed; run the scripts above to regenerate.