Skip to content

feat(drafter): ee3 as production default (depends on #274)#275

Draft
dusterbloom wants to merge 2 commits into
Luce-Org:mainfrom
dusterbloom:feat/pflash-drafter-ee3-default
Draft

feat(drafter): ee3 as production default (depends on #274)#275
dusterbloom wants to merge 2 commits into
Luce-Org:mainfrom
dusterbloom:feat/pflash-drafter-ee3-default

Conversation

@dusterbloom
Copy link
Copy Markdown
Collaborator

Summary

Changes the recommended early-exit default from N=7 to N=3 based on empirical N-sweep. Depends on PR #274 landing first (introduces PFLASH_DRAFTER_EARLY_EXIT_N).

Source change

dflash/README.md — documents ee3 as the recommended default with reproduction instructions.

Evidence

Results not committed. Reproduce via:

  • N-sweep NIAH @ 32K/64K/128K (ee3=3/3 everywhere, 24.3× drafter speedup at 128K vs baseline): dflash/bench/run_ee_n_sweep.sh
  • 5-client multi-client accept_rate (ee3 +1.2 pp vs ee7, all clients within ±2 pp): dflash/bench/run_ee_n_multiclient.sh
  • Sweep plan + rationale: bench/2026-05-25_ee_n_sweep/PLAN.md

Headline numbers (RTX 3090, Qwen3.6-27B-Q4_K_M, Qwen2.5-0.5B-BF16):

  • ee3 drafter speedup: 6.9× @ 32K, 24.3× @ 128K
  • accept_rate vs ee7: +1.2 pp mean (claude_code +0.0, hermes -0.4, opencode +1.7, pi +0.0, codex +4.6)
  • NIAH 3/3 at 32K, 64K, 128K

Dependency

Requires PR #274 to merge before this can land (env vars are introduced there).

Reviewer note

Bench results not committed; run the scripts above to regenerate.

…idation)

Runner scripts: run_ee_n_sweep.sh, run_ee_n_sweep_niah.py, run_ee_n_multiclient.sh.
Decision: ee3 drafter speedup 6.9x@32K, 24.3x@128K, accept_rate within ±2 pp of ee7.
Bench results not committed; reproduce via the added scripts.
…s on Luce-Org#274)

PFLASH_DRAFTER_EARLY_EXIT_N=3 PFLASH_DRAFTER_SCORE_LAYERS=3 is the production
default after ee_n sweep: 6.9x@32K, 24.3x@128K, accept_rate +1.2 pp vs ee7.
Reproduce via dflash/bench/run_ee_n_sweep.sh + run_ee_n_multiclient.sh.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/bench/run_ee_n_sweep_niah.py">

<violation number="1" location="dflash/bench/run_ee_n_sweep_niah.py:167">
P1: `ttft_s` measures total completion time, not time-to-first-token, because the request uses non-streaming mode (`"stream": False`)</violation>
</file>

<file name="dflash/bench/run_ee_n_multiclient.sh">

<violation number="1" location="dflash/bench/run_ee_n_multiclient.sh:12">
P2: Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark</violation>

<violation number="2" location="dflash/bench/run_ee_n_multiclient.sh:71">
P2: Server-log capture uses global `ls -t` latest rather than run-specific path, causing stale/mismatched server logs after failures.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

t0 = time.perf_counter()
try:
r = requests.post(f"{BASE_URL}/v1/chat/completions", json=payload, timeout=600)
result["ttft_s"] = time.perf_counter() - t0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: ttft_s measures total completion time, not time-to-first-token, because the request uses non-streaming mode ("stream": False)

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_sweep_niah.py, line 167:

<comment>`ttft_s` measures total completion time, not time-to-first-token, because the request uses non-streaming mode (`"stream": False`)</comment>

<file context>
@@ -0,0 +1,296 @@
+        t0 = time.perf_counter()
+        try:
+            r = requests.post(f"{BASE_URL}/v1/chat/completions", json=payload, timeout=600)
+            result["ttft_s"] = time.perf_counter() - t0
+            r.raise_for_status()
+            data = r.json()
</file context>

|| echo "FAIL: $name x $client (see $cond_dir/${client}.log)"

# Capture server log if the harness wrote one to the standard evidence dir.
latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Server-log capture uses global ls -t latest rather than run-specific path, causing stale/mismatched server logs after failures.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_multiclient.sh, line 71:

<comment>Server-log capture uses global `ls -t` latest rather than run-specific path, causing stale/mismatched server logs after failures.</comment>

<file context>
@@ -0,0 +1,78 @@
+            || echo "FAIL: $name x $client (see $cond_dir/${client}.log)"
+
+        # Capture server log if the harness wrote one to the standard evidence dir.
+        latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true)
+        if [[ -n "$latest_server_log" ]]; then
+            cp "$latest_server_log" "$cond_dir/${client}_server.log"
</file context>

@@ -0,0 +1,78 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_multiclient.sh, line 12:

<comment>Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark</comment>

<file context>
@@ -0,0 +1,78 @@
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+WORKTREE="/home/peppi/Dev/lucebox-hub/.claude/worktrees/harness-adapters"
+DRIVER="$WORKTREE/harness/client_test_runner.py"
+
</file context>

@davide221
Copy link
Copy Markdown
Contributor

NIAH is a simple benchmark, it is a very interesting result. Can you check if we can push deeper this finding on agentic coding use cases/benchmarks? Worth exploring before default to N layers

@dusterbloom
Copy link
Copy Markdown
Collaborator Author

NIAH is a simple benchmark, it is a very interesting result. Can you check if we can push deeper this finding on agentic coding use cases/benchmarks? Worth exploring before default to N layers

Absolutely 100% agreed. It has to work. Initial testing seems it works but Qwen3.6 tool calling is acting wierd and not really functioning at least on claude-code. Digging deeper on that today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants