test(server): CPU-only HTTP server test rig (stub backend + scenarios) by easel · Pull Request #343 · Luce-Org/lucebox-hub

easel · 2026-06-04T15:38:47Z

What

Adds a CPU-only end-to-end test rig for the dflash HttpServer that runs under CUDA_VISIBLE_DEVICES="" and validates streaming + non-streaming responses, OpenAI + Anthropic API shapes, and reasoning-channel routing without booting a real GPU model.

New files (8 test sources + 1 fixture + 1 .gitattributes change):

server/test/replay_http_server.cpp — driver binary: tokenizer + stub backend + HttpServer, wired through real chat-template render and SSE emitter paths
server/test/scenario_store.{h,cpp} — JSON scenario loader
server/test/stub_model_backend.{h,cpp} — ModelBackend impl that replays scripted token streams (no GPU)
server/test/scenarios/qwen3_enable_thinking_basic.json — first scenario, exercises Qwen3.6 enable_thinking -> reasoning_content
server/test/scripts/strip_gguf_to_tokenizer.py — utility used to produce the tokenizer-only fixture
server/test/fixtures/qwen3.6-tokenizer.gguf — 10 MiB LFS-tracked tokenizer-only GGUF (committed as pointer)
server/test/test_stub_integration.py — pytest driver that spawns replay_http_server and asserts on HTTP responses

Build/CI wiring (additive only):

server/CMakeLists.txt: new replay_http_server target — links dflash_common (CUDA TUs included) but never instantiates a real ModelBackend, so ggml_cuda_init() is never called
.github/workflows/ci.yml: opt checkout into LFS, build replay_http_server, run the pytest suite with CUDA_VISIBLE_DEVICES=""
.gitattributes: track *.gguf via LFS

Why

The HTTP server's reasoning-channel routing, SSE framing, and chat-template wiring previously had no CI-time coverage — regressions could only be caught by the full-image GPU smoke tests, which are expensive and serialized. This rig exercises the same call chain (ParsedRequest -> render_chat_template -> SseEmitter -> socket) on every push with deterministic, scripted token streams, so wire-format regressions surface in minutes instead of after a release build.

This PR carries forward only the unique test-rig content from #308 (which is being closed). The reasoning-channel routing fix itself already lives on main and in #341; this PR is purely additive test infrastructure with no production-code changes.

Dependencies

None. Purely additive test infrastructure.

Test plan

CI build step Build dflash (smoke + server) succeeds with replay_http_server target
CI step Run CPU integration tests (stub backend, no GPU) passes under CUDA_VISIBLE_DEVICES=""
LFS smudge fetches the tokenizer fixture on checkout (pointer is 133 bytes; blob is 10 MiB)
Local: cd server && cmake --build build --target replay_http_server and uv run pytest -v server/test/test_stub_integration.py

Generated with Claude Code.

howard0su · 2026-06-05T01:34:31Z

nice work.

## What Adds a CPU-only end-to-end test rig for the dflash HttpServer that runs under CUDA_VISIBLE_DEVICES="" and validates streaming + non-streaming responses, OpenAI + Anthropic API shapes, and reasoning-channel routing without booting a real GPU model. New files: - server/test/replay_http_server.cpp — driver binary: tokenizer + stub backend + HttpServer, wired through real chat-template render and SSE emitter paths - server/test/scenario_store.{h,cpp} — JSON scenario loader - server/test/stub_model_backend.{h,cpp} — ModelBackend impl that replays scripted token streams (no GPU) - server/test/scenarios/qwen3_enable_thinking_basic.json — first scenario, exercises Qwen3.6 enable_thinking → reasoning_content - server/test/scripts/strip_gguf_to_tokenizer.py — utility to strip a full GGUF down to tokenizer/vocab metadata (the fixture below was produced with this) - server/test/fixtures/qwen3.6-tokenizer.gguf — 10 MiB LFS-tracked tokenizer-only GGUF (stored as a pointer; smudge fetches on demand) - server/test/test_stub_integration.py — pytest driver that spawns replay_http_server, fires HTTP requests, and asserts on the responses Build/CI wiring: - server/CMakeLists.txt: new replay_http_server target (links dflash_common but never instantiates a real ModelBackend, so CUDA init is never triggered) - .github/workflows/ci.yml: opt checkout into LFS, build replay_http_server, run the pytest suite with CUDA_VISIBLE_DEVICES="" - .gitattributes: track *.gguf via LFS ## Why The HTTP server's reasoning-channel routing, SSE framing, and chat-template wiring previously had no CI-time coverage — regressions could only be caught by the full-image GPU smoke tests, which are expensive and serialized. This rig exercises the same call chain (ParsedRequest → render_chat_template → SseEmitter → socket) on every push with deterministic, scripted token streams, so wire-format regressions surface in minutes instead of after a release build. ## Dependencies None. Purely additive test infrastructure — no production code paths are modified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ontent channel The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna chat templates append `<think>\n` to the prompt suffix when enable_thinking is honored, so the model emits reasoning tokens directly with no opening tag — the emitter never transitioned and reasoning text leaked into `content` while `reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7% (no-think) for Qwen3.6-27B Q4_K_M. The plumbing was already there: parse_reasoning() supports started_in_thinking=true (reasoning.h:17-19) but no caller passed it. Fix: 1. chat_template.h: render_chat_template / render_chat_template_jinja now return a PromptRenderResult { text, started_in_thinking }. The built-in QWEN3 and LAGUNA branches set started_in_thinking deterministically when enable_thinking && add_generation_prompt; GEMMA4 stays false (its reasoning channel is opened by the model emitting `<|channel>`, which http_server forwards into the emitter as `<think>`). The Jinja path suffix-sniffs the rendered prompt for a trailing `<think>` opener and emits a [WARN] log when sniffing decides true so a template/model-card mismatch surfaces at runtime. 2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter. When constructed with REASONING, active_kind_ initializes to "thinking" so the Anthropic first content_block is `thinking` instead of `text` (avoids a spurious empty text-block stop+restart on the first reasoning delta). Deliberately leaves checked_think_prefix_ at its default (false) so the existing one-time `<think>` strip guard still trips if a template/model-card mismatch causes the model to emit a redundant opener. 3. http_server.cpp: thread render_result.started_in_thinking through ParsedRequest into the SseEmitter's initial_mode. Both streaming and non-streaming paths feed tokens through the same emitter, so the fix covers both response shapes. Tests: add 12 unit tests under test_server_unit (assertion count 1608 → 1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA / GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff positive/negative cases. Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming `/v1/chat/completions` with `thinking:{type:enabled}` now populates reasoning_content and never leaks `</think>` into content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 19 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/CMakeLists.txt">

<violation number="1" location="server/CMakeLists.txt:799">
P2: The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-05T20:55:05Z

+        # http_server.cpp #includes <curl/curl.h> for its upstream-proxy
+        # passthrough; replay_http_server compiles that TU so it must link
+        # libcurl even though the stub backend itself doesn't use it.
+        find_package(CURL REQUIRED)


P2: The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/CMakeLists.txt, line 799: <comment>The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.</comment> <file context> @@ -785,6 +785,44 @@ if(DFLASH27B_TESTS) + # http_server.cpp #includes <curl/curl.h> for its upstream-proxy + # passthrough; replay_http_server compiles that TU so it must link + # libcurl even though the stub backend itself doesn't use it. + find_package(CURL REQUIRED) + add_executable(replay_http_server + test/replay_http_server.cpp </file context>

easel mentioned this pull request Jun 4, 2026

fix(server): route Qwen3.6/Laguna think-mode reasoning to reasoning_content channel #308

Closed

easel force-pushed the feat/server-cpu-test-rig branch 3 times, most recently from 9542cd8 to ab3b595 Compare June 4, 2026 19:06

easel and others added 2 commits June 5, 2026 16:01

easel force-pushed the feat/server-cpu-test-rig branch from 8ebcfe9 to 57b6d14 Compare June 5, 2026 20:01

easel marked this pull request as ready for review June 5, 2026 20:51

cubic-dev-ai Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(server): CPU-only HTTP server test rig (stub backend + scenarios)#343

test(server): CPU-only HTTP server test rig (stub backend + scenarios)#343
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-cpu-test-rig

easel commented Jun 4, 2026

Uh oh!

howard0su commented Jun 5, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

easel commented Jun 4, 2026

What

Why

Dependencies

Test plan

Uh oh!

howard0su commented Jun 5, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants