test(server): CPU-only HTTP server test rig (stub backend + scenarios)#343
Open
easel wants to merge 2 commits into
Open
test(server): CPU-only HTTP server test rig (stub backend + scenarios)#343easel wants to merge 2 commits into
easel wants to merge 2 commits into
Conversation
9542cd8 to
ab3b595
Compare
Contributor
|
nice work. |
## What
Adds a CPU-only end-to-end test rig for the dflash HttpServer that runs
under CUDA_VISIBLE_DEVICES="" and validates streaming + non-streaming
responses, OpenAI + Anthropic API shapes, and reasoning-channel routing
without booting a real GPU model.
New files:
- server/test/replay_http_server.cpp — driver binary: tokenizer + stub
backend + HttpServer, wired through real chat-template render and SSE
emitter paths
- server/test/scenario_store.{h,cpp} — JSON scenario loader
- server/test/stub_model_backend.{h,cpp} — ModelBackend impl that
replays scripted token streams (no GPU)
- server/test/scenarios/qwen3_enable_thinking_basic.json — first
scenario, exercises Qwen3.6 enable_thinking → reasoning_content
- server/test/scripts/strip_gguf_to_tokenizer.py — utility to strip a
full GGUF down to tokenizer/vocab metadata (the fixture below was
produced with this)
- server/test/fixtures/qwen3.6-tokenizer.gguf — 10 MiB LFS-tracked
tokenizer-only GGUF (stored as a pointer; smudge fetches on demand)
- server/test/test_stub_integration.py — pytest driver that spawns
replay_http_server, fires HTTP requests, and asserts on the responses
Build/CI wiring:
- server/CMakeLists.txt: new replay_http_server target (links
dflash_common but never instantiates a real ModelBackend, so CUDA
init is never triggered)
- .github/workflows/ci.yml: opt checkout into LFS, build
replay_http_server, run the pytest suite with CUDA_VISIBLE_DEVICES=""
- .gitattributes: track *.gguf via LFS
## Why
The HTTP server's reasoning-channel routing, SSE framing, and
chat-template wiring previously had no CI-time coverage — regressions
could only be caught by the full-image GPU smoke tests, which are
expensive and serialized. This rig exercises the same call chain
(ParsedRequest → render_chat_template → SseEmitter → socket) on every
push with deterministic, scripted token streams, so wire-format
regressions surface in minutes instead of after a release build.
## Dependencies
None. Purely additive test infrastructure — no production code paths
are modified.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ontent channel
The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to
REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna
chat templates append `<think>\n` to the prompt suffix when enable_thinking is
honored, so the model emits reasoning tokens directly with no opening tag —
the emitter never transitioned and reasoning text leaked into `content` while
`reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7%
(no-think) for Qwen3.6-27B Q4_K_M.
The plumbing was already there: parse_reasoning() supports
started_in_thinking=true (reasoning.h:17-19) but no caller passed it.
Fix:
1. chat_template.h: render_chat_template / render_chat_template_jinja now
return a PromptRenderResult { text, started_in_thinking }. The built-in
QWEN3 and LAGUNA branches set started_in_thinking deterministically when
enable_thinking && add_generation_prompt; GEMMA4 stays false (its
reasoning channel is opened by the model emitting `<|channel>`, which
http_server forwards into the emitter as `<think>`). The Jinja path
suffix-sniffs the rendered prompt for a trailing `<think>` opener and
emits a [WARN] log when sniffing decides true so a template/model-card
mismatch surfaces at runtime.
2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter.
When constructed with REASONING, active_kind_ initializes to "thinking"
so the Anthropic first content_block is `thinking` instead of `text`
(avoids a spurious empty text-block stop+restart on the first reasoning
delta). Deliberately leaves checked_think_prefix_ at its default (false)
so the existing one-time `<think>` strip guard still trips if a
template/model-card mismatch causes the model to emit a redundant opener.
3. http_server.cpp: thread render_result.started_in_thinking through
ParsedRequest into the SseEmitter's initial_mode. Both streaming and
non-streaming paths feed tokens through the same emitter, so the fix
covers both response shapes.
Tests: add 12 unit tests under test_server_unit (assertion count 1608 →
1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and
ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus
PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA /
GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff
positive/negative cases.
Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming
`/v1/chat/completions` with `thinking:{type:enabled}` now populates
reasoning_content and never leaks `</think>` into content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8ebcfe9 to
57b6d14
Compare
Contributor
There was a problem hiding this comment.
1 issue found across 19 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/CMakeLists.txt">
<violation number="1" location="server/CMakeLists.txt:799">
P2: The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| # http_server.cpp #includes <curl/curl.h> for its upstream-proxy | ||
| # passthrough; replay_http_server compiles that TU so it must link | ||
| # libcurl even though the stub backend itself doesn't use it. | ||
| find_package(CURL REQUIRED) |
Contributor
There was a problem hiding this comment.
P2: The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/CMakeLists.txt, line 799:
<comment>The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.</comment>
<file context>
@@ -785,6 +785,44 @@ if(DFLASH27B_TESTS)
+ # http_server.cpp #includes <curl/curl.h> for its upstream-proxy
+ # passthrough; replay_http_server compiles that TU so it must link
+ # libcurl even though the stub backend itself doesn't use it.
+ find_package(CURL REQUIRED)
+ add_executable(replay_http_server
+ test/replay_http_server.cpp
</file context>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a CPU-only end-to-end test rig for the dflash HttpServer that runs under
CUDA_VISIBLE_DEVICES=""and validates streaming + non-streaming responses, OpenAI + Anthropic API shapes, and reasoning-channel routing without booting a real GPU model.New files (8 test sources + 1 fixture + 1 .gitattributes change):
server/test/replay_http_server.cpp— driver binary: tokenizer + stub backend + HttpServer, wired through real chat-template render and SSE emitter pathsserver/test/scenario_store.{h,cpp}— JSON scenario loaderserver/test/stub_model_backend.{h,cpp}— ModelBackend impl that replays scripted token streams (no GPU)server/test/scenarios/qwen3_enable_thinking_basic.json— first scenario, exercises Qwen3.6enable_thinking->reasoning_contentserver/test/scripts/strip_gguf_to_tokenizer.py— utility used to produce the tokenizer-only fixtureserver/test/fixtures/qwen3.6-tokenizer.gguf— 10 MiB LFS-tracked tokenizer-only GGUF (committed as pointer)server/test/test_stub_integration.py— pytest driver that spawnsreplay_http_serverand asserts on HTTP responsesBuild/CI wiring (additive only):
server/CMakeLists.txt: newreplay_http_servertarget — linksdflash_common(CUDA TUs included) but never instantiates a realModelBackend, soggml_cuda_init()is never called.github/workflows/ci.yml: opt checkout into LFS, buildreplay_http_server, run the pytest suite withCUDA_VISIBLE_DEVICES="".gitattributes: track*.ggufvia LFSWhy
The HTTP server's reasoning-channel routing, SSE framing, and chat-template wiring previously had no CI-time coverage — regressions could only be caught by the full-image GPU smoke tests, which are expensive and serialized. This rig exercises the same call chain (
ParsedRequest->render_chat_template->SseEmitter-> socket) on every push with deterministic, scripted token streams, so wire-format regressions surface in minutes instead of after a release build.This PR carries forward only the unique test-rig content from #308 (which is being closed). The reasoning-channel routing fix itself already lives on
mainand in #341; this PR is purely additive test infrastructure with no production-code changes.Dependencies
None. Purely additive test infrastructure.
Test plan
Build dflash (smoke + server)succeeds withreplay_http_servertargetRun CPU integration tests (stub backend, no GPU)passes underCUDA_VISIBLE_DEVICES=""cd server && cmake --build build --target replay_http_serveranduv run pytest -v server/test/test_stub_integration.pyGenerated with Claude Code.