Skip to content

test(server): CPU-only HTTP server test rig (stub backend + scenarios)#343

Open
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-cpu-test-rig
Open

test(server): CPU-only HTTP server test rig (stub backend + scenarios)#343
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-cpu-test-rig

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented Jun 4, 2026

What

Adds a CPU-only end-to-end test rig for the dflash HttpServer that runs under CUDA_VISIBLE_DEVICES="" and validates streaming + non-streaming responses, OpenAI + Anthropic API shapes, and reasoning-channel routing without booting a real GPU model.

New files (8 test sources + 1 fixture + 1 .gitattributes change):

  • server/test/replay_http_server.cpp — driver binary: tokenizer + stub backend + HttpServer, wired through real chat-template render and SSE emitter paths
  • server/test/scenario_store.{h,cpp} — JSON scenario loader
  • server/test/stub_model_backend.{h,cpp} — ModelBackend impl that replays scripted token streams (no GPU)
  • server/test/scenarios/qwen3_enable_thinking_basic.json — first scenario, exercises Qwen3.6 enable_thinking -> reasoning_content
  • server/test/scripts/strip_gguf_to_tokenizer.py — utility used to produce the tokenizer-only fixture
  • server/test/fixtures/qwen3.6-tokenizer.gguf — 10 MiB LFS-tracked tokenizer-only GGUF (committed as pointer)
  • server/test/test_stub_integration.py — pytest driver that spawns replay_http_server and asserts on HTTP responses

Build/CI wiring (additive only):

  • server/CMakeLists.txt: new replay_http_server target — links dflash_common (CUDA TUs included) but never instantiates a real ModelBackend, so ggml_cuda_init() is never called
  • .github/workflows/ci.yml: opt checkout into LFS, build replay_http_server, run the pytest suite with CUDA_VISIBLE_DEVICES=""
  • .gitattributes: track *.gguf via LFS

Why

The HTTP server's reasoning-channel routing, SSE framing, and chat-template wiring previously had no CI-time coverage — regressions could only be caught by the full-image GPU smoke tests, which are expensive and serialized. This rig exercises the same call chain (ParsedRequest -> render_chat_template -> SseEmitter -> socket) on every push with deterministic, scripted token streams, so wire-format regressions surface in minutes instead of after a release build.

This PR carries forward only the unique test-rig content from #308 (which is being closed). The reasoning-channel routing fix itself already lives on main and in #341; this PR is purely additive test infrastructure with no production-code changes.

Dependencies

None. Purely additive test infrastructure.

Test plan

  • CI build step Build dflash (smoke + server) succeeds with replay_http_server target
  • CI step Run CPU integration tests (stub backend, no GPU) passes under CUDA_VISIBLE_DEVICES=""
  • LFS smudge fetches the tokenizer fixture on checkout (pointer is 133 bytes; blob is 10 MiB)
  • Local: cd server && cmake --build build --target replay_http_server and uv run pytest -v server/test/test_stub_integration.py

Generated with Claude Code.

@howard0su
Copy link
Copy Markdown
Contributor

nice work.

easel and others added 2 commits June 5, 2026 16:01
## What
Adds a CPU-only end-to-end test rig for the dflash HttpServer that runs
under CUDA_VISIBLE_DEVICES="" and validates streaming + non-streaming
responses, OpenAI + Anthropic API shapes, and reasoning-channel routing
without booting a real GPU model.

New files:
- server/test/replay_http_server.cpp — driver binary: tokenizer + stub
  backend + HttpServer, wired through real chat-template render and SSE
  emitter paths
- server/test/scenario_store.{h,cpp} — JSON scenario loader
- server/test/stub_model_backend.{h,cpp} — ModelBackend impl that
  replays scripted token streams (no GPU)
- server/test/scenarios/qwen3_enable_thinking_basic.json — first
  scenario, exercises Qwen3.6 enable_thinking → reasoning_content
- server/test/scripts/strip_gguf_to_tokenizer.py — utility to strip a
  full GGUF down to tokenizer/vocab metadata (the fixture below was
  produced with this)
- server/test/fixtures/qwen3.6-tokenizer.gguf — 10 MiB LFS-tracked
  tokenizer-only GGUF (stored as a pointer; smudge fetches on demand)
- server/test/test_stub_integration.py — pytest driver that spawns
  replay_http_server, fires HTTP requests, and asserts on the responses

Build/CI wiring:
- server/CMakeLists.txt: new replay_http_server target (links
  dflash_common but never instantiates a real ModelBackend, so CUDA
  init is never triggered)
- .github/workflows/ci.yml: opt checkout into LFS, build
  replay_http_server, run the pytest suite with CUDA_VISIBLE_DEVICES=""
- .gitattributes: track *.gguf via LFS

## Why
The HTTP server's reasoning-channel routing, SSE framing, and
chat-template wiring previously had no CI-time coverage — regressions
could only be caught by the full-image GPU smoke tests, which are
expensive and serialized. This rig exercises the same call chain
(ParsedRequest → render_chat_template → SseEmitter → socket) on every
push with deterministic, scripted token streams, so wire-format
regressions surface in minutes instead of after a release build.

## Dependencies
None. Purely additive test infrastructure — no production code paths
are modified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ontent channel

The SseEmitter hard-started in StreamMode::CONTENT and only transitioned to
REASONING when it saw `<think>` in the generated stream. But Qwen3.6 / Laguna
chat templates append `<think>\n` to the prompt suffix when enable_thinking is
honored, so the model emits reasoning tokens directly with no opening tag —
the emitter never transitioned and reasoning text leaked into `content` while
`reasoning_content` stayed empty. ds4-eval pass rate: 14.1% (think) vs 71.7%
(no-think) for Qwen3.6-27B Q4_K_M.

The plumbing was already there: parse_reasoning() supports
started_in_thinking=true (reasoning.h:17-19) but no caller passed it.

Fix:

1. chat_template.h: render_chat_template / render_chat_template_jinja now
   return a PromptRenderResult { text, started_in_thinking }. The built-in
   QWEN3 and LAGUNA branches set started_in_thinking deterministically when
   enable_thinking && add_generation_prompt; GEMMA4 stays false (its
   reasoning channel is opened by the model emitting `<|channel>`, which
   http_server forwards into the emitter as `<think>`). The Jinja path
   suffix-sniffs the rendered prompt for a trailing `<think>` opener and
   emits a [WARN] log when sniffing decides true so a template/model-card
   mismatch surfaces at runtime.

2. SseEmitter: add `initial_mode = StreamMode::CONTENT` defaulted parameter.
   When constructed with REASONING, active_kind_ initializes to "thinking"
   so the Anthropic first content_block is `thinking` instead of `text`
   (avoids a spurious empty text-block stop+restart on the first reasoning
   delta). Deliberately leaves checked_think_prefix_ at its default (false)
   so the existing one-time `<think>` strip guard still trips if a
   template/model-card mismatch causes the model to emit a redundant opener.

3. http_server.cpp: thread render_result.started_in_thinking through
   ParsedRequest into the SseEmitter's initial_mode. Both streaming and
   non-streaming paths feed tokens through the same emitter, so the fix
   covers both response shapes.

Tests: add 12 unit tests under test_server_unit (assertion count 1608 →
1637): SseEmitter initial_mode=REASONING routing for OPENAI_CHAT and
ANTHROPIC formats (closed, unclosed, redundant-opener-strip cases) plus
PromptRenderResult.started_in_thinking provenance for QWEN3 / LAGUNA /
GEMMA4 (enable/disable/no-gen-prompt) and the Jinja suffix-sniff
positive/negative cases.

Smoke-tested manually against Qwen3.6-27B Q4_K_M; non-streaming
`/v1/chat/completions` with `thinking:{type:enabled}` now populates
reasoning_content and never leaks `</think>` into content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel easel force-pushed the feat/server-cpu-test-rig branch from 8ebcfe9 to 57b6d14 Compare June 5, 2026 20:01
@easel easel marked this pull request as ready for review June 5, 2026 20:51
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 19 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/CMakeLists.txt">

<violation number="1" location="server/CMakeLists.txt:799">
P2: The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/CMakeLists.txt
# http_server.cpp #includes <curl/curl.h> for its upstream-proxy
# passthrough; replay_http_server compiles that TU so it must link
# libcurl even though the stub backend itself doesn't use it.
find_package(CURL REQUIRED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/CMakeLists.txt, line 799:

<comment>The new replay test target makes libcurl a hard configure-time dependency, which can break builds that previously skipped curl-dependent tests.</comment>

<file context>
@@ -785,6 +785,44 @@ if(DFLASH27B_TESTS)
+        # http_server.cpp #includes <curl/curl.h> for its upstream-proxy
+        # passthrough; replay_http_server compiles that TU so it must link
+        # libcurl even though the stub backend itself doesn't use it.
+        find_package(CURL REQUIRED)
+        add_executable(replay_http_server
+            test/replay_http_server.cpp
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants