fix(test): chain DFlash verify/replay produce valid logits at all positions#346
Open
davide221 wants to merge 1 commit into
Open
fix(test): chain DFlash verify/replay produce valid logits at all positions#346davide221 wants to merge 1 commit into
davide221 wants to merge 1 commit into
Conversation
…itions The chain (non-DDTree) decode path in test_dflash aborted after step 0 on Qwen3.6 and generated 0 tokens; running with --ddtree worked, so the bug was specific to the chain verify/replay. Two pre-existing root causes, both independent of the DDTree path: 1. build_causal_mask for the batched verify and the legacy replay was called without kv_pad_override, so the mask buffer was strided by align_up(win_len) while the mask tensor was allocated with stride align_up(max_ctx + n_tokens). Only query row 0 landed at the right offset; rows 1.. read an unwritten region, zeroing attention (and logits) for every verify/replay position > 0. DDTree already passed this; the chain did not. 2. The chain verify read sg.argmax_tokens, whose CUDA ggml_argmax returns -1 after position 0 even when logits are valid (same defect fixed for DDTree in 1b3882d). Switched to reading full logits + CPU argmax per position. Verified on main (f59f2a3), RTX 3090 / Qwen3.6-27B Q4_K_M: 7 steps, accepted=41/112 (36.6%/step), coherent output matching the DDTree path. Co-Authored-By: WOZCODE <contact@withwoz.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The chain (non-DDTree) DFlash decode path in
test_dflashaborts after step 0 on Qwen3.6 and generates 0 tokens. Running the same prompt with--ddtreeworks, so the bug is specific to the chain verify/replay. This fixes it; chain mode now decodes correctly.Root cause
Two pre-existing bugs in
server/test/test_dflash.cpp, both independent of the DDTree path:Causal-mask stride mismatch (the main bug). The batched verify and the legacy replay call
build_causal_mask(...)withoutkv_pad_override, so the mask buffer is strided byalign_up(win_len, kq_stride_pad), while the mask tensor is allocated with stridealign_up(max_ctx + n_tokens, kq_stride_pad). Only query row 0 lands at the correct offset; rows 1.. read an unwritten region, so attention (and therefore logits) is zeroed for every verify/replay position > 0. DDTree already passes the override; the chain path did not. Empirically, logitsmaxabswasp0=26.5, p1..=0.0; with the override,p1..=26.x.Broken GPU argmax read. The chain verify read
sg.argmax_tokens, whose CUDAggml_argmaxreturns-1after position 0 even when logits are valid (the same defect fixed for DDTree in1b3882d). Switched to reading fullsg.logits+ CPUargmax_f32per position.With (1) alone the verify is correct but the replay still corrupts
last_tok; both are needed.Validation
RTX 3090 (sm_86) / CUDA 12, Qwen3.6-27B Q4_K_M target + 3.6 DFlash draft, on
main(f59f2a3):[step 0] accept_n=2 bonus=-1, then silent abort, 0 tokens.temp=1.0sampling).Notes
This is the chain test-harness path; the production server uses DDTree (unaffected). Issue #259 is a different (V100/Volta MMA) problem.
🧙 Built with WOZCODE