You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement Option 2 from `qwen3-moe-forward-gpu-v1` v1.8.0 (PR #1825 cascade discharge amendment): add a CUDA `f32→Q8K` activation quantization kernel and route the Q4_K MoE matvec dispatch through the existing `PackedDp4aQ4KQ8Kernel` instead of `q4k_matvec`.
Closes the root cause of #1583 (the 0.94-cos drop on real Qwen3-MoE forward) the correct way: making CUDA match the CPU algorithm, not papering over it.
`Q4_K(weights) × Q8_K(quantize(f32_activations))` — integer math via maddubs (4-8× speedup)
CUDA `q4k_matvec`
`Q4_K(weights) × f32_activations` — no activation quantization
Per-matvec the CUDA path diverges by 2.88% on real Qwen3 weights. Compounded across 128 experts × 48 layers → ~6% cumulative cos drop. Neither side is wrong — they're computing different operations.
Option 2 (recommended in v1.8.0) makes CUDA match CPU.
What this issue covers
A multi-PR cascade to implement the fix:
PR-A — CUDA f32→Q8K activation quant kernel scaffold
Add a new PTX kernel + `Kernel` trait impl in `crates/aprender-gpu/src/kernels/quantize/`. Mirrors the CPU `quantize_activations_q8k_into` in `crates/aprender-serve/src/quantize/parallel_k.rs` (search for that name; produces `scales: &[f32]` + `quants: &[i8]` from f32 activations).
Q8_K block format:
256 quants per super-block (matches Q4_K's super-block size, hence the "K" suffix vs Q8_0's 32)
Per-super-block scale (f32)
256 int8 quants per super-block
For `in_dim=768`: 3 super-blocks → 768 quants + 3 scales
Kernel structure (similar to `Q4KDequantKernel` / `PackedDp4aQ4KQ8Kernel`):
PR-C — Falsifier: CPU+CUDA q4k_q8k_matvec parity on real Qwen3 weights
The decisive empirical test. Three-path bisection like #1822:
A = CPU `fused_q4k_parallel_matvec` (production-MoE, the current truth)
B = CPU `quantize_activations_q8k_into` → CPU `fused_q4k_q8k_parallel_matvec_into` (manual split — already known to equal A from CPU code structure)
C = CUDA `quantize_activations_q8k` (new from PR-A/B) → CUDA `packed_dp4a_q4k_q8_gemv_async` (existing) — the proposed Option 2 path
Acceptance: `rel_diff(A, C) < 1e-3` on the same `blk.0.attn_k.weight` slab #1822 used. If A ≈ C ulp-scale, Option 2 is empirically validated.
PR-D — Wire `expert_swiglu_cuda` Q4_K dispatch to Option 2 path
Replace the qtype-aware dispatch in `crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs` so Q4_K matvecs go through the new pre-quant + DP4A path instead of `q4k_matvec`.
Q6_K dispatch stays on the current `q6k_gemv` (the cascade verified it's ulp-scale already — #1801, #1816).
Medium — must preserve qtype dispatch; Q6_K unchanged
E — Real-model parity verification
2 hr
Medium — Float-equivalence is hard; tolerate documented residual
Total: ~1 week focused engineering.
Bonus perf
`packed_dp4a_q4k_q8_gemv_async` uses DP4A integer ops on Ampere+ (RTX 4090). Expected FASTER than current `q4k_matvec` f32 path by 2-4× (4 muls/instruction vs 1). This is a perf win in addition to the parity fix.
Reference files
CPU Q8_K quant: `crates/aprender-serve/src/quantize/parallel_k.rs::quantize_activations_q8k_into` (search inside that file)
CPU Q4_K × Q8_K matmul: `crates/aprender-serve/src/quantize/fused_q.rs::fused_q4k_q8k_parallel_matvec_into`
CUDA Q4_K × Q8 matmul (existing, ready to wire): `crates/aprender-serve/src/cuda/executor/q4k_q8_gemv.rs::packed_dp4a_q4k_q8_gemv_async`
CUDA dispatch to update: `crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs`
Contract to flip: `contracts/qwen3-moe-forward-gpu-v1.yaml` (currently v1.8.0)
Scope
Implement Option 2 from `qwen3-moe-forward-gpu-v1` v1.8.0 (PR #1825 cascade discharge amendment): add a CUDA `f32→Q8K` activation quantization kernel and route the Q4_K MoE matvec dispatch through the existing `PackedDp4aQ4KQ8Kernel` instead of `q4k_matvec`.
Closes the root cause of #1583 (the 0.94-cos drop on real Qwen3-MoE forward) the correct way: making CUDA match the CPU algorithm, not papering over it.
Background — why this exists
The M-GPU-MOE-3 cascade (#1801, #1805, #1811, #1816, #1818, #1821, #1822, #1825) empirically pinned the root cause to a CPU/CUDA algorithm mismatch:
Per-matvec the CUDA path diverges by 2.88% on real Qwen3 weights. Compounded across 128 experts × 48 layers → ~6% cumulative cos drop. Neither side is wrong — they're computing different operations.
Option 2 (recommended in v1.8.0) makes CUDA match CPU.
What this issue covers
A multi-PR cascade to implement the fix:
PR-A — CUDA f32→Q8K activation quant kernel scaffold
Add a new PTX kernel + `Kernel` trait impl in `crates/aprender-gpu/src/kernels/quantize/`. Mirrors the CPU `quantize_activations_q8k_into` in `crates/aprender-serve/src/quantize/parallel_k.rs` (search for that name; produces `scales: &[f32]` + `quants: &[i8]` from f32 activations).
Q8_K block format:
Kernel structure (similar to `Q4KDequantKernel` / `PackedDp4aQ4KQ8Kernel`):
Tests: source-level codegen (assert PTX contains key ops, similar to `test_fused_swiglu_ptx_generation` pattern in #1802).
PR-B — `CudaExecutor::quantize_activations_q8k` host API
Add a host-friendly wrapper:
```rust
pub fn quantize_activations_q8k(
&mut self,
f32_input: &[f32],
out_quants: &mut [i8],
out_scales: &mut [f32],
) -> Result<(), GpuError>
```
Mirrors `CudaExecutor::fused_swiglu_host` shape (upload f32, dispatch kernel, download quants + scales).
PR-C — Falsifier: CPU+CUDA q4k_q8k_matvec parity on real Qwen3 weights
The decisive empirical test. Three-path bisection like #1822:
Acceptance: `rel_diff(A, C) < 1e-3` on the same `blk.0.attn_k.weight` slab #1822 used. If A ≈ C ulp-scale, Option 2 is empirically validated.
PR-D — Wire `expert_swiglu_cuda` Q4_K dispatch to Option 2 path
Replace the qtype-aware dispatch in `crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs` so Q4_K matvecs go through the new pre-quant + DP4A path instead of `q4k_matvec`.
Q6_K dispatch stays on the current `q6k_gemv` (the cascade verified it's ulp-scale already — #1801, #1816).
PR-E — Re-run per-layer real-model parity gate
Run `tests/qwen3_moe_per_layer_gpu_parity.rs` (FALSIFY-QW3-MOE-PER-LAYER-001) and verify all 48 layers cos ≥ 0.99 (vs current 47/48 with L47 at cos=0.961).
On success: flip `qwen3-moe-forward-gpu-v1` status to `ACTIVE_RUNTIME` (currently `ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE`). Closes #1583.
Effort estimate
Total: ~1 week focused engineering.
Bonus perf
`packed_dp4a_q4k_q8_gemv_async` uses DP4A integer ops on Ampere+ (RTX 4090). Expected FASTER than current `q4k_matvec` f32 path by 2-4× (4 muls/instruction vs 1). This is a perf win in addition to the parity fix.
Reference files
Cross-refs
🤖 Generated with Claude Code