Skip to content

M-GPU-MOE-3 Option 2: CUDA f32→Q8K activation quant kernel (close v1.8.0 discharge with parity fix) #1838

@noahgift

Description

@noahgift

Scope

Implement Option 2 from `qwen3-moe-forward-gpu-v1` v1.8.0 (PR #1825 cascade discharge amendment): add a CUDA `f32→Q8K` activation quantization kernel and route the Q4_K MoE matvec dispatch through the existing `PackedDp4aQ4KQ8Kernel` instead of `q4k_matvec`.

Closes the root cause of #1583 (the 0.94-cos drop on real Qwen3-MoE forward) the correct way: making CUDA match the CPU algorithm, not papering over it.

Background — why this exists

The M-GPU-MOE-3 cascade (#1801, #1805, #1811, #1816, #1818, #1821, #1822, #1825) empirically pinned the root cause to a CPU/CUDA algorithm mismatch:

Path Computes
CPU `fused_q4k_parallel_matvec` `Q4_K(weights) × Q8_K(quantize(f32_activations))` — integer math via maddubs (4-8× speedup)
CUDA `q4k_matvec` `Q4_K(weights) × f32_activations` — no activation quantization

Per-matvec the CUDA path diverges by 2.88% on real Qwen3 weights. Compounded across 128 experts × 48 layers → ~6% cumulative cos drop. Neither side is wrong — they're computing different operations.

Option 2 (recommended in v1.8.0) makes CUDA match CPU.

What this issue covers

A multi-PR cascade to implement the fix:

PR-A — CUDA f32→Q8K activation quant kernel scaffold

Add a new PTX kernel + `Kernel` trait impl in `crates/aprender-gpu/src/kernels/quantize/`. Mirrors the CPU `quantize_activations_q8k_into` in `crates/aprender-serve/src/quantize/parallel_k.rs` (search for that name; produces `scales: &[f32]` + `quants: &[i8]` from f32 activations).

Q8_K block format:

  • 256 quants per super-block (matches Q4_K's super-block size, hence the "K" suffix vs Q8_0's 32)
  • Per-super-block scale (f32)
  • 256 int8 quants per super-block
  • For `in_dim=768`: 3 super-blocks → 768 quants + 3 scales

Kernel structure (similar to `Q4KDequantKernel` / `PackedDp4aQ4KQ8Kernel`):

  • Input: `f32[in_dim]` activation vector + `in_dim` parameter
  • Output: `u8[in_dim]` (the quants, reinterpreted as i8 host-side) + `f32[num_super_blocks]` (the per-block scales)
  • One block per super-block, 256 threads per block
  • Per-thread: load 1 f32, compute scale via warp reduction (max abs), divide, round, store i8

Tests: source-level codegen (assert PTX contains key ops, similar to `test_fused_swiglu_ptx_generation` pattern in #1802).

PR-B — `CudaExecutor::quantize_activations_q8k` host API

Add a host-friendly wrapper:
```rust
pub fn quantize_activations_q8k(
&mut self,
f32_input: &[f32],
out_quants: &mut [i8],
out_scales: &mut [f32],
) -> Result<(), GpuError>
```

Mirrors `CudaExecutor::fused_swiglu_host` shape (upload f32, dispatch kernel, download quants + scales).

PR-C — Falsifier: CPU+CUDA q4k_q8k_matvec parity on real Qwen3 weights

The decisive empirical test. Three-path bisection like #1822:

  • A = CPU `fused_q4k_parallel_matvec` (production-MoE, the current truth)
  • B = CPU `quantize_activations_q8k_into` → CPU `fused_q4k_q8k_parallel_matvec_into` (manual split — already known to equal A from CPU code structure)
  • C = CUDA `quantize_activations_q8k` (new from PR-A/B) → CUDA `packed_dp4a_q4k_q8_gemv_async` (existing) — the proposed Option 2 path

Acceptance: `rel_diff(A, C) < 1e-3` on the same `blk.0.attn_k.weight` slab #1822 used. If A ≈ C ulp-scale, Option 2 is empirically validated.

PR-D — Wire `expert_swiglu_cuda` Q4_K dispatch to Option 2 path

Replace the qtype-aware dispatch in `crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs` so Q4_K matvecs go through the new pre-quant + DP4A path instead of `q4k_matvec`.

Q6_K dispatch stays on the current `q6k_gemv` (the cascade verified it's ulp-scale already — #1801, #1816).

PR-E — Re-run per-layer real-model parity gate

Run `tests/qwen3_moe_per_layer_gpu_parity.rs` (FALSIFY-QW3-MOE-PER-LAYER-001) and verify all 48 layers cos ≥ 0.99 (vs current 47/48 with L47 at cos=0.961).

On success: flip `qwen3-moe-forward-gpu-v1` status to `ACTIVE_RUNTIME` (currently `ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE`). Closes #1583.

Effort estimate

PR Time Risk
A — Quant kernel + PTX scaffold 1-2 days Medium — new PTX kernel; reuse Q8_K block format math from CPU
B — Host API wrapper 2 hr Low
C — Falsifier 4 hr Low — pattern from #1822
D — Wire dispatch 2 hr Medium — must preserve qtype dispatch; Q6_K unchanged
E — Real-model parity verification 2 hr Medium — Float-equivalence is hard; tolerate documented residual

Total: ~1 week focused engineering.

Bonus perf

`packed_dp4a_q4k_q8_gemv_async` uses DP4A integer ops on Ampere+ (RTX 4090). Expected FASTER than current `q4k_matvec` f32 path by 2-4× (4 muls/instruction vs 1). This is a perf win in addition to the parity fix.

Reference files

  • CPU Q8_K quant: `crates/aprender-serve/src/quantize/parallel_k.rs::quantize_activations_q8k_into` (search inside that file)
  • CPU Q4_K × Q8_K matmul: `crates/aprender-serve/src/quantize/fused_q.rs::fused_q4k_q8k_parallel_matvec_into`
  • CUDA Q4_K × Q8 matmul (existing, ready to wire): `crates/aprender-serve/src/cuda/executor/q4k_q8_gemv.rs::packed_dp4a_q4k_q8_gemv_async`
  • CUDA dispatch to update: `crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs`
  • Contract to flip: `contracts/qwen3-moe-forward-gpu-v1.yaml` (currently v1.8.0)
  • Real-model parity gate: `crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs`

Cross-refs

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions