M-GPU-MOE-3 Option 2: CUDA f32→Q8K activation quant kernel (close v1.8.0 discharge with parity fix)

## Scope

Implement Option 2 from [\`qwen3-moe-forward-gpu-v1\` v1.8.0](https://github.com/paiml/aprender/blob/main/contracts/qwen3-moe-forward-gpu-v1.yaml) (PR #1825 cascade discharge amendment): add a CUDA \`f32→Q8K\` activation quantization kernel and route the Q4_K MoE matvec dispatch through the existing \`PackedDp4aQ4KQ8Kernel\` instead of \`q4k_matvec\`.

Closes the root cause of #1583 (the 0.94-cos drop on real Qwen3-MoE forward) the *correct* way: making CUDA match the CPU algorithm, not papering over it.

## Background — why this exists

The M-GPU-MOE-3 cascade (#1801, #1805, #1811, #1816, #1818, #1821, #1822, #1825) empirically pinned the root cause to a **CPU/CUDA algorithm mismatch**:

| Path | Computes |
|---|---|
| CPU \`fused_q4k_parallel_matvec\` | \`Q4_K(weights) × Q8_K(quantize(f32_activations))\` — integer math via maddubs (4-8× speedup) |
| CUDA \`q4k_matvec\` | \`Q4_K(weights) × f32_activations\` — no activation quantization |

Per-matvec the CUDA path diverges by 2.88% on real Qwen3 weights. Compounded across 128 experts × 48 layers → ~6% cumulative cos drop. **Neither side is wrong** — they're computing different operations.

Option 2 (recommended in v1.8.0) makes CUDA match CPU.

## What this issue covers

A multi-PR cascade to implement the fix:

### PR-A — CUDA f32→Q8K activation quant kernel scaffold

Add a new PTX kernel + \`Kernel\` trait impl in \`crates/aprender-gpu/src/kernels/quantize/\`. Mirrors the CPU \`quantize_activations_q8k_into\` in \`crates/aprender-serve/src/quantize/parallel_k.rs\` (search for that name; produces \`scales: &[f32]\` + \`quants: &[i8]\` from f32 activations).

Q8_K block format:
- **256 quants per super-block** (matches Q4_K's super-block size, hence the \"K\" suffix vs Q8_0's 32)
- **Per-super-block scale** (f32)
- **256 int8 quants** per super-block
- For \`in_dim=768\`: 3 super-blocks → 768 quants + 3 scales

Kernel structure (similar to \`Q4KDequantKernel\` / \`PackedDp4aQ4KQ8Kernel\`):
- Input: \`f32[in_dim]\` activation vector + \`in_dim\` parameter
- Output: \`u8[in_dim]\` (the quants, reinterpreted as i8 host-side) + \`f32[num_super_blocks]\` (the per-block scales)
- One block per super-block, 256 threads per block
- Per-thread: load 1 f32, compute scale via warp reduction (max abs), divide, round, store i8

Tests: source-level codegen (assert PTX contains key ops, similar to \`test_fused_swiglu_ptx_generation\` pattern in [#1802](https://github.com/paiml/aprender/pull/1802)).

### PR-B — \`CudaExecutor::quantize_activations_q8k\` host API

Add a host-friendly wrapper:
\`\`\`rust
pub fn quantize_activations_q8k(
    &mut self,
    f32_input: &[f32],
    out_quants: &mut [i8],
    out_scales: &mut [f32],
) -> Result<(), GpuError>
\`\`\`

Mirrors \`CudaExecutor::fused_swiglu_host\` shape (upload f32, dispatch kernel, download quants + scales).

### PR-C — Falsifier: CPU+CUDA q4k_q8k_matvec parity on real Qwen3 weights

The decisive empirical test. Three-path bisection like [#1822](https://github.com/paiml/aprender/pull/1822):

- A = CPU \`fused_q4k_parallel_matvec\` (production-MoE, the current truth)
- B = CPU \`quantize_activations_q8k_into\` → CPU \`fused_q4k_q8k_parallel_matvec_into\` (manual split — already known to equal A from CPU code structure)
- **C** = CUDA \`quantize_activations_q8k\` (new from PR-A/B) → CUDA \`packed_dp4a_q4k_q8_gemv_async\` (existing) — the proposed Option 2 path

Acceptance: \`rel_diff(A, C) < 1e-3\` on the same \`blk.0.attn_k.weight\` slab #1822 used. If A ≈ C ulp-scale, **Option 2 is empirically validated**.

### PR-D — Wire \`expert_swiglu_cuda\` Q4_K dispatch to Option 2 path

Replace the qtype-aware dispatch in \`crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs\` so Q4_K matvecs go through the new pre-quant + DP4A path instead of \`q4k_matvec\`.

Q6_K dispatch stays on the current \`q6k_gemv\` (the cascade verified it's ulp-scale already — #1801, #1816).

### PR-E — Re-run per-layer real-model parity gate

Run [\`tests/qwen3_moe_per_layer_gpu_parity.rs\`](https://github.com/paiml/aprender/blob/main/crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs) (FALSIFY-QW3-MOE-PER-LAYER-001) and verify all 48 layers cos ≥ 0.99 (vs current 47/48 with L47 at cos=0.961).

On success: flip \`qwen3-moe-forward-gpu-v1\` status to \`ACTIVE_RUNTIME\` (currently \`ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE\`). Closes #1583.

## Effort estimate

| PR | Time | Risk |
|---|---|---|
| A — Quant kernel + PTX scaffold | 1-2 days | Medium — new PTX kernel; reuse Q8_K block format math from CPU |
| B — Host API wrapper | 2 hr | Low |
| C — Falsifier | 4 hr | Low — pattern from #1822 |
| D — Wire dispatch | 2 hr | Medium — must preserve qtype dispatch; Q6_K unchanged |
| E — Real-model parity verification | 2 hr | Medium — Float-equivalence is hard; tolerate documented residual |

**Total: ~1 week focused engineering.**

## Bonus perf

\`packed_dp4a_q4k_q8_gemv_async\` uses DP4A integer ops on Ampere+ (RTX 4090). Expected **FASTER than current \`q4k_matvec\` f32 path** by 2-4× (4 muls/instruction vs 1). This is a perf win in addition to the parity fix.

## Reference files

- CPU Q8_K quant: \`crates/aprender-serve/src/quantize/parallel_k.rs::quantize_activations_q8k_into\` (search inside that file)
- CPU Q4_K × Q8_K matmul: \`crates/aprender-serve/src/quantize/fused_q.rs::fused_q4k_q8k_parallel_matvec_into\` 
- CUDA Q4_K × Q8 matmul (existing, ready to wire): \`crates/aprender-serve/src/cuda/executor/q4k_q8_gemv.rs::packed_dp4a_q4k_q8_gemv_async\`
- CUDA dispatch to update: \`crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs\`
- Contract to flip: \`contracts/qwen3-moe-forward-gpu-v1.yaml\` (currently v1.8.0)
- Real-model parity gate: \`crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs\`

## Cross-refs

- Cascade discharge contract: #1825 (qwen3-moe-forward-gpu-v1 v1.7.2 → v1.8.0)
- Cascade falsifiers: #1801, #1805, #1811, #1816, #1818, #1821, #1822
- Issue closed by this: #1583 (M-GPU-MOE-3)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Path	Computes
CPU `fused_q4k_parallel_matvec`	`Q4_K(weights) × Q8_K(quantize(f32_activations))` — integer math via maddubs (4-8× speedup)
CUDA `q4k_matvec`	`Q4_K(weights) × f32_activations` — no activation quantization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M-GPU-MOE-3 Option 2: CUDA f32→Q8K activation quant kernel (close v1.8.0 discharge with parity fix) #1838

Scope

Background — why this exists

What this issue covers

PR-A — CUDA f32→Q8K activation quant kernel scaffold

PR-B — `CudaExecutor::quantize_activations_q8k` host API

PR-C — Falsifier: CPU+CUDA q4k_q8k_matvec parity on real Qwen3 weights

PR-D — Wire `expert_swiglu_cuda` Q4_K dispatch to Option 2 path

PR-E — Re-run per-layer real-model parity gate

Effort estimate

Bonus perf

Reference files

Cross-refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PR	Time	Risk
A — Quant kernel + PTX scaffold	1-2 days	Medium — new PTX kernel; reuse Q8_K block format math from CPU
B — Host API wrapper	2 hr	Low
C — Falsifier	4 hr	Low — pattern from #1822
D — Wire dispatch	2 hr	Medium — must preserve qtype dispatch; Q6_K unchanged
E — Real-model parity verification	2 hr	Medium — Float-equivalence is hard; tolerate documented residual

M-GPU-MOE-3 Option 2: CUDA f32→Q8K activation quant kernel (close v1.8.0 discharge with parity fix) #1838

Description

Scope

Background — why this exists

What this issue covers

PR-A — CUDA f32→Q8K activation quant kernel scaffold

PR-B — `CudaExecutor::quantize_activations_q8k` host API

PR-C — Falsifier: CPU+CUDA q4k_q8k_matvec parity on real Qwen3 weights

PR-D — Wire `expert_swiglu_cuda` Q4_K dispatch to Option 2 path

PR-E — Re-run per-layer real-model parity gate

Effort estimate

Bonus perf

Reference files

Cross-refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions