Skip to content

Remove HipW4A16SkinnyLinearKernel and no-group wvSplitK_int4#869

Merged
mgehre-amd merged 3 commits intogfx11from
matthias.remove-wvsplitk-int4-nogrouped
Apr 17, 2026
Merged

Remove HipW4A16SkinnyLinearKernel and no-group wvSplitK_int4#869
mgehre-amd merged 3 commits intogfx11from
matthias.remove-wvsplitk-int4-nogrouped

Conversation

@mgehre-amd
Copy link
Copy Markdown

Remove wvSplitK_int4 no-groups kernel variant and remove HipW4A16SkinnyLinearKernel. Both are not unused (all int4 model use groups), and HybridW4A16 supersedes HipW4A16SkinnyLinearKernel.

@mgehre-amd mgehre-amd requested a review from eble-amd April 13, 2026 11:50
@mgehre-amd mgehre-amd requested a review from gshtras as a code owner April 13, 2026 11:50
@mgehre-amd mgehre-amd changed the title Matthias.remove wvsplitk int4 nogrouped Remove HipW4A16SkinnyLinearKernel and no-group wvSplitK_int4 Apr 13, 2026
Copy link
Copy Markdown

@eble-amd eble-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my favorite kind of change. Please merge it as soon as you receive all appropriate approval.

No real INT4 quantized model uses per-channel (group_size=-1) quantization.
All AWQ, GPTQ, and compressed-tensors models use per-group scales with
group_size=32 or 128, handled by wvSplitK_int4_g. Removing the unused
no-groups variant reduces compile time and code complexity.

Changes:
- Removes ~260 lines of C++ kernel entry point and sweep code
- Eliminates group_size=-1 dispatch path in Python kernel wrapper
- Keeps shared kernel templates (used by grouped variant via template param)

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
HybridW4A16LinearKernel fully supersedes it — it handles both the skinny
decode path (wvSplitK_int4_g) and large-batch prefill (Triton), including
asymmetric/zero-point models. The skinny-only kernel class added no unique
capability.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd force-pushed the matthias.remove-wvsplitk-int4-nogrouped branch from 055ceeb to 7ba0080 Compare April 14, 2026 21:34
The test_triton_w4a16_skinny_fmt_gemm_matches_reference test was
failing with 1/16384 elements marginally exceeding atol=1e-2
(actual diff: 0.0106). Relax to atol=5e-2, matching the asymmetric
test's tolerance on the same kernel.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd merged commit 68d9e72 into gfx11 Apr 17, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants