Skip to content

cpu: fix ARM NEON nvfp4 vec dot#1

Open
aidaiprivate-source wants to merge 1 commit into
masterfrom
0cc4m/cpu-arm-nvfp4-fix
Open

cpu: fix ARM NEON nvfp4 vec dot#1
aidaiprivate-source wants to merge 1 commit into
masterfrom
0cc4m/cpu-arm-nvfp4-fix

Conversation

@aidaiprivate-source

@aidaiprivate-source aidaiprivate-source commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Overview

Additional information

Requirements

Summary by CodeRabbit

  • Refactor
    • Optimized CPU performance for quantized vector operations on ARM-based systems through enhanced computation methods.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b99754bf-5bb5-4587-b8d4-2a45ae1cf6cf

📥 Commits

Reviewing files that changed from the base of the PR and between 94a220c and a30369d.

📒 Files selected for processing (2)
  • ggml/src/ggml-cpu/arch/arm/quants.c
  • ggml/src/ggml-cpu/ggml-cpu-impl.h

📝 Walkthrough

Walkthrough

This PR refactors the ARM NEON implementation of 4-bit quantized dot product calculation. It introduces a new ggml_nvfp4_dot8 helper function and modifies ggml_vec_dot_nvfp4_q8_0 to process data in four 8-lane chunks instead of two 16-lane vectors, computing four partial results that are then directly used in the final fused multiply-add operation.

Changes

NVFP4 Quantized Dot Product Optimization

Layer / File(s) Summary
NVFP4 dot8 helper function
ggml/src/ggml-cpu/ggml-cpu-impl.h
New ggml_nvfp4_dot8 NEON helper multiplies paired int8x8_t lanes using vmull_s8, widens pairwise sums via vpaddlq_s16, and combines low/high results via vaddq_s32 into a final int32x4_t.
vec_dot_nvfp4_q8_0 implementation refactor
ggml/src/ggml-cpu/arch/arm/quants.c
Operand loading splits q8 and q4 into four 8-lane chunks each; ggml_nvfp4_dot8 computes four partial results (p0p3); accumulation builds a float32x4_t sums vector from horizontal sums of all four partials and performs fused multiply-add with scales, removing prior int32 to float conversion.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Four lanes split wide, the chunks align,
Where dot products dance in parallel line,
NEON helpers bloom with widening grace,
Quantized math speeds up this ARM race!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is a blank template with all sections empty; no actual description of changes, rationale, or AI usage disclosure is provided. Fill in the Overview section with a clear description of what the fix addresses and why it was needed. Complete the AI usage disclosure field and provide any relevant context in Additional information.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: a fix to the ARM NEON nvfp4 vector dot implementation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 0cc4m/cpu-arm-nvfp4-fix
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch 0cc4m/cpu-arm-nvfp4-fix

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
ggml/src/ggml-cpu/arch/arm/quants.c

ggml/src/ggml-cpu/arch/arm/quants.c:2:10: fatal error: 'ggml-common.h' file not found
2 | #include "ggml-common.h"
| ^~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/a30369d51585883a578bf7272075e9f0412ca86d-55390c3f7c50f038/tmp/clang_command_.tmp.d0bc3a.txt
++Contents of '/tmp/coderabbit-infer/a30369d51585883a578bf7272075e9f0412ca86d-55390c3f7c50f038/tmp/clang_command_.tmp.d0bc3a.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit-obj" "-mrelax-all" "-dis

... [truncated 718 characters] ...

x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fskip-odr-check-in-gmf"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/55390c3f7c50f038/file.o" "-x" "c"
"ggml/src/ggml-cpu/arch/arm/quants.c" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants