Feature: ARM32 NEON Kernel Portability

### Describe what you are looking for

NEON extensions have been back-ported to Arm v8.2 which still ship to hundreds of millions devices worldwide. I've started considering adding such a compatibility layer for older Android devices, but for now - it's low priority.

## What's truly ARM64-only

Only two things have no ARM32 equivalent:
- `float64x2_t` — the type doesn't exist on AArch32 (no 64-bit float SIMD)
- `vrndnq_f32` — no NEON rounding instruction on AArch32

Everything else has a portable helper — including `vqtbl1q_u8` (via `vtbl2`) and all horizontal reductions (via `vpadd` cascades). Functions that accumulate in f64 for precision (like `nk_dot_f32_neon`) use `nk_b128_vec_t` as a portable accumulator: SIMD f64 FMA on ARM64, NEON f32 multiply + scalar f64 accumulation via VFP on ARM32.

## Portable helpers

There are several helpers we can add to keep the kernel code clean, replacing pure intrinsic usage.

### Reduction helpers in `reduce/neon.h`

| Helper                        | Replaces     | ARM32 implementation               |
| ----------------------------- | ------------ | ---------------------------------- |
| `nk_f32x4_reduce_sum_neon_`   | `vaddvq_f32` | `vpadd_f32` cascade                |
| `nk_i32x4_reduce_sum_neon_`   | `vaddvq_s32` | `vpadd_s32` cascade                |
| `nk_u32x4_reduce_sum_neon_`   | `vaddvq_u32` | `vpadd_u32` cascade                |
| `nk_f64x2_reduce_sum_neon_`   | `vaddvq_f64` | ARM64-only, guarded                |
| `nk_u8x16_reduce_sum_neon_`   | `vaddlvq_u8` | `vpaddlq` cascade (u8→u16→u32→u64) |
| `nk_f32x4_reduce_max_neon_`   | `vmaxvq_f32` | `vpmax_f32` cascade                |
| `nk_f32x4_reduce_min_neon_`   | `vminvq_f32` | `vpmin_f32` cascade                |
| `nk_f32x4_pairwise_add_neon_` | `vpaddq_f32` | two `vpadd_f32` + `vcombine_f32`   |
| `nk_i32x4_pairwise_add_neon_` | `vpaddq_s32` | two `vpadd_s32` + `vcombine_s32`   |
| `nk_u32x4_pairwise_add_neon_` | `vpaddq_u32` | two `vpadd_u32` + `vcombine_u32`   |

### Conversion helpers in `cast/neon.h`

| Helper                           | Replaces                  | ARM32 implementation                  |
| -------------------------------- | ------------------------- | ------------------------------------- |
| `nk_f16x8_high_to_f32x4_neon_`   | `vcvt_high_f32_f16`       | `vcvt_f32_f16(vget_high_f16(...))`    |
| `nk_f64x2_from_f32x4_high_neon_` | `vcvt_high_f64_f32`       | ARM64-only, guarded                   |
| `nk_u16x8_shll_high_16_neon_`    | `vshll_high_n_u16(v, 16)` | `vshll_n_u16(vget_high_u16(...), 16)` |
| `nk_u8x16_tbl1_neon_`            | `vqtbl1q_u8`              | `vtbl2_u8` with split halves          |

### f64-precision accumulator in `dot/neon.h`

For `nk_dot_f32_neon` and similar functions that accumulate in f64 for numerical stability.
Uses `nk_b128_vec_t` as the portable accumulator type.

```c
NK_INTERNAL void nk_f32_dot_acc_init_neon_(nk_b128_vec_t *acc_low_vec, nk_b128_vec_t *acc_high_vec) {
#if NK_TARGET_ARM64_
    acc_low_vec->f64x2 = vdupq_n_f64(0);
    acc_high_vec->f64x2 = vdupq_n_f64(0);
#else
    acc_low_vec->f64s[0] = 0; acc_low_vec->f64s[1] = 0;
    acc_high_vec->f64s[0] = 0; acc_high_vec->f64s[1] = 0;
#endif
}

NK_INTERNAL void nk_f32_dot_acc_fma_neon_(
    nk_b128_vec_t *acc_low_vec, nk_b128_vec_t *acc_high_vec,
    float32x4_t a_f32x4, float32x4_t b_f32x4) {
#if NK_TARGET_ARM64_
    acc_low_vec->f64x2 = vfmaq_f64(acc_low_vec->f64x2,
        vcvt_f64_f32(vget_low_f32(a_f32x4)), vcvt_f64_f32(vget_low_f32(b_f32x4)));
    acc_high_vec->f64x2 = vfmaq_f64(acc_high_vec->f64x2,
        nk_f64x2_from_f32x4_high_neon_(a_f32x4), nk_f64x2_from_f32x4_high_neon_(b_f32x4));
#else
    // NEON f32 multiply, scatter into scalar f64 accumulators (VFP double).
    float32x4_t products_f32x4 = vmulq_f32(a_f32x4, b_f32x4);
    acc_low_vec->f64s[0] += (nk_f64_t)vgetq_lane_f32(products_f32x4, 0);
    acc_low_vec->f64s[1] += (nk_f64_t)vgetq_lane_f32(products_f32x4, 1);
    acc_high_vec->f64s[0] += (nk_f64_t)vgetq_lane_f32(products_f32x4, 2);
    acc_high_vec->f64s[1] += (nk_f64_t)vgetq_lane_f32(products_f32x4, 3);
#endif
}

NK_INTERNAL nk_f64_t nk_f32_dot_acc_reduce_neon_(nk_b128_vec_t acc_low_vec, nk_b128_vec_t acc_high_vec) {
#if NK_TARGET_ARM64_
    return nk_f64x2_reduce_sum_neon_(vaddq_f64(acc_low_vec.f64x2, acc_high_vec.f64x2));
#else
    return acc_low_vec.f64s[0] + acc_low_vec.f64s[1] + acc_high_vec.f64s[0] + acc_high_vec.f64s[1];
#endif
}
```

Once done, we'll need to change many of the `#if NK_TARGET_ARM64_` guards to `#if NK_TARGET_ARM64_ || NK_TARGET_ARM32_` and change pragmas:

- All widened files: `target("arch=armv8-a+simd")` → `target("neon")`.
- GCC: `target("arch=armv8-a+simd")` → `target("fpu=neon-vfpv4")`.

### Can you contribute to the implementation?

- [x] I can contribute

### Is your feature request specific to a certain interface?

It applies to everything

### Contact Details

_No response_

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Helper	Replaces	ARM32 implementation
`nk_f32x4_reduce_sum_neon_`	`vaddvq_f32`	`vpadd_f32` cascade
`nk_i32x4_reduce_sum_neon_`	`vaddvq_s32`	`vpadd_s32` cascade
`nk_u32x4_reduce_sum_neon_`	`vaddvq_u32`	`vpadd_u32` cascade
`nk_f64x2_reduce_sum_neon_`	`vaddvq_f64`	ARM64-only, guarded
`nk_u8x16_reduce_sum_neon_`	`vaddlvq_u8`	`vpaddlq` cascade (u8→u16→u32→u64)
`nk_f32x4_reduce_max_neon_`	`vmaxvq_f32`	`vpmax_f32` cascade
`nk_f32x4_reduce_min_neon_`	`vminvq_f32`	`vpmin_f32` cascade
`nk_f32x4_pairwise_add_neon_`	`vpaddq_f32`	two `vpadd_f32` + `vcombine_f32`
`nk_i32x4_pairwise_add_neon_`	`vpaddq_s32`	two `vpadd_s32` + `vcombine_s32`
`nk_u32x4_pairwise_add_neon_`	`vpaddq_u32`	two `vpadd_u32` + `vcombine_u32`

Helper	Replaces	ARM32 implementation
`nk_f16x8_high_to_f32x4_neon_`	`vcvt_high_f32_f16`	`vcvt_f32_f16(vget_high_f16(...))`
`nk_f64x2_from_f32x4_high_neon_`	`vcvt_high_f64_f32`	ARM64-only, guarded
`nk_u16x8_shll_high_16_neon_`	`vshll_high_n_u16(v, 16)`	`vshll_n_u16(vget_high_u16(...), 16)`
`nk_u8x16_tbl1_neon_`	`vqtbl1q_u8`	`vtbl2_u8` with split halves

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: ARM32 NEON Kernel Portability #339

Describe what you are looking for

What's truly ARM64-only

Portable helpers

Reduction helpers in `reduce/neon.h`

Conversion helpers in `cast/neon.h`

f64-precision accumulator in `dot/neon.h`

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: ARM32 NEON Kernel Portability #339

Description

Describe what you are looking for

What's truly ARM64-only

Portable helpers

Reduction helpers in reduce/neon.h

Conversion helpers in cast/neon.h

f64-precision accumulator in dot/neon.h

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Reduction helpers in `reduce/neon.h`

Conversion helpers in `cast/neon.h`

f64-precision accumulator in `dot/neon.h`