Describe what you are looking for
LASX provides 256-bit SIMD with 32 vector registers, analogous to AVX2. Its standout strength is widening integer multiply-accumulate via even/odd split (xvmaddwev_h_b / xvmaddwod_h_b for i8->i16, then xvmaddwev_w_h / xvmaddwod_w_h into i32). No native FP16 or BF16 — only f32 and f64 hardware floats.
dot/ and dots/
The highest-value new kernels are nk_dot_i8_loongapx and nk_dot_u8_loongapx. The widening even/odd multiply-add pair processes 32 i8 elements per 256-bit iteration into i32 accumulators, matching AVX2 throughput. Sub-byte i4/u4 use xvandi_b + shift for nibble extraction, then the same widening chain.
nk_dot_f32_loongapx uses xvfmadd_s directly. nk_dot_bf16_loongapx needs manual upcast: unpack via xvilvl_h / xvilvh_h with zero, left-shift 16 via xvslli_w, reinterpret as f32. FP16 requires a more involved software conversion (sign-extend exponent, shift mantissa). The e4m3/e5m2 float8 types need LUT-based conversion to f32, similar to Haswell. Batched dots/ variants replicate accumulators across output lanes with the same arithmetic.
Complex dot products (f32c, bf16c) use xvfmul_s + xvxor_v + xvfadd_s for the delayed sign-flip pattern.
spatial/ and spatials/
nk_euclidean_f32_loongapx uses xvfsub_s + xvfmul_s + xvfadd_s. The i8 Euclidean variant benefits most — subtract then widen-multiply the difference with itself through the even/odd chain. Cosine kernels run three accumulators (ab, a^2, b^2) in parallel; LASX's 32 registers handle this comfortably.
BF16/FP16 spatial kernels apply the same manual upcast as dot before the subtract-square-accumulate sequence. Batched spatials/ tiles fit well at 8 f32 lanes per accumulator, 4 accumulators per tile row.
set/ and sets/
nk_hamming_u1_loongapx and nk_jaccard_u1_loongapx use xvxor_v + xvpcnt_b + horizontal sum. Batched sets/ variants replicate accumulators with the same pattern.
Can you contribute to the implementation?
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
Code of Conduct
Describe what you are looking for
LASX provides 256-bit SIMD with 32 vector registers, analogous to AVX2. Its standout strength is widening integer multiply-accumulate via even/odd split (
xvmaddwev_h_b/xvmaddwod_h_bfor i8->i16, thenxvmaddwev_w_h/xvmaddwod_w_hinto i32). No native FP16 or BF16 — only f32 and f64 hardware floats.dot/ and dots/
The highest-value new kernels are
nk_dot_i8_loongapxandnk_dot_u8_loongapx. The widening even/odd multiply-add pair processes 32 i8 elements per 256-bit iteration into i32 accumulators, matching AVX2 throughput. Sub-byte i4/u4 usexvandi_b+ shift for nibble extraction, then the same widening chain.nk_dot_f32_loongapxusesxvfmadd_sdirectly.nk_dot_bf16_loongapxneeds manual upcast: unpack viaxvilvl_h/xvilvh_hwith zero, left-shift 16 viaxvslli_w, reinterpret as f32. FP16 requires a more involved software conversion (sign-extend exponent, shift mantissa). The e4m3/e5m2 float8 types need LUT-based conversion to f32, similar to Haswell. Batcheddots/variants replicate accumulators across output lanes with the same arithmetic.Complex dot products (f32c, bf16c) use
xvfmul_s+xvxor_v+xvfadd_sfor the delayed sign-flip pattern.spatial/ and spatials/
nk_euclidean_f32_loongapxusesxvfsub_s+xvfmul_s+xvfadd_s. The i8 Euclidean variant benefits most — subtract then widen-multiply the difference with itself through the even/odd chain. Cosine kernels run three accumulators (ab, a^2, b^2) in parallel; LASX's 32 registers handle this comfortably.BF16/FP16 spatial kernels apply the same manual upcast as dot before the subtract-square-accumulate sequence. Batched
spatials/tiles fit well at 8 f32 lanes per accumulator, 4 accumulators per tile row.set/ and sets/
nk_hamming_u1_loongapxandnk_jaccard_u1_loongapxusexvxor_v+xvpcnt_b+ horizontal sum. Batchedsets/variants replicate accumulators with the same pattern.Can you contribute to the implementation?
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
Code of Conduct