Skip to content

Eliminate use_hint 32/88 intrinsics#940

Open
willieyz wants to merge 2 commits intomainfrom
eliminate-use_hint_32_88-intrinsics
Open

Eliminate use_hint 32/88 intrinsics#940
willieyz wants to merge 2 commits intomainfrom
eliminate-use_hint_32_88-intrinsics

Conversation

@willieyz
Copy link
Contributor

@willieyz willieyz commented Feb 3, 2026

We also tried unrolling the loops: mld_poly_use_hint_88_avx2_loop and mld_poly_use_hint_32_avx2_loop
in both files. However, the benchmark results showed that this did not provide any performance benefit, so we decided to keep the current version.

  • bench components
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
mld_poly_caddq
(avg)
AVX2 intrinsics no-opt 821 781 789
x86_64 asm no-opt 847 786 787
Δ (%) no-opt +3.17% +0.64% -0.25%
mld_poly_caddq
(avg)
AVX2 intrinsics opt 210 147 143
x86_64 asm opt 220 153 155
x86_64 asm
(unroll)
opt 273 154 156 unroll by 4
Δ (%) opt +4.76% +4.08% +8.39%
Δ (%) (unroll) opt +30.00% +4.76% +9.09% unroll by 4
  • bench
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
keypair cycles
(avg)
AVX2 intrinsics no-opt 127436 218610 360739 baseline (main)
x86_64 asm no-opt 127459 217604 367118
Δ (%) no-opt +0.02% -0.46% +1.77%
AVX2 intrinsics opt 56955 98362 157869 baseline (main)
x86_64 asm opt 59747 102961 165706
x86_64 asm
(unroll)
opt 59483 104732 166654
Δ (%) opt +4.90% +4.68% +4.96%
Δ (%) (unroll) opt +4.44% +6.48% +5.56% unroll by 4
sign cycles
(avg)
AVX2 intrinsics no-opt 451922 756003 958151 baseline (main)
x86_64 asm no-opt 452833 752512 974497
Δ (%) no-opt +0.20% -0.46% +1.71%
AVX2 intrinsics opt 170370 281545 347924 baseline (main)
x86_64 asm opt 178564 294843 362677
x86_64 asm
(unroll)
opt 177251 300667 366158
Δ (%) opt +4.81% +4.72% +4.24%
Δ (%) (unroll) opt +4.04% +6.79% +5.24% unroll by 4
verify cycles
(avg)
AVX2 intrinsics no-opt 134113 220671 363234 baseline (main)
x86_64 asm no-opt 134633 220015 369763
Δ (%) no-opt +0.39% -0.30% +1.80%
AVX2 intrinsics opt 60234 98904 156281 baseline (main)
x86_64 asm opt 63140 103682 164376
x86_64 asm
(unroll)
opt 62822 105719 164028
Δ (%) opt +4.82% +4.83% +5.18%
Δ (%) (unroll) opt +4.30% +6.89% +4.96% unroll by 4

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-87)

Full Results (174 proofs)
Proof Status Current Previous Change
**TOTAL** 2304s 2700s -14.7%
mld_attempt_signature_generation 238s 282s -16%
sign_verify_internal 219s 251s -13%
polyvecl_pointwise_acc_montgomery_c 175s 230s -24%
polyvec_matrix_expand 136s 154s -12%
poly_pointwise_montgomery_c 135s 189s -29%
rej_uniform_native 128s 153s -16%
polyvec_matrix_expand_serial 103s 121s -15%
mld_ct_memcmp 80s 105s -24%
mld_invntt_layer 71s 85s -16%
mld_ntt_layer 53s 62s -15%
sign_signature_internal 43s 44s -2%
keccak_squeezeblocks_x4 42s 50s -16%
mld_compute_t0_t1_tr_from_sk_components 27s 28s -4%
polymat_permute_bitrev_to_custom 25s 27s -7%
rej_uniform 20s 23s -13%
fqmul 19s 25s -24%
poly_chknorm_c 19s 21s -10%
poly_uniform_eta_4x 19s 18s +6%
rej_uniform_c 17s 24s -29%
keccakf1600x4_permute_native 15s 13s +15%
poly_uniform_4x 15s 16s -6%
polyt0_unpack 15s 17s -12%
polyveck_add 15s 17s -12%
polyeta_unpack 14s 18s -22%
polyveck_power2round 14s 17s -18%
keccak_absorb_once_x4 13s 14s -7%
polyvec_matrix_pointwise_montgomery 13s 14s -7%
keccakf1600_permute 11s 10s +10%
mld_ntt_butterfly_block 11s 14s -21%
polyveck_reduce 11s 12s -8%
mld_compute_pack_z 10s 7s +43%
mld_polyvecl_permute_bitrev_to_custom_native 10s 8s +25%
sign_pk_from_sk 10s 12s -17%
mld_h 9s 5s +80%
poly_invntt_tomont_c 9s 11s -18%
polyveck_shiftl 9s 8s +12%
keccakf1600_permute_native 8s 9s -11%
poly_caddq_native_aarch64 8s 5s +60%
poly_decompose_c 8s 8s +0%
polyveck_caddq 8s 10s -20%
polyveck_make_hint 8s 6s +33%
polyveck_pointwise_poly_montgomery 8s 8s +0%
polyvecl_ntt 8s 8s +0%
sign 8s 8s +0%
sign_signature_pre_hash_internal 8s 3s +167%
mld_check_pct 7s 8s -12%
poly_uniform_eta 7s 6s +17%
polyveck_chknorm 7s 9s -22%
polyveck_decompose 7s 7s +0%
polyveck_invntt_tomont 7s 8s -12%
polyveck_use_hint 7s 10s -30%
poly_ntt_c 6s 4s +50%
polyveck_ntt 6s 10s -40%
polyveck_sub 6s 8s -25%
sign_keypair 6s 7s -14%
sign_keypair_internal 6s 6s +0%
sign_signature_extmu 6s 3s +100%
sign_verify_pre_hash_shake256 6s 8s -25%
unpack_hints 6s 5s +20%
unpack_sk 6s 5s +20%
keccak_absorb 5s 6s -17%
keccakf1600_xor_bytes 5s 2s +150%
mld_sample_s1_s2 5s 8s -38%
mld_sample_s1_s2_serial 5s 7s -29%
pack_pk 5s 6s -17%
poly_add 5s 6s -17%
poly_challenge 5s 6s -17%
poly_ntt_native 5s 6s -17%
poly_power2round 5s 6s -17%
poly_uniform_gamma1_4x 5s 3s +67%
poly_use_hint 5s 5s +0%
polyvecl_pointwise_acc_montgomery_native 5s 4s +25%
polyvecl_uniform_gamma1 5s 6s -17%
polyw1_pack 5s 5s +0%
rej_eta 5s 3s +67%
rej_eta_c 5s 6s -17%
rej_eta_native 5s 4s +25%
shake256 5s 3s +67%
shake256x4_squeezeblocks 5s 5s +0%
sign_verify_extmu 5s 6s -17%
sign_verify_pre_hash_internal 5s 6s -17%
caddq 4s 4s +0%
keccak_init 4s 3s +33%
keccakf1600x4_permute 4s 2s +100%
mld_ct_cmask_nonzero_u8 4s 2s +100%
montgomery_reduce 4s 5s -20%
ntt_native_x86_64 4s 4s +0%
poly_caddq_native 4s 6s -33%
poly_chknorm 4s 3s +33%
poly_chknorm_native 4s 8s -50%
poly_ntt 4s 2s +100%
poly_uniform 4s 5s -20%
poly_use_hint_c 4s 4s +0%
polyt0_pack 4s 3s +33%
polyveck_pack_eta 4s 2s +100%
polyveck_pack_w1 4s 4s +0%
polyvecl_chknorm 4s 4s +0%
polyvecl_uniform_gamma1_serial 4s 5s -20%
polyvecl_unpack_eta 4s 4s +0%
polyz_unpack_c 4s 6s -33%
power2round 4s 3s +33%
shake128_absorb 4s 1s +300%
sign_open 4s 3s +33%
sign_signature 4s 6s -33%
sign_verify 4s 4s +0%
decompose 3s 2s +50%
keccak_squeeze 3s 6s -50%
keccakf1600_extract_bytes (big endian) 3s 2s +50%
keccakf1600_xor_bytes (big endian) 3s 3s +0%
keccakf1600x4_xor_bytes 3s 3s +0%
mld_ct_sel_int32 3s 2s +50%
mld_prepare_domain_separation_prefix 3s 4s -25%
pack_sig_z 3s 3s +0%
poly_decompose_native 3s 5s -40%
poly_invntt_tomont 3s 2s +50%
poly_make_hint 3s 5s -40%
poly_pointwise_montgomery 3s 4s -25%
poly_pointwise_montgomery_native 3s 4s -25%
poly_shiftl 3s 3s +0%
poly_sub 3s 3s +0%
polyveck_pack_t0 3s 4s -25%
polyveck_unpack_eta 3s 3s +0%
polyveck_unpack_t0 3s 7s -57%
polyvecl_pack_eta 3s 4s -25%
polyvecl_permute_bitrev_to_custom 3s 2s +50%
polyvecl_unpack_z 3s 5s -40%
polyz_pack 3s 3s +0%
polyz_unpack 3s 3s +0%
shake128_finalize 3s 1s +200%
shake128_squeeze 3s 2s +50%
shake128x4_absorb_once 3s 4s -25%
shake256_finalize 3s 2s +50%
shake256x4_absorb_once 3s 2s +50%
sign_signature_pre_hash_shake256 3s 3s +0%
unpack_pk 3s 4s -25%
unpack_sig 3s 3s +0%
fqscale 2s 4s -50%
keccak_finalize 2s 1s +100%
make_hint 2s 2s +0%
mld_ct_abs_i32 2s 5s -60%
mld_ct_get_optblocker_u32 2s 1s +100%
mld_ct_get_optblocker_u8 2s 4s -50%
mld_value_barrier_i64 2s 2s +0%
mld_value_barrier_u32 2s 2s +0%
mld_value_barrier_u8 2s 2s +0%
pack_sig_c_h 2s 4s -50%
pack_sk 2s 3s -33%
poly_caddq 2s 3s -33%
poly_caddq_c 2s 5s -60%
poly_decompose 2s 3s -33%
poly_invntt_tomont_native 2s 4s -50%
poly_reduce 2s 4s -50%
poly_uniform_gamma1 2s 3s -33%
poly_use_hint_native 2s 4s -50%
polyeta_pack 2s 4s -50%
polyt1_pack 2s 5s -60%
polyt1_unpack 2s 2s +0%
polyvecl_pointwise_acc_montgomery 2s 7s -71%
polyz_unpack_native 2s 3s -33%
reduce32 2s 3s -33%
shake128_init 2s 2s +0%
shake128_release 2s 2s +0%
shake256_absorb 2s 2s +0%
shake256_init 2s 3s -33%
shake256_release 2s 2s +0%
shake256_squeeze 2s 3s -33%
sys_check_capability 2s 4s -50%
keccakf1600x4_extract_bytes 1s 5s -80%
mld_ct_cmask_neg_i32 1s 1s +0%
mld_ct_cmask_nonzero_u32 1s 4s -75%
mld_ct_get_optblocker_i64 1s 4s -75%
mld_keccakf1600_extract_bytes 1s 2s -50%
shake128x4_squeezeblocks 1s 4s -75%
use_hint 1s 3s -67%

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-44)

Full Results (174 proofs)
Proof Status Current Previous Change
**TOTAL** 2103s 1958s +7.4%
sign_verify_internal 254s 242s +5%
mld_attempt_signature_generation 216s 199s +9%
polyvecl_pointwise_acc_montgomery_c 207s 184s +12%
poly_pointwise_montgomery_c 130s 124s +5%
rej_uniform_native 129s 125s +3%
mld_invntt_layer 117s 112s +4%
mld_ct_memcmp 85s 76s +12%
keccak_squeezeblocks_x4 47s 43s +9%
mld_ntt_layer 44s 41s +7%
sign_signature_internal 43s 42s +2%
polyvec_matrix_expand 28s 28s +0%
fqmul 23s 18s +28%
rej_uniform 23s 18s +28%
poly_uniform_eta_4x 18s 15s +20%
rej_uniform_c 17s 17s +0%
poly_chknorm_c 16s 17s -6%
polymat_permute_bitrev_to_custom 16s 18s -11%
mld_compute_t0_t1_tr_from_sk_components 15s 14s +7%
polyt0_unpack 15s 17s -12%
polyeta_unpack 14s 12s +17%
keccakf1600x4_permute_native 13s 13s +0%
mld_ntt_butterfly_block 13s 11s +18%
poly_uniform_4x 13s 20s -35%
polyz_unpack_c 13s 10s +30%
keccak_absorb_once_x4 12s 12s +0%
mld_polyvecl_permute_bitrev_to_custom_native 12s 7s +71%
polyveck_add 10s 8s +25%
mld_check_pct 9s 12s -25%
keccakf1600_permute 8s 8s +0%
polyvec_matrix_pointwise_montgomery 8s 9s -11%
polyveck_decompose 8s 6s +33%
unpack_hints 8s 5s +60%
keccakf1600_permute_native 7s 8s -12%
keccak_absorb 6s 5s +20%
mld_compute_pack_z 6s 6s +0%
poly_sub 6s 2s +200%
poly_uniform_eta 6s 6s +0%
polyvec_matrix_expand_serial 6s 5s +20%
polyveck_make_hint 6s 3s +100%
polyveck_ntt 6s 7s -14%
polyvecl_ntt 6s 3s +100%
polyz_unpack_native 6s 5s +20%
sign_verify_extmu 6s 6s +0%
fqscale 5s 3s +67%
keccak_squeeze 5s 3s +67%
poly_caddq_native 5s 4s +25%
poly_decompose_c 5s 3s +67%
poly_invntt_tomont_c 5s 6s -17%
poly_make_hint 5s 4s +25%
poly_ntt_c 5s 3s +67%
poly_pointwise_montgomery 5s 3s +67%
polyt0_pack 5s 3s +67%
polyveck_invntt_tomont 5s 6s -17%
polyveck_pointwise_poly_montgomery 5s 5s +0%
polyveck_power2round 5s 5s +0%
polyveck_sub 5s 3s +67%
polyveck_use_hint 5s 3s +67%
polyvecl_chknorm 5s 7s -29%
polyw1_pack 5s 1s +400%
sign 5s 3s +67%
sign_keypair 5s 5s +0%
sign_open 5s 5s +0%
sign_signature 5s 4s +25%
sign_signature_pre_hash_internal 5s 6s -17%
unpack_sk 5s 3s +67%
decompose 4s 3s +33%
keccak_init 4s 1s +300%
keccakf1600x4_permute 4s 2s +100%
mld_ct_cmask_nonzero_u32 4s 2s +100%
mld_ct_cmask_nonzero_u8 4s 3s +33%
mld_keccakf1600_extract_bytes 4s 2s +100%
mld_prepare_domain_separation_prefix 4s 3s +33%
mld_sample_s1_s2_serial 4s 3s +33%
mld_value_barrier_u32 4s 2s +100%
pack_sig_c_h 4s 4s +0%
pack_sk 4s 3s +33%
poly_add 4s 4s +0%
poly_challenge 4s 4s +0%
poly_chknorm_native 4s 3s +33%
poly_invntt_tomont_native 4s 2s +100%
poly_pointwise_montgomery_native 4s 3s +33%
poly_power2round 4s 4s +0%
poly_reduce 4s 2s +100%
poly_uniform 4s 5s -20%
poly_use_hint_c 4s 3s +33%
polyeta_pack 4s 3s +33%
polyt1_pack 4s 3s +33%
polyveck_caddq 4s 3s +33%
polyveck_pack_eta 4s 3s +33%
polyveck_pack_t0 4s 3s +33%
polyveck_reduce 4s 7s -43%
polyvecl_pack_eta 4s 4s +0%
polyvecl_permute_bitrev_to_custom 4s 2s +100%
polyvecl_pointwise_acc_montgomery_native 4s 3s +33%
polyvecl_unpack_z 4s 4s +0%
reduce32 4s 3s +33%
rej_eta_native 4s 3s +33%
shake128_init 4s 2s +100%
shake128x4_absorb_once 4s 2s +100%
shake128x4_squeezeblocks 4s 2s +100%
shake256 4s 3s +33%
shake256_absorb 4s 3s +33%
shake256x4_squeezeblocks 4s 2s +100%
sign_pk_from_sk 4s 7s -43%
sign_signature_extmu 4s 4s +0%
sign_signature_pre_hash_shake256 4s 4s +0%
sign_verify_pre_hash_internal 4s 6s -33%
sys_check_capability 4s 2s +100%
unpack_pk 4s 3s +33%
use_hint 4s 3s +33%
caddq 3s 3s +0%
keccakf1600_extract_bytes (big endian) 3s 1s +200%
keccakf1600_xor_bytes 3s 4s -25%
keccakf1600_xor_bytes (big endian) 3s 3s +0%
make_hint 3s 2s +50%
mld_ct_abs_i32 3s 2s +50%
mld_ct_cmask_neg_i32 3s 3s +0%
mld_h 3s 2s +50%
mld_sample_s1_s2 3s 4s -25%
mld_value_barrier_u8 3s 3s +0%
montgomery_reduce 3s 3s +0%
ntt_native_x86_64 3s 5s -40%
pack_sig_z 3s 3s +0%
poly_caddq_c 3s 4s -25%
poly_decompose_native 3s 3s +0%
poly_invntt_tomont 3s 3s +0%
poly_shiftl 3s 3s +0%
poly_uniform_gamma1_4x 3s 3s +0%
poly_use_hint 3s 3s +0%
polyveck_pack_w1 3s 3s +0%
polyveck_shiftl 3s 6s -50%
polyveck_unpack_eta 3s 3s +0%
polyveck_unpack_t0 3s 2s +50%
polyvecl_pointwise_acc_montgomery 3s 2s +50%
polyvecl_uniform_gamma1_serial 3s 2s +50%
polyvecl_unpack_eta 3s 3s +0%
polyz_unpack 3s 2s +50%
power2round 3s 3s +0%
rej_eta 3s 2s +50%
rej_eta_c 3s 4s -25%
shake128_absorb 3s 3s +0%
shake128_finalize 3s 3s +0%
shake128_release 3s 2s +50%
shake128_squeeze 3s 5s -40%
shake256_finalize 3s 5s -40%
shake256_release 3s 2s +50%
sign_keypair_internal 3s 5s -40%
sign_verify_pre_hash_shake256 3s 2s +50%
unpack_sig 3s 3s +0%
keccak_finalize 2s 3s -33%
keccakf1600x4_extract_bytes 2s 3s -33%
keccakf1600x4_xor_bytes 2s 3s -33%
mld_ct_get_optblocker_u32 2s 2s +0%
mld_ct_sel_int32 2s 2s +0%
mld_value_barrier_i64 2s 4s -50%
pack_pk 2s 3s -33%
poly_caddq 2s 1s +100%
poly_caddq_native_aarch64 2s 2s +0%
poly_chknorm 2s 1s +100%
poly_decompose 2s 4s -50%
poly_ntt_native 2s 4s -50%
poly_uniform_gamma1 2s 5s -60%
poly_use_hint_native 2s 4s -50%
polyt1_unpack 2s 3s -33%
polyveck_chknorm 2s 2s +0%
polyvecl_uniform_gamma1 2s 3s -33%
shake256_init 2s 4s -50%
shake256_squeeze 2s 2s +0%
sign_verify 2s 5s -60%
mld_ct_get_optblocker_i64 1s 3s -67%
mld_ct_get_optblocker_u8 1s 1s +0%
poly_ntt 1s 2s -50%
polyz_pack 1s 2s -50%
shake256x4_absorb_once 1s 2s -50%

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-65)

Full Results (174 proofs)
Proof Status Current Previous Change
**TOTAL** 2463s 2562s -3.9%
sign_verify_internal 373s 391s -5%
polyvecl_pointwise_acc_montgomery_c 245s 275s -11%
mld_attempt_signature_generation 213s 217s -2%
poly_pointwise_montgomery_c 147s 167s -12%
rej_uniform_native 139s 140s -1%
polyvec_matrix_expand 99s 100s -1%
mld_ct_memcmp 91s 94s -3%
mld_invntt_layer 78s 78s +0%
polyvec_matrix_expand_serial 64s 69s -7%
mld_ntt_layer 58s 58s +0%
keccak_squeezeblocks_x4 45s 45s +0%
sign_signature_internal 45s 46s -2%
mld_compute_t0_t1_tr_from_sk_components 27s 27s +0%
rej_uniform_c 23s 20s +15%
rej_uniform 21s 22s -5%
fqmul 20s 23s -13%
polymat_permute_bitrev_to_custom 19s 19s +0%
polyveck_decompose 18s 16s +12%
poly_chknorm_c 17s 18s -6%
poly_uniform_eta_4x 17s 18s -6%
mld_ntt_butterfly_block 16s 13s +23%
polyt0_unpack 15s 16s -6%
keccakf1600x4_permute_native 14s 13s +8%
poly_uniform_4x 14s 15s -7%
polyvec_matrix_pointwise_montgomery 13s 16s -19%
sign_pk_from_sk 12s 6s +100%
keccak_absorb_once_x4 11s 17s -35%
sign 11s 11s +0%
keccakf1600_permute_native 10s 8s +25%
polyveck_add 10s 10s +0%
polyveck_reduce 10s 7s +43%
keccakf1600_permute 9s 7s +29%
mld_polyvecl_permute_bitrev_to_custom_native 9s 7s +29%
polyveck_power2round 9s 8s +12%
polyvecl_ntt 9s 8s +12%
mld_h 8s 5s +60%
poly_invntt_tomont_c 8s 13s -38%
polyveck_chknorm 8s 5s +60%
polyveck_sub 8s 7s +14%
polyveck_use_hint 8s 7s +14%
mld_check_pct 7s 7s +0%
mld_compute_pack_z 7s 6s +17%
poly_challenge 7s 4s +75%
poly_decompose_c 7s 9s -22%
poly_use_hint_c 7s 7s +0%
polyveck_invntt_tomont 7s 9s -22%
polyveck_ntt 7s 7s +0%
polyveck_shiftl 7s 8s -12%
keccak_absorb 6s 7s -14%
keccakf1600x4_extract_bytes 6s 3s +100%
make_hint 6s 3s +100%
mld_sample_s1_s2_serial 6s 5s +20%
poly_invntt_tomont 6s 5s +20%
polyveck_caddq 6s 7s -14%
polyveck_make_hint 6s 5s +20%
polyvecl_uniform_gamma1 6s 4s +50%
rej_eta_c 6s 7s -14%
sign_signature_pre_hash_shake256 6s 5s +20%
mld_ct_cmask_nonzero_u8 5s 4s +25%
mld_prepare_domain_separation_prefix 5s 4s +25%
mld_sample_s1_s2 5s 6s -17%
poly_add 5s 3s +67%
poly_caddq_native_aarch64 5s 4s +25%
poly_pointwise_montgomery_native 5s 4s +25%
poly_power2round 5s 6s -17%
polyeta_unpack 5s 6s -17%
polyveck_pack_t0 5s 2s +150%
polyveck_pointwise_poly_montgomery 5s 6s -17%
polyz_unpack_c 5s 6s -17%
rej_eta_native 5s 4s +25%
shake128_finalize 5s 2s +150%
shake256_finalize 5s 2s +150%
sign_keypair_internal 5s 4s +25%
sign_signature 5s 4s +25%
sign_signature_extmu 5s 4s +25%
unpack_hints 5s 5s +0%
unpack_sk 5s 3s +67%
keccak_finalize 4s 4s +0%
keccak_squeeze 4s 3s +33%
pack_pk 4s 2s +100%
pack_sig_z 4s 2s +100%
poly_caddq_native 4s 4s +0%
poly_chknorm 4s 3s +33%
poly_decompose_native 4s 3s +33%
poly_reduce 4s 4s +0%
poly_sub 4s 2s +100%
poly_uniform_eta 4s 3s +33%
poly_uniform_gamma1_4x 4s 4s +0%
polyt0_pack 4s 4s +0%
polyveck_pack_eta 4s 5s -20%
polyveck_unpack_eta 4s 5s -20%
polyveck_unpack_t0 4s 2s +100%
polyvecl_chknorm 4s 5s -20%
polyvecl_pointwise_acc_montgomery_native 4s 7s -43%
polyvecl_unpack_eta 4s 5s -20%
polyw1_pack 4s 5s -20%
reduce32 4s 4s +0%
shake128x4_squeezeblocks 4s 3s +33%
sign_keypair 4s 4s +0%
sign_open 4s 5s -20%
sign_verify_extmu 4s 4s +0%
sign_verify_pre_hash_internal 4s 4s +0%
sys_check_capability 4s 2s +100%
unpack_pk 4s 4s +0%
keccakf1600_xor_bytes (big endian) 3s 2s +50%
keccakf1600x4_permute 3s 3s +0%
mld_ct_cmask_nonzero_u32 3s 2s +50%
mld_ct_get_optblocker_u8 3s 2s +50%
mld_keccakf1600_extract_bytes 3s 2s +50%
mld_value_barrier_u32 3s 2s +50%
ntt_native_x86_64 3s 5s -40%
poly_caddq 3s 2s +50%
poly_chknorm_native 3s 4s -25%
poly_decompose 3s 3s +0%
poly_invntt_tomont_native 3s 3s +0%
poly_make_hint 3s 3s +0%
poly_ntt 3s 5s -40%
poly_ntt_c 3s 5s -40%
poly_ntt_native 3s 7s -57%
poly_pointwise_montgomery 3s 4s -25%
poly_uniform 3s 4s -25%
poly_use_hint 3s 5s -40%
poly_use_hint_native 3s 5s -40%
polyt1_pack 3s 2s +50%
polyveck_pack_w1 3s 4s -25%
polyvecl_pack_eta 3s 4s -25%
polyvecl_uniform_gamma1_serial 3s 4s -25%
polyvecl_unpack_z 3s 3s +0%
polyz_pack 3s 2s +50%
polyz_unpack 3s 3s +0%
polyz_unpack_native 3s 3s +0%
shake128_absorb 3s 3s +0%
shake128_init 3s 2s +50%
shake128_release 3s 3s +0%
shake128_squeeze 3s 3s +0%
shake128x4_absorb_once 3s 2s +50%
shake256 3s 3s +0%
shake256_absorb 3s 5s -40%
shake256_init 3s 3s +0%
shake256_squeeze 3s 3s +0%
shake256x4_absorb_once 3s 1s +200%
shake256x4_squeezeblocks 3s 2s +50%
sign_verify 3s 3s +0%
unpack_sig 3s 2s +50%
caddq 2s 4s -50%
decompose 2s 2s +0%
fqscale 2s 4s -50%
keccakf1600_xor_bytes 2s 2s +0%
keccakf1600x4_xor_bytes 2s 5s -60%
mld_ct_cmask_neg_i32 2s 3s -33%
mld_ct_get_optblocker_i64 2s 4s -50%
mld_ct_get_optblocker_u32 2s 3s -33%
mld_ct_sel_int32 2s 5s -60%
mld_value_barrier_u8 2s 3s -33%
montgomery_reduce 2s 6s -67%
pack_sig_c_h 2s 2s +0%
pack_sk 2s 4s -50%
poly_caddq_c 2s 4s -50%
poly_shiftl 2s 4s -50%
poly_uniform_gamma1 2s 5s -60%
polyeta_pack 2s 2s +0%
polyvecl_permute_bitrev_to_custom 2s 2s +0%
polyvecl_pointwise_acc_montgomery 2s 4s -50%
rej_eta 2s 3s -33%
sign_signature_pre_hash_internal 2s 6s -67%
sign_verify_pre_hash_shake256 2s 4s -50%
use_hint 2s 2s +0%
keccak_init 1s 3s -67%
keccakf1600_extract_bytes (big endian) 1s 3s -67%
mld_ct_abs_i32 1s 2s -50%
mld_value_barrier_i64 1s 2s -50%
polyt1_unpack 1s 2s -50%
power2round 1s 5s -80%
shake256_release 1s 3s -67%

@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 8 times, most recently from da005db to 1ea9d5f Compare February 5, 2026 04:02
This commit adds poly_use_hint to bench --components for benchmarking
the performance impact of the changes to:
- poly_use_hint_32
- poly_use_hint_88

Signed-off-by: willieyz <willie.zhao@chelpis.com>
In this PR, we replace the AVX2 intrinsics implementation of
poly_use_hint_32 and poly_use_hint_88 with a x86_64 assembly version,
this is part of the effort to enable HOL-Light proofs.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch from 1ea9d5f to 8a19e9a Compare February 5, 2026 06:05
@willieyz willieyz marked this pull request as ready for review February 5, 2026 06:39
@willieyz willieyz requested a review from a team as a code owner February 5, 2026 06:39
@willieyz willieyz marked this pull request as draft February 5, 2026 07:19
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 46205 cycles 46203 cycles 1.00
ML-DSA-44 sign 131278 cycles 131278 cycles 1
ML-DSA-44 verify 47765 cycles 47768 cycles 1.00
ML-DSA-65 keypair 81014 cycles 81024 cycles 1.00
ML-DSA-65 sign 215785 cycles 215787 cycles 1.00
ML-DSA-65 verify 80057 cycles 80052 cycles 1.00
ML-DSA-87 keypair 132158 cycles 132151 cycles 1.00
ML-DSA-87 sign 276862 cycles 276816 cycles 1.00
ML-DSA-87 verify 130418 cycles 130384 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 114213 cycles 114155 cycles 1.00
ML-DSA-44 sign 418158 cycles 417994 cycles 1.00
ML-DSA-44 verify 122319 cycles 122262 cycles 1.00
ML-DSA-65 keypair 195508 cycles 195499 cycles 1.00
ML-DSA-65 sign 682497 cycles 682470 cycles 1.00
ML-DSA-65 verify 197760 cycles 197741 cycles 1.00
ML-DSA-87 keypair 322642 cycles 322656 cycles 1.00
ML-DSA-87 sign 864585 cycles 864584 cycles 1.00
ML-DSA-87 verify 328628 cycles 328653 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 34677 cycles 34696 cycles 1.00
ML-DSA-44 sign 120151 cycles 120195 cycles 1.00
ML-DSA-44 verify 38151 cycles 38145 cycles 1.00
ML-DSA-65 keypair 61275 cycles 60582 cycles 1.01
ML-DSA-65 sign 202094 cycles 200476 cycles 1.01
ML-DSA-65 verify 62940 cycles 62563 cycles 1.01
ML-DSA-87 keypair 93525 cycles 94602 cycles 0.99
ML-DSA-87 sign 236210 cycles 240494 cycles 0.98
ML-DSA-87 verify 95587 cycles 95761 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 93726 cycles 93889 cycles 1.00
ML-DSA-44 sign 333512 cycles 333450 cycles 1.00
ML-DSA-44 verify 99955 cycles 99851 cycles 1.00
ML-DSA-65 keypair 160065 cycles 160390 cycles 1.00
ML-DSA-65 sign 545794 cycles 545908 cycles 1.00
ML-DSA-65 verify 160881 cycles 160887 cycles 1.00
ML-DSA-87 keypair 267728 cycles 267405 cycles 1.00
ML-DSA-87 sign 707504 cycles 707235 cycles 1.00
ML-DSA-87 verify 270918 cycles 269967 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 276468 cycles 277102 cycles 1.00
ML-DSA-44 sign 818650 cycles 810656 cycles 1.01
ML-DSA-44 verify 276672 cycles 278882 cycles 0.99
ML-DSA-65 keypair 475323 cycles 478906 cycles 0.99
ML-DSA-65 sign 1367640 cycles 1360800 cycles 1.01
ML-DSA-65 verify 459822 cycles 466415 cycles 0.99
ML-DSA-87 keypair 825623 cycles 818822 cycles 1.01
ML-DSA-87 sign 1873209 cycles 1878770 cycles 1.00
ML-DSA-87 verify 800938 cycles 794467 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 69035 cycles 69134 cycles 1.00
ML-DSA-44 sign 187364 cycles 187688 cycles 1.00
ML-DSA-44 verify 69341 cycles 69282 cycles 1.00
ML-DSA-65 keypair 119503 cycles 119368 cycles 1.00
ML-DSA-65 sign 303527 cycles 300862 cycles 1.01
ML-DSA-65 verify 115926 cycles 115513 cycles 1.00
ML-DSA-87 keypair 203793 cycles 203546 cycles 1.00
ML-DSA-87 sign 394456 cycles 394636 cycles 1.00
ML-DSA-87 verify 195809 cycles 195483 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 57235 cycles 56751 cycles 1.01
ML-DSA-44 sign 181496 cycles 181670 cycles 1.00
ML-DSA-44 verify 61165 cycles 61146 cycles 1.00
ML-DSA-65 keypair 98680 cycles 98647 cycles 1.00
ML-DSA-65 sign 298309 cycles 298480 cycles 1.00
ML-DSA-65 verify 100528 cycles 100288 cycles 1.00
ML-DSA-87 keypair 152581 cycles 152587 cycles 1.00
ML-DSA-87 sign 355291 cycles 355235 cycles 1.00
ML-DSA-87 verify 153950 cycles 153556 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 68156 cycles 68132 cycles 1.00
ML-DSA-44 sign 202004 cycles 201919 cycles 1.00
ML-DSA-44 verify 70775 cycles 70781 cycles 1.00
ML-DSA-65 keypair 120970 cycles 120914 cycles 1.00
ML-DSA-65 sign 331183 cycles 331101 cycles 1.00
ML-DSA-65 verify 117884 cycles 117908 cycles 1.00
ML-DSA-87 keypair 198649 cycles 198347 cycles 1.00
ML-DSA-87 sign 427544 cycles 427112 cycles 1.00
ML-DSA-87 verify 194417 cycles 194311 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 135070 cycles 134705 cycles 1.00
ML-DSA-44 sign 526006 cycles 524023 cycles 1.00
ML-DSA-44 verify 147853 cycles 147704 cycles 1.00
ML-DSA-65 keypair 226865 cycles 226528 cycles 1.00
ML-DSA-65 sign 860582 cycles 861852 cycles 1.00
ML-DSA-65 verify 235373 cycles 235761 cycles 1.00
ML-DSA-87 keypair 370367 cycles 371080 cycles 1.00
ML-DSA-87 sign 1079627 cycles 1079785 cycles 1.00
ML-DSA-87 verify 382615 cycles 383268 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 41639 cycles 42042 cycles 0.99
ML-DSA-44 sign 134495 cycles 135046 cycles 1.00
ML-DSA-44 verify 44953 cycles 45886 cycles 0.98
ML-DSA-65 keypair 72877 cycles 72408 cycles 1.01
ML-DSA-65 sign 214749 cycles 215490 cycles 1.00
ML-DSA-65 verify 73910 cycles 73252 cycles 1.01
ML-DSA-87 keypair 107778 cycles 107965 cycles 1.00
ML-DSA-87 sign 252308 cycles 254024 cycles 0.99
ML-DSA-87 verify 109196 cycles 111034 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 157593 cycles 157623 cycles 1.00
ML-DSA-44 sign 550359 cycles 549610 cycles 1.00
ML-DSA-44 verify 169225 cycles 169078 cycles 1.00
ML-DSA-65 keypair 267977 cycles 267943 cycles 1.00
ML-DSA-65 sign 903637 cycles 902493 cycles 1.00
ML-DSA-65 verify 274125 cycles 274108 cycles 1.00
ML-DSA-87 keypair 450990 cycles 447542 cycles 1.01
ML-DSA-87 sign 1162617 cycles 1156527 cycles 1.01
ML-DSA-87 verify 460584 cycles 457749 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 72258 cycles 72244 cycles 1.00
ML-DSA-44 sign 211991 cycles 212021 cycles 1.00
ML-DSA-44 verify 75712 cycles 75740 cycles 1.00
ML-DSA-65 keypair 127432 cycles 127429 cycles 1.00
ML-DSA-65 sign 350175 cycles 350138 cycles 1.00
ML-DSA-65 verify 125364 cycles 125365 cycles 1.00
ML-DSA-87 keypair 208138 cycles 208164 cycles 1.00
ML-DSA-87 sign 448958 cycles 448891 cycles 1.00
ML-DSA-87 verify 205105 cycles 205092 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 128309 cycles 128287 cycles 1.00
ML-DSA-44 sign 447743 cycles 447655 cycles 1.00
ML-DSA-44 verify 138349 cycles 144617 cycles 0.96
ML-DSA-65 keypair 220300 cycles 220134 cycles 1.00
ML-DSA-65 sign 727626 cycles 727309 cycles 1.00
ML-DSA-65 verify 223200 cycles 223042 cycles 1.00
ML-DSA-87 keypair 365101 cycles 365095 cycles 1.00
ML-DSA-87 sign 926593 cycles 926085 cycles 1.00
ML-DSA-87 verify 372803 cycles 372794 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 120283 cycles 123215 cycles 0.98
ML-DSA-44 sign 447117 cycles 449447 cycles 0.99
ML-DSA-44 verify 131120 cycles 129997 cycles 1.01
ML-DSA-65 keypair 205159 cycles 204042 cycles 1.01
ML-DSA-65 sign 729240 cycles 726667 cycles 1.00
ML-DSA-65 verify 210548 cycles 209895 cycles 1.00
ML-DSA-87 keypair 336772 cycles 336983 cycles 1.00
ML-DSA-87 sign 923968 cycles 923345 cycles 1.00
ML-DSA-87 verify 346738 cycles 346079 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 138516 cycles 138530 cycles 1.00
ML-DSA-44 sign 484183 cycles 484184 cycles 1.00
ML-DSA-44 verify 148695 cycles 162312 cycles 0.92
ML-DSA-65 keypair 242236 cycles 242042 cycles 1.00
ML-DSA-65 sign 792617 cycles 792604 cycles 1.00
ML-DSA-65 verify 241189 cycles 241158 cycles 1.00
ML-DSA-87 keypair 396195 cycles 396278 cycles 1.00
ML-DSA-87 sign 1012977 cycles 1012741 cycles 1.00
ML-DSA-87 verify 402535 cycles 402584 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 113732 cycles 113782 cycles 1.00
ML-DSA-44 sign 356644 cycles 356752 cycles 1.00
ML-DSA-44 verify 118430 cycles 118475 cycles 1.00
ML-DSA-65 keypair 197173 cycles 196794 cycles 1.00
ML-DSA-65 sign 590265 cycles 589466 cycles 1.00
ML-DSA-65 verify 195302 cycles 194959 cycles 1.00
ML-DSA-87 keypair 323470 cycles 323525 cycles 1.00
ML-DSA-87 sign 754020 cycles 753949 cycles 1.00
ML-DSA-87 verify 320376 cycles 320428 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 464016 cycles 464390 cycles 1.00
ML-DSA-44 sign 2147712 cycles 2143061 cycles 1.00
ML-DSA-44 verify 549450 cycles 550761 cycles 1.00
ML-DSA-65 keypair 780154 cycles 778123 cycles 1.00
ML-DSA-65 sign 3526513 cycles 3512438 cycles 1.00
ML-DSA-65 verify 854883 cycles 855593 cycles 1.00
ML-DSA-87 keypair 1267406 cycles 1267438 cycles 1.00
ML-DSA-87 sign 4379414 cycles 4378736 cycles 1.00
ML-DSA-87 verify 1380266 cycles 1386777 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 212950 cycles 212911 cycles 1.00
ML-DSA-44 sign 760811 cycles 760738 cycles 1.00
ML-DSA-44 verify 229323 cycles 234592 cycles 0.98
ML-DSA-65 keypair 381103 cycles 381182 cycles 1.00
ML-DSA-65 sign 1254342 cycles 1254335 cycles 1.00
ML-DSA-65 verify 372069 cycles 372135 cycles 1.00
ML-DSA-87 keypair 604503 cycles 604612 cycles 1.00
ML-DSA-87 sign 1594375 cycles 1594912 cycles 1.00
ML-DSA-87 verify 618664 cycles 618598 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 113355 cycles 113272 cycles 1.00
ML-DSA-44 sign 356033 cycles 355738 cycles 1.00
ML-DSA-44 verify 117884 cycles 117885 cycles 1.00
ML-DSA-65 keypair 196542 cycles 196931 cycles 1.00
ML-DSA-65 sign 589198 cycles 589334 cycles 1.00
ML-DSA-65 verify 194585 cycles 194567 cycles 1.00
ML-DSA-87 keypair 322401 cycles 322504 cycles 1.00
ML-DSA-87 sign 752036 cycles 753152 cycles 1.00
ML-DSA-87 verify 319958 cycles 320215 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 212612 cycles 212810 cycles 1.00
ML-DSA-44 sign 759997 cycles 759720 cycles 1.00
ML-DSA-44 verify 228854 cycles 229136 cycles 1.00
ML-DSA-65 keypair 380708 cycles 380820 cycles 1.00
ML-DSA-65 sign 1252502 cycles 1251840 cycles 1.00
ML-DSA-65 verify 371854 cycles 372231 cycles 1.00
ML-DSA-87 keypair 605059 cycles 605579 cycles 1.00
ML-DSA-87 sign 1593982 cycles 1591706 cycles 1.00
ML-DSA-87 verify 618815 cycles 617581 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 828493 cycles 828629 cycles 1.00
ML-DSA-44 sign 3237874 cycles 3236899 cycles 1.00
ML-DSA-44 verify 920036 cycles 920218 cycles 1.00
ML-DSA-65 keypair 1414978 cycles 1413016 cycles 1.00
ML-DSA-65 sign 5366078 cycles 5357541 cycles 1.00
ML-DSA-65 verify 1482925 cycles 1480164 cycles 1.00
ML-DSA-87 keypair 2312703 cycles 2311040 cycles 1.00
ML-DSA-87 sign 6669160 cycles 6668340 cycles 1.00
ML-DSA-87 verify 2416765 cycles 2415856 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 222746 cycles 227029 cycles 0.98
ML-DSA-44 sign 609985 cycles 617875 cycles 0.99
ML-DSA-44 verify 223898 cycles 224701 cycles 1.00
ML-DSA-65 keypair 396984 cycles 412531 cycles 0.96
ML-DSA-65 sign 1037227 cycles 1061715 cycles 0.98
ML-DSA-65 verify 375316 cycles 387814 cycles 0.97
ML-DSA-87 keypair 658105 cycles 666611 cycles 0.99
ML-DSA-87 sign 1352975 cycles 1398456 cycles 0.97
ML-DSA-87 verify 638484 cycles 667131 cycles 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 314316 cycles 322374 cycles 0.98
ML-DSA-44 sign 1219077 cycles 1200283 cycles 1.02
ML-DSA-44 verify 347864 cycles 342633 cycles 1.02
ML-DSA-65 keypair 605825 cycles 566673 cycles 1.07
ML-DSA-65 sign 2034909 cycles 1937222 cycles 1.05
ML-DSA-65 verify 568560 cycles 546998 cycles 1.04
ML-DSA-87 keypair 877363 cycles 869944 cycles 1.01
ML-DSA-87 sign 2465004 cycles 2468357 cycles 1.00
ML-DSA-87 verify 897477 cycles 906874 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-65 keypair 605825 cycles 566673 cycles 1.07
ML-DSA-65 sign 2034909 cycles 1937222 cycles 1.05
ML-DSA-65 verify 568560 cycles 546998 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz marked this pull request as ready for review February 5, 2026 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of poly_use_hint with assembly

2 participants