Commit 87eca7f
Optimize spqlios-stockham inner loop and clean up dead code
Separate the j=0 (no twiddle) and j>0 (with twiddle) paths in the
inner loop, eliminating the branch from the hot path. Clean up
commented-out code from the failed 4-element vectorization attempt.
Results (default 128-bit params, N=1024):
RawIFFT: 1285 → 1235 ns (-3.9%)
RawFFT: 1716 → 1682 ns (-2.0%)
NAND: 11.58 → 11.26 ms (-2.8%)
Comparison of all backends:
Original SPQLIOS (CT+fused asm): NAND 10.46 ms, BR(tfhers) 14.95 ms
Stockham split (radix-4 C++): NAND 11.26 ms, BR(tfhers) 17.05 ms
Stockham interleaved (radix-4): NAND 12.06 ms, BR(tfhers) 13.55 ms
tfhe-rs (Stockham r4 Rust): BR(tfhers) 14.07 ms
Conclusion: For split format, the Stockham algorithm with C++ intrinsics
cannot beat hand-tuned Cooley-Tukey assembly because:
1. The Stockham scatter pattern in the final stage requires scalar stores
with split format (interleaved format uses catlo/cathi instead)
2. The j=0 first pass accounts for ~25% of the work but doesn't benefit
from radix-4 (all twiddles are 1)
3. The out-of-place write pattern doesn't help when the data fits in L1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 94612d5 commit 87eca7f
2 files changed
Lines changed: 264 additions & 210 deletions
0 commit comments