Skip to content

Commit 87eca7f

Browse files
Ubuntuclaude
andcommitted
Optimize spqlios-stockham inner loop and clean up dead code
Separate the j=0 (no twiddle) and j>0 (with twiddle) paths in the inner loop, eliminating the branch from the hot path. Clean up commented-out code from the failed 4-element vectorization attempt. Results (default 128-bit params, N=1024): RawIFFT: 1285 → 1235 ns (-3.9%) RawFFT: 1716 → 1682 ns (-2.0%) NAND: 11.58 → 11.26 ms (-2.8%) Comparison of all backends: Original SPQLIOS (CT+fused asm): NAND 10.46 ms, BR(tfhers) 14.95 ms Stockham split (radix-4 C++): NAND 11.26 ms, BR(tfhers) 17.05 ms Stockham interleaved (radix-4): NAND 12.06 ms, BR(tfhers) 13.55 ms tfhe-rs (Stockham r4 Rust): BR(tfhers) 14.07 ms Conclusion: For split format, the Stockham algorithm with C++ intrinsics cannot beat hand-tuned Cooley-Tukey assembly because: 1. The Stockham scatter pattern in the final stage requires scalar stores with split format (interleaved format uses catlo/cathi instead) 2. The j=0 first pass accounts for ~25% of the work but doesn't benefit from radix-4 (all twiddles are 1) 3. The out-of-place write pattern doesn't help when the data fits in L1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 94612d5 commit 87eca7f

2 files changed

Lines changed: 264 additions & 210 deletions

File tree

0 commit comments

Comments
 (0)