Problem
IR cabinet performance on the Raspberry Pi is inconsistent in a way that does not track IR length — heavily-trimmed (short) IRs can still perform poorly, while other IRs of similar length run fine. The differentiator is the IR's coefficient values, not its tap count.
Root cause: denormal (subnormal) floating-point arithmetic. There is no global flush-to-zero set anywhere in the codebase (no FPCR / MXCSR / FTZ setup on the audio thread). When the input signal decays toward silence, the FIR convolver's input_buffer and accumulator (ir/convolver/fir.rs) drift into the denormal range (< ~1.2e-38). On ARM (the Pi), denormal ops without FTZ are ~10-100x slower per operation — so the cost spikes depending on whether a given IR's tail structure drives products/sums into denormals against a decaying input.
This explains both observations:
- cost not proportional to trimmed length, and
- short IRs that still struggle.
Current partial workaround
eq.rs, delay.rs, reverb.rs, poweramp.rs manually flush their internal state at a 1e-20 threshold. This is incomplete: it does not cover the IR convolver, and 1e-20 is itself a normal f32, so it does nothing for denormal products formed inside the FIR loop.
Proposed fix
Set FTZ + DAZ once at the start of every real-time audio thread (JACK process thread / nih-plug process thread):
- AArch64: set the
FZ bit in FPCR.
- x86:
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON) + DAZ.
This is standard real-time-audio practice, cheap (one-time), and makes the whole IR performance gradient consistent regardless of IR content. Keep the existing per-stage manual flushes as belt-and-suspenders.
Verification
Reproducible on x86 too (also slow on denormals without FTZ): benchmark fir.rs fed a signal decaying toward zero, tail-heavy IR vs punchy IR, FTZ off vs on. FTZ should collapse the variance. The Pi is the same effect amplified.
Notes
Problem
IR cabinet performance on the Raspberry Pi is inconsistent in a way that does not track IR length — heavily-trimmed (short) IRs can still perform poorly, while other IRs of similar length run fine. The differentiator is the IR's coefficient values, not its tap count.
Root cause: denormal (subnormal) floating-point arithmetic. There is no global flush-to-zero set anywhere in the codebase (no
FPCR/MXCSR/ FTZ setup on the audio thread). When the input signal decays toward silence, the FIR convolver'sinput_bufferand accumulator (ir/convolver/fir.rs) drift into the denormal range (< ~1.2e-38). On ARM (the Pi), denormal ops without FTZ are ~10-100x slower per operation — so the cost spikes depending on whether a given IR's tail structure drives products/sums into denormals against a decaying input.This explains both observations:
Current partial workaround
eq.rs,delay.rs,reverb.rs,poweramp.rsmanually flush their internal state at a1e-20threshold. This is incomplete: it does not cover the IR convolver, and1e-20is itself a normalf32, so it does nothing for denormal products formed inside the FIR loop.Proposed fix
Set FTZ + DAZ once at the start of every real-time audio thread (JACK process thread / nih-plug
processthread):FZbit inFPCR._MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON)+ DAZ.This is standard real-time-audio practice, cheap (one-time), and makes the whole IR performance gradient consistent regardless of IR content. Keep the existing per-stage manual flushes as belt-and-suspenders.
Verification
Reproducible on x86 too (also slow on denormals without FTZ): benchmark
fir.rsfed a signal decaying toward zero, tail-heavy IR vs punchy IR, FTZ off vs on. FTZ should collapse the variance. The Pi is the same effect amplified.Notes