Make exponentiation significantly faster by axman6 · Pull Request #221 · haskell/statistics

axman6 · 2025-08-10T14:18:19Z

From #220, make the (^) function significantly faster - it does this two ways, by unrolling common low exponents (1-5), which allows GHC to inline the full set of multiplications, and in cases above 5, it uses a tail recursive loop, which is much faster than the non-tail recursive definition that's been in hiding in this package for over a decade (it probably is faster than Prelude's definition, but it's not fast).

Benchmarks, adding the 10th and 20th central moment to demonstrate the effect of the loop (I'm sure someone must care about that statistic...). I've only included benchmarks that actually call (^)

                   , master   , unrolled ,            , master      , unrolled
Name               , Mean (ps), Mean (ps), % of master, 2*Stdev (ps), 2*Stdev (ps)
All.sample.skewness, 112217871,  48159448,       42.91,     10580706, 3933644
All.sample.kurtosis, 125990136,  51025634,       40.49,      7770772, 3650532
All.sample.C.M. 2  ,  71661035,  45683325,       63.74,      6811708, 4477812
All.sample.C.M. 3  ,  78805029,  43542919,       55.25,      7553062, 3881564
All.sample.C.M. 4  ,  85555419,  45128808,       52.74,      8290264, 4030818
All.sample.C.M. 5  ,  94791943,  44017114,       46.43,      7973198, 4154740
All.sample.C.M. 10 , 136170214,  76725390,       56.34,     13295868, 6833108
All.sample.C.M. 20 , 218253906, 106618505,       48.85,     15665576, 7769680

Shimuuar · 2025-08-12T22:13:41Z

I've checked core and it looks like unrolling doesn't work as intended. ^ is never called with statically known arguments. It's inlined into centralMoment{,s} and they're not inlined so there's no chance for specialization.

Variants with and without unrolling have similar performance and I've got interested what is exact difference? I'll benchmark it

axman6 · 2025-08-13T00:35:54Z

That's interesting, have you looked at the core that gets produced with known exponents when calling centralMoment? I was definitely seeing differences in the benchmarks with the CM 2-5 benchmarks with the expanded definitions for the small powers, but the differences weren't huge.

Since the exponent arguments to centralMoments are likely to always be known, it feels like its definition should also be marked INLINE, it's not a large function.

While I'm thinking about it, do you know why the definition chooses to use centralMoment twice when the exponents are small? The definition calling G.foldl' should perform at least as well as the two separate definitions. Maybe I should do some benchmarks...

Shimuuar · 2025-08-13T10:07:59Z

Unrolling does have an effect. Here benchmarking results using PAPI benchmarks (TOT_CYC — number of CPU cycles and BR_INS — branching instructions):

01 is you variant, 02 removes unrolling, and 03 moves unrolling into centralMoment definition:

centralMoment a xs
    | a < 0  = error "Statistics.Sample.centralMoment: negative input"
    | a == 0 = 1
    | a == 1 = 0
    | a == 2 = expectation go2 xs
    | a == 3 = expectation go3 xs
    | a == 4 = expectation go4 xs
    | otherwise = expectation go xs

So unrolling does have some effect. It not very surprising. After all it changes body of inlined function. For small powers unrolled version traverses chain of conditional jumps and computes result. It could be friendlier to branch predictor. And moving conditionals out of the loop improves performance slightly. And curiously benchmark for CM2 is not affected by all this changes at all.

While I'm thinking about it, do you know why the definition chooses to use centralMoment twice when the exponents are small?

For n=0 and n=1 central moments could be computed in constant time. (1 and 0 correspondingly)

Shimuuar · 2025-08-15T17:38:33Z

Lets drop unrolling and then this PR could be merged

As it currently stands unrolling doesn't achieve specialization for central momenta of statically known powers. But simply turning ^ will give us a new baseline.

Tail-recursive variant is much faster

We don't get specialization and gains are relatively small. Let go with simpler solution

Shimuuar · 2025-09-01T12:06:42Z

Thank you! Merged

axman6 and others added 2 commits September 1, 2025 14:42

Make exponentiation significantly faster

d3c97c2

Tail-recursive variant is much faster

Drop manual unrolling

0a34059

We don't get specialization and gains are relatively small. Let go with simpler solution

Shimuuar force-pushed the fast-exponentiation branch from 539c470 to 0a34059 Compare September 1, 2025 11:43

Shimuuar merged commit 7816c14 into haskell:master Sep 1, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Make exponentiation significantly faster#221

Make exponentiation significantly faster#221
Shimuuar merged 2 commits intohaskell:masterfrom
axman6:fast-exponentiation

axman6 commented Aug 10, 2025 •

edited

Loading

Uh oh!

Shimuuar commented Aug 12, 2025

Uh oh!

axman6 commented Aug 13, 2025 •

edited

Loading

Uh oh!

Shimuuar commented Aug 13, 2025

Uh oh!

Shimuuar commented Aug 15, 2025

Uh oh!

Uh oh!

Shimuuar commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

axman6 commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shimuuar commented Aug 12, 2025

Uh oh!

axman6 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shimuuar commented Aug 13, 2025

Uh oh!

Shimuuar commented Aug 15, 2025

Uh oh!

Uh oh!

Shimuuar commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

axman6 commented Aug 10, 2025 •

edited

Loading

axman6 commented Aug 13, 2025 •

edited

Loading