Skip to content

Performance optimisation for cobra_apply? #46

@mfreeborn

Description

@mfreeborn

I was just playing around with this library and was having a look at the cobra_apply function. It recalculates a.reverse_bits() >> ((block_width - 1).leading_zeros()); an awful lot due to the number of loops, and it showed up as a big chunk in the flame chart.

Block width is fixed at 128, and a (and b and c where relevant) are also always just 0..BLOCK_WIDTH. I tried just extracting it into a lookup table:

static FORWARD: [usize; 128] = [
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
    26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
    50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
    74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
    98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
    117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127,
];

static REVERSED: [usize; 128] = [
    0, 64, 32, 96, 16, 80, 48, 112, 8, 72, 40, 104, 24, 88, 56, 120, 4, 68, 36, 100, 20, 84, 52,
    116, 12, 76, 44, 108, 28, 92, 60, 124, 2, 66, 34, 98, 18, 82, 50, 114, 10, 74, 42, 106, 26, 90,
    58, 122, 6, 70, 38, 102, 22, 86, 54, 118, 14, 78, 46, 110, 30, 94, 62, 126, 1, 65, 33, 97, 17,
    81, 49, 113, 9, 73, 41, 105, 25, 89, 57, 121, 5, 69, 37, 101, 21, 85, 53, 117, 13, 77, 45, 109,
    29, 93, 61, 125, 3, 67, 35, 99, 19, 83, 51, 115, 11, 75, 43, 107, 27, 91, 59, 123, 7, 71, 39,
    103, 23, 87, 55, 119, 15, 79, 47, 111, 31, 95, 63, 127,
];

and then replacing all the similar

    for a in 0..BLOCK_WIDTH {
        let a_rev = a.reverse_bits() >> ((BLOCK_WIDTH - 1).leading_zeros());

with

for (a, a_rev) in FORWARD.into_iter().zip(REVERSED) { ...

and I got 25 - 30% gains across log_n = 15..20.

The cobra test still seems to pass fine. Is this a valid optimisation?

EDIT: actually just doing REVERSED.into_iter().enumerate() is both more elegant and worth another couple of percent on my laptop.

I also note that declaring REVERSED as static, rather than const harms performance by a few percent.

But it would be good if these benches could be double checked by someone else to make sure this isn't all just artefactual on on my particular set up or something...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions