Performance optimisation for `cobra_apply`?

I was just playing around with this library and was having a look at the `cobra_apply` function. It recalculates `a.reverse_bits() >> ((block_width - 1).leading_zeros());` an awful lot due to the number of loops, and it showed up as a big chunk in the flame chart.

Block width is fixed at 128, and `a` (and `b` and `c` where relevant) are also always just `0..BLOCK_WIDTH`. I tried just extracting it into a lookup table:

```rust
static FORWARD: [usize; 128] = [
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
    26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
    50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
    74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
    98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
    117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127,
];

static REVERSED: [usize; 128] = [
    0, 64, 32, 96, 16, 80, 48, 112, 8, 72, 40, 104, 24, 88, 56, 120, 4, 68, 36, 100, 20, 84, 52,
    116, 12, 76, 44, 108, 28, 92, 60, 124, 2, 66, 34, 98, 18, 82, 50, 114, 10, 74, 42, 106, 26, 90,
    58, 122, 6, 70, 38, 102, 22, 86, 54, 118, 14, 78, 46, 110, 30, 94, 62, 126, 1, 65, 33, 97, 17,
    81, 49, 113, 9, 73, 41, 105, 25, 89, 57, 121, 5, 69, 37, 101, 21, 85, 53, 117, 13, 77, 45, 109,
    29, 93, 61, 125, 3, 67, 35, 99, 19, 83, 51, 115, 11, 75, 43, 107, 27, 91, 59, 123, 7, 71, 39,
    103, 23, 87, 55, 119, 15, 79, 47, 111, 31, 95, 63, 127,
];
``` 

and then replacing all the similar

```rust
    for a in 0..BLOCK_WIDTH {
        let a_rev = a.reverse_bits() >> ((BLOCK_WIDTH - 1).leading_zeros());
```

with

```rust
for (a, a_rev) in FORWARD.into_iter().zip(REVERSED) { ...
```

and I got 25 - 30% gains across `log_n = 15..20`.

The `cobra` test still seems to pass fine. Is this a valid optimisation?

EDIT: actually just doing `REVERSED.into_iter().enumerate()` is both more elegant and worth another couple of percent on my laptop.

I also note that declaring `REVERSED` as `static`, rather than `const` harms performance by a few percent.

But it would be good if these benches could be double checked by someone else to make sure this isn't all just artefactual on on my particular set up or something...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimisation for `cobra_apply`? #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance optimisation for cobra_apply? #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Performance optimisation for `cobra_apply`? #46