Benchmark script:
const n = parse(Int, ARGS[1])
const samples = parse(Int, ARGS[2])
const evals = parse(Int, ARGS[3])
@show n
@show samples
@show evals
using BenchmarkTools, FixedSizeArrays
@btime x * y * z seconds=Inf samples=samples evals=evals setup=(x = FixedSizeArray(rand(Float32, n, n)); y = FixedSizeArray(rand(Float32, n, n)); z = FixedSizeArray(rand(Float32, n, n)););
@btime x * y * z seconds=Inf samples=samples evals=evals setup=(x = rand(Float32, n, n); y = rand(Float32, n, n); z = rand(Float32, n, n););
My results for n from 0:9:
n = 0
samples = 20000
evals = 20
45.050 ns (0 allocations: 0 bytes)
40.050 ns (2 allocations: 96 bytes)
n = 1
samples = 20000
evals = 20
317.100 ns (2 allocations: 64 bytes)
300.050 ns (4 allocations: 160 bytes)
n = 2
samples = 20000
evals = 20
132.250 ns (2 allocations: 96 bytes)
99.150 ns (4 allocations: 192 bytes)
n = 3
samples = 20000
evals = 20
131.200 ns (2 allocations: 128 bytes)
118.700 ns (4 allocations: 224 bytes)
n = 4
samples = 20000
evals = 20
360.200 ns (2 allocations: 192 bytes)
353.650 ns (4 allocations: 288 bytes)
n = 5
samples = 20000
evals = 20
435.800 ns (2 allocations: 256 bytes)
417.300 ns (4 allocations: 352 bytes)
n = 6
samples = 20000
evals = 20
499.950 ns (2 allocations: 352 bytes)
463.850 ns (4 allocations: 448 bytes)
n = 7
samples = 20000
evals = 20
565.550 ns (2 allocations: 448 bytes)
557.550 ns (4 allocations: 544 bytes)
n = 8
samples = 20000
evals = 20
516.450 ns (2 allocations: 576 bytes)
500.900 ns (4 allocations: 672 bytes)
n = 9
samples = 20000
evals = 20
633.700 ns (2 allocations: 736 bytes)
604.650 ns (4 allocations: 832 bytes)
Lots of weird stuff here (why is the n == 1 case so slow?), but the takeaway is that FSA is slower than Array even though FSA allocates less.
Of course, the heavy lifting here is supposed to depend on BLAS, not on Julia code, so the question is, where does the difference come from in the first place.
Benchmark script:
My results for n from
0:9:Lots of weird stuff here (why is the
n == 1case so slow?), but the takeaway is that FSA is slower thanArrayeven though FSA allocates less.Of course, the heavy lifting here is supposed to depend on BLAS, not on Julia code, so the question is, where does the difference come from in the first place.