-
Notifications
You must be signed in to change notification settings - Fork 3
Are the write tests here measuring write time correctly? #3
Description
Heya,
I've been trying to replicate some of these benchmarks on some real world data we have and I'm finding some pretty different results. I've forked and modified the bench code pretty heavily to reflect our use cases a bit more closely, but when I was doing that I noticed this snippet:
if len(dimensions) == 1:
t3 = time.perf_counter()
dataset[:dimensions[0]] = data
elif len(dimensions) == 2:
t3 = time.perf_counter()
dataset[:dimensions[0], :dimensions[1]] = data
else:
t3 = time.perf_counter()
dataset[:dimensions[0], :dimensions[1], :dimensions[2]] = data
t4 = time.perf_counter()
# Add up the times taken to get the total time taken to create and write all datasets
dataset_creation_time += (t2 - t1)
dataset_population_time += (t4 - t3)coming from
| dataset[:dimensions[0]] = data |
This looks to me like you're just measuring the time to write into a buffer, rather than actually write the files to disk? From a usage scenario, I'm pretty sure disk IO rather than filling a buffer will dominate write time, so I don't think this is necessarily benchmarking exactly what you were hoping to?
For example, our loads and writes look something like this:
which is pretty different to the results you guys found - at least at first glance!
(Ignore the high outlier for the HDF5 read, I'm pretty sure thats related to FS block caching for importing the code used to read the files off disk).
Thanks so much for doing this work - it's something I'd never thought about much until I read the paper and it's definitely got me thinking about serialisation more deeply. I'm on my laptop right now, but when I get the chance I'll link my fork of the benchmarks too.