Skip to content

Commit 5d8161f

Browse files
committed
Initial Commit
0 parents  commit 5d8161f

14 files changed

Lines changed: 432 additions & 0 deletions

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
/venv/
2+
/.idea/

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# File Format Comparison Benchmark
2+
3+
Scientific data is often stored in files because of the simplicity they provide in managing, transferring, and sharing
4+
data. These files are typically structured in a specific arrangement and contain metadata to understand the structure
5+
the data is stored in. There are numerous file formats in use in various scientific domains that provide abstractions
6+
for storing and retrieving data. With the abundance of file formats aiming to store large amounts of scientific data
7+
quickly and easily,
8+
a question that arises is, "Which scientific file format is best for a general use case?"
9+
In this study, we compiled a set of benchmarks for common file operations, i.e., create, open, read, write, and close,
10+
and used the results of these benchmarks to compare three popular formats: `HDF5`, `netCDF4`, and `Zarr`.
11+
12+
## Benchmark Overview
13+
14+
This benchmark compares the time taken to create a dataset, write data to a dataset, and finally open that dataset at a
15+
later time and read its contents. This can be categorized into two types of operations: the writing operation and the
16+
reading operation.
17+
18+
Additionally, this benchmark uses a configuration-based system in which the user is able to specify the testing
19+
parameters such as the number of datasets to create within the file and the dimensions of the array that will be written
20+
to each dataset by editing a YAML configuration file.
21+
22+
After the benchmark is done, the program then stores the times taken across multiple trials in a CSV file and plots its
23+
data with [matplotlib.pyplot](https://github.com/matplotlib/matplotlib) to allow the user to make a definitive
24+
comparison between the file formats being tested.
25+
26+
## How to Run
27+
28+
1. Install the requirements found in the `requirements.txt` file.
29+
2. Run the `runner.py` file. If no configuration files are found in the `datasets_test/configuration_files/` directory,
30+
a configuration file will be generated. Otherwise, the benchmark will be run with all `.yaml` configuration files
31+
found in the directory. The benchmark will test each file format 5 times, but this can be
32+
modified by changing the `num_trials` variable in the `runner.py` file.
33+
34+
Note: Both the CSV files and the Plots can be found under the generated `datasets_test/data/` folder after the benchmark
35+
is
36+
run.

datasets_test/README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Benchmark Operations
2+
3+
This benchmark consists of two main operations, both of which will be discussed below.
4+
5+
## Write Benchmark
6+
7+
The write operation is the first operation to be tested in the benchmark. It creates files with the filename as
8+
specified from the configuration file and extensions `.hdf5` for HDF5 files, `.netc` for netCDF4 files, and `.zarr` for
9+
Zarr files. The file is placed inside a folder named `files/` to help reduce clutter in the working directory.
10+
11+
Taking information from the configuration file, a sample data array is generated with dimensions and length as
12+
specified. Then, the program creates a dataset within the file and writes the sample data array to the dataset. This
13+
process of generating a sample data array, creating a dataset, and populating it with the values from the sample data
14+
array is repeated until the benchmark has created the number of datasets as specified by the configuration file.
15+
16+
After the file is populated with data, the benchmark copies the file to a directory named `files_read/` and renames the
17+
file to avoid any caching effects that may interfere with the read times.
18+
19+
Finally, the time taken to create all the datasets and populate them with data is divided by the number of datasets to
20+
find the average time taken to create and populate one dataset. Both of these times are then returned to the main
21+
program, where they are written to the CSV output file.
22+
23+
## Read Benchmark
24+
25+
The benchmark now opens the copied file in the `files_read/` directory and begins testing the read operations of the
26+
three file formats.
27+
28+
This operation consists of opening each dataset within the file and printing its contents to the standard output. The
29+
time taken to open all the datasets and the time taken to read from all the datasets are once again divided by the
30+
number of datasets within the file out to find the average time taken to open and read one dataset. <br><br> Both of
31+
these times are then returned to the main program, where they are also written to the CSV output file. This process of
32+
running the write operation benchmark and the read operation benchmark are then repeated multiple times in order to
33+
ensure the consistency of the data gathered.
34+
35+
Finally, the data from the CSV file is averaged out with [pandas](https://github.com/pandas-dev/pandas) and plotted
36+
with [matplotlib.pyplot](https://github.com/matplotlib/matplotlib) to show a direct comparison between the file formats
37+
being tested in a given operation.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
FILE_NAME: 2048_Vector
2+
NUMBER_DATASETS: 2048
3+
NUMBER_ELEMENTS:
4+
- 128
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
FILE_NAME: 2048_Matrix
2+
NUMBER_DATASETS: 2048
3+
NUMBER_ELEMENTS:
4+
- 128
5+
- 128
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FILE_NAME: 2048_Tensor
2+
NUMBER_DATASETS: 2048
3+
NUMBER_ELEMENTS:
4+
- 128
5+
- 128
6+
- 128
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FILE_NAME: 2048_Datasets
2+
NUMBER_DATASETS: 2048
3+
NUMBER_ELEMENTS:
4+
- 256
5+
CHUNK_SIZE: 0
6+
MIN_DATA_VALUE: 1
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
FILE_NAME: 4096_Datasets
2+
NUMBER_DATASETS: 4096
3+
NUMBER_ELEMENTS:
4+
- 256
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
FILE_NAME: 8192_Datasets
2+
NUMBER_DATASETS: 8192
3+
NUMBER_ELEMENTS:
4+
- 256

datasets_test/plot.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
import os
2+
3+
import matplotlib.pyplot as plt
4+
import numpy as np
5+
import pandas as pd
6+
7+
8+
def plot(file_formats, num_datasets, dimensions):
9+
# Generate two plots - one for the read / write times and one for the dataset create / open times
10+
if not os.path.exists('datasets_test/data/plots'):
11+
os.mkdir('datasets_test/data/plots')
12+
create_time, write_time, open_time, read_time, error = process_csv(file_formats, num_datasets, dimensions)
13+
width = .25
14+
15+
plt.figure(1)
16+
plt_labels = ['Dataset Read Time', 'Dataset Write Time']
17+
x = np.arange(len(plt_labels))
18+
offset = -width
19+
plt.ylabel('Time (ms)')
20+
plt.title(f'{num_datasets} Datasets {dimensions} Elements Dataset Read / Write Times')
21+
plt.xticks(x, plt_labels)
22+
for i in range(0, len(file_formats)):
23+
# Round to 5 decimal places so data shows nicely
24+
read_time_rounded = round(read_time[i], 5)
25+
write_time_rounded = round(write_time[i], 5)
26+
read_error = error[i][3]
27+
write_error = error[i][1]
28+
bar_create_open = plt.bar(x=x + offset, height=[read_time_rounded, write_time_rounded], width=width,
29+
label=file_formats[i], edgecolor='black', yerr=[read_error, write_error])
30+
plt.bar_label(bar_create_open, padding=3)
31+
offset += width
32+
plt.legend()
33+
plt.tight_layout()
34+
plt.savefig(f'datasets_test/data/plots/{num_datasets}_{dimensions}_read_write.png')
35+
# plt.show()
36+
plt.cla()
37+
plt.clf()
38+
39+
plt.figure(2)
40+
plt_labels = ['Dataset Create Time', 'Dataset Open Time']
41+
x = np.arange(len(plt_labels))
42+
offset = -width
43+
plt.ylabel('Time (ms)')
44+
plt.title(f'{num_datasets} Datasets {dimensions} Elements Dataset Create / Open Times')
45+
plt.xticks(x, plt_labels)
46+
for i in range(0, len(file_formats)):
47+
# Round to 5 decimal places, so that it displays nicely on the plot.
48+
create_time_rounded = round(create_time[i], 5)
49+
open_time_rounded = round(open_time[i], 5)
50+
create_error = error[i][0]
51+
open_error = error[i][2]
52+
bar_read_write = plt.bar(x=x + offset, height=[create_time_rounded, open_time_rounded], width=width,
53+
label=file_formats[i], edgecolor='black', yerr=[create_error, open_error])
54+
plt.bar_label(bar_read_write, padding=3)
55+
offset += width
56+
plt.legend()
57+
plt.tight_layout()
58+
plt.savefig(f'datasets_test/data/plots/{num_datasets}_{dimensions}_create_open.png')
59+
# plt.show()
60+
plt.cla()
61+
plt.clf()
62+
63+
64+
def process_csv(file_formats, num_datasets, dimensions):
65+
# Calculate the average value in each column of the provided CSV file.
66+
# Append it to the file if not already appended.
67+
# Return these average times to be plotted.
68+
total_dataset_create_time = []
69+
total_dataset_write_time = []
70+
total_dataset_open_time = []
71+
total_dataset_read_time = []
72+
error = []
73+
for file_format in file_formats:
74+
df = pd.read_csv(f'datasets_test/data/{file_format}_{num_datasets}_{dimensions}.csv')
75+
dataset_create_time, dataset_write_time, dataset_open_time, dataset_read_time = df.iloc[:, 1:].mean(axis=0)
76+
create_deviation, write_deviation, open_deviation, read_deviation = df.iloc[:, 1:].std(axis=0)
77+
total_dataset_create_time.append(dataset_create_time)
78+
total_dataset_write_time.append(dataset_write_time)
79+
total_dataset_open_time.append(dataset_open_time)
80+
total_dataset_read_time.append(dataset_read_time)
81+
error.append([create_deviation, write_deviation, open_deviation, read_deviation])
82+
if df.iloc[-1, 0] == 'Average':
83+
# Go to next iteration if the last column of the CSV file has the average times
84+
continue
85+
average_values = pd.DataFrame({
86+
file_format: 'Average',
87+
'Dataset Creation Time': [dataset_create_time],
88+
'Dataset Write Time': [dataset_write_time],
89+
'Dataset Open Time': [dataset_open_time],
90+
'Dataset Read Time': [dataset_read_time]
91+
})
92+
df = pd.concat([df, average_values], ignore_index=True)
93+
df.to_csv(f'datasets_test/data/{file_format}_{num_datasets}_{dimensions}.csv', index=False)
94+
return total_dataset_create_time, total_dataset_write_time, total_dataset_open_time, total_dataset_read_time, error

0 commit comments

Comments
 (0)