Initial Commit

asriniket · asriniket · commit 5d8161fb3e0e · 2022-06-22T22:46:51.000-05:00
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+/venv/
+/.idea/
diff --git a/README.md b/README.md
@@ -0,0 +1,36 @@
+# File Format Comparison Benchmark
+
+Scientific data is often stored in files because of the simplicity they provide in managing, transferring, and sharing
+data. These files are typically structured in a specific arrangement and contain metadata to understand the structure
+the data is stored in. There are numerous file formats in use in various scientific domains that provide abstractions
+for storing and retrieving data. With the abundance of file formats aiming to store large amounts of scientific data
+quickly and easily,
+a question that arises is, "Which scientific file format is best for a general use case?"
+In this study, we compiled a set of benchmarks for common file operations, i.e., create, open, read, write, and close,
+and used the results of these benchmarks to compare three popular formats: `HDF5`, `netCDF4`, and `Zarr`.
+
+## Benchmark Overview
+
+This benchmark compares the time taken to create a dataset, write data to a dataset, and finally open that dataset at a
+later time and read its contents. This can be categorized into two types of operations: the writing operation and the
+reading operation.
+
+Additionally, this benchmark uses a configuration-based system in which the user is able to specify the testing
+parameters such as the number of datasets to create within the file and the dimensions of the array that will be written
+to each dataset by editing a YAML configuration file.
+
+After the benchmark is done, the program then stores the times taken across multiple trials in a CSV file and plots its
+data with [matplotlib.pyplot](https://github.com/matplotlib/matplotlib) to allow the user to make a definitive
+comparison between the file formats being tested.
+
+## How to Run
+
+1. Install the requirements found in the `requirements.txt` file.
+2. Run the `runner.py` file. If no configuration files are found in the `datasets_test/configuration_files/` directory,
+   a configuration file will be generated. Otherwise, the benchmark will be run with all `.yaml` configuration files
+   found in the directory. The benchmark will test each file format 5 times, but this can be
+   modified by changing the `num_trials` variable in the `runner.py` file.
+
+Note: Both the CSV files and the Plots can be found under the generated `datasets_test/data/` folder after the benchmark
+is
+run.
diff --git a/datasets_test/README.md b/datasets_test/README.md
@@ -0,0 +1,37 @@
+# Benchmark Operations
+
+This benchmark consists of two main operations, both of which will be discussed below.
+
+## Write Benchmark
+
+The write operation is the first operation to be tested in the benchmark. It creates files with the filename as
+specified from the configuration file and extensions `.hdf5` for HDF5 files, `.netc` for netCDF4 files, and `.zarr` for
+Zarr files. The file is placed inside a folder named `files/` to help reduce clutter in the working directory.
+
+Taking information from the configuration file, a sample data array is generated with dimensions and length as
+specified. Then, the program creates a dataset within the file and writes the sample data array to the dataset. This
+process of generating a sample data array, creating a dataset, and populating it with the values from the sample data
+array is repeated until the benchmark has created the number of datasets as specified by the configuration file.
+
+After the file is populated with data, the benchmark copies the file to a directory named `files_read/` and renames the
+file to avoid any caching effects that may interfere with the read times.
+
+Finally, the time taken to create all the datasets and populate them with data is divided by the number of datasets to
+find the average time taken to create and populate one dataset. Both of these times are then returned to the main
+program, where they are written to the CSV output file.
+
+## Read Benchmark
+
+The benchmark now opens the copied file in the `files_read/` directory and begins testing the read operations of the
+three file formats.
+
+This operation consists of opening each dataset within the file and printing its contents to the standard output. The
+time taken to open all the datasets and the time taken to read from all the datasets are once again divided by the
+number of datasets within the file out to find the average time taken to open and read one dataset. <br><br> Both of
+these times are then returned to the main program, where they are also written to the CSV output file. This process of
+running the write operation benchmark and the read operation benchmark are then repeated multiple times in order to
+ensure the consistency of the data gathered.
+
+Finally, the data from the CSV file is averaged out with [pandas](https://github.com/pandas-dev/pandas) and plotted
+with [matplotlib.pyplot](https://github.com/matplotlib/matplotlib) to show a direct comparison between the file formats
+being tested in a given operation.
diff --git a/datasets_test/configuration_files/1.yaml b/datasets_test/configuration_files/1.yaml
@@ -0,0 +1,4 @@
+FILE_NAME: 2048_Vector
+NUMBER_DATASETS: 2048
+NUMBER_ELEMENTS:
+  - 128
diff --git a/datasets_test/configuration_files/2.yaml b/datasets_test/configuration_files/2.yaml
@@ -0,0 +1,5 @@
+FILE_NAME: 2048_Matrix
+NUMBER_DATASETS: 2048
+NUMBER_ELEMENTS:
+  - 128
+  - 128
diff --git a/datasets_test/configuration_files/3.yaml b/datasets_test/configuration_files/3.yaml
@@ -0,0 +1,6 @@
+FILE_NAME: 2048_Tensor
+NUMBER_DATASETS: 2048
+NUMBER_ELEMENTS:
+  - 128
+  - 128
+  - 128
diff --git a/datasets_test/configuration_files/4.yaml b/datasets_test/configuration_files/4.yaml
@@ -0,0 +1,6 @@
+FILE_NAME: 2048_Datasets
+NUMBER_DATASETS: 2048
+NUMBER_ELEMENTS:
+  - 256
+CHUNK_SIZE: 0
+MIN_DATA_VALUE: 1
diff --git a/datasets_test/configuration_files/5.yaml b/datasets_test/configuration_files/5.yaml
@@ -0,0 +1,4 @@
+FILE_NAME: 4096_Datasets
+NUMBER_DATASETS: 4096
+NUMBER_ELEMENTS:
+  - 256
diff --git a/datasets_test/configuration_files/6.yaml b/datasets_test/configuration_files/6.yaml
@@ -0,0 +1,4 @@
+FILE_NAME: 8192_Datasets
+NUMBER_DATASETS: 8192
+NUMBER_ELEMENTS:
+  - 256
diff --git a/datasets_test/plot.py b/datasets_test/plot.py
@@ -0,0 +1,94 @@
+import os
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+
+
+def plot(file_formats, num_datasets, dimensions):
+    # Generate two plots - one for the read / write times and one for the dataset create / open times
+    if not os.path.exists('datasets_test/data/plots'):
+        os.mkdir('datasets_test/data/plots')
+    create_time, write_time, open_time, read_time, error = process_csv(file_formats, num_datasets, dimensions)
+    width = .25
+
+    plt.figure(1)
+    plt_labels = ['Dataset Read Time', 'Dataset Write Time']
+    x = np.arange(len(plt_labels))
+    offset = -width
+    plt.ylabel('Time (ms)')
+    plt.title(f'{num_datasets} Datasets {dimensions} Elements Dataset Read / Write Times')
+    plt.xticks(x, plt_labels)
+    for i in range(0, len(file_formats)):
+        # Round to 5 decimal places so data shows nicely
+        read_time_rounded = round(read_time[i], 5)
+        write_time_rounded = round(write_time[i], 5)
+        read_error = error[i][3]
+        write_error = error[i][1]
+        bar_create_open = plt.bar(x=x + offset, height=[read_time_rounded, write_time_rounded], width=width,
+                                  label=file_formats[i], edgecolor='black', yerr=[read_error, write_error])
+        plt.bar_label(bar_create_open, padding=3)
+        offset += width
+    plt.legend()
+    plt.tight_layout()
+    plt.savefig(f'datasets_test/data/plots/{num_datasets}_{dimensions}_read_write.png')
+    # plt.show()
+    plt.cla()
+    plt.clf()
+
+    plt.figure(2)
+    plt_labels = ['Dataset Create Time', 'Dataset Open Time']
+    x = np.arange(len(plt_labels))
+    offset = -width
+    plt.ylabel('Time (ms)')
+    plt.title(f'{num_datasets} Datasets {dimensions} Elements Dataset Create / Open Times')
+    plt.xticks(x, plt_labels)
+    for i in range(0, len(file_formats)):
+        # Round to 5 decimal places, so that it displays nicely on the plot.
+        create_time_rounded = round(create_time[i], 5)
+        open_time_rounded = round(open_time[i], 5)
+        create_error = error[i][0]
+        open_error = error[i][2]
+        bar_read_write = plt.bar(x=x + offset, height=[create_time_rounded, open_time_rounded], width=width,
+                                 label=file_formats[i], edgecolor='black', yerr=[create_error, open_error])
+        plt.bar_label(bar_read_write, padding=3)
+        offset += width
+    plt.legend()
+    plt.tight_layout()
+    plt.savefig(f'datasets_test/data/plots/{num_datasets}_{dimensions}_create_open.png')
+    # plt.show()
+    plt.cla()
+    plt.clf()
+
+
+def process_csv(file_formats, num_datasets, dimensions):
+    # Calculate the average value in each column of the provided CSV file.
+    # Append it to the file if not already appended.
+    # Return these average times to be plotted.
+    total_dataset_create_time = []
+    total_dataset_write_time = []
+    total_dataset_open_time = []
+    total_dataset_read_time = []
+    error = []
+    for file_format in file_formats:
+        df = pd.read_csv(f'datasets_test/data/{file_format}_{num_datasets}_{dimensions}.csv')
+        dataset_create_time, dataset_write_time, dataset_open_time, dataset_read_time = df.iloc[:, 1:].mean(axis=0)
+        create_deviation, write_deviation, open_deviation, read_deviation = df.iloc[:, 1:].std(axis=0)
+        total_dataset_create_time.append(dataset_create_time)
+        total_dataset_write_time.append(dataset_write_time)
+        total_dataset_open_time.append(dataset_open_time)
+        total_dataset_read_time.append(dataset_read_time)
+        error.append([create_deviation, write_deviation, open_deviation, read_deviation])
+        if df.iloc[-1, 0] == 'Average':
+            # Go to next iteration if the last column of the CSV file has the average times
+            continue
+        average_values = pd.DataFrame({
+            file_format: 'Average',
+            'Dataset Creation Time': [dataset_create_time],
+            'Dataset Write Time': [dataset_write_time],
+            'Dataset Open Time': [dataset_open_time],
+            'Dataset Read Time': [dataset_read_time]
+        })
+        df = pd.concat([df, average_values], ignore_index=True)
+        df.to_csv(f'datasets_test/data/{file_format}_{num_datasets}_{dimensions}.csv', index=False)
+    return total_dataset_create_time, total_dataset_write_time, total_dataset_open_time, total_dataset_read_time, error
diff --git a/datasets_test/read.py b/datasets_test/read.py
@@ -0,0 +1,64 @@
+import os
+import shutil
+import time
+
+import h5py
+import zarr
+from netCDF4 import Dataset
+
+
+def read(file_format, filename, num_datasets, dimensions):
+    # Open the copied file and read the data in each dataset
+    dataset_open_time = 0.0
+    dataset_read_time = 0.0
+
+    # Open files
+    if file_format == 'HDF5':
+        file = h5py.File(f'datasets_test/files_read/{filename}_copy.hdf5', 'r')
+    elif file_format == 'netCDF4':
+        file = Dataset(f'datasets_test/files_read/{filename}_copy.netc', 'r')
+    else:
+        file = zarr.open(f'datasets_test/files_read/{filename}_copy.zarr', 'r')
+
+    # Open a dataset within the file and record the time
+    for i in range(0, num_datasets):
+        if file_format == 'HDF5':
+            t1 = time.perf_counter()
+            dataset = file[f'Dataset_{i}']
+        elif file_format == 'netCDF4':
+            t1 = time.perf_counter()
+            dataset = file.variables[f'Dataset_{i}']
+        else:
+            t1 = time.perf_counter()
+            dataset = file.get(f'Dataset_{i}')
+        t2 = time.perf_counter()
+
+        # Print the values within each dataset and measure the time taken
+        if len(dimensions) == 1:
+            t3 = time.perf_counter()
+            print(dataset[:dimensions[0]])
+        elif len(dimensions) == 2:
+            t3 = time.perf_counter()
+            print(dataset[:dimensions[0], :dimensions[1]])
+        else:
+            t3 = time.perf_counter()
+            print(dataset[:dimensions[0], :dimensions[1], :dimensions[2]])
+        t4 = time.perf_counter()
+
+        # Add up the times taken to get the total time taken to open and read all datasets
+        dataset_open_time += (t2 - t1)
+        dataset_read_time += (t4 - t3)
+
+    # Close the file (if applicable) and delete it to save space
+    if not file_format == 'Zarr':
+        file.close()
+        if file_format == 'HDF5':
+            os.remove(f'datasets_test/files_read/{filename}_copy.hdf5')
+        else:
+            os.remove(f'datasets_test/files_read/{filename}_copy.netc')
+    else:
+        shutil.rmtree(f'datasets_test/files_read/{filename}_copy.zarr')
+
+    # Return average time taken to open one dataset and read from it. Times are in milliseconds
+    arr = [1000 * dataset_open_time / num_datasets, 1000 * dataset_read_time / num_datasets]
+    return arr
diff --git a/datasets_test/write.py b/datasets_test/write.py
@@ -0,0 +1,103 @@
+import os
+import shutil
+import time
+
+import h5py
+import numpy as np
+import zarr
+from netCDF4 import Dataset
+
+
+def write(file_format, filename, num_datasets, dimensions):
+    # Create a file with the specified number of datasets and populate each dataset with data from a generated array
+    dataset_creation_time = 0.0
+    dataset_population_time = 0.0
+
+    # Create files
+    if file_format == 'HDF5':
+        file = h5py.File(f'datasets_test/files/{filename}.hdf5', 'w')
+    elif file_format == 'netCDF4':  # netCDF4 dimensions must be assigned upon file creation
+        file = Dataset(f'datasets_test/files/{filename}.netc', 'w', format='NETCDF4')
+        if len(dimensions) == 1:
+            file.createDimension('x', None)
+            axes = ('x',)
+        elif len(dimensions) == 2:
+            file.createDimension('x', None)
+            file.createDimension('y', None)
+            axes = ('x', 'y',)
+        else:
+            file.createDimension('x', None)
+            file.createDimension('y', None)
+            file.createDimension('z', None)
+            axes = ('x', 'y', 'z')
+    else:
+        file = zarr.open(f'datasets_test/files/{filename}.zarr', 'w')
+
+    # Create datasets and populate them with data
+    for i in range(0, num_datasets):
+        data = generate_array(tuple(dimensions))
+        if not file_format == 'netCDF4':  # h5py and zarr use the same function name to create a dataset
+            t1 = time.perf_counter()
+            dataset = file.create_dataset(f'Dataset_{i}', shape=dimensions, dtype='f')
+        else:
+            t1 = time.perf_counter()
+            dataset = file.createVariable(f'Dataset_{i}', dimensions=axes, datatype='f')  # noqa
+        t2 = time.perf_counter()
+
+        # Populate datasets with the generated array of data
+        if len(dimensions) == 1:
+            t3 = time.perf_counter()
+            dataset[:dimensions[0]] = data
+        elif len(dimensions) == 2:
+            t3 = time.perf_counter()
+            dataset[:dimensions[0], :dimensions[1]] = data
+        else:
+            t3 = time.perf_counter()
+            dataset[:dimensions[0], :dimensions[1], :dimensions[2]] = data
+        t4 = time.perf_counter()
+
+        # Add up the times taken to get the total time taken to create and write all datasets
+        dataset_creation_time += (t2 - t1)
+        dataset_population_time += (t4 - t3)
+
+    # Zarr files can not be closed
+    if not file_format == 'Zarr':
+        file.close()
+
+    # Copy the file to a new directory and rename it to begin the read operations. This helps avoid any caching effects
+    copy_file(file_format, filename)
+
+    # Return the average time taken to create one dataset and write to it. Times are in milliseconds
+    arr = [1000 * dataset_creation_time / num_datasets, 1000 * dataset_population_time / num_datasets]
+    return arr
+
+
+def generate_array(num_elements):
+    # Generate a random array of data with the provided dimensions
+    np.random.seed(None)
+    if len(num_elements) == 1:
+        a = num_elements[0]
+        arr = np.random.rand(a).astype(np.float32)
+    elif len(num_elements) == 2:
+        a, b = tuple(num_elements)
+        arr = np.random.rand(a, b).astype(np.float32)
+    else:
+        a, b, c = tuple(num_elements)
+        arr = np.random.rand(a, b, c).astype(np.float32)
+    return arr
+
+
+def copy_file(file_format, filename):
+    # Copy and rename file to avoid caching effects when reading from the file
+    if file_format == 'HDF5':
+        shutil.copy(f'datasets_test/files/{filename}.hdf5', 'datasets_test/files_read')
+        os.rename(f'datasets_test/files_read/{filename}.hdf5', f'datasets_test/files_read/{filename}_copy.hdf5')
+        os.remove(f'datasets_test/files/{filename}.hdf5')
+    elif file_format == 'netCDF4':
+        shutil.copy(f'datasets_test/files/{filename}.netc', 'datasets_test/files_read')
+        os.rename(f'datasets_test/files_read/{filename}.netc', f'datasets_test/files_read/{filename}_copy.netc')
+        os.remove(f'datasets_test/files/{filename}.netc')
+    elif file_format == 'Zarr':
+        shutil.copytree(f'datasets_test/files/{filename}.zarr', f'datasets_test/files_read/{filename}.zarr')
+        os.rename(f'datasets_test/files_read/{filename}.zarr', f'datasets_test/files_read/{filename}_copy.zarr')
+        shutil.rmtree(f'datasets_test/files/{filename}.zarr')
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,7 @@
+h5py==3.6.0
+matplotlib==3.5.1
+netCDF4==1.5.8
+numpy==1.22.2
+pandas==1.4.1
+PyYAML==6.0
+zarr==2.11.0
diff --git a/runner.py b/runner.py