HDF5 is a recursive file format, used to compress and organize large amounts of data, especially for research.
- Following Beej's guide: https://beej.us/guide/bgc/html/
- Learn the structure of HDF5 file formats
- Handle Parse both UTF-8 and ASCII
- Use the C language in a recursive manner to parse HDF5
- Implement a B+ Tree to index table objects
- Learn how to write unit tests in C
- Learn how to benchmark my own custom parsers against hdf5 C API
HDF5 files
Taken from wikipedia, there are two primary items in an HDF5 file:
- Datasets, which are typed multidimensional arrays
- Groups, which are container structures that can hold datasets and other groups
Components
- Superblock
- Object headers
- Groups + datasets (basic)
- B-tree (symbol table or chunk index)
General setup for tools:
- Use python
h5pyto generate minimal files - Use
h5dumpto view these files for debugging purposes
Superblock header should include information:
- about hdf5 file version
- Contain a header:
\x89HDF\r\n\x1a\n
ascii + utf8
- read C files in a variety of different ways
- Read C files as bytes
- Read C files in ASCII
- Read C files in UTF8
- On-the-fly figure out if file is ascii or utf-8 encoded
- Read dynamically
gcc -o parser main.c errorUtils.c errorUtils.h
gcc -g playground.c -o playground.exe
MVP of HDF5 parser:
- Read File as Superblock
- Leaves and branches OOP system
- Read object headers, attributes etc
- Navigate file system
- Fork this repository
- Generate and/or download example h5 data
- Generate h5 data through python
- Example complex h5 dataset taken from
@sushanttwayanaon kaggle
- Implement the subgoals
- benchmark compare to mine and official HDF5 C API
How to run the C code for parsing:
- Compile C function
- requires compiler installe, ie: gcc
gcc main.c -o main
- Run compiled object file with command line argument