Skip to content

jokasimr/h5db

Repository files navigation

h5db: HDF5 Extension for DuckDB

h5db lets DuckDB query HDF5 files directly with SQL. It is aimed at analytics-style access to HDF5 data: inspect file structure, read datasets as columns, read attributes, and work with remote files through DuckDB's filesystem stack or SFTP.

Highlights

  • Reads local or remote (https://, s3://, sftp://, ...) HDF5 files directly from SQL.
  • Maps numeric datasets, string datasets, and 1D-4D array datasets into DuckDB types.
  • Multiple datasets can be stacked horizontally to make a table.
  • Scalar datasets are treated as constant columns.
  • Supports projection pushdown in h5_read(...).
  • Supports row-range predicate pushdown for h5_index() and run-start encoded columns.
  • Table-valued h5_tree(...), h5_ls(...), h5_read(...), and h5_attributes(...) accept single files, glob patterns, or VARCHAR[] filename inputs.
  • h5_tree(...) and table-valued h5_ls(...) can prune glob/list inputs from selective filename filters before opening files.
  • Supports reading HDF5 attributes on objects and the file root.
  • Supports path-complete namespace listing with h5_tree(...).
  • Supports shallow group listing with table and scalar h5_ls(...).
  • Supports projecting selected HDF5 attributes as extra columns in h5_tree(...) and h5_ls(...).

Core Functions

  • h5_read(filename_or_filenames, datasets_or_definitions...) Reads one or more datasets as DuckDB columns. Supports regular datasets, special column encodings such as "run start encoded" columns (see h5_rse()), and virtual index columns (see h5_index()).
  • h5_tree(filename_or_filenames, projected_attributes...) Recursively lists namespace entries with path, type, dtype, and shape. Output is path-oriented: if multiple paths resolve to the same object, each path appears as its own row.
  • h5_ls(filename_or_filenames[, group_path], projected_attributes...) Lists only the immediate children of a group. The table form returns the same row shape as h5_tree; the scalar form returns a MAP(VARCHAR, STRUCT(...)) keyed by child name.
  • h5_attributes(filename_or_filenames, object_path) Reads attributes from an object or the file root. Multi-file reads return one wide row per file and require the same attribute names, types, and order in every matched file.

For a practical guide to the main workflows, see docs/USER_GUIDE.md. For the full API, see docs/API.md.

Quick Start

In DuckDB, install and load the extension:

INSTALL h5db FROM community;
LOAD h5db;

Then run queries such as:

-- Read one dataset
FROM h5_read('data.h5', '/measurements');

-- Read multiple datasets side by side
FROM h5_read('data.h5', '/timestamps', '/temperatures');

-- Add a virtual row index
FROM h5_read('data.h5', h5_index(), '/measurements');

-- Read a remote file
FROM h5_read('https://example.com/data.h5', '/dataset_name');

-- Read matching local files
FROM h5_read('runs/run_*.h5', '/counts');

-- Expand an explicit list of exact files and/or patterns in order
FROM h5_tree([
    'runs/calibration.h5',
    'runs/run_*.h5'
]);

-- Inspect the file structure
FROM h5_tree('data.h5');

-- Inspect the file structure and project selected attributes
SELECT path, type, NX_class
FROM h5_tree(
    'data.h5',
    h5_attr('NX_class')
);

-- Project all attributes as one map-valued column
SELECT path, h5_attr
FROM h5_tree(
    'data.h5',
    h5_attr()
);

-- List only the root group's immediate children
FROM h5_ls('data.h5');

-- List the immediate children of a specific group
FROM h5_ls('data.h5', '/entry/instrument');

-- Return a map of immediate children keyed by child name
SELECT h5_ls('data.h5', '/entry/instrument');

-- Read attributes
FROM h5_attributes('data.h5', '/measurements');

-- Read run-start encoded data
FROM h5_read(
    'experiment.h5',
    '/timestamp',
    h5_rse('/state_run_starts', '/state_values')
);

-- Rename a column definition
FROM h5_read(
    'data.h5',
    h5_alias('idx', h5_index()),
    '/measurements'
);

-- Rename a projected h5_tree attribute column
SELECT path, time
FROM h5_tree(
    'data.h5',
    h5_alias('time', h5_attr('count_time'))
);

Remote Access

All file-opening h5db functions accept local paths or remote URLs as filename.

The table-valued h5_tree(...), h5_ls(...), h5_read(...), and h5_attributes(...) also accept:

  • a glob pattern
  • a VARCHAR[] of exact filenames/URLs and/or glob patterns

Multi-file semantics:

  • a single glob expands to lexicographically sorted matches
  • list entries are expanded left-to-right
  • duplicate files are preserved and are processed more than once
  • h5_read(...) concatenates rows file by file
  • h5_attributes(...) returns one row per matched file
  • h5_index() is the outermost-dimension row index within each matched file, so it starts at 0 for each file
  • table h5_tree(...), table h5_ls(...), h5_read(...), and h5_attributes(...) expose a hidden virtual filename column; refer to it explicitly when you need file provenance
  • h5_read(...) requires compatible column definitions across all matched files
  • h5_attributes(...) requires the same attribute names, types, and order across all matched files
  • scalar h5_ls(...) accepts one filename/URL expression per row and does not expand lists or glob patterns
  • output column names are checked using DuckDB's case-insensitive identifier matching; use h5_alias(...) when two datasets or projected attributes would otherwise produce the same name

For local paths and DuckDB-backed remote schemes, glob expansion uses DuckDB's filesystem stack. For sftp:// URLs, glob expansion is handled by h5db's SFTP backend. In both cases, h5db follows DuckDB's other multi-file reader semantics such as read_parquet(...). In particular, recursive ** does not traverse symlink directories. Globs support the usual *, ?, and bracket classes, plus one recursive ** segment per pattern. A pattern that matches no files raises an error.

When you filter h5_tree(...) or table-valued h5_ls(...) by the virtual filename column, h5db can apply that filter before opening each expanded file:

SELECT path, type
FROM h5_tree('runs/**/*.h5')
WHERE filename LIKE '%/run_042.h5';

SELECT filename, path
FROM h5_ls('runs/run_*.h5', '/entry')
WHERE filename IN ('runs/run_001.h5', 'runs/run_002.h5');

This pruning applies to filename-only predicates and filename predicates inside AND filters that DuckDB pushes into the table function. Dynamic predicates from joins, semi-joins, or scalar subqueries may still open all expanded files. h5_read(...) and h5_attributes(...) do not currently use filename filters to avoid opening files.

When several files contain the same HDF5 path, use filename to distinguish them:

SELECT filename, path
FROM h5_tree('runs/**/*.h5')
WHERE path = '/entry/data';

SELECT filename, *
FROM h5_read('runs/run_*.h5', '/entry/data');
  • DuckDB-backed remote schemes such as http://, https://, s3://, s3a://, s3n://, r2://, gcs://, gs://, and hf:// use DuckDB's filesystem stack and benefit from DuckDB's remote caching features.
  • sftp:// URLs are supported through h5db's built-in SFTP backend.
  • Remote opens are treated as immutable snapshots. swmr := true is accepted for API consistency, but remote paths do not use H5F_ACC_SWMR_READ.

To read over SFTP, create a DuckDB secret of type sftp whose scope matches the URL you want to access.

Available authentication methods are:

  • USE_AGENT true
  • PASSWORD '...'
  • KEY_PATH '...' with optional KEY_PASSPHRASE '...'

Recommended default: if you already use an SSH agent / OS keychain integration, prefer USE_AGENT true. It avoids storing the SSH password or private-key passphrase in the DuckDB secret. Password and explicit key-file auth are also supported.

Example:

CREATE OR REPLACE SECRET beamline_sftp (
    TYPE sftp,
    SCOPE 'sftp://beamline.example.org/',
    USERNAME 'alice',
    USE_AGENT true,
    KNOWN_HOSTS_PATH '/home/alice/.ssh/known_hosts'
);

Then query the file normally:

FROM h5_read(
    'sftp://beamline.example.org/data/run001.h5',
    '/entry/data'
);

FROM h5_read(
    'sftp://beamline.example.org/data/run_*.h5',
    '/entry/data'
);

SFTP secrets require:

  • USERNAME
  • exactly one of PASSWORD, KEY_PATH, or USE_AGENT
  • at least one of KNOWN_HOSTS_PATH or HOST_KEY_FINGERPRINT

Optional SFTP secret fields:

  • KEY_PASSPHRASE
  • PORT (default 22)
  • HOST_KEY_ALGORITHMS

For key-based auth, replace PASSWORD with KEY_PATH and optionally KEY_PASSPHRASE. For agent-based auth, use USE_AGENT true. On Unix-like systems libssh2 resolves the agent through SSH_AUTH_SOCK; on Windows it uses the supported agent backends available through libssh2. If you use HOST_KEY_FINGERPRINT, provide the lowercase hex SHA1 host-key fingerprint. See docs/API.md for the full SFTP secret reference.

Build From Source

Prerequisites

  • vcpkg
  • VCPKG_TOOLCHAIN_PATH pointing to vcpkg/scripts/buildsystems/vcpkg.cmake
  • Git submodules initialized

Quick Build

# 1. Install vcpkg (one-time setup, outside this repo)
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
export VCPKG_TOOLCHAIN_PATH="$(pwd)/scripts/buildsystems/vcpkg.cmake"
cd ..

# 2. Clone and build h5db
git clone https://github.com/jokasimr/h5db.git
cd h5db
git submodule update --init --recursive
make -j8

If you prefer not to export VCPKG_TOOLCHAIN_PATH in your shell, put it in a repo-root .env file instead. See docs/developer/DEVELOPER.md for the full setup and troubleshooting guide.

For contributor workflows that run make test or the SFTP interaction harness, set up and activate the repo venv first:

./scripts/setup-dev-env.sh
source venv/bin/activate
make -j8

Build Outputs

  • ./build/release/duckdb DuckDB shell with h5db loaded.
  • ./build/release/test/unittest SQLLogicTest runner.
  • ./build/release/extension/h5db/h5db.duckdb_extension Loadable extension artifact.

Current Limitations

  • Compound, enum, reference, opaque, bitfield, time-like, and non-string variable-length HDF5 types are not supported.
  • Datasets with more than 4 dimensions are not supported.
  • Multi-dimensional string datasets are not supported.
  • Attribute multidimensional dataspaces are not supported.

See docs/API.md for full type-mapping details and error behavior.

Testing

# Full suite: local tests + rewritten remote HTTP/SFTP suites
make test

# Local SQLLogicTests only
./build/release/test/unittest "test/sql/*" "~test/sql/remote/*"

# Rewritten remote URL suite via the local range-capable HTTP server
make test_remote_http

# Rewritten remote URL suite via the local SFTP server + interaction harness
make test_remote_sftp

Notes:

  • make test ensures missing HDF5 fixtures exist before running tests.
  • make test runs both remote harnesses.

For targeted test runs, test-data generation, and debugging workflows, see docs/developer/DEVELOPER.md and test/README.md.

Documentation

About

Duckdb extension for reading HDF5 files.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Generated from duckdb/extension-template