Skip to content

jmacd/duckpond

Repository files navigation

DuckPond

OpenSSF Scorecard SLSA 3

DuckPond is a query-native filesystem for time-series data, built on Apache Arrow, DataFusion, and Delta Lake. Every filesystem object can be queried with SQL, and SQL queries create new filesystem objects that appear as native files and directories.

Built by the Caspar Water System.

Caspar Duck Pond

Quick Start

# Build
make build

# Run unit tests
make test

# Initialize a pond and try it out
export POND=/tmp/mypond
pond init
pond mkdir /data
echo "hello" | pond copy - /data/greeting.txt
pond list /**
pond cat /data/greeting.txt

Developer Guide

Prerequisites

  • Rust stable toolchain (see rust-toolchain.toml)
  • Docker (for integration tests and site deployment)
  • Node.js >= 22 (for browser tests and vendor download)

Daily Workflow

make build          # Build pond binary (debug)
make test           # Run all unit tests
make integration    # Build Docker test image + run integration tests
make check          # fmt + clippy + test (CI equivalent)

Run make with no arguments to see all available targets.

One-Time Setup

Download JavaScript vendor dependencies for offline site generation:

make vendor         # Downloads DuckDB-WASM, Observable Plot, D3

This populates crates/sitegen/vendor/dist/ (gitignored, ~35MB). After this, pond run sitegen build produces sites that work without network access.

Repository Structure

crates/
  tinyfs/       Pure filesystem abstractions (FS, WD, Node, path resolution)
  tlogfs/       Delta Lake persistence (OpLog, transactions, DataFusion)
  steward/      Transaction orchestration, control table, factory execution
  provider/     URL-based data access, factory registry, table providers
  cmd/          CLI commands (pond init/list/cat/copy/run/...)
  sitegen/      Static site generator (factory)
  remote/       S3 backup & replication (factory)
  hydrovu/      HydroVu API collector (factory)
  utilities/    Shared helpers (glob, chunked files, perf tracing)

scripts/        Shared deployment scripts
testsuite/      Integration tests (Docker-based)
  tests/        Individual test scripts (NNN-description.sh)
  browser/      Puppeteer browser validation tests

docs/           Architecture and design documentation
water/          Water monitoring demo site
septic/         Septic system demo site
noyo/           Noyo Harbor demo site

Architecture

See docs/duckpond-overview.md for the full architecture description. Key layers (bottom to top):

Layer Crate Role
Filesystem tinyfs Pure abstractions: FS, WD, Node, path resolution
Persistence tlogfs Delta Lake storage, OpLog, DataFusion integration
Orchestration steward Transactions, control table, factory lifecycle
Data Access provider URL schemes, factory registry, table providers
CLI cmd User-facing commands

CLI Reference

See docs/cli-reference.md for the complete command reference. Common commands:

pond init                           # Create a new pond
pond list '/**'                     # List all entries
pond cat /path/to/file              # Read a file
pond cat --sql "SELECT * FROM source WHERE ..." /path  # Query a table
pond copy host:///local/file /pond/path                # Import a file
pond copy host+series:///data.parquet /pond/series      # Import time-series
pond mkdir /dir                     # Create a directory
pond mknod <factory> /path --config-path config.yaml   # Install a factory
pond run /path/to/factory <command>                     # Execute a factory
pond log                            # Transaction history

Integration Tests

Tests live in testsuite/tests/ as numbered shell scripts. Each test runs in a fresh Docker container with the pond binary:

make test-image                     # Build the test Docker image
make integration                    # Run all tests (skips browser tests)
make integration-all                # Run all tests including browser

# Run a single test
cd testsuite && ./run-test.sh 201

# Run interactively (explore in container)
cd testsuite && ./run-test.sh --interactive

Demo Sites

Each demo site (water/, septic/, noyo/) rsyncs data from its remote machine and runs everything locally:

# First time: configure your site
cp water/deploy.env.example water/deploy.env
# Edit deploy.env with your remote host and S3 credentials

# Site workflow (all run locally)
cd water
./setup-local.sh          # rsync data + init pond + install factories
./run-local.sh            # rsync new data + ingest
./generate-local.sh       # build static site + preview
./update-local.sh         # after editing YAML/templates

Credentials are kept in deploy.env (gitignored) — never in the YAML configs checked into the repository. Remote machines use container images built by GitHub Actions.

Documentation

Document Contents
CLI Reference Complete command syntax and examples
Architecture Overview System design and crate map
System Patterns Transaction model, factories, providers
Sitegen Design Static site generator architecture
Cross-Pond Import Foreign pond import status
Large File Storage Content-addressed storage for large files
Releasing Release process and supply chain security

License

Apache-2.0 — see LICENSES/ for details.

About

Small telemetry system

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors