Skip to content

Embeddable Go persistence framework for structured object storage

License

Notifications You must be signed in to change notification settings

justapithecus/lode

Repository files navigation

Lode

Structured persistence for object storage.

Lode is an embeddable Go persistence framework that brings snapshots, metadata, and safe write semantics to object storage—without running a database. It offers two coequal persistence paradigms: Datasets for structured collections of named objects, and Volumes for sparse, resumable byte-range storage.


What Lode is (and is not)

Lode is:

  • An embeddable Go library
  • A persistence framework for structured data and sparse byte ranges
  • Snapshots, atomic commits, and metadata guarantees you'd otherwise hand-roll on raw storage APIs
  • Explicit and predictable by design

Lode is not:

  • A database or storage engine
  • A query planner or execution runtime
  • A distributed system or background service

Lode owns structure and lifecycle, not execution.


Why Lode exists

Many systems already use object storage as their primary persistence layer:

  • Events written to S3
  • Parquet files on a filesystem
  • Analytics data dumped “by convention”

What’s usually missing is discipline:

  • No clear snapshots or versions
  • No metadata guarantees
  • Accidental overwrites
  • Ad-hoc directory layouts
  • Unclear read/write safety

Lode formalizes the parts that should be formalized and refuses to do the rest.


Core ideas

Lode is built around a small set of invariants:

Dataset (structured collections):

  • Datasets are immutable — writes produce snapshots
  • Snapshots reference manifests that describe data files and metadata
  • Storage, format, compression, and partitioning are orthogonal

Volume (sparse byte ranges):

  • Volumes are sparse, resumable byte address spaces
  • Volume commits track which byte ranges exist — gaps are explicit, never zero-filled
  • Each commit produces a cumulative manifest covering all committed blocks

Shared invariants:

  • Manifest presence is the commit signal — a snapshot is visible only after its manifest is persisted
  • Metadata is explicit, persisted, and self-describing (never inferred)
  • Single-writer semantics required — Lode does not resolve concurrent writer conflicts

If you know the snapshot ID, you know exactly what data you are reading.


Quick Start (5 minutes)

Install

go get github.com/justapithecus/lode

Write and read a blob

package main

import (
    "context"
    "fmt"
    "github.com/justapithecus/lode/lode"
)

func main() {
    ctx := context.Background()

    // Create dataset with filesystem storage
    ds, _ := lode.NewDataset("mydata", lode.NewFSFactory("/tmp/lode-demo"))

    // Write a blob (default: raw bytes, no codec)
    snap, _ := ds.Write(ctx, []any{[]byte("hello world")}, lode.Metadata{"source": "demo"})
    fmt.Println("Created snapshot:", snap.ID)

    // Read it back
    data, _ := ds.Read(ctx, snap.ID)
    fmt.Println("Data:", string(data[0].([]byte)))
}

Write structured records

// With a codec, Write accepts structured data
ds, _ := lode.NewDataset("events",
    lode.NewFSFactory("/tmp/lode-demo"),
    lode.WithCodec(lode.NewJSONLCodec()),
)

records := lode.R(
    lode.D{"id": 1, "event": "signup"},
    lode.D{"id": 2, "event": "login"},
)
snap, _ := ds.Write(ctx, records, lode.Metadata{"batch": "1"})

Stream large files

// StreamWrite is for large binary payloads (no codec)
ds, _ := lode.NewDataset("backups", lode.NewFSFactory("/tmp/lode-demo"))

sw, _ := ds.StreamWrite(ctx, lode.Metadata{"type": "backup"})
sw.Write([]byte("... large data ..."))
snap, _ := sw.Commit(ctx)

Write sparse byte ranges (Volume)

// Volume: sparse, resumable byte-range persistence
vol, _ := lode.NewVolume("disk-image", lode.NewFSFactory("/tmp/lode-demo"), 1024)

// Stage a byte range and commit
blk, _ := vol.StageWriteAt(ctx, 0, bytes.NewReader(data))
snap, _ := vol.Commit(ctx, []lode.BlockRef{blk}, lode.Metadata{"step": "boot"})

// Read it back
result, _ := vol.ReadAt(ctx, snap.ID, 0, len(data))

See examples/ for complete runnable code.


Which Write API?

Paradigm API Use Case Codec Partitioning
Dataset Write In-memory data, small batches ✅ Optional ✅ Supported
Dataset StreamWrite Large binary payloads (GB+) ❌ Raw only ❌ Not supported
Dataset StreamWriteRecords Large record streams ✅ Required (streaming) ❌ Not supported
Volume StageWriteAt + Commit Sparse byte ranges, resumable ❌ Raw only ❌ Not applicable

Decision flow:

  1. Is the data already in memory? → Use Write
  2. Is it a large binary blob (no structure)? → Use StreamWrite
  3. Is it a large stream of records? → Use StreamWriteRecords with a streaming codec
  4. Is it a sparse/resumable byte address space? → Use Volume

Guarantees

What Lode commits to:

Guarantee Detail
Immutable snapshots Once written, data files and manifests never change
Atomic commits Manifest presence is the commit signal; no partial visibility
Explicit metadata Every snapshot has caller-supplied metadata (never inferred)
Safe writes Overwrites are prevented (atomic for small files; best-effort for large)
Backend-agnostic Same semantics on filesystem, memory, or S3

What Lode explicitly does NOT provide:

Non-goal Why
Multi-writer conflict resolution Requires distributed coordination; use external locks
Query execution Lode structures data; query engines consume it
Background compaction No implicit mutations; callers control lifecycle
Automatic cleanup Partial objects from failed writes may remain

For full contract details: docs/contracts/


Gotchas

Common pitfalls when using Lode:

  • Metadata must be non-nil — Pass lode.Metadata{} for empty metadata, not nil.
  • Raw mode expects []byte — Without a codec, Write expects exactly one []byte element.
  • Single-writer only — Concurrent writers to the same dataset or volume may corrupt history.
  • Cleanup is best-effort — Failed streams may leave partial objects in storage.
  • StreamWriteRecords requires streaming codec — Not all codecs support streaming.

See PUBLIC_API.md for complete usage guidance.


Example: event storage

A canonical Lode workflow looks like this:

  • Dataset: events
  • Rows: timestamped events
  • Partitioning: Hive-style by day (dt=YYYY-MM-DD)
  • Format: JSON Lines
  • Compression: gzip
  • Backend: file system or S3

Each write produces a new snapshot. Reads always target a snapshot explicitly.


Supported Backends

  • Filesystem — Local storage via NewFSFactory
  • In-memory — Testing via NewMemoryFactory
  • S3 — AWS S3, MinIO, LocalStack, R2 via lode/s3

Storage Prerequisites

Ensure storage exists before constructing a dataset, volume, or reader.

Lode does not create storage infrastructure. This is intentional:

  • No hidden side effects in constructors
  • Explicit provisioning keeps control with the caller
  • Same pattern across all backends

Filesystem

# Create directory before use
mkdir -p /data/lode
// Then construct dataset
ds, err := lode.NewDataset("events", lode.NewFSFactory("/data/lode"))

S3

# Create bucket before use (via AWS CLI, console, or IaC)
aws s3 mb s3://my-bucket
// Then construct dataset (wrap Store in factory)
store, err := s3.New(client, s3.Config{Bucket: "my-bucket"})
factory := func() (lode.Store, error) { return store, nil }
ds, err := lode.NewDataset("events", factory)

Bootstrap Helpers

If you need provisioning helpers, implement them outside core APIs:

// Example: ensure directory exists before constructing dataset
func EnsureFSDataset(id, root string, opts ...lode.Option) (lode.Dataset, error) {
    if err := os.MkdirAll(root, 0755); err != nil {
        return nil, fmt.Errorf("create storage root: %w", err)
    }
    return lode.NewDataset(id, lode.NewFSFactory(root), opts...)
}

This keeps Lode's core APIs explicit and predictable.


Examples

Example Purpose Run
default_layout Write → list → read with default layout go run ./examples/default_layout
hive_layout Partition-first layout with Hive partitioner go run ./examples/hive_layout
blob_upload Raw blob write/read (no codec, default bundle) go run ./examples/blob_upload
manifest_driven Demonstrates manifest-as-commit-signal go run ./examples/manifest_driven
stream_write_records Streaming record writes with iterator go run ./examples/stream_write_records
parquet Parquet codec with schema-typed fields go run ./examples/parquet
volume_sparse Sparse Volume: stage, commit, read with gaps go run ./examples/volume_sparse
s3_experimental S3 adapter with LocalStack/MinIO go run ./examples/s3_experimental

Each example is self-contained and runnable. See the example source for detailed comments.


Status

Lode is at v0.7.0 and under active development. APIs are stabilizing; some changes are possible before v1.0.

If you are evaluating Lode, focus on:

  • snapshot semantics (Dataset and Volume)
  • metadata visibility
  • API clarity

Usage overview: PUBLIC_API.md Concrete usage: examples/ Implementation milestones: docs/IMPLEMENTATION_PLAN.md


License

Apache License 2.0. Copyright © 2026 Andrew Hu me@andrewhu.nyc.