Skip to content

Architecture

Anup Ghatage edited this page Feb 12, 2026 · 1 revision

Architecture

Overview

Zeppelin is an S3-native vector search engine. All persistent state lives in object storage. Nodes are stateless and disposable — restart any node and it reconstructs its view from S3.

┌─────────────────────────────────────────────────────────┐
│                      HTTP Client                         │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────▼──────┐
                    │  Axum HTTP  │  (routes, handlers, middleware)
                    │   Server    │
                    └──────┬──────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
    ┌─────▼─────┐  ┌──────▼──────┐  ┌──────▼──────┐
    │ Namespace  │  │   Query     │  │    WAL      │
    │  Manager   │  │  Engine     │  │   Writer    │
    └─────┬─────┘  └──────┬──────┘  └──────┬──────┘
          │                │                │
          │         ┌──────▼──────┐         │
          │         │    Index    │         │
          │         │  (IVF/HANN) │         │
          │         └──────┬──────┘         │
          │                │                │
    ┌─────▼────────────────▼────────────────▼─────┐
    │              Disk Cache (LRU)                │
    └──────────────────────┬──────────────────────┘
                           │
                    ┌──────▼──────┐
                    │   Storage   │  (object_store wrapper)
                    │    Layer    │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────────┐
              │            │                │
         ┌────▼───┐  ┌────▼───┐  ┌─────────▼──┐
         │ AWS S3 │  │  GCS   │  │ Azure Blob │
         └────────┘  └────────┘  └────────────┘

Design Principles

  1. S3 is the source of truth. Never trust local state over S3. The manifest on S3 is always authoritative. Local cache is disposable.

  2. Immutable artifacts. WAL fragments and segments are write-once. Never modified in place. The manifest tracks what exists.

  3. No fallbacks. Code crashes explicitly on errors. No silent degradation, no swallowing errors, no default values for things that should be configured.

  4. Stateless nodes. Any node can serve any namespace. On startup, the node scans S3 to discover namespaces.

S3 Key Structure

<namespace>/
├── meta.json                          # Namespace metadata
├── manifest.json                      # WAL manifest (fragments + segments)
├── lease.json                         # Writer lease (fencing token)
├── wal/
│   ├── <ulid>.fragment.json           # WAL fragment (vectors + deletes)
│   └── ...
└── segments/
    └── <segment_id>/
        ├── centroids.bin              # IVF centroid vectors
        ├── cluster_<N>.bin            # Full-precision cluster data
        ├── f16_cluster_<N>.bin        # f16-compressed cluster data
        ├── sq_calibration.bin         # SQ8 calibration parameters
        ├── sq_cluster_<N>.bin         # SQ8-quantized cluster data
        ├── pq_codebook.bin            # PQ codebook
        ├── pq_cluster_<N>.bin         # PQ-encoded cluster data
        ├── attributes_<N>.json        # Cluster attribute data
        ├── bitmap_<field>.bin         # RoaringBitmap per attribute field
        ├── fts_<field>.bin            # Inverted index per FTS field
        └── tree.json                  # Hierarchical tree structure

Module Map

Module Responsibility
src/storage/ Object store wrapper. Nothing above this touches object_store directly
src/wal/ Write-ahead log: fragment serialization, manifest management, leases
src/namespace/ Namespace CRUD and metadata
src/index/ Vector indexing: IVF-Flat, Hierarchical, SQ8, PQ, bitmap, f16
src/compaction/ Background WAL → segment compaction
src/cache/ Local disk cache with LRU eviction
src/server/ Axum HTTP handlers (thin layer over domain logic)
src/fts/ Full-text search: tokenizer, BM25, inverted indexes, rank_by
src/query.rs Query execution: manifest read → WAL scan + segment search → merge
src/config.rs Configuration loading (env vars + TOML + defaults)
src/types.rs Core types: VectorEntry, Filter, DistanceMetric, IndexType
src/error.rs Error types with HTTP status code mapping
src/metrics.rs Prometheus metrics registry

Write Path

Client POST /v1/namespaces/:ns/vectors
    │
    ▼
Handler: validate dimensions, batch size, vector IDs
    │
    ▼
WalWriter::append()
    │
    ├── Serialize vectors + attributes to JSON
    ├── Compute xxHash checksum
    ├── Write fragment to S3: <ns>/wal/<ulid>.fragment.json
    └── CAS update manifest.json (add FragmentRef)
  1. Validate input against namespace metadata
  2. Create a WAL fragment with vectors, attributes, and optional deletes
  3. Write the fragment to S3 (immutable, write-once)
  4. Update the manifest via CAS (compare-and-swap using ETags)

Read Path

Client POST /v1/namespaces/:ns/query
    │
    ▼
Handler: validate query, resolve consistency
    │
    ▼
Read manifest.json from S3
    │
    ├── [Strong] Scan all WAL fragments
    │   └── Brute-force distance/BM25 on each fragment
    │
    ├── [Always] Search index segments
    │   ├── Load centroids from cache/S3
    │   ├── Find top-nprobe nearest centroids
    │   ├── Load cluster data for those centroids
    │   ├── Apply bitmap pre-filter (if available)
    │   └── Compute distances, collect top-k per segment
    │
    └── Merge WAL + segment results
        ├── Deduplicate (WAL wins on conflict)
        ├── Remove deleted IDs
        ├── Apply post-filter
        └── Return top-k results

Compaction Path

Background compaction loop (every N seconds)
    │
    ▼
For each namespace with pending WAL fragments:
    │
    ├── 1. Read manifest (get fragment list + existing segments)
    ├── 2. Acquire lease (fencing token)
    ├── 3. Load all WAL fragments
    ├── 4. Merge with existing segment data
    ├── 5. Compute delete set
    ├── 6. Train k-means centroids
    ├── 7. Assign vectors to clusters
    ├── 8. Write cluster artifacts to S3
    │   ├── centroids.bin
    │   ├── cluster_<N>.bin (full precision)
    │   ├── f16_cluster_<N>.bin (if f16 enabled)
    │   ├── sq_cluster_<N>.bin (if SQ8)
    │   ├── pq_cluster_<N>.bin (if PQ)
    │   ├── attributes_<N>.json
    │   ├── bitmap_<field>.bin (per attribute)
    │   └── fts_<field>.bin (per FTS field)
    ├── 9. CAS update manifest (add SegmentRef, clear processed fragments)
    └── 10. Deferred deletion of old artifacts

Compaction is atomic via CAS: if another writer updates the manifest concurrently, the compaction retries. Old segments are deleted only after the new manifest is committed (deferred deletion pattern).

Consistency Model

Level Behavior Use Case
Strong Reads manifest + scans all uncompacted WAL fragments + queries segments Default. See all committed writes
Eventual Reads segments only (skips WAL) Faster queries, may miss recent writes

The consistency level is set per-query via the consistency field.

Clone this wiki locally