Skip to content

RFC: pluggable storage backends — vector-optimized vs append-log modes #56

@justrach

Description

@justrach

RFC: Pluggable Storage Backends

Motivation

codedb2 currently has a single storage model: append-only version chains in memory + data.log on disk. This works well for edit provenance tracking, but different deployment scenarios want fundamentally different storage characteristics:

Scenario Needs Current fit
MCP agent (Claude Code, Cursor) Fast symbol lookup, instant startup Good (snapshot + trigram)
Semantic code search Vector embeddings, similarity queries Not supported
Long-running CI/audit Full edit history, compact storage Partial (data.log exists but no replay)
Embedded/WASM Minimal memory, no disk I/O Poor (loads everything into RAM)

Proposal: Storage mode flag

codedb --mode=default    # current behavior (trigram + sparse ngram + append log)
codedb --mode=vector     # vector embeddings for semantic search
codedb --mode=compact    # minimal footprint, no version history
codedb --mode=full       # default + vector + persistent store

Architecture

                    ┌─────────────────────────┐
                    │      Explorer API        │
                    │  (tree, outline, search)  │
                    └────────┬────────────────┘
                             │
                    ┌────────▼────────────────┐
                    │    StorageBackend trait   │
                    │  indexFile()              │
                    │  search()                 │
                    │  persist() / restore()    │
                    └────────┬────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
   ┌─────▼─────┐     ┌──────▼──────┐     ┌──────▼──────┐
   │  Default   │     │   Vector    │     │   Compact   │
   │            │     │             │     │             │
   │ trigram    │     │ embeddings  │     │ outline-    │
   │ sparse_ngr │     │ HNSW/flat   │     │ only, no    │
   │ word_index │     │ index       │     │ content     │
   │ data.log   │     │ .vec files  │     │ store       │
   └────────────┘     └─────────────┘     └─────────────┘

Vector mode design

Embedding generation:

  • codedb2 is pure Zig, no Python deps — so either:
    • (A) Ship a small ONNX-like inference engine in Zig (e.g., zml or custom)
    • (B) Call an external embedding API (OpenAI, local ollama) via HTTP
    • (C) Use pre-computed trigram hash vectors as a lightweight proxy (no model needed)

Option (C) is most interesting — we already have trigram frequency data. A file's "vector" could be its trigram frequency distribution normalized to a unit vector. Cosine similarity between these gives semantic-ish similarity without any ML model:

file_vector[i] = count(trigram_i in file) / total_trigrams_in_file
similarity(a, b) = dot(a.vector, b.vector) / (|a| * |b|)

This gives "files with similar code patterns" for free using existing index data.

Storage format for vectors:

.codedb/projects/<hash>/vectors.bin
  Header: magic(4) + version(2) + dim(4) + file_count(4)
  Per file: path_len(u16) + path + float32[dim]

New CLI commands:

codedb similar src/snapshot.zig        # files with similar code patterns
codedb similar --query "error handling" # semantic search (needs embeddings)

New MCP tools:

{"name": "codedb_similar", "args": {"path": "src/snapshot.zig", "top_k": 5}}
{"name": "codedb_semantic", "args": {"query": "authentication flow", "top_k": 10}}

Compact mode design

For WASM/embedded — strip content storage, keep only outlines:

  • No contents HashMap (saves ~90% memory on large repos)
  • No trigram/sparse_ngram indexes
  • Symbol-only queries (outline, find, tree, deps)
  • Snapshot format uses TREE + OUTLINE sections only, skips CONTENT + FREQ

Implementation phases

  1. Phase 1: Define StorageBackend interface in Zig (comptime interface pattern)
  2. Phase 2: Refactor current code into DefaultBackend
  3. Phase 3: Add CompactBackend (outline-only, for WASM)
  4. Phase 4: Add trigram-frequency vector similarity (no ML, pure Zig)
  5. Phase 5: Optional external embedding API integration

Open questions

  1. Should vector mode be additive (default + vectors) or exclusive?
  2. For trigram-frequency vectors — what dimensionality? Full 256³ = 16M is too large. Top-K most discriminative trigrams (K=1024)?
  3. Should --mode be a CLI flag, config file, or auto-detected?
  4. How does this interact with the snapshot format? New section type for vectors?

Prior art

  • Zoekt — trigram-based code search (Go)
  • Sourcegraph — hybrid trigram + embeddings
  • Sweep — embedding-based code search for AI agents
  • CodeSearchNet — benchmark for code search

cc @justrach

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions