-
Notifications
You must be signed in to change notification settings - Fork 36
RFC: pluggable storage backends — vector-optimized vs append-log modes #56
Copy link
Copy link
Open
Labels
priority:p2Medium priorityMedium priority
Description
RFC: Pluggable Storage Backends
Motivation
codedb2 currently has a single storage model: append-only version chains in memory + data.log on disk. This works well for edit provenance tracking, but different deployment scenarios want fundamentally different storage characteristics:
| Scenario | Needs | Current fit |
|---|---|---|
| MCP agent (Claude Code, Cursor) | Fast symbol lookup, instant startup | Good (snapshot + trigram) |
| Semantic code search | Vector embeddings, similarity queries | Not supported |
| Long-running CI/audit | Full edit history, compact storage | Partial (data.log exists but no replay) |
| Embedded/WASM | Minimal memory, no disk I/O | Poor (loads everything into RAM) |
Proposal: Storage mode flag
codedb --mode=default # current behavior (trigram + sparse ngram + append log)
codedb --mode=vector # vector embeddings for semantic search
codedb --mode=compact # minimal footprint, no version history
codedb --mode=full # default + vector + persistent store
Architecture
┌─────────────────────────┐
│ Explorer API │
│ (tree, outline, search) │
└────────┬────────────────┘
│
┌────────▼────────────────┐
│ StorageBackend trait │
│ indexFile() │
│ search() │
│ persist() / restore() │
└────────┬────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌─────▼─────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Default │ │ Vector │ │ Compact │
│ │ │ │ │ │
│ trigram │ │ embeddings │ │ outline- │
│ sparse_ngr │ │ HNSW/flat │ │ only, no │
│ word_index │ │ index │ │ content │
│ data.log │ │ .vec files │ │ store │
└────────────┘ └─────────────┘ └─────────────┘
Vector mode design
Embedding generation:
- codedb2 is pure Zig, no Python deps — so either:
- (A) Ship a small ONNX-like inference engine in Zig (e.g., zml or custom)
- (B) Call an external embedding API (OpenAI, local ollama) via HTTP
- (C) Use pre-computed trigram hash vectors as a lightweight proxy (no model needed)
Option (C) is most interesting — we already have trigram frequency data. A file's "vector" could be its trigram frequency distribution normalized to a unit vector. Cosine similarity between these gives semantic-ish similarity without any ML model:
file_vector[i] = count(trigram_i in file) / total_trigrams_in_file
similarity(a, b) = dot(a.vector, b.vector) / (|a| * |b|)
This gives "files with similar code patterns" for free using existing index data.
Storage format for vectors:
.codedb/projects/<hash>/vectors.bin
Header: magic(4) + version(2) + dim(4) + file_count(4)
Per file: path_len(u16) + path + float32[dim]
New CLI commands:
codedb similar src/snapshot.zig # files with similar code patterns
codedb similar --query "error handling" # semantic search (needs embeddings)
New MCP tools:
{"name": "codedb_similar", "args": {"path": "src/snapshot.zig", "top_k": 5}}
{"name": "codedb_semantic", "args": {"query": "authentication flow", "top_k": 10}}Compact mode design
For WASM/embedded — strip content storage, keep only outlines:
- No
contentsHashMap (saves ~90% memory on large repos) - No trigram/sparse_ngram indexes
- Symbol-only queries (outline, find, tree, deps)
- Snapshot format uses TREE + OUTLINE sections only, skips CONTENT + FREQ
Implementation phases
- Phase 1: Define
StorageBackendinterface in Zig (comptime interface pattern) - Phase 2: Refactor current code into
DefaultBackend - Phase 3: Add
CompactBackend(outline-only, for WASM) - Phase 4: Add trigram-frequency vector similarity (no ML, pure Zig)
- Phase 5: Optional external embedding API integration
Open questions
- Should vector mode be additive (default + vectors) or exclusive?
- For trigram-frequency vectors — what dimensionality? Full 256³ = 16M is too large. Top-K most discriminative trigrams (K=1024)?
- Should
--modebe a CLI flag, config file, or auto-detected? - How does this interact with the snapshot format? New section type for vectors?
Prior art
- Zoekt — trigram-based code search (Go)
- Sourcegraph — hybrid trigram + embeddings
- Sweep — embedding-based code search for AI agents
- CodeSearchNet — benchmark for code search
cc @justrach
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
priority:p2Medium priorityMedium priority