RFC: pluggable storage backends

## RFC: Pluggable Storage Backends

### Motivation

codedb2 currently has a single storage model: append-only version chains in memory + `data.log` on disk. This works well for edit provenance tracking, but different deployment scenarios want fundamentally different storage characteristics:

| Scenario | Needs | Current fit |
|----------|-------|-------------|
| MCP agent (Claude Code, Cursor) | Fast symbol lookup, instant startup | Good (snapshot + trigram) |
| Semantic code search | Vector embeddings, similarity queries | Not supported |
| Long-running CI/audit | Full edit history, compact storage | Partial (data.log exists but no replay) |
| Embedded/WASM | Minimal memory, no disk I/O | Poor (loads everything into RAM) |

### Proposal: Storage mode flag

```
codedb --mode=default    # current behavior (trigram + sparse ngram + append log)
codedb --mode=vector     # vector embeddings for semantic search
codedb --mode=compact    # minimal footprint, no version history
codedb --mode=full       # default + vector + persistent store
```

### Architecture

```
                    ┌─────────────────────────┐
                    │      Explorer API        │
                    │  (tree, outline, search)  │
                    └────────┬────────────────┘
                             │
                    ┌────────▼────────────────┐
                    │    StorageBackend trait   │
                    │  indexFile()              │
                    │  search()                 │
                    │  persist() / restore()    │
                    └────────┬────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
   ┌─────▼─────┐     ┌──────▼──────┐     ┌──────▼──────┐
   │  Default   │     │   Vector    │     │   Compact   │
   │            │     │             │     │             │
   │ trigram    │     │ embeddings  │     │ outline-    │
   │ sparse_ngr │     │ HNSW/flat   │     │ only, no    │
   │ word_index │     │ index       │     │ content     │
   │ data.log   │     │ .vec files  │     │ store       │
   └────────────┘     └─────────────┘     └─────────────┘
```

### Vector mode design

**Embedding generation:**
- codedb2 is pure Zig, no Python deps — so either:
  - (A) Ship a small ONNX-like inference engine in Zig (e.g., [zml](https://github.com/ziglang/zml) or custom)
  - (B) Call an external embedding API (OpenAI, local ollama) via HTTP
  - (C) Use pre-computed trigram hash vectors as a lightweight proxy (no model needed)

Option (C) is most interesting — we already have trigram frequency data. A file's "vector" could be its trigram frequency distribution normalized to a unit vector. Cosine similarity between these gives semantic-ish similarity without any ML model:

```
file_vector[i] = count(trigram_i in file) / total_trigrams_in_file
similarity(a, b) = dot(a.vector, b.vector) / (|a| * |b|)
```

This gives "files with similar code patterns" for free using existing index data.

**Storage format for vectors:**
```
.codedb/projects/<hash>/vectors.bin
  Header: magic(4) + version(2) + dim(4) + file_count(4)
  Per file: path_len(u16) + path + float32[dim]
```

**New CLI commands:**
```
codedb similar src/snapshot.zig        # files with similar code patterns
codedb similar --query "error handling" # semantic search (needs embeddings)
```

**New MCP tools:**
```json
{"name": "codedb_similar", "args": {"path": "src/snapshot.zig", "top_k": 5}}
{"name": "codedb_semantic", "args": {"query": "authentication flow", "top_k": 10}}
```

### Compact mode design

For WASM/embedded — strip content storage, keep only outlines:
- No `contents` HashMap (saves ~90% memory on large repos)
- No trigram/sparse_ngram indexes
- Symbol-only queries (outline, find, tree, deps)
- Snapshot format uses TREE + OUTLINE sections only, skips CONTENT + FREQ

### Implementation phases

1. **Phase 1**: Define `StorageBackend` interface in Zig (comptime interface pattern)
2. **Phase 2**: Refactor current code into `DefaultBackend`
3. **Phase 3**: Add `CompactBackend` (outline-only, for WASM)
4. **Phase 4**: Add trigram-frequency vector similarity (no ML, pure Zig)
5. **Phase 5**: Optional external embedding API integration

### Open questions

1. Should vector mode be additive (default + vectors) or exclusive?
2. For trigram-frequency vectors — what dimensionality? Full 256³ = 16M is too large. Top-K most discriminative trigrams (K=1024)?
3. Should `--mode` be a CLI flag, config file, or auto-detected?
4. How does this interact with the snapshot format? New section type for vectors?

### Prior art

- [Zoekt](https://github.com/sourcegraph/zoekt) — trigram-based code search (Go)
- [Sourcegraph](https://sourcegraph.com) — hybrid trigram + embeddings
- [Sweep](https://github.com/sweepai/sweep) — embedding-based code search for AI agents
- [CodeSearchNet](https://github.com/github/CodeSearchNet) — benchmark for code search

cc @justrach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: pluggable storage backends — vector-optimized vs append-log modes #56

Motivation

Proposal: Storage mode flag

Architecture

Vector mode design

Compact mode design

Implementation phases

Open questions

Prior art

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scenario	Needs	Current fit
MCP agent (Claude Code, Cursor)	Fast symbol lookup, instant startup	Good (snapshot + trigram)
Semantic code search	Vector embeddings, similarity queries	Not supported
Long-running CI/audit	Full edit history, compact storage	Partial (data.log exists but no replay)
Embedded/WASM	Minimal memory, no disk I/O	Poor (loads everything into RAM)

RFC: pluggable storage backends — vector-optimized vs append-log modes #56

Description