Skip to content

Pluggable embedding providers (ollama / openai / voyage) + storage-path unification#57

Merged
dvcdsys merged 34 commits into
mainfrom
feat/embedding-providers
Jun 2, 2026
Merged

Pluggable embedding providers (ollama / openai / voyage) + storage-path unification#57
dvcdsys merged 34 commits into
mainfrom
feat/embedding-providers

Conversation

@dvcdsys
Copy link
Copy Markdown
Owner

@dvcdsys dvcdsys commented May 28, 2026

What

Adds a pluggable embedding-provider system so the server can embed via the
local ollama llama-server sidecar, any OpenAI-compatible endpoint
(vLLM/TEI/LocalAI/OpenAI), or Voyage AI — switchable at runtime from the
dashboard. Also unifies on-disk storage path derivation behind that change.

Major pieces:

  • Provider abstractionprovider.Provider interface + kind→Factory
    registry; ollama/openai/voyage implementations (ollama code moved into
    provider/ollama/).
  • Voyage provider — int8 dequant, operator-configurable RPM/TPM
    token-bucket throttle, token-aware batch splitting, adaptive bisect on the
    120K "batch too large" 400, byte-window sliding split + averaging for
    oversize single inputs.
  • Admin surfaceGET/PUT /admin/embedding-providers[/active],
    POST /…/{kind}/test, provider-aware /status; dashboard provider UI.
  • Storage unification — one permanent, model-INDEPENDENT system DB; the
    chroma vector store is namespaced by the ACTIVE provider identity
    (provider.StorageSlug(Provider.ID())) with live reopen on switch, so
    vectors of different dimensions never collide. One-time boot migrations
    adopt legacy per-model DB/chroma layouts without a reindex.

Why

The original design derived both the SQLite path and the chroma dir from the
embedding model name (a Python-era artefact). That was only safe while the
model was a fixed env var; runtime provider switching turned it into a
critical bug — switching to a higher-dim provider wrote vectors into the
previous provider's collection, breaking search with "vectors must have the
same length" — plus a footgun where a model change spawned a parallel system
DB with empty accounts.

How

  • Two-plane storage: model-independent system DB (cfg.SQLitePath) +
    identity-namespaced chroma; vectorstore.Holder allows atomic swap of the
    active store on provider switch with no restart.
  • API keys are never persisted/logged/returned — only the env-var NAME is
    stored; the key is read live from the environment at request time.

Pre-PR review fixes (last 5 commits)

  • voyage: validate returned vector dimension; guard the averaging path
    against inconsistent window widths (panic → clean error); accept int8 as
    base64; trim base-URL trailing slash.
  • openai/db: trim base-URL trailing slash; migration-13 now tests for a
    known kind prefix instead of any : so colon-bearing legacy ollama model
    names get prefixed.
  • httpapi: fix a nil-actor panic in SwitchEmbeddingProvider under
    CIX_AUTH_DISABLED=true; add per-endpoint 403 gating tests for all admin
    routes.
  • embeddings: close three concurrency hazards — a data race on the
    s.queue pointer swap during Restart, missing mutual exclusion between
    SwitchProvider and Restart, and an unsynchronized waiterDone read in the
    ollama supervisor's Stop.

Update — post-review fixes (commits f5d6e22, 668ad29)

A full review of the branch was run after the original description above; two follow-up commits address the findings.

Correctness / resilience (f5d6e22)

  • Provider switch is now atomic. The work queue stays blocked across both the provider swap and the vector-store reopen, and a failed reopen rolls the provider back — closing the window where a concurrent embed could write new-dimension vectors into the previous provider's collection.
  • Boot no longer hard-fails on a remote provider. If the persisted openai/voyage provider's connect-test (or the ollama sidecar) fails at startup, the server degrades instead of refusing to boot, so the dashboard / auth / all-project APIs stay reachable to recover. Remote providers self-heal once the upstream/key returns.
  • Validate-then-persist in PUT /admin/embedding-providers/active: the live swap is applied (and validated) before the DB row is written, so a transient switch failure can't leave the DB pointing at a provider that bricks the next boot.
  • Chroma migration fails closed at boot instead of silently opening an empty namespace.
  • Sidecar status fix: healthy openai/voyage providers (StateRemote) now report ready: true (previously always false).
  • Restart blocks the freshly-built queue during a concurrency-cap change so in-flight embeds get a clean 503 instead of ErrSupervisor.
  • ollama token-window guard against a pathologically small ctx_size (no more infinite split loop).
  • Shared HTTP-provider plumbing (base-URL normalize, API-key resolve, Ready/Status) extracted to one place.
  • Release gate fixed: make test-gate (the embedding-parity gate) compiles again — a deleted embedRaw helper was breaking it.

Nested per-provider vector namespacing (668ad29)

Replaces the flat chroma_<slug> scheme with a nested layout, one path segment per identity field:

<ChromaPersistDir>/<kind>/<model-slug>[/<variant>]
chroma/ollama/nomic_embed_text   ·   chroma/voyage/voyage_code_3/2048_int8

The provider kind is its own directory level (derived from the kind, not glued onto the model slug), which structurally removes the legacy-migration ambiguity where a model name normalizing to a kind-looking slug (e.g. openai-community/x) was silently orphaned. Adds Provider.StorageComponents(), Config.ChromaDirFor(), and an unambiguous flat→nested migration that preserves existing ollama vectors without a reindex.

Testing

  • go vet ./... ✅ · go test ./... ✅ (39 pkgs) · go test -race ./... ✅ (39 pkgs, 0 data races) · embed_gate compiles.
  • Added regression tests: switch rollback + happy-path; per-provider StorageComponents (incl. the foo vs ollama-foo anti-collision); flat→nested migration (incl. the kind-looking-name case).

Known deferred (non-blocking)

  • An empty pre-unification per-model DB leftover triggers a cosmetic boot WARN from AdoptLegacyModelDB; no data impact. Can be silenced in a follow-up (skip-warn-when-empty).

Type of change

  • Bug fix
  • New feature
  • Refactor

Checklist

  • go build ./..., go vet ./..., go test ./... all pass
  • go test -race ./internal/embeddings/... ./internal/vectorstore/... passes
  • No secrets or API keys committed (provider configs store only env-var names)

🤖 Generated with Claude Code

dvcdsys and others added 29 commits May 26, 2026 13:28
Add provider.Provider interface and a Factory/registry under
internal/embeddings/provider. Move the existing llama-server sidecar
into provider/ollama as the first implementation. Add provider/openai
(generic /v1/embeddings) and provider/voyage (api.voyageai.com with
input_type, output_dimension, and int8 dequantization). Rewrite
embeddings.Service to delegate to the active Provider while keeping
the queue/backpressure machinery centralized.

EmbeddingModel() now returns Provider.ID() — a fingerprint of the form
"{kind}:{model}[:{dim}][:{dtype}]" — so the existing per-project
drift check in repojobs transparently treats a provider switch as a
model change and forces a full reindex on the next clone job.

Behaviour unchanged for default ollama deployments; the openai and
voyage providers are registered but not yet selectable through any
admin surface. Tests are green.
Add migration 12 to extend runtime_settings with two columns:
embedding_provider (kind selector) and embedding_provider_config
(per-provider JSON blob). Idempotent ALTER TABLE; API keys are not
persisted — providers read them live from env vars named in the
blob.

Add internal/embeddingscfg, a thin persistence layer over the two
new columns. Wire it into main.go's boot path: read the persisted
selection, seed the row from env-derived ollama defaults on fresh
installs, then hand the resolved Provider to embeddings.Service via
the new NewWithProvider constructor.

Add embeddings.Service.SwitchProvider — build new + Start + drain
queue + atomic swap + background Stop of the old. Used by the new
admin endpoints:

  GET  /api/v1/admin/embedding-providers          list kinds + schemas + env-key readiness
  GET  /api/v1/admin/embedding-providers/active   current kind + persisted config
  PUT  /api/v1/admin/embedding-providers/active   atomic switch
  POST /api/v1/admin/embedding-providers/{kind}/test pre-save sanity check

Extend /api/v1/status with embedding_provider and
embedding_provider_manages_process so the dashboard footer can
render the provider name as the label and a permanent green dot
for HTTP-only providers that have no managed process to die.

Behaviour unchanged for existing ollama installs.
Provider-level: stub httptest.Server-based tests for the openai and
voyage clients. Cover request shape, response ordering, HTTP-error
surfacing, missing-API-key behavior, ID() fingerprinting, Voyage
input_type query/document selection, and the int8 dequantize path
(127 → ~1.0, -127 → ~-1.0, etc).

Admin handler: gating tests for the new endpoints — admin sees all
three kinds in the list with non-empty schemas, viewer gets 403,
unknown kind is rejected, /test with a missing API key returns 400.
Add EmbeddingProviderSection with a kind dropdown and per-provider
forms. Three hardcoded React components (OpenAIProviderForm,
VoyageProviderForm, ollama tuning kept in the existing sections)
render below the dropdown. Voyage form covers model picker, output
dimension (256/512/1024/2048), output dtype (float | int8), and
truncation toggle. Each form surfaces a red banner when the
referenced API-key env var is not set on the server and disables
Save accordingly — keys themselves are never stored in the
dashboard or DB.

New hooks: useEmbeddingProviders, useActiveProvider, useTestProvider,
useSwitchProvider. The save flow is test → switch → toast → invalidate
runtime-model + sidecar-status caches so the footer and project cards
update immediately.

Footer: label is now the active provider kind ("ollama"/"openai"/
"voyage") instead of hardcoded "llama". Dot is green/red for ollama
(real liveness signal from /health), permanently green for HTTP-only
providers (no managed process to die — failures surface at search
time with diagnostics).

ServerPage: ollama-specific sections (EmbeddingModelSection,
RuntimeParamsSection, SidecarSection) render only when the active
provider is ollama. Otherwise the provider form above is the only
edit surface.

openapi.yaml: extend StatusResponse with embedding_provider +
embedding_provider_manages_process; add the four new
/admin/embedding-providers endpoints and their schemas.
Add commented examples for CIX_OPENAI_API_KEY and CIX_VOYAGE_API_KEY
to both compose files. Default behavior is unchanged — first boot
seeds the ollama provider from the existing CIX_EMBEDDING_MODEL +
CIX_LLAMA_* vars. Switching to a remote provider is a dashboard
action; the env var only needs to be exported when the operator
intends to use OpenAI-compatible or Voyage backends.
The previous gate covered EmbeddingModelSection, RuntimeParamsSection,
and SidecarSection but left AdvancedSection (batch size / concurrency)
visible — and also exposed the ollama-only Save & Restart header
button regardless of provider. Switch to /status as the single
source of truth for the live active kind (already polled by the
footer; no extra request) and gate AdvancedSection plus the
Save & Restart button on it. Update the header copy to explain
what's editable per provider type.
The Service-level embedding queue depth (max_embedding_concurrency)
applies to every provider — it caps how many parallel
/v1/embeddings POSTs the server runs, which OpenAI and Voyage both
accept natively (subject to their account-level rate limits). The
previous UI hid it behind the ollama-only block.

Backend Service.Restart is now provider-aware. Submitting a runtime
config change with a remote provider active only rebuilds the
queue; the (HTTP-only) provider stays in place. The previous code
unconditionally called buildOllamaFromConfig, which would have
silently re-spawned llama-server on top of a live voyage/openai
provider — that bug is fixed here.

Dashboard rearrangement:
  - AdvancedSection renamed to Throughput in the UI, always visible.
    Receives isOllama: when false the llama-batch field is hidden;
    concurrency is shown regardless.
  - Save & Restart button on ServerPage stays visible for all
    providers; label switches to "Save" when the active provider
    has no managed sidecar to restart.
  - Header copy updated to explain the new layout.
The indexer already sends all chunks of one file as a batched POST
(input: [chunk1, chunk2, ...]); concurrency caps how many such
batched POSTs run in parallel. The previous copy only mentioned
"parallel" without explaining that batching is the inner loop —
caused confusion when picking values.

Also call out the unsplit Voyage limits (voyage-code-3 = 128
inputs per request) so operators know to keep files under that
threshold until the per-provider auto-split lands.
Voyage caps inputs per /v1/embeddings POST at 128 for voyage-code-*
models (1000 for voyage-3*). OpenAI proper caps at 2048. The
previous implementation forwarded the caller's slice as-is, so a
single file with 200 chunks would 422 against voyage-code-3.

Both HTTP providers now slice oversized inputs into sequential
sub-batches inside EmbedDocuments under the same Service queue slot
(no extra concurrency consumed) and concatenate results in input
order. The split is transparent to callers — the indexer keeps
sending one batched POST per file and providers handle the rest.

Tests assert two POSTs for a 200-item Voyage call (128 + 72) and
a 3000-item OpenAI call (2048 + 952).

Dashboard copy updated: throughput card no longer warns about
unsplit limits since the safety net is in place.
Add an informational banner to the Voyage form pointing operators
to the dashboard billing page and explaining the free-tier
3 RPM / 10K TPM ceiling — small enough for a smoke test but the
indexer will saturate it on any real repo. Suggest concurrency=1
as the workaround for free-tier users.

Add a one-paragraph note to the OpenAI form about the current
no-retry policy: 429s are surfaced as-is; users should lower
concurrency or upgrade their account tier. Self-hosted servers
typically don't need this.

Decision (per discussion): no client-side retry / token-bucket
implementation. Industry tools like LangChain do retry, but on
hard tier limits (3 RPM) retry buys very little — files just fail
into the next clone job which retries idempotently. Keeping the
provider HTTP layer simple was preferred.
Surface the per-provider cost expectation before the admin picks
anything. The callout sits inside EmbeddingProviderSection and is
visible regardless of which provider is currently active or
selected in the dropdown:

- Ollama: free, local.
- OpenAI-compatible: paid on api.openai.com, free on self-hosted.
- Voyage: paid plan strongly recommended (free tier 3 RPM = unusable).

Dropdown option labels also annotated with (free) / (paid) so the
expectation is visible even when the callout is scrolled past.
The EmbeddingProviderSection dropdown allowed picking "ollama" while
a remote provider was active, but the Save button was conditionally
hidden (draftKind !== 'ollama') so the switch never actually fired.
Even if it had, the backend SwitchEmbeddingProvider rejected empty
or {} configs — and the ollama-specific fields (GGUF model, ctx,
GPU layers, sidecar paths) aren't part of this card's form to begin
with, so the dashboard had nothing valid to send.

Fix on both sides:

  Backend admin_embeddings.go: when kind == "ollama" and the
  submitted config is empty / {} / null, synthesize the full config
  from the live runtime-cfg snapshot applied on top of the env
  defaults via BuildOllamaConfigFromEnv. This is the same path the
  bootstrap seed in main.go uses, so the resulting ollama provider
  config is identical to what a fresh install would have.

  Dashboard EmbeddingProviderSection.tsx: always send {} for
  ollama-switches and always show the Save button (label switches
  to "Save & switch to {kind}" when the kind has changed). Skip
  the /test pre-check for ollama since {} would fail factory
  validation. The button copy under the ollama hint now explains
  that switching back will restart the sidecar with the current
  runtime config + trigger a full reindex per project.
…legacy rows

Two coupled bugs were causing every project to show a "stale model"
badge forever even after reindex:

1. The indexer wrote a BARE model name to projects.indexed_with_model
   at FinishIndexing because Service.embeddingModel was set from
   cfg.EmbeddingModel ONCE at boot. The drift detector and dashboard
   on the read side were comparing against the live Provider.ID()
   which is now PREFIXED ("ollama:<model>") after the pluggable-
   provider refactor — so writes never matched reads, and reindex
   couldn't clear the badge.

2. Pre-existing rows (projects indexed under the old code path) hold
   the bare model name. Without backfill they'd stay marked stale
   even after the indexer fix, because their own row never matches
   the prefixed live ID until the next reindex.

Fixes:

Indexer:
  - Add SetEmbeddingModelLookup(func() string) — binds the indexer
    to a live function (typically embeddings.Service.EmbeddingModel).
  - FinishIndexing now writes Service.EmbeddingModel() which prefers
    the live lookup, so a runtime provider switch (PUT /admin/
    embedding-providers/active) takes effect on the next reindex
    without a process restart.
  - Static SetEmbeddingModel(string) kept for tests that don't wire
    a live Service.

main.go: wire idx.SetEmbeddingModelLookup(embedSvc.EmbeddingModel).

Migration 13 (indexed_with_model_provider_prefix): one-time UPDATE
that prepends "ollama:" to every projects.indexed_with_model value
that lacks a ":" prefix. Safe because pre-refactor there was no
non-ollama backend. Idempotent (rows containing ":" are skipped).

Tested: TestMigrate_IndexedWithModelProviderPrefix covers the three
expected cases — bare gets prefixed, already-prefixed left alone,
NULL left alone.
Voyage's /v1/embeddings caps each request at 120K tokens, separate
from the 128-input cap voyage-code-* models enforce. A single large
file (e.g. bench/results/reference_embeddings.json at 187K tokens)
sat under the input-count limit but blew past the token limit,
producing a 400 from the server with the message:

  "The max allowed tokens per submitted batch is 120000. Your batch
   has 187609 tokens after truncation."

The provider now plans batches against BOTH limits. planBatches
walks the input list and closes the current batch whenever either
limit would be exceeded:

  - inputs in batch >= 128 (voyage-code-* model cap), OR
  - estimated tokens + next text > 100_000 (our target, 17% under
    Voyage's 120K hard cap to leave headroom for estimation error).

Token estimation: bytes ÷ 3, conservative for code (typically 3-4
chars/token) and even safer for UTF-8 multi-byte input (Cyrillic
comments etc count as more bytes per character). Always over-counts
so we never under-batch into a 400.

Why not an SDK: investigated the community Go SDK at
github.com/austinfhunter/voyageai — it's a thin HTTP wrapper that
does NOT handle token batching, int8 dequant, or retry strategy.
Switching would lose our existing int8 support and add a third-
party dependency for no real benefit.

Observability: structured slog.Info at the split decision logs the
batch counts and limits; slog.Debug logs each sub-batch POST with
input count + estimated token cost. Operators can grep
"voyage: splitting batch" to see batch activity and
"voyage: sub-batch POST" to follow individual requests.

Tests cover:
  - planBatches splits a 300K-byte text + smalls into multiple
    batches respecting the 100K token budget.
  - 200 small texts still split at 128 + 72 (legacy count behavior).
  - End-to-end EmbedDocuments produces multiple POSTs when token
    budget is the binding constraint.

Dashboard Throughput card copy updated to mention both limits.
Voyage AI does NOT expose rate limits via API — confirmed by checking
https://docs.voyageai.com/docs/rate-limits and the /reference/rate-limits
page. All limit data lives in their public docs and web dashboard.

So instead of a fake "cache + refresh" mechanism, we ship a hardcoded
snapshot from the docs:

  internal/embeddings/provider/voyage/limits.go
    KnownModelLimits — per-model Tier 1 RPM / TPM + per-request
                       input and token caps for every Voyage model
                       the dashboard offers.
    KnownLimitsSource — attribution string with the docs URL and
                        snapshot date ("2026-05-27") so an operator
                        can tell when to consult upstream for an
                        update.

GET /api/v1/admin/embedding-providers gains a documented_limits
field on the voyage entry. Schema documented in openapi.yaml as
DocumentedEmbeddingLimits / DocumentedModelLimit.

Dashboard VoyageProviderForm renders a compact card under the model
dropdown:
  - Active model row showing RPM / TPM / inputs-per-req / tokens-per-req
  - Tier multiplier note (×2 / ×3) since Voyage scales by lifetime
    spend and we can't detect the operator's tier
  - <details> expander with all models in the snapshot, current model
    highlighted
  - Footnote with the source URL + snapshot date
  - Button linking to the Voyage dashboard for live tier confirmation

No "Refresh" button: it would re-fetch the same hardcoded constant,
giving a false impression of "sync my plan from here". A reload after
upgrading cix-server gets the new snapshot — the source footnote
tells the operator when to do that.
…ottle

Replace the hardcoded "documented limits" snapshot card with editable
input fields for RPM, TPM, max-inputs-per-request, and max-tokens-
per-request. Voyage doesn't expose limits via API (confirmed against
docs.voyageai.com/docs/rate-limits) — the operator copies their
tier's numbers from the Voyage dashboard, the indexer throttles
itself accordingly.

Backend (provider/voyage):

  Config gains four new fields, all optional:
    RateLimitRPM         — caps requests/minute via token-bucket
    RateLimitTPM         — caps tokens/minute via token-bucket
    MaxInputsPerRequest  — overrides the static 128 default
    MaxTokensPerRequest  — overrides the static 100K default

  Two rate.Limiter instances are wired into Provider when the
  corresponding RPM/TPM > 0; embed() waits on both before each POST
  (ctx-aware — server shutdown drains cleanly). nil limiters short-
  circuit when the operator leaves the fields empty, matching prior
  behaviour (only react to upstream 429/400).

  planBatches takes maxInputs+maxTokens args instead of consts, so
  the per-request caps come from cfg.

Backend (admin_embeddings, openapi.yaml, limits.go):
  Drop the documented_limits payload + voyage.KnownModelLimits
  table — the dashboard no longer renders a "cached snapshot",
  values come straight from the operator instead. Less code, fewer
  things to keep in sync.

Frontend:
  VoyageProviderForm: replace the read-only "Documented limits"
  card with a 2x2 grid of number inputs (RPM, TPM, max-inputs,
  max-tokens). Hint paragraph above the grid still mentions the
  public-docs baseline so the operator has a starting point, plus
  a link to the Voyage dashboard. Empty = no client-side
  enforcement (per-field).

  EmbeddingProviderSection: drop the documentedLimits prop +
  useMemo; populate the four new VoyageConfig fields from the
  persisted blob when loading the active provider.

Tests:
  - TestRateLimitRPMThrottlesRequests: 120 RPM with two sequential
    EmbedQuery calls — second waits ~500ms on the limiter.
  - TestRateLimitTPMThrottlesTokens: 600K TPM with two ~60K-token
    POSTs — second waits ~2s for the bucket to refill.
  - planBatches tests updated to pass explicit caps.
Two gaps in the prior token-aware planBatches were leaking real
400s to the indexer:

1. Our estimator (bytes/3) is conservative for prose but can
   under-count for dense code where 1 token ≈ 2 bytes. estimated
   100K could become actual 150K → POST → 400.

2. planBatches puts a single input larger than the cap in its own
   batch (no choice — splitting the text would corrupt the chunk).
   That POST would 400 unconditionally.

Both surfaced as the user-reported error:
  "The max allowed tokens per submitted batch is 120000.
   Your batch has 187609 tokens after truncation."

Fix: a new embedWithAdaptiveSplit wraps each embed() call. On a
400 whose message matches voyageBatchTooLargeRegex, it bisects the
batch in half and retries both halves — recursively, so a real
tokenizer miscalculation that's ~2x our estimate self-corrects in
a single extra round-trip. EmbedQuery and EmbedDocuments both go
through this path now.

Single-chunk edge: when the offending batch is already 1 input,
there is nothing to bisect. The provider returns a clear error
pointing the operator at the chunker upstream:

  "voyage: a single chunk produced 187609 tokens (cap 120000).
   Reduce the indexer's max chunk size or switch to a model with
   a higher per-request cap"

This is the right place to fail — silently dropping a chunk would
hurt search quality without telling anyone; truncating it would
change vector semantics; retrying forever would hang the indexer.

parseBatchTooLarge extracts (cap, actual) from Voyage's stable
error message so logs surface both numbers, letting an operator
diagnose persistent estimator drift (e.g. "real tokens always
1.5x our estimate → tighten MaxTokensPerRequest in the form").

Tests:
  - EmbedDocuments with 4 inputs and a stub that rejects any
    batch >1 input ends up running 4 singleton POSTs after
    bisection.
  - Single oversize input returns the clean error.
  - parseBatchTooLarge covers happy path + non-match cases (429,
    different 400 reasons).
The "Truncate over-length inputs server-side" switch maps to
Voyage's truncation API parameter. It only affects what happens
when a SINGLE chunk exceeds the model's per-input context window
(e.g. 32K tokens for voyage-code-3) — totally separate from the
120K-tokens-per-batch cap our adaptive bisect already handles.

Before: bare label, no hint, easy to confuse with the batch cap.
Now: a paragraph under the switch explains the trade-off (silent
truncation vs explicit 400) and explicitly notes the separation
from the batch limit.
Voyage's truncation:true silently drops content past the model's
per-input context window (32K tokens for voyage-code-3). For long
chunks that means the tail never makes it into the embedding —
search quality suffers without any signal at index time.

The ollama provider already has a sliding-window pipeline for the
same problem (TokenizeAndEmbed: tokenize → split at CLS+SEP
boundaries → embed all windows → average vectors). Mirror that for
voyage, byte-aligned instead of token-aligned since Voyage has no
public /tokenize endpoint.

Added:
  - Config.MaxInputBytes (default 30_000) — caps per-input byte
    length. Sized so worst-case 1 byte/token dense code still
    fits voyage-code-3's 32K window with margin.
  - splitOversizeInput(text, maxBytes) — slices at rune-aligned
    boundaries so we never cut a UTF-8 multi-byte sequence in
    half.
  - embedAndAverage(ctx, texts, inputType) — three-phase pipeline:
      Phase 1: expand each oversize input into N windows; remember
               spans (which window indices map to which original).
      Phase 2: planBatches + adaptive-bisect POST chain on the
               expanded slice (existing logic, unchanged).
      Phase 3: average sub-window vectors back to one per original
               input. Single-window inputs pass through unchanged
               (fast path, no allocations).
  - EmbedQuery and EmbedDocuments both go through embedAndAverage
    now.

Tests:
  - splitOversizeInput: under-cap singleton, ASCII over-cap 3-way
    split with byte-perfect rejoin, UTF-8 Cyrillic split that
    keeps every window valid UTF-8.
  - EmbedDocuments_AveragesOversizeInputWindows: stub returns
    [1.0, 2.0, 3.0] for the 3 windows of a 250-byte input under
    MaxInputBytes=100; provider must return averaged [2.0].
  - Existing TPM throttling test now sets MaxInputBytes=1M to
    keep its single 180K-byte input from being auto-split (that
    test exercises the rate-limit bucket, not the window-split).
Per Voyage's own docs (https://docs.voyageai.com/docs/tokenization):

  "one word roughly corresponds to 1.2 - 1.5 tokens on average;
   for rough estimates divide character count by 5"

Our previous bytesPerToken=3 over-counted for everything except the
densest code, causing planBatches to split into more sub-batches
than necessary. Switching to bytesPerToken=5 matches Voyage's
published rule of thumb and reduces spurious splits on ordinary
content.

The trade-off: dense code can run hotter (1 token ≈ 2-3 bytes
empirically), so our estimate may UNDER-count for very tight code
and let planBatches pack inputs tighter than the real cap permits.
That's fine — embedWithAdaptiveSplit detects the resulting 400
"max tokens per batch" response and bisects the batch in half. We
trade a few extra round-trips on the worst-case dense files for
materially fewer splits on the common case.

Better long-term path is to load Voyage's real HF tokenizer
(publicly available at huggingface.co/voyageai/voyage-code-3 as a
standard tokenizer.json), but that requires a CGO Rust dep
(daulet/tokenizers) — deferred to a follow-up that justifies the
Docker / distroless work.

Tests:
  - TestPlanBatches_SplitsByTokenBudget: bumped big text from
    500K → 600K bytes so it exceeds the 100K-token cap at the
    new divisor.
  - TestEmbedDocuments_SplitsByTokenBudget: bumped from 300K → 500K
    and disabled per-input window split (MaxInputBytes=1M) to keep
    the test focused on the batch-token cap.
  - TestRateLimitTPMThrottlesTokens: bumped 180K → 300K so the
    per-call cost still hits ~60K tokens at the new divisor.
The /5 heuristic comes from Voyage's English-prose docs, but cix is
primarily a code-indexer and dense source code is much denser per
token. Production logs against voyage-code-3 show ~1.4 bytes/token
on tight Go files — a 64-input batch estimating 51K tokens actually
charged 187K, triggering Voyage's 120K hard-cap 400 and forcing the
bisect safety net to fire on every batch even with a correctly
configured rate-limits section.

bytesPerToken=2 catches code reality with margin and over-estimates
prose by ~2.5x (more round-trips, never a 400). The
embedWithAdaptiveSplit bisect stays as a residual safety net.

Tests adjusted for the new divisor (240K/200K/120K byte text sizes
where the original called for 600K/500K/300K under /5).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… vector namespace

The storage layout derived both the SQLite path and the chroma dir from
the embedding model name (a Python-era artefact ported 1:1). Safe only
while the model was a fixed env var; runtime provider switching turned it
into a critical bug — switching to a higher-dim provider (e.g. voyage
2048) wrote vectors into the previous provider's collection, breaking
search with "vectors must have the same length" — and a footgun where a
model change spawned a parallel system DB with empty accounts.

Two-plane design:
- System DB is ONE permanent, model-INDEPENDENT file at cfg.SQLitePath
  (accounts + catalog + parsed code). The parsed-code tables are
  model-independent; vector search never reads them.
- The chroma vector store is namespaced by the ACTIVE provider identity
  via provider.StorageSlug(Provider.ID()), so vectors of different
  dimensions never share a collection. Switching back reuses the prior
  namespace without a reindex.

Changes:
- provider.StorageSlug: deterministic, idempotent filesystem slug from a
  Provider.ID() fingerprint.
- config: drop DynamicSQLitePath/DynamicChromaPersistDir; add
  ChromaDirForSlug; keep ModelSafeName + LegacyDynamicSQLitePath solely
  for the one-time adoption migration.
- internal/storage (new): boot-time, idempotent file migrations —
  AdoptLegacyModelDB (checkpoint-then-rename the legacy per-model DB onto
  the canonical path; fossil moved aside to *.pre-unify-*) and
  PrefixLegacyChromaDirs (rename pre-unification ollama dirs to the
  unified scheme; canonicalises the legacy suffix through StorageSlug so
  model names with '.'/':' still match the path the server opens — no
  silent orphaning).
- db.HasTables: helper to distinguish a real unified DB from a fossil.
- vectorstore.Holder: concurrency-safe swappable *Store wrapper +
  Interface satisfied by both *Store and *Holder; indexer/httpapi/repojobs
  hold the Interface.
- embeddings.Service.SwitchProvider now reopens the vector store under the
  new identity slug and atomically swaps it into the Holder (live, no
  restart); reopen failure keeps the old store and surfaces a loud error.
- main.go/server.go: orchestrate boot migrations, Holder wiring, and
  identity-namespaced storage display.

One-shot migration shims (ModelSafeName, LegacyDynamicSQLitePath,
AdoptLegacyModelDB, PrefixLegacyChromaDirs and their call sites) are
marked "LEGACY-MIGRATION (remove next release)" for deletion once all
deployments have booted on the unified layout.

Tests: StorageSlug (incl. special-char canonicalisation), dbmigrate
(adopt/fossil-aside/WAL-drain/idempotency), chromamigrate (strict
normalize/no-clobber/idempotency), Holder (-race swap-under-read), live
reopen (swap/failure-keeps-old/unwired-noop). go build/vet/test ./...
green; -race on vectorstore green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ze base URL

Three robustness fixes surfaced by pre-PR review:

- H2: validate each returned vector's length against the configured
  output_dimension (when >0) and reject a mismatch loudly. A model that
  silently ignores output_dimension (no Matryoshka support / typo'd
  model) would otherwise write wrong-width vectors deep into the store
  with no attribution. Also guard the oversize-input averaging path
  against windows of differing width, which would panic with
  index-out-of-range; it now returns a clean error.
- M4: accept int8 embeddings returned as a base64-packed byte string
  (some OpenAI-compatible proxies) in addition to the default JSON int
  array, so a proxy swap doesn't fail the whole batch.
- L4: trim a trailing slash from base_url in New() so url building never
  produces a double slash that stricter proxies can 404 on.

Tests: wrong-dimension rejection, inconsistent-window-dim error, int8
base64 dequant. Adjusted an existing input_type test whose 2-dim mock
response no longer matched its 1024 configured dimension.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gacy models

- L4: trim a trailing slash from the openai provider's base_url in New()
  so url building never emits a double slash that stricter
  OpenAI-compatible servers (vLLM/TEI behind a proxy) 404 on. Mirrors the
  same fix in the voyage provider.
- L5: migration 13 (indexed_with_model provider-prefix backfill) tested
  for the mere presence of a ':' to decide a row was already prefixed.
  A legacy Ollama-style model name like "nomic-embed-text:latest"
  contains a colon but is NOT prefixed, so it was wrongly skipped and
  left flagged "stale model" forever. Now tests for a known kind prefix
  (ollama:/openai:/voyage:) instead. NOTE: editing an unreleased
  migration in place — deployments that already applied v13 with the old
  logic won't reprocess, but the default model carries no colon so no
  shipped DB is affected.

Tests: openai trailing-slash path normalization; migration-13 colon-
bearing legacy row now gets the ollama: prefix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…03 gating tests

- M1: mustBeAdmin returns a nil authContext when CIX_AUTH_DISABLED=true
  (a legitimate deployment mode), and SwitchEmbeddingProvider then
  dereferenced user.User.ID for the audit field → handler-goroutine
  nil-panic. Resolve the actor through an `if user != nil` guard,
  matching the ac != nil pattern used elsewhere (tunnels.go).
- L2: add per-endpoint 403 gating tests for the three admin
  embedding-provider routes that lacked them (GET/PUT active, POST test),
  satisfying the project rule that a non-admin caller must be rejected on
  every new endpoint. Auth was already enforced (all four handlers gate
  on mustBeAdmin first); these tests pin it against regressions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cycle, supervisor

Three concurrency hazards surfaced by pre-PR review:

- H1: Restart reassigned the s.queue pointer with no lock while Embed*
  callers read it (acquire + defer-release) — an unsynchronized
  pointer read/write (data race) reachable via the dashboard
  "Save & Restart" during an active reindex. The queue pointer is now
  guarded by s.mu: acquireProvider snapshots the queue and returns it so
  Release targets the SAME instance the slot was taken from, Restart
  writes the new queue under the lock, and Status reads via a snapshot.
- M2: SwitchProvider and Restart had no mutual exclusion, so two admin
  actions could interleave their s.current / s.queue mutations or both
  tear down a provider. A new lifecycleMu serializes the two (Embed*
  paths don't take it).
- M3: supervisor.Stop read s.waiterDone bare while a crash-driven spawn()
  reassigns it under s.mu — data race, far more reachable now that
  SwitchProvider can Stop a provider at runtime. Stop now snapshots
  cmd + waiterDone together under the read lock.

Test: TestRestart_ConcurrentWithEmbeds_NoRace exercises the queue-swap
race under -race (6 embedders vs a restarter alternating the cap to force
pointer swaps). go test -race ./internal/embeddings/... green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gofmt the two ollama-package files whose field/map alignment drifted when
they were extracted/moved into provider/ollama on this branch (gguf.go
GGUFInputs struct, prefix.go QueryPrefixes map). Pure formatting, no
behavior change. Pre-existing gofmt drift in files this branch only
incidentally touches (e.g. vectorstore/store.go) is left alone to keep
the diff focused.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The dashboard 404'd ("project not found: hash=…") on local projects whose
host_path and stored path_hash legitimately diverge — e.g. a project keyed
as sha1("local:{machine}:{path}") while host_path stays the bare filesystem
path. The list/detail responses set PathHash by RE-deriving sha1(host_path),
which for those rows yields a hash that GetByHash (which resolves against
the stored path_hash column) never matches → 404.

host_path can't simply be rewritten to the namespaced key form: it is the
vector-store collection key (md5(host_path)) and the chunks/symbols key, so
changing it would orphan the existing index and force a reindex.

Fix: carry the stored path_hash column through projects.Project and return
it verbatim in both project responses (server.go ProjectSummary + Project),
instead of recomputing from host_path. The stored column is the single
source of truth the lookup already uses, so link hash == lookup hash for
local, external, and freshly-created projects alike. No data migration, no
reindex.

Pre-existing issue (per-machine namespacing + an incomplete manual re-key),
independent of the embedding-provider work; bundled here per request.

Test: a row whose stored path_hash differs from sha1(host_path) — Get/List
surface the stored hash and GetByHash round-trips it back to the project.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-rolled Sscanf-switch + printf-table for ~/.cix/config.yaml
with a tag-driven schema (`key`/`desc`/`default`/`validate`/`env`/`sensitive`
struct tags) that drives Load, Save, `config show`, `config set`, and a new
full-screen editor — one source of truth, no more drift across three files.

Loader: koanf/v2 (file+yaml + confmap defaults from tags + rawbytes legacy-
key normalization). Validator: go-playground/validator/v10, runs on every
mutation. Setter: reflective SetByPath with type-aware parsing and rollback
on validation failure. Show/keys: reflect over the schema walker — sensitive
leaves never bind their value to a named variable (CodeQL-safe).

Env overrides: CIX_SERVER / CIX_API_URL / CIX_API_KEY override server
selection / URL / key with flag > env > file precedence. Local-only —
never persisted. Designed for CI runners passing secrets via env.

TUI: new lazygit-style `cix config edit` and `cix config init` built on
bubbletea + bubbles + lipgloss — two-panel layout, vim-friendly nav
(h/j/k/l), inline textinput edits, server CRUD (a/d/m/t), help overlay.
Replaces the earlier huh-based form (which felt like a one-shot wizard,
not an editor).

On-disk format: legacy `api:` block still migrates to `servers:`. Legacy
lowercase viper keys (debouncems, excludepatterns, cachettl, autowatch,
batchsize) still normalize. The dead `server.port` / `server.cache_ttl`
block is dropped on next save — old files load without error.

New: `cix config keys` lists every settable key with default / env /
description from the schema. `doc/CLI_CONFIG.md` documents the full key
table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread cli/internal/config/loader_koanf.go Fixed
dvcdsys and others added 4 commits June 2, 2026 10:58
Add a "Servers" subsection to both copies of cix SKILL.md (skills/cix/
and plugins/cix/skills/cix/) and a "Targeting multiple cix servers"
block to plugins/cix/README.md. Explains the named-server / default_server
model, the --server flag, and the legacy api.url/api.key aliases, so
agents using the plugin know to use the default and only switch via
--server when the user names one.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oot resilience, gate

Addresses the full code review of feat/embedding-providers.

Blockers
- parity_test.go called the deleted Service.embedRaw, so `make test-gate`
  (the release embedding-parity gate) failed to compile. Restore via a
  build-tagged helper that reaches the live provider's EmbedRaw.
- SwitchProvider swapped the provider and the vector store as two
  non-atomic ops with the queue resumed in between, so a concurrent embed
  could write new-dimension vectors into the old provider's collection;
  on reopen failure it left a live new-provider / old-store pairing with
  no rollback. Now the queue stays blocked across both swaps (resumed via
  defer on every path) and a reopen failure rolls the provider swap back
  and stops the half-started provider — fail closed, never corrupt.
- Boot no longer hard-fails the whole server when the persisted provider's
  Start() (remote connect-test / sidecar spawn) fails. A bad build falls
  back to env ollama; a failed Start attaches the provider in a degraded,
  self-healing state. The HTTP server (dashboard/auth/all projects) stays
  up so an operator can recover instead of editing SQLite by hand.

Medium
- SwitchEmbeddingProvider now validates-then-persists: SwitchProvider runs
  first and only a successful swap is saved, so a transient switch failure
  can't leave the DB pointing at a provider that bricks the next boot.
- PrefixLegacyChromaDirs failure at boot is now fatal (was warn+continue),
  which silently orphaned the index behind a fresh empty namespace. Added
  vectorstore.WarnIfNamespaceOrphaned as a loud diagnostic when the active
  namespace is empty but sibling namespaces exist.
- GetSidecarStatus reported ready=false for healthy openai/voyage providers
  (StateRemote != "running"); treat StateRunning and StateRemote as ready.
- Restart now blocks the freshly-built queue during a concurrency-cap
  change so embeds get a clean 503 during the sidecar respawn instead of
  ErrSupervisor from a transiently-nil provider.

Minor / cleanup
- ollama TokenizeAndEmbed clamps windowSize to >= 1, guarding a CtxSize<=2
  misconfig that spun the split loop forever while holding a queue slot.
- Extract shared HTTP-provider plumbing (base-URL normalize, API-key
  resolve, Ready/Status) into provider/httpremote.go; openai and voyage
  now delegate instead of copy-pasting.

Tests: switch rollback + happy-path, namespace-orphan diagnostic. Full
suite, `go vet` (incl. embed_gate), and `-race` on embeddings + httpapi
all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ixes #8 class)

The flat single-slug scheme `chroma_<StorageSlug(id)>` flattened
provider+model into one string with `_` as the field separator — but `_`
is also what model-name normalization emits. That in-band separator made
the legacy-migration heuristic (`hasKnownPrefix`) ambiguous: a legacy
ollama model whose name normalized to `ollama_…`/`openai_…`/`voyage_…`
(e.g. `openai-community/x`) was misread as already-unified, skipped, and
its vectors silently orphaned behind a fresh empty namespace. Worse,
`ollama:foo` and a model literally named `ollama-foo` both produced the
exact same string — unfixable by any name parsing.

Make the separator structural instead: namespace the vector store as a
nested directory tree, one path segment per identity field:

    <ChromaPersistDir>/<kind>/<model-slug>[/<variant>]
    chroma/ollama/nomic_embed_text
    chroma/voyage/voyage_code_3/2048_int8
    chroma/openai/text_embedding_3_large/256

The provider kind is now its own directory level, derived from the kind
(not glued onto the model slug), so it can never collide with a model
name. `ollama:foo` → ollama/foo, `ollama:ollama-foo` → ollama/ollama_foo —
distinct. The #8 case resolves correctly: openai-community/x →
ollama/openai_community_x.

Changes
- provider.Provider gains StorageComponents() []string — each provider
  builds its path from its OWN structured fields (never by re-parsing the
  flattened ID), so the in-band-separator problem can't recur.
- config: ChromaDirForSlug(slug) → ChromaDirFor(components) (filepath.Join
  under the container).
- Service: StorageSlug() → StoragePath() []string; reopen + AttachVectorStore
  thread components.
- storage: PrefixLegacyChromaDirs (+ hasKnownPrefix prefix-guessing) →
  MigrateFlatChromaToNested. Now unambiguous: every flat `chroma_*` sibling
  is a legacy ollama dir (the old build was ollama-only) and moves to
  chroma/ollama/<StorageSlug(suffix)>/. StorageSlug(ModelSafeName(m)) ==
  StorageSlug(m), so it lands where the server resolves the identity — even
  for kind-looking model names. Existing main/prod ollama vectors migrate
  without a reindex. main.go relocates a legacy Python store at the
  container path first, then migrates (fail-closed).
- Drop vectorstore.WarnIfNamespaceOrphaned — the orphan ambiguity it
  warned about is gone, so the diagnostic is no longer needed.
- Drop the throwaway intermediate flat-unified scheme entirely (dev-branch
  only; no back-compat).

Tests: per-provider StorageComponents incl. the ollama foo vs ollama-foo
anti-collision; migration moves flat→nested incl. the kind-looking-name
(#8) regression, special-char normalization, idempotency, no-clobber,
python-backup skip; switch/reopen tests assert nested paths. Full suite,
`go vet` (incl. embed_gate), and `-race` all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… tests

Pre-merge review fixes for the pluggable-provider work:

- httpapi: reject kind=ollama on POST /admin/embedding-providers/{kind}/test.
  Start()ing a throw-away ollama provider would spawn a second llama-server
  child on the live sidecar's socket. The dashboard never tests ollama (it is
  configured via the runtime-config form), so fail fast with 400 instead of
  risking a competing sidecar from an ad-hoc admin call. Adds a gating test.

- embeddingscfg: add the previously-missing test suite for the provider
  persistence layer boot reads to pick the active provider — Get on a fresh
  DB, Save insert + update round-trips, empty-config, and the empty-kind /
  invalid-JSON rejections.

go vet ./... · go build ./... · go test ./internal/{embeddingscfg,httpapi}/... all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dvcdsys
Copy link
Copy Markdown
Owner Author

dvcdsys commented Jun 2, 2026

Code review summary

Reviewed focusing on the high-risk areas: provider abstraction, switch atomicity, concurrency, storage migrations, and endpoint authorization. The CLI config/TUI (cli/) and the React dashboard were only skimmed.

Verified locally on feat/embedding-providers: go build ./... ✅ · go vet ./internal/{embeddings,storage,httpapi,vectorstore}/... ✅ · targeted go test (storage / vectorstore / providers / httpapi / embeddingscfg) ✅ · go test -race ./internal/embeddings -run 'Switch|Restart|Provider' ✅ (0 races) · no stale Dynamic* path-helper refs in live code.

Strengths

  • Auth gating meets the project rule — all 4 admin endpoints gate on mustBeAdmin, with per-endpoint 403 tests for a viewer.
  • Concurrency is careful and race-testedlifecycleMu serializes switch/restart; the queue stays blocked across both the provider swap and the vector-store reopen, closing the wrong-dimension-write window.
  • Switch is atomic with rollback; a failed reopen rolls s.current back and stops the half-started provider (fail-closed).
  • Boot resilience — a failing remote connect-test degrades instead of bricking boot.
  • Secrets — only the env-var name is persisted; the key is read live.
  • Migrations are fail-closed, idempotent, with the kind-as-its-own-path-segment anti-collision design and regression tests.

Fixed pre-merge (commit 2a9e848)

  1. /admin/embedding-providers/{kind}/test now rejects kind=ollama (400). Start()ing a throw-away ollama provider would spawn a second llama-server child on the live sidecar's socket. The dashboard configures ollama via the runtime-config form, not /test. + gating test.
  2. Added the missing embeddingscfg test suite — the persistence layer boot reads to choose the active provider was previously untested. Covers Get/Save round-trips (insert + update), empty config, and empty-kind / invalid-JSON rejection.

Non-blocking follow-ups (optional)

  • embeddingscfg.Save is UPDATE-then-INSERT, not an atomic upsert. Two concurrent writers could both miss the UPDATE and race the INSERT (one fails on PK conflict). Non-corrupting and rare (admin actions); INSERT … ON CONFLICT(id) DO UPDATE would be race-free. (Confirmed runtimecfg writes only its own columns, so no embedding-column clobber.)
  • Degraded-boot inconsistency: if the persisted provider config is malformed, boot falls back to embeddings.New, which starts ollama and can still hard-fail boot (e.g. remote-only box with no GGUF) — slightly at odds with the "never brick boot" goal.
  • voyage.embedAndAverage: totalSplits is misnamed — it counts windows from oversize inputs (incremented by len(windows)), logged as split_windows. Functionally a >0 gate; cosmetic.
  • Averaged window vectors aren't re-normalized (voyage + ollama). Fine for cosine search (chromem normalizes); worth confirming no dot-product path relies on unit norm. Pre-existing pattern.
  • Holder.Swap discards the old *Store without Close — relies on chromem-go holding no file handles/goroutines (true today). A defensive Close-if-io.Closer would future-proof.
  • Scope: this PR also bundles the CLI config schema + TUI (d9742e8, ~20 files) and the projects path_hash fix (97fd9c4), which are unrelated to embedding providers and make review/bisect harder. Consider splitting such work next time.

Overall: solid, no blockers. 👍

CodeQL go/incorrect-integer-conversion (high) flagged
parseDefaultValue: strconv.ParseInt(raw, 10, 64) yields an int64 that
int(v) narrows, silently truncating on 32-bit builds. Parse with
bitSize 0 (platform int width) instead, so the value is bounded to int
range before conversion and an out-of-range tag errors (ok=false)
rather than wrapping.

Sibling parser in set.go is unaffected — it uses reflect.Value.SetInt
with an explicit OverflowInt guard and no narrowing conversion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dvcdsys dvcdsys merged commit efc6e56 into main Jun 2, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants