Sanitize embedding text and surface request errors by jordanpartridge · Pull Request #26 · the-shit/vector

jordanpartridge · 2026-06-11T20:19:49Z

Problem

OllamaEmbeddings/OpenAiEmbeddings swallowed every request failure (catch (RequestException) → return []), so consumers only ever saw a vague "failed to generate embedding" with no cause. The actual failure being masked: bge-large has a 512-token BERT context, and unicode-dense content (emoji, box-drawing trees, smart punctuation) byte-fallbacks into so many tokens that Ollama's truncation miscounts and rejects the request with HTTP 400 the input length exceeds the context length — while plain prose of 18KB embeds fine. Those characters carry no retrieval signal anyway.

Changes

EmbeddingTextSanitizer — normalizes text before vectorization: strips emoji/pictographs/dingbats/box-drawing/arrow blocks, deletes zero-width chars and variation selectors, maps smart punctuation to ASCII, collapses leftover spacing. Lossy by design — only the embed request sees the sanitized text, stored content is untouched.
OllamaEmbeddings — sanitizes by default (sanitize: true constructor flag to opt out).
OpenAiEmbeddings — same flag, default off (8k context, robust tokenizer).
EmbeddingRequestException — replaces the silent empty-array return on request failure; carries model, HTTP status, and the provider's error message (OpenAI nested error.message, Ollama flat error, raw body fallback for non-JSON responses).

Verification

Full suite passes at 100% coverage; pint + rector clean.
End-to-end against production Ollama (odin) with a real document that previously failed:
- default: returns a 1024-dim vector
- sanitize: false: throws Embedding request failed for model [bge-large] (HTTP 400): the input length exceeds the context length

Breaking change

embedBatch()/embed() now throw on request failure instead of returning empty arrays. Callers that branched on [] === $vector (e.g. conduit-ui/knowledge's QdrantService) will get the descriptive exception directly — which is the point — but should be updated when bumping.

Ollama embeddings silently returned empty arrays on failures, masking errors like "input length exceeds the context length". Large unicode content (emoji, pictographs, box-drawing chars) blows through BERT-style tokenizers' context windows (bge-large: 512 tokens), triggering the truncation check after token expansion—silent failure by default. This commit: - Adds EmbeddingTextSanitizer to strip emoji, pictographs, zero-width chars, and smart punctuation; collapses the extra spaces left behind. Ollama sanitizes by default; OpenAI opts out for backward compatibility. - Adds EmbeddingRequestException to extract and surface provider errors (OpenAI nested error.message, Ollama flat error field, raw body fallback). Both embeddings clients now throw descriptively instead of silencing failures. - Covers all new behavior in tests; suite passes at 100% coverage. Verified end-to-end against production Ollama instance.

lexi-chief-of-staff

Ship it.

Breaking change is intentional and documented; callers that previously tested [] === $vector will now surface real errors.
Sanitizer applied only to the embed path, stored content untouched.
OpenAI keeps sanitize: false default (8k+ context), Ollama defaults true (512-token BERT).
Mock tests cover the new exception path, non-JSON fallback, and sanitization toggle.

nunomaduro/pao no longer publishes dev-main, which broke composer resolution in CI. Swap to the maintained laravel/pao fork and bump Pest to satisfy its version constraint.

lexi-chief-of-staff

Ship it.

Breaking change on failure path is explicit and documented; callers now get model + status + provider message instead of swallowing into [].
Sanitizer default-true only on Ollama (correct default given 512-token constraint); OpenAI stays opt-in.
Regression test in EmbeddingsTest asserts the new exception path and the 400 context-length message.
Dependency swap (nunomaduro/pao → laravel/pao, plus transitive upgrades) is isolated to composer.* and passes pint/rector.
No style or layout issues on the new sanitizer or exception classes.

lexi-chief-of-staff Bot approved these changes Jun 11, 2026

View reviewed changes

chore: replace abandoned nunomaduro/pao with laravel/pao

5f64876

nunomaduro/pao no longer publishes dev-main, which broke composer resolution in CI. Swap to the maintained laravel/pao fork and bump Pest to satisfy its version constraint.

lexi-chief-of-staff Bot approved these changes Jun 12, 2026

View reviewed changes

jordanpartridge merged commit 2be8a18 into main Jun 12, 2026
2 checks passed

jordanpartridge deleted the fix/embedding-sanitize-and-errors branch June 12, 2026 02:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sanitize embedding text and surface request errors#26

Sanitize embedding text and surface request errors#26
jordanpartridge merged 2 commits into
mainfrom
fix/embedding-sanitize-and-errors

jordanpartridge commented Jun 11, 2026

Uh oh!

lexi-chief-of-staff Bot left a comment

Uh oh!

lexi-chief-of-staff Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jordanpartridge commented Jun 11, 2026

Problem

Changes

Verification

Breaking change

Uh oh!

lexi-chief-of-staff Bot left a comment

Choose a reason for hiding this comment

Uh oh!

lexi-chief-of-staff Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant