Sanitize embedding text and surface request errors#26
Merged
Conversation
Ollama embeddings silently returned empty arrays on failures, masking errors like "input length exceeds the context length". Large unicode content (emoji, pictographs, box-drawing chars) blows through BERT-style tokenizers' context windows (bge-large: 512 tokens), triggering the truncation check after token expansion—silent failure by default. This commit: - Adds EmbeddingTextSanitizer to strip emoji, pictographs, zero-width chars, and smart punctuation; collapses the extra spaces left behind. Ollama sanitizes by default; OpenAI opts out for backward compatibility. - Adds EmbeddingRequestException to extract and surface provider errors (OpenAI nested error.message, Ollama flat error field, raw body fallback). Both embeddings clients now throw descriptively instead of silencing failures. - Covers all new behavior in tests; suite passes at 100% coverage. Verified end-to-end against production Ollama instance.
There was a problem hiding this comment.
Ship it.
- Breaking change is intentional and documented; callers that previously tested
[] === $vectorwill now surface real errors. - Sanitizer applied only to the embed path, stored content untouched.
- OpenAI keeps
sanitize: falsedefault (8k+ context), Ollama defaultstrue(512-token BERT). - Mock tests cover the new exception path, non-JSON fallback, and sanitization toggle.
nunomaduro/pao no longer publishes dev-main, which broke composer resolution in CI. Swap to the maintained laravel/pao fork and bump Pest to satisfy its version constraint.
There was a problem hiding this comment.
Ship it.
- Breaking change on failure path is explicit and documented; callers now get model + status + provider message instead of swallowing into [].
- Sanitizer default-true only on Ollama (correct default given 512-token constraint); OpenAI stays opt-in.
- Regression test in EmbeddingsTest asserts the new exception path and the 400 context-length message.
- Dependency swap (nunomaduro/pao → laravel/pao, plus transitive upgrades) is isolated to composer.* and passes pint/rector.
- No style or layout issues on the new sanitizer or exception classes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
OllamaEmbeddings/OpenAiEmbeddingsswallowed every request failure (catch (RequestException) → return []), so consumers only ever saw a vague "failed to generate embedding" with no cause. The actual failure being masked: bge-large has a 512-token BERT context, and unicode-dense content (emoji, box-drawing trees, smart punctuation) byte-fallbacks into so many tokens that Ollama's truncation miscounts and rejects the request with HTTP 400the input length exceeds the context length— while plain prose of 18KB embeds fine. Those characters carry no retrieval signal anyway.Changes
EmbeddingTextSanitizer— normalizes text before vectorization: strips emoji/pictographs/dingbats/box-drawing/arrow blocks, deletes zero-width chars and variation selectors, maps smart punctuation to ASCII, collapses leftover spacing. Lossy by design — only the embed request sees the sanitized text, stored content is untouched.OllamaEmbeddings— sanitizes by default (sanitize: trueconstructor flag to opt out).OpenAiEmbeddings— same flag, default off (8k context, robust tokenizer).EmbeddingRequestException— replaces the silent empty-array return on request failure; carries model, HTTP status, and the provider's error message (OpenAI nestederror.message, Ollama flaterror, raw body fallback for non-JSON responses).Verification
sanitize: false: throwsEmbedding request failed for model [bge-large] (HTTP 400): the input length exceeds the context lengthBreaking change
embedBatch()/embed()now throw on request failure instead of returning empty arrays. Callers that branched on[] === $vector(e.g. conduit-ui/knowledge'sQdrantService) will get the descriptive exception directly — which is the point — but should be updated when bumping.