Skip to content

Sanitize embedding text and surface request errors#26

Merged
jordanpartridge merged 2 commits into
mainfrom
fix/embedding-sanitize-and-errors
Jun 12, 2026
Merged

Sanitize embedding text and surface request errors#26
jordanpartridge merged 2 commits into
mainfrom
fix/embedding-sanitize-and-errors

Conversation

@jordanpartridge

Copy link
Copy Markdown
Contributor

Problem

OllamaEmbeddings/OpenAiEmbeddings swallowed every request failure (catch (RequestException) → return []), so consumers only ever saw a vague "failed to generate embedding" with no cause. The actual failure being masked: bge-large has a 512-token BERT context, and unicode-dense content (emoji, box-drawing trees, smart punctuation) byte-fallbacks into so many tokens that Ollama's truncation miscounts and rejects the request with HTTP 400 the input length exceeds the context length — while plain prose of 18KB embeds fine. Those characters carry no retrieval signal anyway.

Changes

  • EmbeddingTextSanitizer — normalizes text before vectorization: strips emoji/pictographs/dingbats/box-drawing/arrow blocks, deletes zero-width chars and variation selectors, maps smart punctuation to ASCII, collapses leftover spacing. Lossy by design — only the embed request sees the sanitized text, stored content is untouched.
  • OllamaEmbeddings — sanitizes by default (sanitize: true constructor flag to opt out).
  • OpenAiEmbeddings — same flag, default off (8k context, robust tokenizer).
  • EmbeddingRequestException — replaces the silent empty-array return on request failure; carries model, HTTP status, and the provider's error message (OpenAI nested error.message, Ollama flat error, raw body fallback for non-JSON responses).

Verification

  • Full suite passes at 100% coverage; pint + rector clean.
  • End-to-end against production Ollama (odin) with a real document that previously failed:
    • default: returns a 1024-dim vector
    • sanitize: false: throws Embedding request failed for model [bge-large] (HTTP 400): the input length exceeds the context length

Breaking change

embedBatch()/embed() now throw on request failure instead of returning empty arrays. Callers that branched on [] === $vector (e.g. conduit-ui/knowledge's QdrantService) will get the descriptive exception directly — which is the point — but should be updated when bumping.

Ollama embeddings silently returned empty arrays on failures, masking
errors like "input length exceeds the context length". Large unicode
content (emoji, pictographs, box-drawing chars) blows through BERT-style
tokenizers' context windows (bge-large: 512 tokens), triggering the truncation
check after token expansion—silent failure by default.

This commit:
- Adds EmbeddingTextSanitizer to strip emoji, pictographs, zero-width chars,
  and smart punctuation; collapses the extra spaces left behind. Ollama
  sanitizes by default; OpenAI opts out for backward compatibility.
- Adds EmbeddingRequestException to extract and surface provider errors
  (OpenAI nested error.message, Ollama flat error field, raw body fallback).
  Both embeddings clients now throw descriptively instead of silencing failures.
- Covers all new behavior in tests; suite passes at 100% coverage.

Verified end-to-end against production Ollama instance.

@lexi-chief-of-staff lexi-chief-of-staff Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it.

  • Breaking change is intentional and documented; callers that previously tested [] === $vector will now surface real errors.
  • Sanitizer applied only to the embed path, stored content untouched.
  • OpenAI keeps sanitize: false default (8k+ context), Ollama defaults true (512-token BERT).
  • Mock tests cover the new exception path, non-JSON fallback, and sanitization toggle.

nunomaduro/pao no longer publishes dev-main, which broke composer
resolution in CI. Swap to the maintained laravel/pao fork and bump
Pest to satisfy its version constraint.

@lexi-chief-of-staff lexi-chief-of-staff Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it.

  • Breaking change on failure path is explicit and documented; callers now get model + status + provider message instead of swallowing into [].
  • Sanitizer default-true only on Ollama (correct default given 512-token constraint); OpenAI stays opt-in.
  • Regression test in EmbeddingsTest asserts the new exception path and the 400 context-length message.
  • Dependency swap (nunomaduro/pao → laravel/pao, plus transitive upgrades) is isolated to composer.* and passes pint/rector.
  • No style or layout issues on the new sanitizer or exception classes.

@jordanpartridge jordanpartridge merged commit 2be8a18 into main Jun 12, 2026
2 checks passed
@jordanpartridge jordanpartridge deleted the fix/embedding-sanitize-and-errors branch June 12, 2026 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant