Skip to content

Phase 4: Embedding Ecosystem#146

Open
oss-amikos wants to merge 32 commits intomainfrom
feature/phase-04-embedding-ecosystem
Open

Phase 4: Embedding Ecosystem#146
oss-amikos wants to merge 32 commits intomainfrom
feature/phase-04-embedding-ecosystem

Conversation

@oss-amikos
Copy link
Copy Markdown
Contributor

Summary

Phase 4: Embedding Ecosystem
Goal: Expand the embedding ecosystem with sparse/content interfaces, reranking functions, additional dense providers, and an auto-wiring registry.
Status: Verified (UAT 9/9 passed), reviewed (5 rounds of automated review)

This PR adds a complete embedding ecosystem to the Java client: sparse and multimodal embedding interfaces, reranking support, three new dense providers, two sparse providers, and a central registry for auto-wiring providers from server-side collection config.

Changes

Plan 04-01: Embedding Foundation Interfaces

Sparse and multimodal content embedding interfaces with bidirectional adapters.

Key files created:

  • SparseEmbeddingFunction — interface for sparse vector providers (BM25, SPLADE)
  • ContentEmbeddingFunction — interface for multimodal embedding providers
  • Content, Part, BinarySource — value types for multimodal content
  • Modality, Intent — enums for content typing and embedding intent
  • TextEmbeddingAdapter, ContentToTextAdapter — bidirectional adapters

Plan 04-02: Reranking Interface and Providers

Reranking function interface with Cohere and Jina implementations.

Key files created:

  • RerankingFunction — interface for document reranking
  • RerankResult — immutable value type for ranked results
  • CohereRerankingFunction — Cohere v2/rerank endpoint
  • JinaRerankingFunction — Jina v1/rerank endpoint

Plan 04-03: Dense Embedding Providers

Three new dense embedding providers: Gemini, Bedrock, and Voyage.

Key files created:

  • GeminiEmbeddingFunction — via Google GenAI SDK (optional dependency)
  • BedrockEmbeddingFunction — via AWS SDK BedrockRuntime (optional dependency)
  • VoyageEmbeddingFunction — via OkHttp REST calls

Plan 04-04: Sparse Embedding Providers

BM25 local sparse embeddings and ChromaCloud SPLADE remote sparse embeddings.

Key files created:

  • BM25EmbeddingFunction — local tokenize → stem → hash → TF-IDF pipeline
  • BM25Tokenizer, Murmur3, BM25StopWords — pipeline components
  • ChromaCloudSpladeEmbeddingFunction — remote SPLADE via Chroma Cloud API

Plan 04-05: EmbeddingFunctionRegistry

Central registry with three factory maps (dense, sparse, content) and auto-wiring from collection config.

Key files created:

  • EmbeddingFunctionRegistry — singleton registry with provider factory maps
  • UnsupportedEmbeddingProviderException — typed exception for unknown providers

Key files modified:

  • EmbeddingFunctionResolver — now delegates to registry instead of hardcoded dispatch

Requirements Addressed

  • EMB-05: Sparse embedding interfaces and providers
  • EMB-06: Multimodal content embedding interfaces
  • EMB-07: Additional dense embedding providers (Gemini, Bedrock, Voyage)
  • EMB-08: Embedding function registry with auto-wiring
  • RERANK-01: Reranking interface and providers

Verification

  • UAT: 9/9 tests passed
  • 5 rounds of automated code review (code quality, error handling, type design, test coverage)
  • All 114+ unit tests pass
  • Java 8 compatibility maintained
  • Thread-safe BM25Tokenizer and EmbeddingFunctionRegistry
  • Defensive response DTO parsing with index-context error messages
  • Input validation on all public API methods (null query/documents, missing API keys)

Test Plan

  • CI passes (build + all tests)
  • Verify optional dependencies (Gemini, Bedrock) don't break builds without their SDKs
  • Spot-check: BM25 embedding produces deterministic sparse vectors
  • Spot-check: Registry resolves built-in providers by name

All 20 open milestone 0.2.0 issues now have roadmap coverage.
Investigates BM25 tokenizer pipeline, ContentEmbeddingFunction interface,
EmbeddingFunctionRegistry pattern, Gemini/Bedrock/Voyage providers, and
Cohere/Jina reranking — with verified Maven coordinates and API wire formats.
…and adapters

- SparseEmbeddingFunction interface returning SparseVector (separate from EmbeddingFunction)
- ContentEmbeddingFunction interface with embedContents(List<Content>) for multimodal support
- Content/Part/BinarySource value types with static factories and builder pattern
- Modality and Intent enums with getValue()/fromValue()
- TextEmbeddingAdapter: wraps text-only EF as ContentEmbeddingFunction
- ContentToTextAdapter: wraps ContentEmbeddingFunction as text-only EF
- RerankingFunction interface with rerank(query, documents) returning List<RerankResult>
- RerankResult immutable value type with index, score, equals/hashCode/toString
- CohereRerankingFunction targeting Cohere v2/rerank endpoint with WithParam config
- JinaRerankingFunction targeting Jina v1/rerank endpoint with WithParam config
- Both providers sort results by descending relevance score
- Package-private static DEFAULT_BASE_API for WireMock test injection
…nd content types

- TestSparseEmbeddingFunction: 2 tests for embedQuery and embedDocuments
- TestContentEmbeddingFunction: 6 tests for default method, fromTextOnly, adapters
- TestContentTypes: 15 tests for Content, Part, BinarySource, Modality, Intent
- TestRerankResult: value type getters, equality, inequality, toString
- TestCohereRerankingFunction: success flow, 401 auth failure, auth header verification
- TestJinaRerankingFunction: success flow, 500 server error, model in request body
- All tests use WireMock with dynamic ports and WithParam.baseAPI for URL injection
- SUMMARY.md with 2 tasks, 12 files, 23 unit tests
- STATE.md updated with decisions and metrics
- ROADMAP.md progress updated
- Requirements EMB-05, EMB-06 marked complete
- GeminiEmbeddingFunction using Google GenAI SDK (optional dep)
- BedrockEmbeddingFunction using AWS SDK BedrockRuntime (optional dep)
- VoyageEmbeddingFunction using OkHttp REST calls (existing dep)
- All three follow WithParam configuration pattern
- Google GenAI and AWS SDK declared as optional Maven dependencies
- SUMMARY.md with execution results and deviation documentation
- STATE.md updated with metrics, decisions, and session info
- ROADMAP.md updated with plan progress
- REQUIREMENTS.md: RERANK-01 marked complete
- TestGeminiEmbeddingFunction: 5 tests (construction, default model, env key)
- TestBedrockEmbeddingFunction: 5 tests (construction, default model, custom region)
- TestVoyageEmbeddingFunction: 6 WireMock tests (embed docs/query, input_type, auth header, error)
- Fix Jackson version conflict: align jackson-core/annotations to 2.17.2 in dependencyManagement
- SUMMARY.md with 2 task commits, 2 deviations documented
- STATE.md updated with decisions and metrics
- ROADMAP.md updated with plan progress
- EMB-07 requirement marked complete
…g-ecosystem

# Conflicts:
#	.planning/STATE.md
…g-ecosystem

# Conflicts:
#	.planning/STATE.md
…g-ecosystem

# Conflicts:
#	.planning/REQUIREMENTS.md
#	.planning/STATE.md
- Add Murmur3 x86 32-bit inline hash (no Guava dependency)
- Add BM25StopWords with 174 NLTK English stop words matching Go client
- Add BM25Tokenizer with lowercase/split/stopword/stem pipeline using Snowball
- Add BM25EmbeddingFunction implementing SparseEmbeddingFunction with K=1.2, B=0.75
- Add snowball-stemmer 1.3.0.581.1 dependency to pom.xml
… Splade

- Add ChromaCloudSpladeEmbeddingFunction implementing SparseEmbeddingFunction
- Add CreateSparseEmbeddingRequest/Response DTOs for Chroma Cloud Splade API
- Add TestMurmur3 with 6 tests including known hash vectors
- Add TestBM25Tokenizer with 8 tests (stop words, stemming, edge cases)
- Add TestBM25EmbeddingFunction with 8 tests (sorted indices, determinism, custom params)
- Add TestChromaCloudSpladeEmbeddingFunction with 6 WireMock tests (auth, errors, multi-doc)
- Summary: BM25 local + ChromaCloudSplade remote sparse embedding providers
- 28 tests passing, 2 auto-fixed deviations documented
- STATE.md, ROADMAP.md, REQUIREMENTS.md updated
- 14 test methods covering singleton, dense/sparse/content resolution,
  custom registration, thread safety, and content fallback chain
- Dense providers: default, openai, cohere, huggingface/hf, ollama, google_genai, amazon_bedrock, voyageai
- Sparse providers: chroma_bm25, bm25 (alias), chromacloud_splade
- Content resolution: content factory first, then dense + TextEmbeddingAdapter fallback
- Thread-safe synchronized register/resolve methods
- EmbeddingFunctionResolver.resolve() delegates to registry
- buildParams/buildHuggingFaceParams made public for registry access
…nctionResolver

EmbeddingFunctionResolver.resolve() now checks spec.isKnownType() and wraps
registry exceptions with messages matching the pre-refactor API contract.
Updates ChromaHttpCollectionTest assertion to match new registry error format.
@oss-amikos oss-amikos requested a review from tazarov as a code owner April 2, 2026 07:03
The SchemaAndQueryTextsIntegrationTest asserted an exact error message
that changed when EmbeddingFunctionResolver was refactored to use
EmbeddingFunctionRegistry. Relaxed the assertion to check for the
provider name rather than the full message wording.
@tazarov
Copy link
Copy Markdown
Contributor

tazarov commented Apr 2, 2026

@claude PTAL

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

Claude finished @tazarov's task in 2m 6s —— View job


PR Review: Phase 4 — Embedding Ecosystem

  • Review embedding foundation interfaces (SparseEmbeddingFunction, ContentEmbeddingFunction, adapters, content types)
  • Review reranking interface and providers (Cohere, Jina)
  • Review dense embedding providers (Gemini, Bedrock, Voyage)
  • Review sparse embedding providers (BM25, ChromaCloud SPLADE)
  • Review EmbeddingFunctionRegistry and resolver changes
  • Review tests
  • Review pom.xml dependency changes
  • Post final review summary

Overall this is a well-structured, comprehensive PR. The interface design is clean, the value types are properly immutable, input validation is thorough with good error messages, and the test coverage is solid with WireMock-based HTTP testing. Here are my findings:

Issues

1. ContentToTextAdapter.embedDocuments(String[]) — null delegates to null list (minor bug)

ContentToTextAdapter.java:64-66: When documents is String[] null, the method calls embedDocuments((List<String>) null), which will throw IllegalArgumentException("documents must not be null"). This works, but the behavior is inconsistent — passing null to the array overload gives a different exception type (IllegalArgumentException) than what the validation in embedDocuments(List) would produce if it were reached directly. This pattern is repeated in GeminiEmbeddingFunction, BedrockEmbeddingFunction, and VoyageEmbeddingFunction, where the list overload throws ChromaException but the null array path also throws IllegalArgumentException. Not a functional bug since null is rejected either way, but the exception type is inconsistent.

2. Cohere/Jina reranking implementations are nearly identical (design)

CohereRerankingFunction.java and JinaRerankingFunction.java are ~95% identical — same structure, same error handling, same sorting. The only differences are the provider name in error messages, the default URL, and model name. Consider extracting a shared AbstractRerankingFunction base class to reduce duplication (~100 lines of shared code).

Extract shared reranking base class →

3. Similarly, RerankRequest and RerankResponse are duplicated

tech.amikos.chromadb.reranking.cohere.RerankRequest and tech.amikos.chromadb.reranking.jina.RerankRequest are byte-for-byte identical (same for RerankResponse). These could be shared in the parent reranking package.

4. GeminiEmbeddingFunction embeds documents one-at-a-time (performance)

GeminiEmbeddingFunction.java:115-121: embedDocuments loops and calls client.models.embedContent() once per document, making N HTTP calls for N documents. If the Gemini SDK supports batch embedding, this would be a significant performance improvement. Worth checking if the SDK has a batch endpoint.

5. BedrockEmbeddingFunction also embeds one-at-a-time

BedrockEmbeddingFunction.java:144-159: Same per-document loop pattern. Titan Embed v2 supports batch via the texts field (list of strings). Using this would reduce API calls from N to 1.

6. BM25Tokenizer creates a new stemmer per tokenize() call

BM25Tokenizer.java:63: A new englishStemmer is allocated on every call to tokenize(). Since the class documents itself as thread-safe, making the stemmer a ThreadLocal<englishStemmer> would avoid per-call allocation while maintaining thread safety.

7. Snowball stemmer pulled in as a compile-scope dependency

pom.xml:155-158: com.github.rholder:snowball-stemmer is added as a compile dependency. Since BM25 is a sparse embedding feature that not all users need, consider making this <optional>true</optional> like Gemini/Bedrock, or at minimum document this in the PR description as a new required transitive dependency.

8. CreateSparseEmbeddingRequest.json() method is misleading

CreateSparseEmbeddingRequest.java:29-31: The json() method returns a MediaType, not JSON. This is confusing — it's never actually used (the class has a separate toJson() method that returns the JSON string). This looks like dead code left over from copy-paste from the Voyage request DTO pattern.

Remove dead json() method →

9. New Gson() allocations in request DTOs

RerankRequest.json() (both Cohere and Jina), CreateSparseEmbeddingRequest.toJson(), and CreateEmbeddingRequest.json() all do new Gson().toJson(this). Gson instances are thread-safe and reusable — a static field would avoid repeated allocation.

10. OkHttpClient instances are not shared

Each provider class (VoyageEmbeddingFunction, CohereRerankingFunction, JinaRerankingFunction, ChromaCloudSpladeEmbeddingFunction) creates its own new OkHttpClient(). OkHttp recommends sharing a single client to share the connection pool and thread pools. Since users may create multiple provider instances, this could lead to resource waste. Consider accepting an OkHttpClient in the constructor or using a shared static instance.

Minor / Nit

  • BM25EmbeddingFunction.java:28: Charset.forName("UTF-8") with comment about Java 8 compatibility — StandardCharsets.UTF_8 is available since Java 7, so the comment is misleading. Both work fine on Java 8.
  • Modality.java:28: After calling toLowerCase(Locale.ROOT), the regex [a-zA-Z0-9] uppercase check is redundant since input is already lowered. Very minor.
  • EmbeddingFunctionRegistry.java:102: The LinkedHashMap diamond operator <> is used in a few places — since this targets Java 8, that's fine, but some earlier code in this same PR uses explicit type parameters (e.g., new LinkedHashMap<String, DenseFactory>()). Minor inconsistency.

What Looks Good

  • Interface design: SparseEmbeddingFunction, ContentEmbeddingFunction, and RerankingFunction are clean, minimal interfaces. The bidirectional adapter pattern (TextEmbeddingAdapter / ContentToTextAdapter) is well thought out.
  • Content model: Content, Part, BinarySource are properly immutable with defensive copies. Builder validation is solid.
  • Registry pattern: EmbeddingFunctionRegistry with three factory maps, case-insensitive name resolution, and graceful handling of optional SDK dependencies via Class.forName checks is well designed.
  • Error messages: Consistently include model name and index context, making debugging straightforward.
  • Test coverage: Good use of WireMock for HTTP-based providers, thread safety tests for the registry, and thorough edge case coverage (null inputs, empty inputs, malformed responses).
  • BM25 implementation: Correct Murmur3, proper cross-client compatibility with Go/Python, sorted sparse vector output.

Summary

The PR is in good shape architecturally. The main actionable items are:

  1. Remove the dead CreateSparseEmbeddingRequest.json() method (clear bug/leftover)
  2. Consider making snowball-stemmer optional in pom.xml to avoid forcing a new transitive dep on all users
  3. Extract shared reranking base to reduce code duplication (optional but recommended)

Everything else is minor or a future optimization opportunity. The interfaces, value types, and registry are well designed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants