Conversation
All 20 open milestone 0.2.0 issues now have roadmap coverage.
Investigates BM25 tokenizer pipeline, ContentEmbeddingFunction interface, EmbeddingFunctionRegistry pattern, Gemini/Bedrock/Voyage providers, and Cohere/Jina reranking — with verified Maven coordinates and API wire formats.
…and adapters - SparseEmbeddingFunction interface returning SparseVector (separate from EmbeddingFunction) - ContentEmbeddingFunction interface with embedContents(List<Content>) for multimodal support - Content/Part/BinarySource value types with static factories and builder pattern - Modality and Intent enums with getValue()/fromValue() - TextEmbeddingAdapter: wraps text-only EF as ContentEmbeddingFunction - ContentToTextAdapter: wraps ContentEmbeddingFunction as text-only EF
- RerankingFunction interface with rerank(query, documents) returning List<RerankResult> - RerankResult immutable value type with index, score, equals/hashCode/toString - CohereRerankingFunction targeting Cohere v2/rerank endpoint with WithParam config - JinaRerankingFunction targeting Jina v1/rerank endpoint with WithParam config - Both providers sort results by descending relevance score - Package-private static DEFAULT_BASE_API for WireMock test injection
…nd content types - TestSparseEmbeddingFunction: 2 tests for embedQuery and embedDocuments - TestContentEmbeddingFunction: 6 tests for default method, fromTextOnly, adapters - TestContentTypes: 15 tests for Content, Part, BinarySource, Modality, Intent
- TestRerankResult: value type getters, equality, inequality, toString - TestCohereRerankingFunction: success flow, 401 auth failure, auth header verification - TestJinaRerankingFunction: success flow, 500 server error, model in request body - All tests use WireMock with dynamic ports and WithParam.baseAPI for URL injection
- SUMMARY.md with 2 tasks, 12 files, 23 unit tests - STATE.md updated with decisions and metrics - ROADMAP.md progress updated - Requirements EMB-05, EMB-06 marked complete
- GeminiEmbeddingFunction using Google GenAI SDK (optional dep) - BedrockEmbeddingFunction using AWS SDK BedrockRuntime (optional dep) - VoyageEmbeddingFunction using OkHttp REST calls (existing dep) - All three follow WithParam configuration pattern - Google GenAI and AWS SDK declared as optional Maven dependencies
- SUMMARY.md with execution results and deviation documentation - STATE.md updated with metrics, decisions, and session info - ROADMAP.md updated with plan progress - REQUIREMENTS.md: RERANK-01 marked complete
- TestGeminiEmbeddingFunction: 5 tests (construction, default model, env key) - TestBedrockEmbeddingFunction: 5 tests (construction, default model, custom region) - TestVoyageEmbeddingFunction: 6 WireMock tests (embed docs/query, input_type, auth header, error) - Fix Jackson version conflict: align jackson-core/annotations to 2.17.2 in dependencyManagement
- SUMMARY.md with 2 task commits, 2 deviations documented - STATE.md updated with decisions and metrics - ROADMAP.md updated with plan progress - EMB-07 requirement marked complete
…g-ecosystem # Conflicts: # .planning/STATE.md
…g-ecosystem # Conflicts: # .planning/STATE.md
…g-ecosystem # Conflicts: # .planning/REQUIREMENTS.md # .planning/STATE.md
- Add Murmur3 x86 32-bit inline hash (no Guava dependency) - Add BM25StopWords with 174 NLTK English stop words matching Go client - Add BM25Tokenizer with lowercase/split/stopword/stem pipeline using Snowball - Add BM25EmbeddingFunction implementing SparseEmbeddingFunction with K=1.2, B=0.75 - Add snowball-stemmer 1.3.0.581.1 dependency to pom.xml
… Splade - Add ChromaCloudSpladeEmbeddingFunction implementing SparseEmbeddingFunction - Add CreateSparseEmbeddingRequest/Response DTOs for Chroma Cloud Splade API - Add TestMurmur3 with 6 tests including known hash vectors - Add TestBM25Tokenizer with 8 tests (stop words, stemming, edge cases) - Add TestBM25EmbeddingFunction with 8 tests (sorted indices, determinism, custom params) - Add TestChromaCloudSpladeEmbeddingFunction with 6 WireMock tests (auth, errors, multi-doc)
- Summary: BM25 local + ChromaCloudSplade remote sparse embedding providers - 28 tests passing, 2 auto-fixed deviations documented - STATE.md, ROADMAP.md, REQUIREMENTS.md updated
- 14 test methods covering singleton, dense/sparse/content resolution, custom registration, thread safety, and content fallback chain
- Dense providers: default, openai, cohere, huggingface/hf, ollama, google_genai, amazon_bedrock, voyageai - Sparse providers: chroma_bm25, bm25 (alias), chromacloud_splade - Content resolution: content factory first, then dense + TextEmbeddingAdapter fallback - Thread-safe synchronized register/resolve methods - EmbeddingFunctionResolver.resolve() delegates to registry - buildParams/buildHuggingFaceParams made public for registry access
…nctionResolver EmbeddingFunctionResolver.resolve() now checks spec.isKnownType() and wraps registry exceptions with messages matching the pre-refactor API contract. Updates ChromaHttpCollectionTest assertion to match new registry error format.
The SchemaAndQueryTextsIntegrationTest asserted an exact error message that changed when EmbeddingFunctionResolver was refactored to use EmbeddingFunctionRegistry. Relaxed the assertion to check for the provider name rather than the full message wording.
|
@claude PTAL |
|
Claude finished @tazarov's task in 2m 6s —— View job PR Review: Phase 4 — Embedding Ecosystem
Overall this is a well-structured, comprehensive PR. The interface design is clean, the value types are properly immutable, input validation is thorough with good error messages, and the test coverage is solid with WireMock-based HTTP testing. Here are my findings: Issues1.
2. Cohere/Jina reranking implementations are nearly identical (design)
Extract shared reranking base class → 3. Similarly,
4.
5.
6.
7. Snowball stemmer pulled in as a compile-scope dependency
8.
9. New
10. OkHttpClient instances are not shared Each provider class ( Minor / Nit
What Looks Good
SummaryThe PR is in good shape architecturally. The main actionable items are:
Everything else is minor or a future optimization opportunity. The interfaces, value types, and registry are well designed. |
Summary
Phase 4: Embedding Ecosystem
Goal: Expand the embedding ecosystem with sparse/content interfaces, reranking functions, additional dense providers, and an auto-wiring registry.
Status: Verified (UAT 9/9 passed), reviewed (5 rounds of automated review)
This PR adds a complete embedding ecosystem to the Java client: sparse and multimodal embedding interfaces, reranking support, three new dense providers, two sparse providers, and a central registry for auto-wiring providers from server-side collection config.
Changes
Plan 04-01: Embedding Foundation Interfaces
Sparse and multimodal content embedding interfaces with bidirectional adapters.
Key files created:
SparseEmbeddingFunction— interface for sparse vector providers (BM25, SPLADE)ContentEmbeddingFunction— interface for multimodal embedding providersContent,Part,BinarySource— value types for multimodal contentModality,Intent— enums for content typing and embedding intentTextEmbeddingAdapter,ContentToTextAdapter— bidirectional adaptersPlan 04-02: Reranking Interface and Providers
Reranking function interface with Cohere and Jina implementations.
Key files created:
RerankingFunction— interface for document rerankingRerankResult— immutable value type for ranked resultsCohereRerankingFunction— Cohere v2/rerank endpointJinaRerankingFunction— Jina v1/rerank endpointPlan 04-03: Dense Embedding Providers
Three new dense embedding providers: Gemini, Bedrock, and Voyage.
Key files created:
GeminiEmbeddingFunction— via Google GenAI SDK (optional dependency)BedrockEmbeddingFunction— via AWS SDK BedrockRuntime (optional dependency)VoyageEmbeddingFunction— via OkHttp REST callsPlan 04-04: Sparse Embedding Providers
BM25 local sparse embeddings and ChromaCloud SPLADE remote sparse embeddings.
Key files created:
BM25EmbeddingFunction— local tokenize → stem → hash → TF-IDF pipelineBM25Tokenizer,Murmur3,BM25StopWords— pipeline componentsChromaCloudSpladeEmbeddingFunction— remote SPLADE via Chroma Cloud APIPlan 04-05: EmbeddingFunctionRegistry
Central registry with three factory maps (dense, sparse, content) and auto-wiring from collection config.
Key files created:
EmbeddingFunctionRegistry— singleton registry with provider factory mapsUnsupportedEmbeddingProviderException— typed exception for unknown providersKey files modified:
EmbeddingFunctionResolver— now delegates to registry instead of hardcoded dispatchRequirements Addressed
Verification
Test Plan