Implement hybrid search with LLM re-ranking and corpus discovery#151
Implement hybrid search with LLM re-ranking and corpus discovery#151alexwelcing wants to merge 13 commits intomainfrom
Conversation
…reset Replace the sycophantic Ship AI persona and basic vector RAG with a foundation for corpus-driven intelligence: - Add hybrid search (BM25 full-text + vector semantic with Reciprocal Rank Fusion) via new Supabase `hybrid_search()` function, replacing the single-vector `match_page_sections` approach - Add LLM re-ranking with position-aware blending of retrieval vs relevance scores - Build library corpus system: `corpus_entry` table for AI-discovered external sources, auto-stored via CORPUS_ENTRY signal extraction during chat streaming - Create model abstraction layer (`lib/ai/llm-provider.ts`) to decouple from OpenAI, enabling future HuggingFace model swap - Strip prescriptive sycophantic personality — replace with minimal behavioral directives that let the content (and eventually fine-tuning) drive the tone - Add corpus API endpoint for querying stored entries - Add FTS indexes on both article sections and corpus entries https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dad9b5bde
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| processCorpusSignals(corpusSignals, sanitizedQuery, corpusProvider).catch( | ||
| (err) => console.error('Background corpus processing failed:', err) | ||
| ) |
There was a problem hiding this comment.
Keep corpus writes alive until background task finishes
onFinal starts processCorpusSignals(...) but immediately returns without awaiting it, so the promise is detached from the request lifecycle. In edge/serverless execution, the worker can be terminated as soon as the stream closes, which can cancel this write path and silently drop discovered corpus entries; this makes corpus discovery unreliable under normal production load.
Useful? React with 👍 / 👎.
| rrf_score: | ||
| result.rrf_score * retrievalWeight + normalizedRerank * rerankWeight, |
There was a problem hiding this comment.
Put retrieval and rerank scores on the same scale
This blend combines rrf_score (typically a small reciprocal-rank value around hundredths) with normalizedRerank (0–1), so the rerank term dominates even when retrievalWeight is high. In practice that undermines the intended “trust retrieval more for top results” behavior and can reorder context mostly by LLM scoring noise; normalize both signals to comparable ranges before weighting.
Useful? React with 👍 / 👎.
- Replace emoji-laden copy in TerminalInterface and SearchDialog with neutral, direct language matching the new Ship AI persona - Fix broken isLoading check in SearchDialog (was matching old 'great question' text that no longer appears) - Fix broken hasResponse/welcome-state checks in both components (were matching old 'ready to chat whenever you are' greeting) - Add URL validation for corpus entry signals (reject non-http URLs) - Improve corpus content fallback for entries with no summary https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Add VectorSpaceExplorer — a 3D visualization where articles are positioned by semantic axes (polarity=X, horizon=Y, topic=Z). Text input drives hybrid search, camera flies to matching clusters, and results glow proportional to relevance score. New components and changes: - VectorSpaceExplorer.tsx: R3F component with ArticleNode (polarity- colored icosahedrons), SearchConnections (additive-blended lines between hits), SemanticAxes (labeled grid), SearchPulse (expanding ring on query), CameraNavigator (smooth lerp to result centroid) - /api/vector-explore: Returns hybrid search results structured for 3D visualization (slug, heading, rrf_score) - TerminalInterface chat tab split into VECTOR SPACE / AI CHAT toggle — vector mode shows ranked results + "FLY" button, AI chat mode preserved as fallback - ThreeSixty: Renders VectorSpaceExplorer or InfiniteLibrary based on vectorExploreMode state, wires search through prop chain - Pull in latest TerminalInterface UX from main (floating window, keyboard hotkeys, drag support) https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Break monolithic VectorSpaceExplorer into three layers: - lib/3d/layoutAlgorithms.ts: 6 pure positioning functions (semantic-axes, sphere, galaxy, timeline, clusters, helix) with registry and color utils - components/3d/primitives/: Generic reusable 3D building blocks (NodeRenderer with state machine, AxisSystem with N-axis config) - components/3d/vector-space/: Domain-specific composables (VectorNode, VectorConnections, SearchPulse, VectorCamera) orchestrated by a slim VectorSpaceExplorer Deletes the old 496-line monolith. All imports updated. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
The vector space explorer was buried 3 clicks deep (MENU → ASK AI → tiny toggle). Now it's a dedicated tab and a prominent button in the tablet quick-menu grid: - Add VECTORS as its own tab (key 2) in TerminalInterface - Add cyan-accented VECTORS button to tablet quick-menu grid - Auto-activate 3D vector explore mode when entering the tab - Remove the cramped vector/chat sub-toggle from the CHAT tab - CHAT tab is now purely AI chat with clean layout - Renumber keyboard hotkeys (1-6) to match new tab order https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
- Add dedicated `vectorInput` state so the VECTORS tab input doesn't share state with the CHAT tab's `chatInput` - Replace deprecated `onKeyPress` with `onKeyDown` for both vector search and chat inputs (onKeyPress doesn't fire reliably) https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
The FLY button appeared to do nothing because: 1. The VECTORS tab relied on the onVectorSearch prop chain (ThreeSixty -> InteractiveTablet -> TerminalInterface) which could be stale 2. When the API returned 0 results or errored, the UI showed the same placeholder text — indistinguishable from "nothing happened" Now the VECTORS tab: - Calls /api/vector-explore directly with its own fetch - Has its own local state for results, loading, and errors - Shows distinct states: initial, loading, results, no-results, error - FLY button shows "..." while searching and disables when empty - Still notifies parent via onVectorSearch for 3D camera flyto https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
The vector DB contains embeddings for content chunks that may not correspond to actual published articles (e.g. corpus entries, stale data). This caused the VECTORS tab to show links to articles that 404'd. Now the /api/vector-explore endpoint loads the article manifest and filters results so only slugs matching real MDX articles are returned. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
The VECTORS tab now renders search results as an interactive 3D scene: - Query appears as a rotating golden octahedron at center - Results orbit as glowing icosahedron nodes, positioned by score (closer = higher relevance, color shifts cyan→gold) - Connection lines link results to the query and to each other - Nodes have hover states with billboard labels showing title + score - Click any node to navigate to its article - Background particle dust + auto-rotate for ambience - OrbitControls for manual camera rotation/zoom - Scanning ring animation plays during search - Distinct states: initial prompt, searching, no results, error Uses the same R3F stack as the main 3D scene (drei, three). Dynamically imported with ssr: false. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Vector search fix: - API now falls back to manifest text search when hybrid_search returns empty results or is unreachable. Multi-field matching against title, description, keywords, and domains with weighted scoring ensures the graph always populates with relevant nodes. - Supabase/OpenAI failures degrade gracefully instead of showing NO MATCHES. Fullscreen mode: - EXPAND button appears in the graph HUD when results are loaded - Opens a fixed fullscreen portal with the same R3F scene - Full viewport for orbit controls, zoom, and node interaction - Bottom-anchored search bar for re-querying without leaving - ESC key closes fullscreen (overrides terminal close when active) - HUD shows node count and interaction hints https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Core ingestion engine that closes the self-learning loop:
Pipeline: URL → fetch → parse → chunk → embed → store
Components:
- lib/ai/corpus-ingest.ts: Full ingestion pipeline
- PDF parsing via pdf-parse v2 (class API)
- HTML extraction with progressive tag stripping
- Semantic chunking: split on headings → paragraphs → sentences
- LLM relevance gate (1-5 score, rejects spam/off-topic)
- Per-chunk summarization for better retrieval
- Deduplication via URL + content checksums
- 10MB fetch limit, 30s timeout
- pages/api/corpus/ingest.ts: Admin-authenticated endpoint
- POST single URL or batch of URLs
- Protected by withAdminAuth (x-admin-api-key header)
- lib/ai/corpus-manager.ts: Enhanced signal processing
- When Ship AI emits [[CORPUS_ENTRY:...]] with a URL,
now triggers deep ingestion in the background
- Stub entry saved first (fast), full ingestion follows
- Non-blocking: failures don't affect the chat response
- supabase/migrations: Ingestion observability
- corpus_ingestion_log table for tracking pipeline runs
- parent_url column on corpus_entry for chunk→parent linking
The loop is now closed: Ship AI discovers a source during
conversation → emits CORPUS_ENTRY signal → system fetches,
parses, chunks, evaluates relevance, embeds, and stores →
future searches find this knowledge → Ship AI becomes smarter.
https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
…se eval The static import chain vector-search.ts (Edge Runtime) → corpus-manager.ts → corpus-ingest.ts → pdf-parse pulled pdfjs-dist's eval/WebAssembly.compile into the Edge bundle, which webpack rejects. Fix: use dynamic import() for corpus-ingest in processCorpusSignals so the pdf-parse dependency is only loaded at runtime in Node.js context, never statically analyzed into the Edge bundle. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Strip PDF parsing from the ingestion pipeline — PDFs are parsed
externally and passed as pre-parsed text via the `content` field
on IngestRequest. This eliminates the pdf-parse/pdfjs-dist
dependency entirely, removing the Edge Runtime eval issue at its
root instead of working around it with dynamic imports.
The ingest pipeline now handles:
- Web pages (fetched + HTML stripped)
- Plain text (fetched)
- Pre-parsed content (passed directly, skips fetch)
Usage for pre-parsed PDFs:
POST /api/corpus/ingest
{ "url": "https://.../paper.pdf", "title": "...", "content": "extracted text" }
https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Summary
This PR introduces a comprehensive knowledge management and retrieval system for Ship AI, featuring hybrid BM25 + vector search with LLM-based re-ranking, automatic corpus discovery from LLM responses, and an abstracted LLM provider interface for future extensibility.
Key Changes
Search Infrastructure
hybrid_searchSQL RPC): Combines full-text search (BM25) and vector semantic search using Reciprocal Rank Fusion (RRF) with configurable weightscorpus_entry): New table for storing AI-discovered external sources with embeddings, FTS indexes, and deduplication by URLtsvectorcolumns and GIN indexes on both article sections and corpus entries for efficient full-text searchLLM Provider Abstraction
llm-provider.ts: Unified interface for LLM operations (embeddings, chat completion, moderation, re-ranking)LLM_PROVIDERenv variableCorpus Discovery & Management
shipPersona.tsto parse[[CORPUS_ENTRY:title::summary::url]]signals from LLM responsescorpus-manager.tshandles async corpus entry storage with embedding generationpages/api/corpus.ts): Public read access to discovered corpus entries with full-text searchVector Search Refactoring
createLLMProvider()factory patternvector-search.tsto use hybrid search RPC instead of simple vector matchingPrompt & Persona Updates
Code Quality
Notable Implementation Details
1 / (k + rank)where k=60 (configurable) to fairly merge FTS and vector rankingshttps://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V