Implement hybrid search with LLM re-ranking and corpus discovery by alexwelcing · Pull Request #151 · alexwelcing/NextDocsSearch

alexwelcing · 2026-02-28T19:21:10Z

Summary

This PR introduces a comprehensive knowledge management and retrieval system for Ship AI, featuring hybrid BM25 + vector search with LLM-based re-ranking, automatic corpus discovery from LLM responses, and an abstracted LLM provider interface for future extensibility.

Key Changes

Search Infrastructure

Hybrid search function (hybrid_search SQL RPC): Combines full-text search (BM25) and vector semantic search using Reciprocal Rank Fusion (RRF) with configurable weights
Corpus table (corpus_entry): New table for storing AI-discovered external sources with embeddings, FTS indexes, and deduplication by URL
FTS indexes: Added generated tsvector columns and GIN indexes on both article sections and corpus entries for efficient full-text search

LLM Provider Abstraction

New llm-provider.ts: Unified interface for LLM operations (embeddings, chat completion, moderation, re-ranking)
OpenAI implementation: Wraps OpenAI API with structured methods; designed to support swapping in HuggingFace or other providers via LLM_PROVIDER env variable
Re-ranking capability: LLM scores top candidates for relevance and blends scores with retrieval rankings using position-aware weighting

Corpus Discovery & Management

Signal extraction: Extended shipPersona.ts to parse [[CORPUS_ENTRY:title::summary::url]] signals from LLM responses
Background processing: New corpus-manager.ts handles async corpus entry storage with embedding generation
Fire-and-forget integration: Corpus signals are extracted and processed after streaming completes without blocking the response
Corpus API endpoint (pages/api/corpus.ts): Public read access to discovered corpus entries with full-text search

Vector Search Refactoring

Replaced direct OpenAI API calls with createLLMProvider() factory pattern
Updated vector-search.ts to use hybrid search RPC instead of simple vector matching
Improved context assembly with source attribution (article vs. corpus labels)
Increased token budget from 1500 to 2500 for richer context

Prompt & Persona Updates

Simplified system prompt to emphasize direct, substantive answers with source citation
Removed enthusiasm-focused language in favor of clarity and accuracy
Added corpus signal emission instructions for knowledge discovery
Updated initial chat greeting to be more neutral

Code Quality

Converted to consistent semicolon-free style (Prettier)
Added TypeScript interfaces for hybrid search results and LLM operations
Improved error handling with non-blocking corpus processing
Added detailed section comments for search pipeline stages

Notable Implementation Details

RRF scoring: Uses formula 1 / (k + rank) where k=60 (configurable) to fairly merge FTS and vector rankings
Position-aware re-ranking: Top 3 results trust retrieval more (75% weight) while lower positions rely more on LLM re-ranking (60% weight)
Graceful degradation: Re-ranking failures fall back to RRF order; corpus processing errors don't interrupt response streaming
Deduplication: Corpus entries are upsert-on-conflict by URL to prevent duplicate storage
Token budgeting: Context assembly respects a 2500-token limit with proper attribution labels

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

…reset Replace the sycophantic Ship AI persona and basic vector RAG with a foundation for corpus-driven intelligence: - Add hybrid search (BM25 full-text + vector semantic with Reciprocal Rank Fusion) via new Supabase `hybrid_search()` function, replacing the single-vector `match_page_sections` approach - Add LLM re-ranking with position-aware blending of retrieval vs relevance scores - Build library corpus system: `corpus_entry` table for AI-discovered external sources, auto-stored via CORPUS_ENTRY signal extraction during chat streaming - Create model abstraction layer (`lib/ai/llm-provider.ts`) to decouple from OpenAI, enabling future HuggingFace model swap - Strip prescriptive sycophantic personality — replace with minimal behavioral directives that let the content (and eventually fine-tuning) drive the tone - Add corpus API endpoint for querying stored entries - Add FTS indexes on both article sections and corpus entries https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

vercel · 2026-02-28T19:21:14Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
next-docs-search	Ready	Preview, Comment	Mar 2, 2026 1:04am

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dad9b5bde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T19:25:56Z

+            processCorpusSignals(corpusSignals, sanitizedQuery, corpusProvider).catch(
+              (err) => console.error('Background corpus processing failed:', err)
+            )


Keep corpus writes alive until background task finishes

onFinal starts processCorpusSignals(...) but immediately returns without awaiting it, so the promise is detached from the request lifecycle. In edge/serverless execution, the worker can be terminated as soon as the stream closes, which can cancel this write path and silently drop discovered corpus entries; this makes corpus discovery unreliable under normal production load.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-28T19:25:56Z

+            rrf_score:
+              result.rrf_score * retrievalWeight + normalizedRerank * rerankWeight,


Put retrieval and rerank scores on the same scale

This blend combines rrf_score (typically a small reciprocal-rank value around hundredths) with normalizedRerank (0–1), so the rerank term dominates even when retrievalWeight is high. In practice that undermines the intended “trust retrieval more for top results” behavior and can reorder context mostly by LLM scoring noise; normalize both signals to comparable ranges before weighting.

Useful? React with 👍 / 👎.

- Replace emoji-laden copy in TerminalInterface and SearchDialog with neutral, direct language matching the new Ship AI persona - Fix broken isLoading check in SearchDialog (was matching old 'great question' text that no longer appears) - Fix broken hasResponse/welcome-state checks in both components (were matching old 'ready to chat whenever you are' greeting) - Add URL validation for corpus entry signals (reject non-http URLs) - Improve corpus content fallback for entries with no summary https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

Add VectorSpaceExplorer — a 3D visualization where articles are positioned by semantic axes (polarity=X, horizon=Y, topic=Z). Text input drives hybrid search, camera flies to matching clusters, and results glow proportional to relevance score. New components and changes: - VectorSpaceExplorer.tsx: R3F component with ArticleNode (polarity- colored icosahedrons), SearchConnections (additive-blended lines between hits), SemanticAxes (labeled grid), SearchPulse (expanding ring on query), CameraNavigator (smooth lerp to result centroid) - /api/vector-explore: Returns hybrid search results structured for 3D visualization (slug, heading, rrf_score) - TerminalInterface chat tab split into VECTOR SPACE / AI CHAT toggle — vector mode shows ranked results + "FLY" button, AI chat mode preserved as fallback - ThreeSixty: Renders VectorSpaceExplorer or InfiniteLibrary based on vectorExploreMode state, wires search through prop chain - Pull in latest TerminalInterface UX from main (floating window, keyboard hotkeys, drag support) https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

Break monolithic VectorSpaceExplorer into three layers: - lib/3d/layoutAlgorithms.ts: 6 pure positioning functions (semantic-axes, sphere, galaxy, timeline, clusters, helix) with registry and color utils - components/3d/primitives/: Generic reusable 3D building blocks (NodeRenderer with state machine, AxisSystem with N-axis config) - components/3d/vector-space/: Domain-specific composables (VectorNode, VectorConnections, SearchPulse, VectorCamera) orchestrated by a slim VectorSpaceExplorer Deletes the old 496-line monolith. All imports updated. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

The vector space explorer was buried 3 clicks deep (MENU → ASK AI → tiny toggle). Now it's a dedicated tab and a prominent button in the tablet quick-menu grid: - Add VECTORS as its own tab (key 2) in TerminalInterface - Add cyan-accented VECTORS button to tablet quick-menu grid - Auto-activate 3D vector explore mode when entering the tab - Remove the cramped vector/chat sub-toggle from the CHAT tab - CHAT tab is now purely AI chat with clean layout - Renumber keyboard hotkeys (1-6) to match new tab order https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

- Add dedicated `vectorInput` state so the VECTORS tab input doesn't share state with the CHAT tab's `chatInput` - Replace deprecated `onKeyPress` with `onKeyDown` for both vector search and chat inputs (onKeyPress doesn't fire reliably) https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

The FLY button appeared to do nothing because: 1. The VECTORS tab relied on the onVectorSearch prop chain (ThreeSixty -> InteractiveTablet -> TerminalInterface) which could be stale 2. When the API returned 0 results or errored, the UI showed the same placeholder text — indistinguishable from "nothing happened" Now the VECTORS tab: - Calls /api/vector-explore directly with its own fetch - Has its own local state for results, loading, and errors - Shows distinct states: initial, loading, results, no-results, error - FLY button shows "..." while searching and disables when empty - Still notifies parent via onVectorSearch for 3D camera flyto https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

The vector DB contains embeddings for content chunks that may not correspond to actual published articles (e.g. corpus entries, stale data). This caused the VECTORS tab to show links to articles that 404'd. Now the /api/vector-explore endpoint loads the article manifest and filters results so only slugs matching real MDX articles are returned. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

The VECTORS tab now renders search results as an interactive 3D scene: - Query appears as a rotating golden octahedron at center - Results orbit as glowing icosahedron nodes, positioned by score (closer = higher relevance, color shifts cyan→gold) - Connection lines link results to the query and to each other - Nodes have hover states with billboard labels showing title + score - Click any node to navigate to its article - Background particle dust + auto-rotate for ambience - OrbitControls for manual camera rotation/zoom - Scanning ring animation plays during search - Distinct states: initial prompt, searching, no results, error Uses the same R3F stack as the main 3D scene (drei, three). Dynamically imported with ssr: false. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

Vector search fix: - API now falls back to manifest text search when hybrid_search returns empty results or is unreachable. Multi-field matching against title, description, keywords, and domains with weighted scoring ensures the graph always populates with relevant nodes. - Supabase/OpenAI failures degrade gracefully instead of showing NO MATCHES. Fullscreen mode: - EXPAND button appears in the graph HUD when results are loaded - Opens a fixed fullscreen portal with the same R3F scene - Full viewport for orbit controls, zoom, and node interaction - Bottom-anchored search bar for re-querying without leaving - ESC key closes fullscreen (overrides terminal close when active) - HUD shows node count and interaction hints https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

Core ingestion engine that closes the self-learning loop: Pipeline: URL → fetch → parse → chunk → embed → store Components: - lib/ai/corpus-ingest.ts: Full ingestion pipeline - PDF parsing via pdf-parse v2 (class API) - HTML extraction with progressive tag stripping - Semantic chunking: split on headings → paragraphs → sentences - LLM relevance gate (1-5 score, rejects spam/off-topic) - Per-chunk summarization for better retrieval - Deduplication via URL + content checksums - 10MB fetch limit, 30s timeout - pages/api/corpus/ingest.ts: Admin-authenticated endpoint - POST single URL or batch of URLs - Protected by withAdminAuth (x-admin-api-key header) - lib/ai/corpus-manager.ts: Enhanced signal processing - When Ship AI emits [[CORPUS_ENTRY:...]] with a URL, now triggers deep ingestion in the background - Stub entry saved first (fast), full ingestion follows - Non-blocking: failures don't affect the chat response - supabase/migrations: Ingestion observability - corpus_ingestion_log table for tracking pipeline runs - parent_url column on corpus_entry for chunk→parent linking The loop is now closed: Ship AI discovers a source during conversation → emits CORPUS_ENTRY signal → system fetches, parses, chunks, evaluates relevance, embeds, and stores → future searches find this knowledge → Ship AI becomes smarter. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

…se eval The static import chain vector-search.ts (Edge Runtime) → corpus-manager.ts → corpus-ingest.ts → pdf-parse pulled pdfjs-dist's eval/WebAssembly.compile into the Edge bundle, which webpack rejects. Fix: use dynamic import() for corpus-ingest in processCorpusSignals so the pdf-parse dependency is only loaded at runtime in Node.js context, never statically analyzed into the Edge bundle. https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

Strip PDF parsing from the ingestion pipeline — PDFs are parsed externally and passed as pre-parsed text via the `content` field on IngestRequest. This eliminates the pdf-parse/pdfjs-dist dependency entirely, removing the Edge Runtime eval issue at its root instead of working around it with dynamic imports. The ingest pipeline now handles: - Web pages (fetched + HTML stripped) - Plain text (fetched) - Pre-parsed content (passed directly, skips fetch) Usage for pre-parsed PDFs: POST /api/corpus/ingest { "url": "https://.../paper.pdf", "title": "...", "content": "extracted text" } https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

vercel Bot deployed to Preview February 28, 2026 19:21 View deployment

chatgpt-codex-connector Bot reviewed Feb 28, 2026

View reviewed changes

vercel Bot deployed to Preview February 28, 2026 20:08 View deployment

vercel Bot deployed to Preview March 1, 2026 00:46 View deployment

vercel Bot deployed to Preview March 1, 2026 01:30 View deployment

vercel Bot deployed to Preview March 1, 2026 02:20 View deployment

vercel Bot deployed to Preview March 1, 2026 02:56 View deployment

vercel Bot deployed to Preview March 1, 2026 16:35 View deployment

vercel Bot deployed to Preview March 1, 2026 17:04 View deployment

vercel Bot deployed to Preview March 1, 2026 17:15 View deployment

vercel Bot deployed to Preview March 1, 2026 18:40 View deployment

vercel Bot had a problem deploying to Preview March 1, 2026 19:51 Failure

vercel Bot had a problem deploying to Preview March 1, 2026 22:12 Failure

vercel Bot deployed to Preview March 2, 2026 01:04 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement hybrid search with LLM re-ranking and corpus discovery#151

Implement hybrid search with LLM re-ranking and corpus discovery#151
alexwelcing wants to merge 13 commits intomainfrom
claude/ship-ai-personality-upgrade-yvdAd

alexwelcing commented Feb 28, 2026

Uh oh!

vercel Bot commented Feb 28, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 28, 2026

Uh oh!

chatgpt-codex-connector Bot Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		rrf_score:
		result.rrf_score * retrievalWeight + normalizedRerank * rerankWeight,

Conversation

alexwelcing commented Feb 28, 2026

Summary

Key Changes

Search Infrastructure

LLM Provider Abstraction

Corpus Discovery & Management

Vector Search Refactoring

Prompt & Persona Updates

Code Quality

Notable Implementation Details

Uh oh!

vercel Bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Feb 28, 2026 •

edited

Loading