Skip to content

Implement hybrid search with LLM re-ranking and corpus discovery#151

Open
alexwelcing wants to merge 13 commits intomainfrom
claude/ship-ai-personality-upgrade-yvdAd
Open

Implement hybrid search with LLM re-ranking and corpus discovery#151
alexwelcing wants to merge 13 commits intomainfrom
claude/ship-ai-personality-upgrade-yvdAd

Conversation

@alexwelcing
Copy link
Copy Markdown
Owner

Summary

This PR introduces a comprehensive knowledge management and retrieval system for Ship AI, featuring hybrid BM25 + vector search with LLM-based re-ranking, automatic corpus discovery from LLM responses, and an abstracted LLM provider interface for future extensibility.

Key Changes

Search Infrastructure

  • Hybrid search function (hybrid_search SQL RPC): Combines full-text search (BM25) and vector semantic search using Reciprocal Rank Fusion (RRF) with configurable weights
  • Corpus table (corpus_entry): New table for storing AI-discovered external sources with embeddings, FTS indexes, and deduplication by URL
  • FTS indexes: Added generated tsvector columns and GIN indexes on both article sections and corpus entries for efficient full-text search

LLM Provider Abstraction

  • New llm-provider.ts: Unified interface for LLM operations (embeddings, chat completion, moderation, re-ranking)
  • OpenAI implementation: Wraps OpenAI API with structured methods; designed to support swapping in HuggingFace or other providers via LLM_PROVIDER env variable
  • Re-ranking capability: LLM scores top candidates for relevance and blends scores with retrieval rankings using position-aware weighting

Corpus Discovery & Management

  • Signal extraction: Extended shipPersona.ts to parse [[CORPUS_ENTRY:title::summary::url]] signals from LLM responses
  • Background processing: New corpus-manager.ts handles async corpus entry storage with embedding generation
  • Fire-and-forget integration: Corpus signals are extracted and processed after streaming completes without blocking the response
  • Corpus API endpoint (pages/api/corpus.ts): Public read access to discovered corpus entries with full-text search

Vector Search Refactoring

  • Replaced direct OpenAI API calls with createLLMProvider() factory pattern
  • Updated vector-search.ts to use hybrid search RPC instead of simple vector matching
  • Improved context assembly with source attribution (article vs. corpus labels)
  • Increased token budget from 1500 to 2500 for richer context

Prompt & Persona Updates

  • Simplified system prompt to emphasize direct, substantive answers with source citation
  • Removed enthusiasm-focused language in favor of clarity and accuracy
  • Added corpus signal emission instructions for knowledge discovery
  • Updated initial chat greeting to be more neutral

Code Quality

  • Converted to consistent semicolon-free style (Prettier)
  • Added TypeScript interfaces for hybrid search results and LLM operations
  • Improved error handling with non-blocking corpus processing
  • Added detailed section comments for search pipeline stages

Notable Implementation Details

  • RRF scoring: Uses formula 1 / (k + rank) where k=60 (configurable) to fairly merge FTS and vector rankings
  • Position-aware re-ranking: Top 3 results trust retrieval more (75% weight) while lower positions rely more on LLM re-ranking (60% weight)
  • Graceful degradation: Re-ranking failures fall back to RRF order; corpus processing errors don't interrupt response streaming
  • Deduplication: Corpus entries are upsert-on-conflict by URL to prevent duplicate storage
  • Token budgeting: Context assembly respects a 2500-token limit with proper attribution labels

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V

…reset

Replace the sycophantic Ship AI persona and basic vector RAG with a
foundation for corpus-driven intelligence:

- Add hybrid search (BM25 full-text + vector semantic with Reciprocal
  Rank Fusion) via new Supabase `hybrid_search()` function, replacing
  the single-vector `match_page_sections` approach
- Add LLM re-ranking with position-aware blending of retrieval vs
  relevance scores
- Build library corpus system: `corpus_entry` table for AI-discovered
  external sources, auto-stored via CORPUS_ENTRY signal extraction
  during chat streaming
- Create model abstraction layer (`lib/ai/llm-provider.ts`) to decouple
  from OpenAI, enabling future HuggingFace model swap
- Strip prescriptive sycophantic personality — replace with minimal
  behavioral directives that let the content (and eventually fine-tuning)
  drive the tone
- Add corpus API endpoint for querying stored entries
- Add FTS indexes on both article sections and corpus entries

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
@vercel
Copy link
Copy Markdown

vercel Bot commented Feb 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
next-docs-search Ready Ready Preview, Comment Mar 2, 2026 1:04am

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dad9b5bde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +236 to +238
processCorpusSignals(corpusSignals, sanitizedQuery, corpusProvider).catch(
(err) => console.error('Background corpus processing failed:', err)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep corpus writes alive until background task finishes

onFinal starts processCorpusSignals(...) but immediately returns without awaiting it, so the promise is detached from the request lifecycle. In edge/serverless execution, the worker can be terminated as soon as the stream closes, which can cancel this write path and silently drop discovered corpus entries; this makes corpus discovery unreliable under normal production load.

Useful? React with 👍 / 👎.

Comment on lines +121 to +122
rrf_score:
result.rrf_score * retrievalWeight + normalizedRerank * rerankWeight,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Put retrieval and rerank scores on the same scale

This blend combines rrf_score (typically a small reciprocal-rank value around hundredths) with normalizedRerank (0–1), so the rerank term dominates even when retrievalWeight is high. In practice that undermines the intended “trust retrieval more for top results” behavior and can reorder context mostly by LLM scoring noise; normalize both signals to comparable ranges before weighting.

Useful? React with 👍 / 👎.

- Replace emoji-laden copy in TerminalInterface and SearchDialog with
  neutral, direct language matching the new Ship AI persona
- Fix broken isLoading check in SearchDialog (was matching old
  'great question' text that no longer appears)
- Fix broken hasResponse/welcome-state checks in both components
  (were matching old 'ready to chat whenever you are' greeting)
- Add URL validation for corpus entry signals (reject non-http URLs)
- Improve corpus content fallback for entries with no summary

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Add VectorSpaceExplorer — a 3D visualization where articles are
positioned by semantic axes (polarity=X, horizon=Y, topic=Z).
Text input drives hybrid search, camera flies to matching clusters,
and results glow proportional to relevance score.

New components and changes:
- VectorSpaceExplorer.tsx: R3F component with ArticleNode (polarity-
  colored icosahedrons), SearchConnections (additive-blended lines
  between hits), SemanticAxes (labeled grid), SearchPulse (expanding
  ring on query), CameraNavigator (smooth lerp to result centroid)
- /api/vector-explore: Returns hybrid search results structured for
  3D visualization (slug, heading, rrf_score)
- TerminalInterface chat tab split into VECTOR SPACE / AI CHAT
  toggle — vector mode shows ranked results + "FLY" button, AI chat
  mode preserved as fallback
- ThreeSixty: Renders VectorSpaceExplorer or InfiniteLibrary based
  on vectorExploreMode state, wires search through prop chain
- Pull in latest TerminalInterface UX from main (floating window,
  keyboard hotkeys, drag support)

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Break monolithic VectorSpaceExplorer into three layers:

- lib/3d/layoutAlgorithms.ts: 6 pure positioning functions (semantic-axes,
  sphere, galaxy, timeline, clusters, helix) with registry and color utils
- components/3d/primitives/: Generic reusable 3D building blocks
  (NodeRenderer with state machine, AxisSystem with N-axis config)
- components/3d/vector-space/: Domain-specific composables (VectorNode,
  VectorConnections, SearchPulse, VectorCamera) orchestrated by a slim
  VectorSpaceExplorer

Deletes the old 496-line monolith. All imports updated.

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
The vector space explorer was buried 3 clicks deep (MENU → ASK AI →
tiny toggle). Now it's a dedicated tab and a prominent button in the
tablet quick-menu grid:

- Add VECTORS as its own tab (key 2) in TerminalInterface
- Add cyan-accented VECTORS button to tablet quick-menu grid
- Auto-activate 3D vector explore mode when entering the tab
- Remove the cramped vector/chat sub-toggle from the CHAT tab
- CHAT tab is now purely AI chat with clean layout
- Renumber keyboard hotkeys (1-6) to match new tab order

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
- Add dedicated `vectorInput` state so the VECTORS tab input doesn't
  share state with the CHAT tab's `chatInput`
- Replace deprecated `onKeyPress` with `onKeyDown` for both vector
  search and chat inputs (onKeyPress doesn't fire reliably)

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
The FLY button appeared to do nothing because:
1. The VECTORS tab relied on the onVectorSearch prop chain (ThreeSixty
   -> InteractiveTablet -> TerminalInterface) which could be stale
2. When the API returned 0 results or errored, the UI showed the same
   placeholder text — indistinguishable from "nothing happened"

Now the VECTORS tab:
- Calls /api/vector-explore directly with its own fetch
- Has its own local state for results, loading, and errors
- Shows distinct states: initial, loading, results, no-results, error
- FLY button shows "..." while searching and disables when empty
- Still notifies parent via onVectorSearch for 3D camera flyto

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
The vector DB contains embeddings for content chunks that may not
correspond to actual published articles (e.g. corpus entries, stale
data). This caused the VECTORS tab to show links to articles that
404'd.

Now the /api/vector-explore endpoint loads the article manifest and
filters results so only slugs matching real MDX articles are returned.

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
The VECTORS tab now renders search results as an interactive 3D scene:

- Query appears as a rotating golden octahedron at center
- Results orbit as glowing icosahedron nodes, positioned by score
  (closer = higher relevance, color shifts cyan→gold)
- Connection lines link results to the query and to each other
- Nodes have hover states with billboard labels showing title + score
- Click any node to navigate to its article
- Background particle dust + auto-rotate for ambience
- OrbitControls for manual camera rotation/zoom
- Scanning ring animation plays during search
- Distinct states: initial prompt, searching, no results, error

Uses the same R3F stack as the main 3D scene (drei, three).
Dynamically imported with ssr: false.

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Vector search fix:
- API now falls back to manifest text search when hybrid_search
  returns empty results or is unreachable. Multi-field matching
  against title, description, keywords, and domains with weighted
  scoring ensures the graph always populates with relevant nodes.
- Supabase/OpenAI failures degrade gracefully instead of showing
  NO MATCHES.

Fullscreen mode:
- EXPAND button appears in the graph HUD when results are loaded
- Opens a fixed fullscreen portal with the same R3F scene
- Full viewport for orbit controls, zoom, and node interaction
- Bottom-anchored search bar for re-querying without leaving
- ESC key closes fullscreen (overrides terminal close when active)
- HUD shows node count and interaction hints

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Core ingestion engine that closes the self-learning loop:

Pipeline: URL → fetch → parse → chunk → embed → store

Components:
- lib/ai/corpus-ingest.ts: Full ingestion pipeline
  - PDF parsing via pdf-parse v2 (class API)
  - HTML extraction with progressive tag stripping
  - Semantic chunking: split on headings → paragraphs → sentences
  - LLM relevance gate (1-5 score, rejects spam/off-topic)
  - Per-chunk summarization for better retrieval
  - Deduplication via URL + content checksums
  - 10MB fetch limit, 30s timeout

- pages/api/corpus/ingest.ts: Admin-authenticated endpoint
  - POST single URL or batch of URLs
  - Protected by withAdminAuth (x-admin-api-key header)

- lib/ai/corpus-manager.ts: Enhanced signal processing
  - When Ship AI emits [[CORPUS_ENTRY:...]] with a URL,
    now triggers deep ingestion in the background
  - Stub entry saved first (fast), full ingestion follows
  - Non-blocking: failures don't affect the chat response

- supabase/migrations: Ingestion observability
  - corpus_ingestion_log table for tracking pipeline runs
  - parent_url column on corpus_entry for chunk→parent linking

The loop is now closed: Ship AI discovers a source during
conversation → emits CORPUS_ENTRY signal → system fetches,
parses, chunks, evaluates relevance, embeds, and stores →
future searches find this knowledge → Ship AI becomes smarter.

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
…se eval

The static import chain vector-search.ts (Edge Runtime) → corpus-manager.ts
→ corpus-ingest.ts → pdf-parse pulled pdfjs-dist's eval/WebAssembly.compile
into the Edge bundle, which webpack rejects.

Fix: use dynamic import() for corpus-ingest in processCorpusSignals so the
pdf-parse dependency is only loaded at runtime in Node.js context, never
statically analyzed into the Edge bundle.

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Strip PDF parsing from the ingestion pipeline — PDFs are parsed
externally and passed as pre-parsed text via the `content` field
on IngestRequest. This eliminates the pdf-parse/pdfjs-dist
dependency entirely, removing the Edge Runtime eval issue at its
root instead of working around it with dynamic imports.

The ingest pipeline now handles:
- Web pages (fetched + HTML stripped)
- Plain text (fetched)
- Pre-parsed content (passed directly, skips fetch)

Usage for pre-parsed PDFs:
  POST /api/corpus/ingest
  { "url": "https://.../paper.pdf", "title": "...", "content": "extracted text" }

https://claude.ai/code/session_01DwALWjfRL7pa9EswULKw2V
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants