Skip to content

Retrieval Issue: Documentation Drowning Out Code in Semantic Search Results #47

@iamDyeus

Description

@iamDyeus

Contextinator Architecture Analysis & Proposed Solutions

Current Architecture Overview

Your System Components

1. Ingestion Pipeline (ingestion/async_service.py)

  • Clones GitHub repositories
  • Orchestrates the chunking → embedding → storage pipeline
  • Uses async/concurrent processing for performance

2. Chunking System (chunking/)

  • AST Parser (ast_parser.py): Uses Tree-sitter to parse code files into Abstract Syntax Trees
  • Node Collector (node_collector.py): Collects semantic nodes (functions, classes, methods) from the AST
  • Chunk Service (chunk_service.py): Orchestrates file discovery, parsing, and chunking
  • Splitter (splitter.py): Splits large chunks based on token limits with overlap
  • Context Builder (context_builder.py): Adds metadata enrichment to chunks

3. Embedding System (embedding/embedding_service.py)

  • Uses OpenAI's text-embedding-3-large model
  • Generates embeddings for code chunks
  • Supports batch processing and async operations
  • Uses "enriched content" (metadata + code) for embedding

4. Vector Store (vectorstore/chroma_store.py)

  • ChromaDB for vector storage and similarity search
  • Supports both local persistence and server mode
  • Stores embeddings with metadata (file path, language, node type, etc.)

5. Retrieval/Search (tools/semantic_search.py)

  • Pure semantic similarity search in ChromaDB
  • Filters by: language, file_path, node_type, is_parent
  • Returns top-k results based on cosine similarity

The Problem You're Facing

Issue: Documentation Drowns Out Code in Search Results

Symptoms:

  • In large repos with good documentation, 60-70% of top-k results are README.md, USAGE.md, and doc files
  • Actual code chunks get pushed out of results
  • Problem worse with:
    • Deep folder nesting
    • High documentation-to-code ratio
    • High-level/intent-based queries (vs. specific symbol searches)

Root Cause:
Documentation is written in natural language, which embeds much closer to natural language queries than code does.

Example:

  • Query: "how does authentication work?"
  • README section about auth: 🎯 High similarity (natural language → natural language)
  • Actual auth implementation code: ❌ Lower similarity (natural language → code syntax)

Your Current Approach:

Single Vector Space
├── Code chunks (with some metadata enrichment)
└── Documentation chunks (pure natural language)
     └── Documentation always wins similarity scores!

You DO have some enrichment (build_enriched_content adds metadata like file path, language, node type), but this isn't enough to overcome the natural language advantage of docs.


Solution 1: The Expert's "3D Spectral Indexing" Approach

Translation: What They're Actually Saying

The first commenter is proposing a multi-layered, context-aware architecture. Here's what they mean in plain English:

Core Concept: "Architectural Separation of Concerns"

Instead of treating all chunks equally, organize them into architectural layers based on purpose and structure.

The Three Dimensions

1. Sequence of Expressions

  • Linear code flow: line 1 → line 2 → line 3
  • Think: "execution order" or "temporal flow"

2. Nesting (Scope Hierarchy)

  • Depth in the code structure
  • Example:
    Module
    └── Class
        └── Method
            └── For Loop
                └── If Statement
    
  • You already capture this with your AST parent-child relationships!

3. Conditional Flow Routing

  • Branching: if/else, switch, try/catch
  • Concurrency: threads, async/await, callbacks
  • Control flow patterns

Two Perspectives

Perspective 1: AST Graph

  • Parse code into AST (✅ you already do this)
  • Resolve symbols and their scopes
  • Track references between blocks

Perspective 2: Architectural Summaries

  • Create summaries at each architectural layer
  • Example layers:
    • Function level: "This function authenticates users"
    • Class level: "This class handles user management"
    • Module level: "This module provides authentication services"
  • Index these summaries separately from raw code

"Spectral Indexing"

This is the fancy term for context-aware indexing. Instead of just Euclidean distance in vector space, consider:

For Code:

  • Not just "what does it say?" but "where does it live in the architecture?"
  • Weight by:
    • Symbol density: Code with lots of function/class definitions = higher importance
    • AST depth: Deeper nested code might be implementation details
    • Architectural role: Is this a public API? Internal helper? Data structure?

Analogy They Use:

Geographic indexing considers accessibility (roads, transport) not just distance

Applied to Code:

Index by architectural accessibility: "How relevant is this code to the user's architectural concern?"

Implementation Approach

┌─────────────────────────────────────────────────┐
│         Multi-Layer Index Architecture          │
└─────────────────────────────────────────────────┘

Layer 1: Documentation & High-Level Architecture
├── README sections
├── Architecture docs
├── Module-level summaries (generated from code)
└── [Natural language, high-level concepts]

Layer 2: API & Public Interfaces
├── Public classes/functions
├── Exported symbols
├── Function signatures with docstrings
└── [Mix of natural language + code structure]

Layer 3: Implementation Code
├── Function bodies
├── Private methods
├── Implementation details
└── [Raw code with metadata enrichment]

Layer 4: Low-Level Details
├── Deeply nested blocks
├── Helper functions
├── Data transformations
└── [Pure implementation, heavy metadata]

Query Routing Logic:

  1. Classify query intent based on language:

    • High-level keywords ("architecture", "how does", "overview") → Layer 1-2
    • Code-specific ("function", "class", "implement") → Layer 2-3
    • Symbol names (actual function/class names) → Layer 3-4
  2. Search relevant layers with different weights:

    • High-level query: 70% Layer 1, 30% Layer 2
    • Code query: 10% Layer 1, 40% Layer 2, 50% Layer 3
  3. Combine results with architectural context

Key Innovation:
Generate architectural summaries automatically from your AST:

  • For each function: "Function authenticate_user in auth.py validates user credentials"
  • For each class: "Class UserManager handles user CRUD operations"
  • For each module: "Module auth provides authentication and authorization"

Index these summaries as separate chunks in different collections/layers.


Solution 2: The Pragmatic "Query Routing + Boosting" Approach

Translation: What They're Actually Saying

This is a simpler, proven approach that doesn't require major re-architecture.

Core Strategy: Separate Collections + Smart Routing

1. Separate Vector Stores

Collection: "code_chunks"
├── All AST-parsed code chunks
├── Functions, classes, methods
└── Enriched with metadata

Collection: "docs_chunks"  
├── README files
├── USAGE guides
├── Architecture docs
└── Pure documentation

Collection: "hybrid_chunks" (optional)
├── Docstrings
├── Inline comments
├── API documentation generated from code
└── Mix of code + natural language

2. Query Intent Classification

Classify the query BEFORE searching to decide which collection(s) to hit.

Simple Heuristics (No ML needed):

def classify_query(query: str) -> str:
    code_signals = [
        "function", "class", "method", "implementation",
        "def ", "async ", "import", "return",
        ".py", ".js", ".ts",  # file extensions
        "()", "[]", "{}",     # code syntax
        "error:", "traceback"  # debugging
    ]
    
    doc_signals = [
        "how to", "what is", "overview", "architecture",
        "getting started", "tutorial", "guide",
        "why does", "explain", "documentation"
    ]
    
    code_score = sum(1 for signal in code_signals if signal in query.lower())
    doc_score = sum(1 for signal in doc_signals if signal in query.lower())
    
    if code_score > doc_score:
        return "code"
    elif doc_score > code_score:
        return "docs"
    else:
        return "hybrid"

3. Weighted Hybrid Search

Instead of hard filtering, query multiple collections and weight results:

if query_type == "code":
    results = (
        search("code_chunks", k=8, weight=1.0) +
        search("docs_chunks", k=2, weight=0.3)
    )
elif query_type == "docs":
    results = (
        search("docs_chunks", k=7, weight=1.0) +
        search("code_chunks", k=3, weight=0.5)
    )
else:  # hybrid
    results = (
        search("code_chunks", k=5, weight=0.8) +
        search("docs_chunks", k=5, weight=0.8)
    )

# Re-rank combined results by weighted score
final_results = sorted(results, key=lambda x: x.score * x.weight)[:10]

4. Boost Code Chunks with Better Context

Your current enrichment:

File: src/auth.py
Language: python
Type: function_definition
Symbol: authenticate_user
Lines: 45-67

def authenticate_user(username, password):
    # actual code...

Enhanced enrichment to compete with docs:

File: src/auth.py
Module: auth
Language: python  
Type: function_definition
Symbol: authenticate_user
Purpose: Validates user credentials against database
Parent: UserManager class
Dependencies: bcrypt, database
Lines: 45-67
Tags: authentication, security, user-validation

def authenticate_user(username, password):
    # actual code...

Add:

  • Purpose: One-sentence description (can be extracted from docstring or generated)
  • Module context: What module/package this belongs to
  • Dependencies: What this code imports/uses
  • Tags: Semantic tags (authentication, database, API, etc.)
  • Parent context: Already have this, make it more prominent

5. Metadata Boosting at Retrieval Time

When retrieving results, boost scores based on metadata:

def boost_score(chunk, query, base_similarity):
    boost = 1.0
    
    # Boost if query mentions the file name
    if chunk['file_path'] in query:
        boost *= 1.5
    
    # Boost if query mentions the symbol name  
    if chunk.get('node_name', '').lower() in query.lower():
        boost *= 2.0
    
    # Boost by symbol density (more defs = more important)
    symbol_density = chunk.get('symbol_count', 0) / max(chunk.get('line_count', 1), 1)
    if symbol_density > 0.1:  # High symbol density
        boost *= 1.3
    
    # Reduce boost for deeply nested code (implementation details)
    ast_depth = chunk.get('ast_depth', 0)
    if ast_depth > 5:
        boost *= 0.8
    
    return base_similarity * boost

Implementation Steps (Pragmatic Approach)

Phase 1: Separate Collections

  1. Modify chunk_repository() to tag chunks as "code" vs "docs"
  2. Store in separate ChromaDB collections
  3. Update semantic_search() to accept collection parameter

Phase 2: Query Classification

  1. Add simple intent classifier (keyword-based)
  2. Route queries to appropriate collection(s)
  3. Merge results with weights

Phase 3: Enhanced Enrichment

  1. Extract docstrings and add as "purpose" field
  2. Add symbol density calculation
  3. Add semantic tags based on imports/names
  4. Make parent context more prominent

Phase 4: Metadata Boosting

  1. Implement post-retrieval score boosting
  2. Use symbol matches, file paths, AST depth
  3. Re-rank before returning to LangGraph

Comparison: Expert vs. Pragmatic

Aspect Expert (3D Spectral) Pragmatic (Query Routing)
Complexity High - requires architectural analysis Low - mostly classification logic
Re-architecture Significant changes needed Minimal changes to existing code
Time to Implement Weeks Days
Maintenance Complex, many moving parts Simple, straightforward
Flexibility Very flexible, handles complex queries Works well for common cases
Performance Potentially better for complex codebases Fast, proven approach
Best For Large enterprise codebases, complex architectures Most real-world RAG systems

Recommended Implementation Path

Start with Pragmatic (Solution 2), Evolve to Expert (Solution 1)

Phase 1: Quick Wins (Week 1)

  • ✅ Separate code/docs collections
  • ✅ Simple query classification
  • ✅ Test with your current repos

Phase 2: Enhanced Context (Week 2)

  • ✅ Better metadata enrichment
  • ✅ Extract docstrings as purpose
  • ✅ Add semantic tags

Phase 3: Smart Boosting (Week 3)

  • ✅ Implement metadata-based boosting
  • ✅ Symbol density calculations
  • ✅ File path matching

Phase 4: Multi-Layer (Month 2+)

  • ✅ Add architectural summary generation
  • ✅ Create layer-based collections
  • ✅ Advanced query routing

Key Takeaways

What You're Doing Right ✅

  1. AST-based chunking - Much better than naive text splitting
  2. Parent-child relationships - Captures code structure
  3. Enriched content - Adding metadata to chunks
  4. Async pipeline - Good performance architecture

What Needs Improvement ❌

  1. Mixed vector space - Code and docs compete unfairly
  2. Insufficient context - Metadata exists but not prominent enough
  3. No query routing - All queries treated the same
  4. No post-retrieval boosting - Pure similarity, no business logic

The Core Insight 💡

The problem isn't your chunking or embeddings—it's that you're making code compete with natural language in a natural language game.

Solution: Either separate the game (different collections) or change the rules (better context + boosting).


Further Reading

Papers & Resources

  • "Code Search with Natural Language Queries" - GitHub's approach to code search
  • "ColBERT" - Late interaction for better code retrieval
  • "GraphCodeBERT" - Using code structure for better embeddings
  • Weights & Biases Blog: "Building RAG for Code" (practical guide)
  • LangChain Docs: Multi-query retrieval and ensemble retrieval

Similar Systems

  • GitHub Copilot: Uses separate indexes for code vs docs
  • Sourcegraph: Multi-tier search (symbol → code → docs)
  • OpenAI Codex: Separate fine-tuning for code understanding

Questions for Your Team

  1. How code-heavy are your typical queries?

    • If 80%+ are code-specific → Go hard on collection separation
    • If mixed → Hybrid approach with good routing
  2. Can you afford reranking latency?

    • Yes → Use heavier boosting/reranking
    • No → Keep it simple, just collection separation
  3. Do you have labeled query data?

    • Yes → Train a query classifier
    • No → Start with heuristics, collect data, improve
  4. How important is documentation context?

    • Very → Keep hybrid search, just weight it
    • Not much → Aggressive filtering for code queries

Last updated: January 7, 2026

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions