Retrieval Issue: Documentation Drowning Out Code in Semantic Search Results

# Contextinator Architecture Analysis & Proposed Solutions

## Current Architecture Overview

### Your System Components

**1. Ingestion Pipeline** (`ingestion/async_service.py`)
- Clones GitHub repositories
- Orchestrates the chunking → embedding → storage pipeline
- Uses async/concurrent processing for performance

**2. Chunking System** (`chunking/`)
- **AST Parser** (`ast_parser.py`): Uses Tree-sitter to parse code files into Abstract Syntax Trees
- **Node Collector** (`node_collector.py`): Collects semantic nodes (functions, classes, methods) from the AST
- **Chunk Service** (`chunk_service.py`): Orchestrates file discovery, parsing, and chunking
- **Splitter** (`splitter.py`): Splits large chunks based on token limits with overlap
- **Context Builder** (`context_builder.py`): Adds metadata enrichment to chunks

**3. Embedding System** (`embedding/embedding_service.py`)
- Uses OpenAI's `text-embedding-3-large` model
- Generates embeddings for code chunks
- Supports batch processing and async operations
- Uses "enriched content" (metadata + code) for embedding

**4. Vector Store** (`vectorstore/chroma_store.py`)
- ChromaDB for vector storage and similarity search
- Supports both local persistence and server mode
- Stores embeddings with metadata (file path, language, node type, etc.)

**5. Retrieval/Search** (`tools/semantic_search.py`)
- Pure semantic similarity search in ChromaDB
- Filters by: language, file_path, node_type, is_parent
- Returns top-k results based on cosine similarity

---

## The Problem You're Facing

### Issue: Documentation Drowns Out Code in Search Results

**Symptoms:**
- In large repos with good documentation, 60-70% of top-k results are README.md, USAGE.md, and doc files
- Actual code chunks get pushed out of results
- Problem worse with:
  - Deep folder nesting
  - High documentation-to-code ratio
  - High-level/intent-based queries (vs. specific symbol searches)

**Root Cause:**
Documentation is written in **natural language**, which embeds much closer to **natural language queries** than code does.

**Example:**
- Query: "how does authentication work?"
- README section about auth: 🎯 **High similarity** (natural language → natural language)
- Actual auth implementation code: ❌ **Lower similarity** (natural language → code syntax)

**Your Current Approach:**
```
Single Vector Space
├── Code chunks (with some metadata enrichment)
└── Documentation chunks (pure natural language)
     └── Documentation always wins similarity scores!
```

You DO have some enrichment (`build_enriched_content` adds metadata like file path, language, node type), but this isn't enough to overcome the natural language advantage of docs.

---

## Solution 1: The Expert's "3D Spectral Indexing" Approach

### Translation: What They're Actually Saying

The first commenter is proposing a **multi-layered, context-aware architecture**. Here's what they mean in plain English:

### Core Concept: "Architectural Separation of Concerns"

Instead of treating all chunks equally, organize them into **architectural layers** based on purpose and structure.

### The Three Dimensions

**1. Sequence of Expressions**
- Linear code flow: line 1 → line 2 → line 3
- Think: "execution order" or "temporal flow"

**2. Nesting (Scope Hierarchy)**
- Depth in the code structure
- Example:
  ```
  Module
  └── Class
      └── Method
          └── For Loop
              └── If Statement
  ```
- You already capture this with your AST parent-child relationships!

**3. Conditional Flow Routing**
- Branching: if/else, switch, try/catch
- Concurrency: threads, async/await, callbacks
- Control flow patterns

### Two Perspectives

**Perspective 1: AST Graph**
- Parse code into AST (✅ you already do this)
- Resolve symbols and their scopes
- Track references between blocks

**Perspective 2: Architectural Summaries**
- Create **summaries at each architectural layer**
- Example layers:
  - Function level: "This function authenticates users"
  - Class level: "This class handles user management"
  - Module level: "This module provides authentication services"
- Index these summaries separately from raw code

### "Spectral Indexing"

This is the fancy term for **context-aware indexing**. Instead of just Euclidean distance in vector space, consider:

**For Code:**
- Not just "what does it say?" but "where does it live in the architecture?"
- Weight by:
  - **Symbol density**: Code with lots of function/class definitions = higher importance
  - **AST depth**: Deeper nested code might be implementation details
  - **Architectural role**: Is this a public API? Internal helper? Data structure?

**Analogy They Use:**
> Geographic indexing considers accessibility (roads, transport) not just distance

**Applied to Code:**
> Index by architectural accessibility: "How relevant is this code to the user's architectural concern?"

### Implementation Approach

```
┌─────────────────────────────────────────────────┐
│         Multi-Layer Index Architecture          │
└─────────────────────────────────────────────────┘

Layer 1: Documentation & High-Level Architecture
├── README sections
├── Architecture docs
├── Module-level summaries (generated from code)
└── [Natural language, high-level concepts]

Layer 2: API & Public Interfaces
├── Public classes/functions
├── Exported symbols
├── Function signatures with docstrings
└── [Mix of natural language + code structure]

Layer 3: Implementation Code
├── Function bodies
├── Private methods
├── Implementation details
└── [Raw code with metadata enrichment]

Layer 4: Low-Level Details
├── Deeply nested blocks
├── Helper functions
├── Data transformations
└── [Pure implementation, heavy metadata]
```

**Query Routing Logic:**
1. **Classify query intent** based on language:
   - High-level keywords ("architecture", "how does", "overview") → Layer 1-2
   - Code-specific ("function", "class", "implement") → Layer 2-3
   - Symbol names (actual function/class names) → Layer 3-4

2. **Search relevant layers** with different weights:
   - High-level query: 70% Layer 1, 30% Layer 2
   - Code query: 10% Layer 1, 40% Layer 2, 50% Layer 3

3. **Combine results** with architectural context

**Key Innovation:**
Generate **architectural summaries automatically** from your AST:
- For each function: "Function `authenticate_user` in `auth.py` validates user credentials"
- For each class: "Class `UserManager` handles user CRUD operations"
- For each module: "Module `auth` provides authentication and authorization"

Index these summaries as **separate chunks** in different collections/layers.

---

## Solution 2: The Pragmatic "Query Routing + Boosting" Approach

### Translation: What They're Actually Saying

This is a **simpler, proven approach** that doesn't require major re-architecture.

### Core Strategy: Separate Collections + Smart Routing

**1. Separate Vector Stores**

```
Collection: "code_chunks"
├── All AST-parsed code chunks
├── Functions, classes, methods
└── Enriched with metadata

Collection: "docs_chunks"  
├── README files
├── USAGE guides
├── Architecture docs
└── Pure documentation

Collection: "hybrid_chunks" (optional)
├── Docstrings
├── Inline comments
├── API documentation generated from code
└── Mix of code + natural language
```

**2. Query Intent Classification**

Classify the query BEFORE searching to decide which collection(s) to hit.

**Simple Heuristics (No ML needed):**
```python
def classify_query(query: str) -> str:
    code_signals = [
        "function", "class", "method", "implementation",
        "def ", "async ", "import", "return",
        ".py", ".js", ".ts",  # file extensions
        "()", "[]", "{}",     # code syntax
        "error:", "traceback"  # debugging
    ]
    
    doc_signals = [
        "how to", "what is", "overview", "architecture",
        "getting started", "tutorial", "guide",
        "why does", "explain", "documentation"
    ]
    
    code_score = sum(1 for signal in code_signals if signal in query.lower())
    doc_score = sum(1 for signal in doc_signals if signal in query.lower())
    
    if code_score > doc_score:
        return "code"
    elif doc_score > code_score:
        return "docs"
    else:
        return "hybrid"
```

**3. Weighted Hybrid Search**

Instead of hard filtering, query multiple collections and weight results:

```python
if query_type == "code":
    results = (
        search("code_chunks", k=8, weight=1.0) +
        search("docs_chunks", k=2, weight=0.3)
    )
elif query_type == "docs":
    results = (
        search("docs_chunks", k=7, weight=1.0) +
        search("code_chunks", k=3, weight=0.5)
    )
else:  # hybrid
    results = (
        search("code_chunks", k=5, weight=0.8) +
        search("docs_chunks", k=5, weight=0.8)
    )

# Re-rank combined results by weighted score
final_results = sorted(results, key=lambda x: x.score * x.weight)[:10]
```

**4. Boost Code Chunks with Better Context**

Your current enrichment:
```
File: src/auth.py
Language: python
Type: function_definition
Symbol: authenticate_user
Lines: 45-67

def authenticate_user(username, password):
    # actual code...
```

**Enhanced enrichment** to compete with docs:
```
File: src/auth.py
Module: auth
Language: python  
Type: function_definition
Symbol: authenticate_user
Purpose: Validates user credentials against database
Parent: UserManager class
Dependencies: bcrypt, database
Lines: 45-67
Tags: authentication, security, user-validation

def authenticate_user(username, password):
    # actual code...
```

Add:
- **Purpose**: One-sentence description (can be extracted from docstring or generated)
- **Module context**: What module/package this belongs to
- **Dependencies**: What this code imports/uses
- **Tags**: Semantic tags (authentication, database, API, etc.)
- **Parent context**: Already have this, make it more prominent

**5. Metadata Boosting at Retrieval Time**

When retrieving results, boost scores based on metadata:

```python
def boost_score(chunk, query, base_similarity):
    boost = 1.0
    
    # Boost if query mentions the file name
    if chunk['file_path'] in query:
        boost *= 1.5
    
    # Boost if query mentions the symbol name  
    if chunk.get('node_name', '').lower() in query.lower():
        boost *= 2.0
    
    # Boost by symbol density (more defs = more important)
    symbol_density = chunk.get('symbol_count', 0) / max(chunk.get('line_count', 1), 1)
    if symbol_density > 0.1:  # High symbol density
        boost *= 1.3
    
    # Reduce boost for deeply nested code (implementation details)
    ast_depth = chunk.get('ast_depth', 0)
    if ast_depth > 5:
        boost *= 0.8
    
    return base_similarity * boost
```

### Implementation Steps (Pragmatic Approach)

**Phase 1: Separate Collections**
1. Modify `chunk_repository()` to tag chunks as "code" vs "docs"
2. Store in separate ChromaDB collections
3. Update `semantic_search()` to accept collection parameter

**Phase 2: Query Classification**
1. Add simple intent classifier (keyword-based)
2. Route queries to appropriate collection(s)
3. Merge results with weights

**Phase 3: Enhanced Enrichment**
1. Extract docstrings and add as "purpose" field
2. Add symbol density calculation
3. Add semantic tags based on imports/names
4. Make parent context more prominent

**Phase 4: Metadata Boosting**
1. Implement post-retrieval score boosting
2. Use symbol matches, file paths, AST depth
3. Re-rank before returning to LangGraph

---

## Comparison: Expert vs. Pragmatic

| Aspect | Expert (3D Spectral) | Pragmatic (Query Routing) |
|--------|---------------------|---------------------------|
| **Complexity** | High - requires architectural analysis | Low - mostly classification logic |
| **Re-architecture** | Significant changes needed | Minimal changes to existing code |
| **Time to Implement** | Weeks | Days |
| **Maintenance** | Complex, many moving parts | Simple, straightforward |
| **Flexibility** | Very flexible, handles complex queries | Works well for common cases |
| **Performance** | Potentially better for complex codebases | Fast, proven approach |
| **Best For** | Large enterprise codebases, complex architectures | Most real-world RAG systems |

---

## Recommended Implementation Path

### Start with Pragmatic (Solution 2), Evolve to Expert (Solution 1)

**Phase 1: Quick Wins (Week 1)**
- ✅ Separate code/docs collections
- ✅ Simple query classification
- ✅ Test with your current repos

**Phase 2: Enhanced Context (Week 2)**  
- ✅ Better metadata enrichment
- ✅ Extract docstrings as purpose
- ✅ Add semantic tags

**Phase 3: Smart Boosting (Week 3)**
- ✅ Implement metadata-based boosting
- ✅ Symbol density calculations
- ✅ File path matching

**Phase 4: Multi-Layer (Month 2+)**
- ✅ Add architectural summary generation
- ✅ Create layer-based collections
- ✅ Advanced query routing

---

## Key Takeaways

### What You're Doing Right ✅
1. **AST-based chunking** - Much better than naive text splitting
2. **Parent-child relationships** - Captures code structure
3. **Enriched content** - Adding metadata to chunks
4. **Async pipeline** - Good performance architecture

### What Needs Improvement ❌
1. **Mixed vector space** - Code and docs compete unfairly
2. **Insufficient context** - Metadata exists but not prominent enough
3. **No query routing** - All queries treated the same
4. **No post-retrieval boosting** - Pure similarity, no business logic

### The Core Insight 💡
**The problem isn't your chunking or embeddings—it's that you're making code compete with natural language in a natural language game.**

**Solution:** Either separate the game (different collections) or change the rules (better context + boosting).

---

## Further Reading

### Papers & Resources
- **"Code Search with Natural Language Queries"** - GitHub's approach to code search
- **"ColBERT"** - Late interaction for better code retrieval
- **"GraphCodeBERT"** - Using code structure for better embeddings
- **Weights & Biases Blog**: "Building RAG for Code" (practical guide)
- **LangChain Docs**: Multi-query retrieval and ensemble retrieval

### Similar Systems
- **GitHub Copilot**: Uses separate indexes for code vs docs
- **Sourcegraph**: Multi-tier search (symbol → code → docs)
- **OpenAI Codex**: Separate fine-tuning for code understanding

---

## Questions for Your Team

1. **How code-heavy are your typical queries?**
   - If 80%+ are code-specific → Go hard on collection separation
   - If mixed → Hybrid approach with good routing

2. **Can you afford reranking latency?**
   - Yes → Use heavier boosting/reranking
   - No → Keep it simple, just collection separation

3. **Do you have labeled query data?**
   - Yes → Train a query classifier
   - No → Start with heuristics, collect data, improve

4. **How important is documentation context?**
   - Very → Keep hybrid search, just weight it  
   - Not much → Aggressive filtering for code queries

---

*Last updated: January 7, 2026*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval Issue: Documentation Drowning Out Code in Semantic Search Results #47

Contextinator Architecture Analysis & Proposed Solutions

Current Architecture Overview

Your System Components

The Problem You're Facing

Issue: Documentation Drowns Out Code in Search Results

Solution 1: The Expert's "3D Spectral Indexing" Approach

Translation: What They're Actually Saying

Core Concept: "Architectural Separation of Concerns"

The Three Dimensions

Two Perspectives

"Spectral Indexing"

Implementation Approach

Solution 2: The Pragmatic "Query Routing + Boosting" Approach

Translation: What They're Actually Saying

Core Strategy: Separate Collections + Smart Routing

Implementation Steps (Pragmatic Approach)

Comparison: Expert vs. Pragmatic

Recommended Implementation Path

Start with Pragmatic (Solution 2), Evolve to Expert (Solution 1)

Key Takeaways

What You're Doing Right ✅

What Needs Improvement ❌

The Core Insight 💡

Further Reading

Papers & Resources

Similar Systems

Questions for Your Team

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Expert (3D Spectral)	Pragmatic (Query Routing)
Complexity	High - requires architectural analysis	Low - mostly classification logic
Re-architecture	Significant changes needed	Minimal changes to existing code
Time to Implement	Weeks	Days
Maintenance	Complex, many moving parts	Simple, straightforward
Flexibility	Very flexible, handles complex queries	Works well for common cases
Performance	Potentially better for complex codebases	Fast, proven approach
Best For	Large enterprise codebases, complex architectures	Most real-world RAG systems

Retrieval Issue: Documentation Drowning Out Code in Semantic Search Results #47

Description

Contextinator Architecture Analysis & Proposed Solutions

Current Architecture Overview

Your System Components

The Problem You're Facing

Issue: Documentation Drowns Out Code in Search Results

Solution 1: The Expert's "3D Spectral Indexing" Approach

Translation: What They're Actually Saying

Core Concept: "Architectural Separation of Concerns"

The Three Dimensions

Two Perspectives

"Spectral Indexing"

Implementation Approach

Solution 2: The Pragmatic "Query Routing + Boosting" Approach

Translation: What They're Actually Saying

Core Strategy: Separate Collections + Smart Routing

Implementation Steps (Pragmatic Approach)

Comparison: Expert vs. Pragmatic

Recommended Implementation Path

Start with Pragmatic (Solution 2), Evolve to Expert (Solution 1)

Key Takeaways

What You're Doing Right ✅

What Needs Improvement ❌

The Core Insight 💡

Further Reading

Papers & Resources

Similar Systems

Questions for Your Team

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions