Contextinator Architecture Analysis & Proposed Solutions
Current Architecture Overview
Your System Components
1. Ingestion Pipeline (ingestion/async_service.py)
- Clones GitHub repositories
- Orchestrates the chunking → embedding → storage pipeline
- Uses async/concurrent processing for performance
2. Chunking System (chunking/)
- AST Parser (
ast_parser.py): Uses Tree-sitter to parse code files into Abstract Syntax Trees
- Node Collector (
node_collector.py): Collects semantic nodes (functions, classes, methods) from the AST
- Chunk Service (
chunk_service.py): Orchestrates file discovery, parsing, and chunking
- Splitter (
splitter.py): Splits large chunks based on token limits with overlap
- Context Builder (
context_builder.py): Adds metadata enrichment to chunks
3. Embedding System (embedding/embedding_service.py)
- Uses OpenAI's
text-embedding-3-large model
- Generates embeddings for code chunks
- Supports batch processing and async operations
- Uses "enriched content" (metadata + code) for embedding
4. Vector Store (vectorstore/chroma_store.py)
- ChromaDB for vector storage and similarity search
- Supports both local persistence and server mode
- Stores embeddings with metadata (file path, language, node type, etc.)
5. Retrieval/Search (tools/semantic_search.py)
- Pure semantic similarity search in ChromaDB
- Filters by: language, file_path, node_type, is_parent
- Returns top-k results based on cosine similarity
The Problem You're Facing
Issue: Documentation Drowns Out Code in Search Results
Symptoms:
- In large repos with good documentation, 60-70% of top-k results are README.md, USAGE.md, and doc files
- Actual code chunks get pushed out of results
- Problem worse with:
- Deep folder nesting
- High documentation-to-code ratio
- High-level/intent-based queries (vs. specific symbol searches)
Root Cause:
Documentation is written in natural language, which embeds much closer to natural language queries than code does.
Example:
- Query: "how does authentication work?"
- README section about auth: 🎯 High similarity (natural language → natural language)
- Actual auth implementation code: ❌ Lower similarity (natural language → code syntax)
Your Current Approach:
Single Vector Space
├── Code chunks (with some metadata enrichment)
└── Documentation chunks (pure natural language)
└── Documentation always wins similarity scores!
You DO have some enrichment (build_enriched_content adds metadata like file path, language, node type), but this isn't enough to overcome the natural language advantage of docs.
Solution 1: The Expert's "3D Spectral Indexing" Approach
Translation: What They're Actually Saying
The first commenter is proposing a multi-layered, context-aware architecture. Here's what they mean in plain English:
Core Concept: "Architectural Separation of Concerns"
Instead of treating all chunks equally, organize them into architectural layers based on purpose and structure.
The Three Dimensions
1. Sequence of Expressions
- Linear code flow: line 1 → line 2 → line 3
- Think: "execution order" or "temporal flow"
2. Nesting (Scope Hierarchy)
- Depth in the code structure
- Example:
Module
└── Class
└── Method
└── For Loop
└── If Statement
- You already capture this with your AST parent-child relationships!
3. Conditional Flow Routing
- Branching: if/else, switch, try/catch
- Concurrency: threads, async/await, callbacks
- Control flow patterns
Two Perspectives
Perspective 1: AST Graph
- Parse code into AST (✅ you already do this)
- Resolve symbols and their scopes
- Track references between blocks
Perspective 2: Architectural Summaries
- Create summaries at each architectural layer
- Example layers:
- Function level: "This function authenticates users"
- Class level: "This class handles user management"
- Module level: "This module provides authentication services"
- Index these summaries separately from raw code
"Spectral Indexing"
This is the fancy term for context-aware indexing. Instead of just Euclidean distance in vector space, consider:
For Code:
- Not just "what does it say?" but "where does it live in the architecture?"
- Weight by:
- Symbol density: Code with lots of function/class definitions = higher importance
- AST depth: Deeper nested code might be implementation details
- Architectural role: Is this a public API? Internal helper? Data structure?
Analogy They Use:
Geographic indexing considers accessibility (roads, transport) not just distance
Applied to Code:
Index by architectural accessibility: "How relevant is this code to the user's architectural concern?"
Implementation Approach
┌─────────────────────────────────────────────────┐
│ Multi-Layer Index Architecture │
└─────────────────────────────────────────────────┘
Layer 1: Documentation & High-Level Architecture
├── README sections
├── Architecture docs
├── Module-level summaries (generated from code)
└── [Natural language, high-level concepts]
Layer 2: API & Public Interfaces
├── Public classes/functions
├── Exported symbols
├── Function signatures with docstrings
└── [Mix of natural language + code structure]
Layer 3: Implementation Code
├── Function bodies
├── Private methods
├── Implementation details
└── [Raw code with metadata enrichment]
Layer 4: Low-Level Details
├── Deeply nested blocks
├── Helper functions
├── Data transformations
└── [Pure implementation, heavy metadata]
Query Routing Logic:
-
Classify query intent based on language:
- High-level keywords ("architecture", "how does", "overview") → Layer 1-2
- Code-specific ("function", "class", "implement") → Layer 2-3
- Symbol names (actual function/class names) → Layer 3-4
-
Search relevant layers with different weights:
- High-level query: 70% Layer 1, 30% Layer 2
- Code query: 10% Layer 1, 40% Layer 2, 50% Layer 3
-
Combine results with architectural context
Key Innovation:
Generate architectural summaries automatically from your AST:
- For each function: "Function
authenticate_user in auth.py validates user credentials"
- For each class: "Class
UserManager handles user CRUD operations"
- For each module: "Module
auth provides authentication and authorization"
Index these summaries as separate chunks in different collections/layers.
Solution 2: The Pragmatic "Query Routing + Boosting" Approach
Translation: What They're Actually Saying
This is a simpler, proven approach that doesn't require major re-architecture.
Core Strategy: Separate Collections + Smart Routing
1. Separate Vector Stores
Collection: "code_chunks"
├── All AST-parsed code chunks
├── Functions, classes, methods
└── Enriched with metadata
Collection: "docs_chunks"
├── README files
├── USAGE guides
├── Architecture docs
└── Pure documentation
Collection: "hybrid_chunks" (optional)
├── Docstrings
├── Inline comments
├── API documentation generated from code
└── Mix of code + natural language
2. Query Intent Classification
Classify the query BEFORE searching to decide which collection(s) to hit.
Simple Heuristics (No ML needed):
def classify_query(query: str) -> str:
code_signals = [
"function", "class", "method", "implementation",
"def ", "async ", "import", "return",
".py", ".js", ".ts", # file extensions
"()", "[]", "{}", # code syntax
"error:", "traceback" # debugging
]
doc_signals = [
"how to", "what is", "overview", "architecture",
"getting started", "tutorial", "guide",
"why does", "explain", "documentation"
]
code_score = sum(1 for signal in code_signals if signal in query.lower())
doc_score = sum(1 for signal in doc_signals if signal in query.lower())
if code_score > doc_score:
return "code"
elif doc_score > code_score:
return "docs"
else:
return "hybrid"
3. Weighted Hybrid Search
Instead of hard filtering, query multiple collections and weight results:
if query_type == "code":
results = (
search("code_chunks", k=8, weight=1.0) +
search("docs_chunks", k=2, weight=0.3)
)
elif query_type == "docs":
results = (
search("docs_chunks", k=7, weight=1.0) +
search("code_chunks", k=3, weight=0.5)
)
else: # hybrid
results = (
search("code_chunks", k=5, weight=0.8) +
search("docs_chunks", k=5, weight=0.8)
)
# Re-rank combined results by weighted score
final_results = sorted(results, key=lambda x: x.score * x.weight)[:10]
4. Boost Code Chunks with Better Context
Your current enrichment:
File: src/auth.py
Language: python
Type: function_definition
Symbol: authenticate_user
Lines: 45-67
def authenticate_user(username, password):
# actual code...
Enhanced enrichment to compete with docs:
File: src/auth.py
Module: auth
Language: python
Type: function_definition
Symbol: authenticate_user
Purpose: Validates user credentials against database
Parent: UserManager class
Dependencies: bcrypt, database
Lines: 45-67
Tags: authentication, security, user-validation
def authenticate_user(username, password):
# actual code...
Add:
- Purpose: One-sentence description (can be extracted from docstring or generated)
- Module context: What module/package this belongs to
- Dependencies: What this code imports/uses
- Tags: Semantic tags (authentication, database, API, etc.)
- Parent context: Already have this, make it more prominent
5. Metadata Boosting at Retrieval Time
When retrieving results, boost scores based on metadata:
def boost_score(chunk, query, base_similarity):
boost = 1.0
# Boost if query mentions the file name
if chunk['file_path'] in query:
boost *= 1.5
# Boost if query mentions the symbol name
if chunk.get('node_name', '').lower() in query.lower():
boost *= 2.0
# Boost by symbol density (more defs = more important)
symbol_density = chunk.get('symbol_count', 0) / max(chunk.get('line_count', 1), 1)
if symbol_density > 0.1: # High symbol density
boost *= 1.3
# Reduce boost for deeply nested code (implementation details)
ast_depth = chunk.get('ast_depth', 0)
if ast_depth > 5:
boost *= 0.8
return base_similarity * boost
Implementation Steps (Pragmatic Approach)
Phase 1: Separate Collections
- Modify
chunk_repository() to tag chunks as "code" vs "docs"
- Store in separate ChromaDB collections
- Update
semantic_search() to accept collection parameter
Phase 2: Query Classification
- Add simple intent classifier (keyword-based)
- Route queries to appropriate collection(s)
- Merge results with weights
Phase 3: Enhanced Enrichment
- Extract docstrings and add as "purpose" field
- Add symbol density calculation
- Add semantic tags based on imports/names
- Make parent context more prominent
Phase 4: Metadata Boosting
- Implement post-retrieval score boosting
- Use symbol matches, file paths, AST depth
- Re-rank before returning to LangGraph
Comparison: Expert vs. Pragmatic
| Aspect |
Expert (3D Spectral) |
Pragmatic (Query Routing) |
| Complexity |
High - requires architectural analysis |
Low - mostly classification logic |
| Re-architecture |
Significant changes needed |
Minimal changes to existing code |
| Time to Implement |
Weeks |
Days |
| Maintenance |
Complex, many moving parts |
Simple, straightforward |
| Flexibility |
Very flexible, handles complex queries |
Works well for common cases |
| Performance |
Potentially better for complex codebases |
Fast, proven approach |
| Best For |
Large enterprise codebases, complex architectures |
Most real-world RAG systems |
Recommended Implementation Path
Start with Pragmatic (Solution 2), Evolve to Expert (Solution 1)
Phase 1: Quick Wins (Week 1)
- ✅ Separate code/docs collections
- ✅ Simple query classification
- ✅ Test with your current repos
Phase 2: Enhanced Context (Week 2)
- ✅ Better metadata enrichment
- ✅ Extract docstrings as purpose
- ✅ Add semantic tags
Phase 3: Smart Boosting (Week 3)
- ✅ Implement metadata-based boosting
- ✅ Symbol density calculations
- ✅ File path matching
Phase 4: Multi-Layer (Month 2+)
- ✅ Add architectural summary generation
- ✅ Create layer-based collections
- ✅ Advanced query routing
Key Takeaways
What You're Doing Right ✅
- AST-based chunking - Much better than naive text splitting
- Parent-child relationships - Captures code structure
- Enriched content - Adding metadata to chunks
- Async pipeline - Good performance architecture
What Needs Improvement ❌
- Mixed vector space - Code and docs compete unfairly
- Insufficient context - Metadata exists but not prominent enough
- No query routing - All queries treated the same
- No post-retrieval boosting - Pure similarity, no business logic
The Core Insight 💡
The problem isn't your chunking or embeddings—it's that you're making code compete with natural language in a natural language game.
Solution: Either separate the game (different collections) or change the rules (better context + boosting).
Further Reading
Papers & Resources
- "Code Search with Natural Language Queries" - GitHub's approach to code search
- "ColBERT" - Late interaction for better code retrieval
- "GraphCodeBERT" - Using code structure for better embeddings
- Weights & Biases Blog: "Building RAG for Code" (practical guide)
- LangChain Docs: Multi-query retrieval and ensemble retrieval
Similar Systems
- GitHub Copilot: Uses separate indexes for code vs docs
- Sourcegraph: Multi-tier search (symbol → code → docs)
- OpenAI Codex: Separate fine-tuning for code understanding
Questions for Your Team
-
How code-heavy are your typical queries?
- If 80%+ are code-specific → Go hard on collection separation
- If mixed → Hybrid approach with good routing
-
Can you afford reranking latency?
- Yes → Use heavier boosting/reranking
- No → Keep it simple, just collection separation
-
Do you have labeled query data?
- Yes → Train a query classifier
- No → Start with heuristics, collect data, improve
-
How important is documentation context?
- Very → Keep hybrid search, just weight it
- Not much → Aggressive filtering for code queries
Last updated: January 7, 2026
Contextinator Architecture Analysis & Proposed Solutions
Current Architecture Overview
Your System Components
1. Ingestion Pipeline (
ingestion/async_service.py)2. Chunking System (
chunking/)ast_parser.py): Uses Tree-sitter to parse code files into Abstract Syntax Treesnode_collector.py): Collects semantic nodes (functions, classes, methods) from the ASTchunk_service.py): Orchestrates file discovery, parsing, and chunkingsplitter.py): Splits large chunks based on token limits with overlapcontext_builder.py): Adds metadata enrichment to chunks3. Embedding System (
embedding/embedding_service.py)text-embedding-3-largemodel4. Vector Store (
vectorstore/chroma_store.py)5. Retrieval/Search (
tools/semantic_search.py)The Problem You're Facing
Issue: Documentation Drowns Out Code in Search Results
Symptoms:
Root Cause:
Documentation is written in natural language, which embeds much closer to natural language queries than code does.
Example:
Your Current Approach:
You DO have some enrichment (
build_enriched_contentadds metadata like file path, language, node type), but this isn't enough to overcome the natural language advantage of docs.Solution 1: The Expert's "3D Spectral Indexing" Approach
Translation: What They're Actually Saying
The first commenter is proposing a multi-layered, context-aware architecture. Here's what they mean in plain English:
Core Concept: "Architectural Separation of Concerns"
Instead of treating all chunks equally, organize them into architectural layers based on purpose and structure.
The Three Dimensions
1. Sequence of Expressions
2. Nesting (Scope Hierarchy)
3. Conditional Flow Routing
Two Perspectives
Perspective 1: AST Graph
Perspective 2: Architectural Summaries
"Spectral Indexing"
This is the fancy term for context-aware indexing. Instead of just Euclidean distance in vector space, consider:
For Code:
Analogy They Use:
Applied to Code:
Implementation Approach
Query Routing Logic:
Classify query intent based on language:
Search relevant layers with different weights:
Combine results with architectural context
Key Innovation:
Generate architectural summaries automatically from your AST:
authenticate_userinauth.pyvalidates user credentials"UserManagerhandles user CRUD operations"authprovides authentication and authorization"Index these summaries as separate chunks in different collections/layers.
Solution 2: The Pragmatic "Query Routing + Boosting" Approach
Translation: What They're Actually Saying
This is a simpler, proven approach that doesn't require major re-architecture.
Core Strategy: Separate Collections + Smart Routing
1. Separate Vector Stores
2. Query Intent Classification
Classify the query BEFORE searching to decide which collection(s) to hit.
Simple Heuristics (No ML needed):
3. Weighted Hybrid Search
Instead of hard filtering, query multiple collections and weight results:
4. Boost Code Chunks with Better Context
Your current enrichment:
Enhanced enrichment to compete with docs:
Add:
5. Metadata Boosting at Retrieval Time
When retrieving results, boost scores based on metadata:
Implementation Steps (Pragmatic Approach)
Phase 1: Separate Collections
chunk_repository()to tag chunks as "code" vs "docs"semantic_search()to accept collection parameterPhase 2: Query Classification
Phase 3: Enhanced Enrichment
Phase 4: Metadata Boosting
Comparison: Expert vs. Pragmatic
Recommended Implementation Path
Start with Pragmatic (Solution 2), Evolve to Expert (Solution 1)
Phase 1: Quick Wins (Week 1)
Phase 2: Enhanced Context (Week 2)
Phase 3: Smart Boosting (Week 3)
Phase 4: Multi-Layer (Month 2+)
Key Takeaways
What You're Doing Right ✅
What Needs Improvement ❌
The Core Insight 💡
The problem isn't your chunking or embeddings—it's that you're making code compete with natural language in a natural language game.
Solution: Either separate the game (different collections) or change the rules (better context + boosting).
Further Reading
Papers & Resources
Similar Systems
Questions for Your Team
How code-heavy are your typical queries?
Can you afford reranking latency?
Do you have labeled query data?
How important is documentation context?
Last updated: January 7, 2026