Skip to content

Commit 01ef58d

Browse files
authored
Merge pull request #170 from m1rl0k/dense
Dense
2 parents ae1296e + 0ef77d7 commit 01ef58d

63 files changed

Lines changed: 7692 additions & 904 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,14 @@ jobs:
3232
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements*.txt', '**/pyproject.toml') }}
3333
restore-keys: |
3434
${{ runner.os }}-pip-
35+
36+
- name: Cache embedding models
37+
uses: actions/cache@v3
38+
with:
39+
path: ~/.cache/huggingface
40+
key: ${{ runner.os }}-embeddings-bge-base-en-v1.5
41+
restore-keys: |
42+
${{ runner.os }}-embeddings-
3543
3644
- name: Install dependencies
3745
run: |
@@ -51,6 +59,10 @@ jobs:
5159
# Integration tests set their own unique collection names.
5260
# Unit tests mock Qdrant and don't need a real collection.
5361
62+
- name: Pre-download embedding model
63+
run: |
64+
python -c "from fastembed import TextEmbedding; m = TextEmbedding(model_name='BAAI/bge-base-en-v1.5'); list(m.embed(['test']))"
65+
5466
- name: Run tests
5567
run: pytest -q
5668

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,7 @@ docs/FORMULAS.md
4747
# SvelteKit
4848
.svelte-kit/
4949
build/
50+
/ideas
51+
/events
52+
*.XxwubJkx
53+
.coverage

.skills/mcp-tool-selection/SKILL.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,10 @@ grep -rn "REDIS_HOST" . # Exact environment variable
4141

4242
| Question Type | Tool |
4343
|--------------|------|
44-
| "Where is X implemented?" | MCP repo_search |
45-
| "How does authentication work?" | MCP context_answer |
44+
| "Where is X implemented?" | MCP `repo_search` |
45+
| "Who calls this and show code?" | MCP `symbol_graph` (hydrated w/ snippets) |
46+
| "How does authentication work?" | MCP `context_answer` |
47+
| "High-level module overview?" | MCP `info_request` (with explanations) |
4648
| "Does REDIS_HOST exist?" | Literal grep |
4749
| "Why did behavior change?" | `search_commits_for` + `change_history_for_path` |
4850

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,16 @@ Python, TypeScript/JavaScript, Go, Java, Rust, C#, PHP, Shell, Terraform, YAML,
226226
*Corpus: 20,604 code snippets | 500 queries | Pure dense retrieval, no reranking*
227227
*Jina-Code: jinaai/jina-embeddings-v2-base-code (code-specific, 8k context)*
228228

229+
### CoIR Benchmark (Full Corpus, Dense Retrieval)
230+
231+
| Benchmark | Corpus | Queries | NDCG@10 |
232+
|-----------|--------|---------|---------|
233+
| **CodeSearchNet-Python** | 280K | 14.9K | **74.37%** |
234+
| **CodeSearchNet-Go** | 280K | 14.9K | **74.51%** |
235+
| **CodeSearchNet-JavaScript** | 280K | 14.9K | **57.19%** |
236+
237+
*Full CoIR corpus evaluation with dense retrieval (Jina-Code embeddings)*
238+
229239
---
230240

231241
## License

ctx-mcp-bridge/package-lock.json

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

ctx-mcp-bridge/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@context-engine-bridge/context-engine-mcp-bridge",
3-
"version": "0.0.12",
3+
"version": "0.0.13",
44
"description": "Context Engine MCP bridge (http/stdio proxy combining indexer + memory servers)",
55
"bin": {
66
"ctxce": "bin/ctxce.js",

ctx-mcp-bridge/src/resultPathMapping.js

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -258,10 +258,16 @@ function remapHitPaths(hit, workspaceRoot) {
258258
if (!containerPath && rawPath) {
259259
containerPath = rawPath;
260260
}
261-
const relPath = computeWorkspaceRelativePath(containerPath, hostPath);
262261
const out = { ...hit };
263-
if (relPath) {
264-
out.rel_path = relPath;
262+
// Respect server's rel_path if already provided and non-empty; only compute if missing
263+
const serverRelPath = typeof hit.rel_path === "string" ? hit.rel_path.trim() : "";
264+
if (serverRelPath) {
265+
out.rel_path = serverRelPath;
266+
} else {
267+
const relPath = computeWorkspaceRelativePath(containerPath, hostPath);
268+
if (relPath) {
269+
out.rel_path = relPath;
270+
}
265271
}
266272
// Remap related_paths nested under each hit (repo_search/hybrid_search emit this per result).
267273
try {
@@ -271,9 +277,10 @@ function remapHitPaths(hit, workspaceRoot) {
271277
} catch {
272278
// ignore
273279
}
274-
if (workspaceRoot && relPath) {
280+
const finalRelPath = out.rel_path || "";
281+
if (workspaceRoot && finalRelPath) {
275282
try {
276-
const relNative = _posixToNative(relPath);
283+
const relNative = _posixToNative(finalRelPath);
277284
const candidate = path.join(workspaceRoot, relNative);
278285
const diagnostics = envTruthy(process.env.CTXCE_BRIDGE_PATH_DIAGNOSTICS, false);
279286
const strictClientPath = envTruthy(process.env.CTXCE_BRIDGE_CLIENT_PATH_STRICT, false);
@@ -315,8 +322,8 @@ function remapHitPaths(hit, workspaceRoot) {
315322
if (overridePath) {
316323
if (typeof out.client_path === "string" && out.client_path) {
317324
out.path = out.client_path;
318-
} else if (relPath) {
319-
out.path = relPath;
325+
} else if (finalRelPath) {
326+
out.path = finalRelPath;
320327
}
321328
}
322329
return out;

docker-compose.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -433,6 +433,9 @@ services:
433433
- LEX_SPARSE_NAME=${LEX_SPARSE_NAME:-}
434434
# Pattern vectors for structural code similarity
435435
- PATTERN_VECTORS=${PATTERN_VECTORS:-}
436+
# Graph edges for symbol relationships
437+
- INDEX_GRAPH_EDGES=${INDEX_GRAPH_EDGES:-1}
438+
- INDEX_GRAPH_EDGES_MODE=${INDEX_GRAPH_EDGES_MODE:-symbol}
436439
volumes:
437440
- workspace_pvc:/work:rw
438441
- codebase_pvc:/work/.codebase:rw
@@ -469,6 +472,7 @@ services:
469472
- QWEN3_QUERY_INSTRUCTION=${QWEN3_QUERY_INSTRUCTION:-1}
470473
- QWEN3_INSTRUCTION_TEXT=${QWEN3_INSTRUCTION_TEXT}
471474
- WATCH_ROOT=${WATCH_ROOT:-/work}
475+
# - WATCH_USE_POLLING=${WATCH_USE_POLLING:-1} SET on MAC OSx
472476
- HOST_INDEX_PATH=/work
473477
- QDRANT_TIMEOUT=${QDRANT_TIMEOUT:-60}
474478
# Chunking config - use ${VAR:-} to properly inherit from .env (not host shell)
@@ -490,6 +494,10 @@ services:
490494
- LEX_SPARSE_NAME=${LEX_SPARSE_NAME:-}
491495
# Pattern vectors for structural code similarity
492496
- PATTERN_VECTORS=${PATTERN_VECTORS:-}
497+
# Graph edges for symbol relationships
498+
- INDEX_GRAPH_EDGES=${INDEX_GRAPH_EDGES:-1}
499+
- INDEX_GRAPH_EDGES_MODE=${INDEX_GRAPH_EDGES_MODE:-symbol}
500+
- GRAPH_BACKFILL_ENABLED=${GRAPH_BACKFILL_ENABLED:-1}
493501
volumes:
494502
- workspace_pvc:/work:rw
495503
- codebase_pvc:/work/.codebase:rw

docs/ARCHITECTURE.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,80 @@ Production-ready MCP (Model Context Protocol) retrieval stack unifying code inde
129129
- **Auto-Detection**: Identifies retry patterns, resource cleanup, filter loops
130130
- **Requires**: `PATTERN_VECTORS=1` to enable
131131

132+
#### Symbol Graph & Code Relationships
133+
134+
**Graph Edge Storage** (`scripts/ingest/graph_edges.py`)
135+
136+
Context Engine maintains pre-computed graph edges in dedicated Qdrant collections for fast symbol navigation. During indexing, call and import relationships are extracted and stored separately from code chunks.
137+
138+
- **Separate Collections**: Each base collection `<name>` has a companion `<name>_graph` collection
139+
- **Payload-Only Storage**: Graph collections store edges as indexed payloads (no vectors)
140+
- **Edge Types**:
141+
- `calls`: Function/method call relationships
142+
- `imports`: Module/symbol import relationships
143+
144+
**Edge Schema:**
145+
```json
146+
{
147+
"caller_symbol": "process_data",
148+
"callee_symbol": "validate_input",
149+
"caller_path": "src/handlers/processor.py",
150+
"edge_type": "calls",
151+
"repo": "my-project",
152+
"start_line": 45,
153+
"language": "python"
154+
}
155+
```
156+
157+
The schema provides both granularity levels for agentic workflows:
158+
- `caller_path`: File path for immediate agent action (view, edit)
159+
- `caller_symbol`: Function/method name for understanding which function makes the call
160+
161+
**Fast Indexed Queries:**
162+
- `get_callers(symbol)`: Find all files/functions that call a symbol
163+
- `get_callees(symbol)`: Find all functions a symbol calls
164+
- `get_importers(module)`: Find all files importing a module
165+
166+
**AST Analyzer** (`scripts/ast_analyzer.py`)
167+
168+
Tree-sitter-based multi-language AST analysis for semantic code understanding:
169+
170+
- **Symbol Extraction**: Functions, classes, methods with signatures, docstrings, decorators
171+
- **Call Graph Construction**: Maps caller → callee relationships with enclosing function context
172+
- **Dependency Tracking**: Extracts imports and module dependencies
173+
- **Semantic Chunking**: Splits code at function/class boundaries (not arbitrary line counts)
174+
175+
**Supported Languages:**
176+
| Language | Package |
177+
|----------|---------|
178+
| Python | `tree-sitter-python` |
179+
| JavaScript | `tree-sitter-javascript` |
180+
| TypeScript | `tree-sitter-typescript` |
181+
| Go | `tree-sitter-go` |
182+
| Rust | `tree-sitter-rust` |
183+
| Java | `tree-sitter-java` |
184+
| C/C++ | `tree-sitter-c`, `tree-sitter-cpp` |
185+
| C# | `tree-sitter-c-sharp` |
186+
| Ruby | `tree-sitter-ruby` |
187+
| Bash | `tree-sitter-bash` |
188+
189+
**Symbol Graph MCP Tool** (`scripts/mcp_impl/symbol_graph.py`)
190+
191+
Provides the `symbol_graph()` MCP tool for navigating code relationships:
192+
193+
- **Query Types**: `callers`, `definition`, `importers`
194+
- **Hydration**: Results include actual code snippets fetched from the main collection
195+
- **Fallback**: When graph queries return empty, falls back to semantic search
196+
- **Multi-Strategy Matching**: Exact match → variant match → substring match
197+
198+
**Intent Classification** (`scripts/intent_classifier.py`)
199+
200+
Semantic query routing using embedding similarity to exemplars:
201+
202+
- **Intent Categories**: `GRAPH`, `SEMANTIC`, `IDENTIFIER`, `HYBRID`
203+
- **Confidence Scoring**: Routes to appropriate search strategy based on query type
204+
- **Keyword Fallback**: Pattern-based classification when embeddings unavailable
205+
132206
### 5. Learning Reranker System (Optional)
133207

134208
The Learning Reranker is an **optional** self-improving ranking system that learns from search patterns to provide increasingly relevant results over time. It is enabled by default but can be disabled via `RERANK_LEARNING=0` and `RERANK_EVENTS_ENABLED=0` environment variables. See [Configuration](CONFIGURATION.md#learning-reranker) for all options.

docs/CLAUDE.example.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,14 @@ These rules are NOT optional - favor qdrant-indexer tooling at all costs over ex
9696
- Good for: "find retry loops with exponential backoff", "try: ... except: logger.error()", "error handling patterns".
9797
- Cross-language: Python pattern can match Go/Rust/Java with similar control flow.
9898
- Note: Returns error if pattern detection module is not available.
99+
- symbol_graph:
100+
- Use for: structural navigation (callers, definitions, importers).
101+
- Think: "who calls this function?", "where is this class defined?".
102+
- **Note**: Results are "hydrated" with ~500-char source snippets for immediate context.
103+
- info_request:
104+
- Use for: rapid broad discovery and architectural overviews.
105+
- Good for: "how does the reranker work?", "overview of database modules".
106+
- Tip: Set `include_explanation=true` for NL summaries and `include_relationships=true` for dependencies.
99107

100108
Advanced lineage workflow (code + history):
101109

@@ -148,4 +156,4 @@ These rules are NOT optional - favor qdrant-indexer tooling at all costs over ex
148156
blended code + memory results instead of calling repo_search and memory.memory_find
149157
separately.
150158
- Treat expand_query and the expand flag on context_answer as expensive options:
151-
only use them after a normal search/answer attempt failed to find good context.
159+
only use them after a normal search/answer attempt failed to find good context.

0 commit comments

Comments
 (0)