Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,8 @@ codememory serve
codememory search "where is the auth logic?"

# Git graph (rollout build)
codememory git-init --repo /absolute/path/to/repo --mode local --full-history
codememory git-sync --repo /absolute/path/to/repo --incremental
codememory git-init --repo /absolute/path/to/repo
codememory git-sync --repo /absolute/path/to/repo --full
codememory git-status --repo /absolute/path/to/repo --json
```

Expand Down Expand Up @@ -136,6 +136,14 @@ Full workflow and options: [docs/TOOL_USE_ANNOTATION.md](docs/TOOL_USE_ANNOTATIO
| `get_file_dependencies(file_path, domain="code")` | Returns imports and dependents for a file |
| `identify_impact(file_path, max_depth=3, domain="code")` | Blast radius analysis for changes |
| `get_file_info(file_path, domain="code")` | File structure overview (classes, functions) |
| `create_memory_entities(entities)` | Create or update agent-authored memory nodes in Neo4j |
| `create_memory_relations(relations)` | Create typed relationships between memory nodes |
| `add_memory_observations(observations)` | Append observation strings to existing memory nodes |
| `delete_memory_entities(entity_names)` | Delete memory nodes by name |
| `delete_memory_relations(relations)` | Delete typed relationships between memory nodes |
| `delete_memory_observations(observations)` | Remove observation strings from memory nodes |
| `search_memory_nodes(query, limit=5)` | Search memory nodes by name, type, and observations |
| `read_memory_graph()` | Read a summary of the current memory graph |
| `get_git_file_history(file_path, limit=20, domain="git")` | File-level commit history and ownership signals (git rollout) |
| `get_commit_context(sha, include_diff_stats=true)` | Commit metadata and change statistics (git rollout) |
| `find_recent_risky_changes(path_or_symbol, window_days, domain="hybrid")` | Recent high-risk changes using hybrid signals (git rollout) |
Expand Down
126 changes: 117 additions & 9 deletions docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ $ codememory serve

**Server behavior:**
- Runs until interrupted (Ctrl+C)
- Exposes 4 MCP tools (see [MCP Tools](#mcp-tools))
- Exposes MCP tools for code graph queries, git graph queries, and agent-authored memory writes (see [MCP Tools](#mcp-tools))
- Uses local config or environment variables
- Graceful shutdown on SIGTERM/SIGINT

Expand Down Expand Up @@ -846,30 +846,137 @@ print(f"Cost: ${metrics['cost_usd']:.4f}")

##### `semantic_search()`

Perform vector similarity search.
Perform vector similarity search with optional multi-repo filtering.

```python
def semantic_search(self, query: str, limit: int = 5) -> List[Dict]
def semantic_search(
self,
query: str,
limit: int = 5,
repo_id: Optional[str] = None
) -> List[Dict]
```

**Parameters:**
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `query` | str | Yes | - | Natural language search query |
| `limit` | int | No | 5 | Maximum results to return |
| `repo_id` | Optional[str] | No | None | Restrict results to a specific repo. Falls back to `self.repo_id` if set. |

**Behavior when `repo_id` is active:**
- Over-fetches `limit × 3` candidates from the vector index
- Adds a `WHERE entity.repo_id = $repo_id` filter after the DESCRIBE hop
- Calls `_rerank_results()` to score and trim to `limit`

**Returns:**
```python
[
{
"name": "authenticate",
"sig": "src/auth.py:authenticate",
"score": 0.92,
"score": 0.92, # raw vector similarity (0–1)
"final_score": 0.94, # 0.9×vector_score + structural_bonus
"text": "def authenticate(username, password):..."
},
...
]
```

- `final_score` is always present when `repo_id` filtering is active (via `_rerank_results()`).

Comment on lines +878 to +887

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation claims final_score is only present when repo_id filtering is active, and that score is always raw vector similarity (0–1). In the implementation (KnowledgeGraphBuilder._rerank_results), final_score is added for every result (including when repo_id is not set), and when the vector path falls back to fulltext search the returned score comes from Neo4j fulltext scoring (not guaranteed to be 0–1). Please update the contract here to match actual behavior, or adjust the code to only add final_score / only rerank when repo filtering is active.

Copilot uses AI. Check for mistakes.
**Example:**
```python
results = builder.semantic_search("JWT validation", limit=3)
results = builder.semantic_search("JWT validation", limit=3, repo_id="my-service")
for r in results:
print(f"{r['name']} - Score: {r['score']:.2f}")
print(f"{r['name']} - Score: {r['score']:.2f} Final: {r['final_score']:.2f}")
```

---

##### `_rerank_results()`

Private method. Re-scores a candidate list by combining vector similarity with graph connectivity bonuses, then trims to `limit`.

```python
def _rerank_results(self, results: List[Dict], limit: int) -> List[Dict]
```

**Parameters:**
| Parameter | Type | Description |
|-----------|------|-------------|
| `results` | List[Dict] | Candidate results (over-fetched, each with a `score` field) |
| `limit` | int | Final number of results to return |

**Scoring formula:**
```
final_score = 0.9 × vector_score + structural_bonus
```

**Connectivity bonuses (structural_bonus):**
| Relation | Bonus |
|----------|-------|
| `calls_out` | +0.05 |
| `called_by` | +0.05 |
| `methods` | +0.03 |

**Behavior:**
- Sorts descending by `final_score`
- Trims list to `limit`
- Adds `final_score` key to each result dict

**GDS upgrade path:** Replace heuristic bonuses with `entity.pagerank` from `gds.pageRank.write()` once GDS is available.

**Note:** This is a private method — call `semantic_search()` directly; it invokes `_rerank_results()` internally.

---

##### `search_memory_nodes()`

Search the graph for memory nodes (agent-authored notes and observations) with optional repo filtering. Returns both outgoing and incoming relations for each result.

```python
def search_memory_nodes(
self,
query: str,
limit: int = 5,
repo_id: Optional[str] = None
) -> List[Dict]
```

**Parameters:**
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `query` | str | Yes | - | Natural language search query |
| `limit` | int | No | 5 | Maximum results to return |
| `repo_id` | Optional[str] | No | None | Restrict results to a specific repo. Falls back to `self.repo_id` if set. |

**Returns:**
```python
[
{
"name": "note_about_auth",
"sig": "memory:note_about_auth",
"score": 0.88,
"text": "Authentication flow requires...",
"outgoing_relations": [
{"target": "src/auth.py:authenticate", "relation_type": "REFERENCES"}
],
"incoming_relations": [
{"source": "src/api/routes/auth.py", "relation_type": "DOCUMENTED_BY"}
]
},
...
]
```

**`incoming_relations` format:** `[{"source": str, "relation_type": str}, ...]`

**Example:**
```python
nodes = builder.search_memory_nodes("auth flow notes", limit=5, repo_id="my-service")
for n in nodes:
print(f"{n['name']} ({len(n['incoming_relations'])} incoming)")
```

---
Expand Down Expand Up @@ -1129,7 +1236,8 @@ def get_indexing_config(self) -> Dict[str, Any]
{
"name": str, # Entity name
"sig": str, # Entity signature
"score": float, # Similarity (0-1)
"score": float, # Raw vector similarity (0–1)
"final_score": float, # Reranked score: 0.9×score + structural_bonus (present when repo_id filtering is active)
"text": str # Code snippet
}
```
Expand All @@ -1146,5 +1254,5 @@ def get_indexing_config(self) -> Dict[str, Any]

---

**API Version:** 1.0.0
**Last Updated:** 2025-02-09
**API Version:** 1.1.0
**Last Updated:** 2026-04-05
29 changes: 29 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,35 @@ FOR (n:Function|Class|File) ON EACH [n.name, n.docstring, n.path]

---

## Multi-Repo Partitioning (repo_id)

CodeMemory supports multiple repositories in a single Neo4j database using `repo_id` partitioning.

### Identity Model

| Node | Old identity | New identity |
|------|-------------|--------------|
| File | `path` (global) | `(repo_id, path)` (composite) |
| Function | `signature` (global) | `(repo_id, signature)` (composite) |
| Class | `qualified_name` (global) | `(repo_id, qualified_name)` (composite) |
| Memory | `name` (global) | `(repo_id, name)` (composite) |

A `Repository` anchor node (`{repo_id, root_path}`) is also created per repo.

### Backward Compatibility

When `CODEMEMORY_REPO` is not set, `repo_id` is `None` and all queries omit the repo filter — identical to the pre-partitioning behavior.

### Retrieval Model

When `repo_id` is active, `semantic_search()` over-fetches by 3x, filters by `entity.repo_id`, then applies structural reranking (`_rerank_results()`) before returning the final result set. This prevents worktree pollution (multiple indexed copies of the same function appearing in results).

### GDS Upgrade Path

When Aura API credentials are available (`gds.aura.api.credentials(clientId, clientSecret)`), replace the heuristic structural bonus in `_rerank_results()` with GDS-computed `entity.pagerank`. See comments in `graph.py` near `_rerank_results()`.

---

## 4-Pass Ingestion Pipeline

The ingestion pipeline processes code in 4 sequential passes to build the complete graph.
Expand Down
8 changes: 4 additions & 4 deletions docs/FIELD_TEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ codememory index
codememory status --json

# 2) Git graph setup + sync
codememory git-init --repo /absolute/path/to/repo --mode local --full-history
codememory git-sync --repo /absolute/path/to/repo --incremental
codememory git-init --repo /absolute/path/to/repo
codememory git-sync --repo /absolute/path/to/repo --full
codememory git-status --repo /absolute/path/to/repo --json

# 3) Optional MCP checks (domain routing)
Expand Down Expand Up @@ -62,7 +62,7 @@ Record exact values from command output.
### Performance

- `codememory index` elapsed time:
- `codememory git-sync --incremental` elapsed time:
- `codememory git-sync` elapsed time:
- Embedding calls:
- Token usage:
- Estimated cost:
Expand All @@ -71,7 +71,7 @@ Record exact values from command output.

- [ ] PASS / FAIL: `git-init` succeeds with expected repo metadata.
- [ ] PASS / FAIL: first `git-sync` ingests history and sets checkpoint.
- [ ] PASS / FAIL: second `git-sync --incremental` with no new commits reports zero new commits.
- [ ] PASS / FAIL: second `git-sync` with no new commits reports zero new commits.
- [ ] PASS / FAIL: `git-status --json` returns stable envelope (`ok`, `error`, `data`, `metrics`).
- [ ] PASS / FAIL: code graph queries still work with git graph enabled.
- [ ] PASS / FAIL: `domain="code"` queries return expected code entities.
Expand Down
22 changes: 10 additions & 12 deletions docs/GIT_GRAPH.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,17 +52,12 @@ Use explicit domain routing in MCP tool calls:
Initialize git graph metadata and checkpoint state for a repository.

```bash
codememory git-init \
--repo /absolute/path/to/repo \
--mode local \
--full-history
codememory git-init --repo /absolute/path/to/repo
```

Common options:
- `--repo PATH`
- `--mode local|local+github`
- `--full-history`
- `--since <rev>`
- `--json`

Expected output (human-readable):

Expand All @@ -78,14 +73,17 @@ Checkpoint: <HEAD_SHA>
Sync commits from git history into the git graph.

```bash
codememory git-sync --repo /absolute/path/to/repo --incremental
# Initial full backfill
codememory git-sync --repo /absolute/path/to/repo --full

# Later incremental updates
codememory git-sync --repo /absolute/path/to/repo
```

Common options:
- `--repo PATH`
- `--incremental`
- `--full`
- `--from-ref <ref>`
- `--json`

Expected output (human-readable):

Expand Down Expand Up @@ -153,8 +151,8 @@ Expected JSON envelope:
Quick validation sequence:

```bash
codememory git-init --repo /absolute/path/to/repo --mode local --full-history
codememory git-sync --repo /absolute/path/to/repo --incremental
codememory git-init --repo /absolute/path/to/repo
codememory git-sync --repo /absolute/path/to/repo --full
codememory git-status --repo /absolute/path/to/repo --json
```

Expand Down
2 changes: 1 addition & 1 deletion docs/MCP_INTEGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -776,7 +776,7 @@ Before refactoring:
codememory search "function_name"
codememory impact path/to/file.py
# Optional git graph sync (git-enabled builds)
codememory git-sync --repo /absolute/path/to/repo --incremental
codememory git-sync --repo /absolute/path/to/repo
```

### 5. Keep Index Updated
Expand Down
Loading
Loading