29 Mar 05:58

35ce275

v0.8.0 Latest

Latest

Ranking v2 — From PageRank to Multi-Signal Composite Search

The Problem

codebase-memory-mcp v1 used a single signal for ranking: PageRank. You'd search for "payment processing" and get back whatever had the highest PageRank score among text matches. This worked for well-connected hub nodes but failed badly for:

Concept queries ("authentication and session management") — PageRank doesn't know that "authentication" means OauthMiddleware
Cross-file exploration ("complete order creation flow") — PageRank ranks individual nodes, not flows
Vocabulary gaps — code is named postOrd, users search for "create order"

The result: on a 15-case benchmark, the old system scored 30 out of a possible ~200. Most concept and cross-file queries returned irrelevant results.

The New Architecture

Multi-Signal Composite Ranking

Instead of PageRank alone, ranking is now a weighted combination of 5 independent signals:

score = W_PPR(0.35)         × Personalized PageRank
      + W_BM25(0.30)        × FTS5 BM25 text relevance
      + W_COCHANGE(0.20)    × Co-change frequency
      + W_BETWEENNESS(0.15) × Betweenness centrality
      + W_AUTHORITY(0.10)   × In-degree authority (HITS)

Each signal captures a different aspect of relevance:

Signal	What it measures	Helps with
Personalized PageRank	Graph proximity to query-relevant seed nodes	Finding related code through call/import edges
BM25	Text match quality (name, qualified_name, file_path, search_terms)	Direct name matches, prefix matching
Co-change	Files that change together in git history	Finding coupled code across modules
Betweenness centrality	Nodes that sit on many shortest paths in the graph	Identifying integration points, middleware, shared utilities
In-degree authority	Number of incoming edges (callers/importers)	Ranking genuinely important code over stubs and auto-generated files

FTS5 Search Pipeline

The text search layer was completely rebuilt:

Prefix matching — Query "payment" becomes payment* in FTS5, matching CamelCase-concatenated tokens like paymentmappingservice (from PaymentMappingService). This was the single biggest improvement.
CamelCase splitting — New search_terms column stores split forms: OauthMiddleware is indexed as "OauthMiddleware Oauth Middleware". Now middleware* finds it. BM25 weight 0.25 (low enough to not dilute primary name matches).
Stop word filtering — English stop words plus common code verbs (checks, creates, handles, gets, finds) are stripped before FTS5 query building. Without this, "checks*" matched hundreds of checkXxx functions.
Per-file result cap — FNV-1a hash tracks file paths; any single file is limited to 3 FTS results. Prevents large files (like _ide_helper.php with 5000+ stubs) from flooding the candidate set. General algorithm, no hardcoded exclusions.

Personalized PageRank (PPR)

PPR replaced global PageRank. Instead of a static, query-independent rank, PPR is seeded from the top 10 FTS hits and propagates through call/import/inheritance edges with per-type weights:

CALLS=1.0, INHERITS=0.9, HTTP_CALLS=0.8, IMPORTS=0.7, ...

15 iterations, damping factor 0.85. This means the graph signal is query-dependent — searching for "payment" propagates from payment-related nodes, not from globally popular nodes.

Betweenness Centrality

Brandes' algorithm computes betweenness centrality across the entire call graph. Nodes that sit on many shortest paths (middleware, shared services, base controllers) score higher. This is precomputed at index time and stored in node_scores.betweenness.

In-Degree Authority (Simplified HITS)

Inspired by Kleinberg's HITS algorithm. Instead of full hub/authority iteration, we use a simplified version: count incoming edges per node, normalize to [0,1]. Nodes called by many others are authoritative; auto-generated stubs with 0 callers are penalized.

Explore Mode FTS Fallback

The explore mode (for broad area queries like "order creation flow") previously used only regex matching. When regex found 0 results, it now falls back to cbm_store_ranked_search with 20 results. This turned all C-tier cross-file queries from 0 to scoring.

Compact Output

Removed debug fields (ppr, bm25, betweenness, composite_score) from the locate JSON response. The LLM only needs: name, file, type, line. Results are sorted by rank — position conveys importance. This saved ~800 bytes per query.

Development Process

25 bounded iterations using the autoresearch methodology. Each iteration: modify one thing → build → run 2683 unit tests (guard) → score against 15 benchmark cases → keep or discard.

Score Progression

Iter  Score  Delta  Status   What
 0      30    —     base     PageRank-only ranking
 1-7    —     —     discard  Weight tuning, LIKE fallback — no improvement
 8      46   +16    keep     FTS5 prefix queries (word*)
 9      59   +13    keep     Stop word filtering
10      60    +1    keep     Per-file cap (FNV-1a hash)
11      -1   -61    discard  Combined changes — catastrophic
12      72   +12    keep     Context tool + explore FTS fallback
13      73    +1    keep     In-degree authority (HITS)
14      93   +20    keep     Locate results 20→10
15     111   +18    keep     CamelCase splitting (search_terms)
16     112    +1    keep     Per-file cap 5→3
17     152   +40    discard  Synonym table — hardcodes project knowledge
18-22   —     —     discard  Weight tuning, neighbors, PPR iterations
23     120    +8    keep     Remove debug score fields
25     123    +3    keep     Remove composite score field

Key lessons:

7 failed iterations before the first improvement. Pure weight tuning doesn't work when the right files aren't in the candidate set.
Always test changes in isolation. Iteration 11 combined two +2 changes and got -61.
Don't hardcode. Synonym tables and file exclusions scored well but were project-specific. Per-file caps and prefix queries are general.
Output efficiency matters. 31 of 123 points came from reducing output bytes, not improving ranking.

LLM End-to-End Validation

The same 15 cases run through Claude Code with and without codebase-memory-mcp:

	No MCP (grep/glob)	With MCP
PASS	10	11
PARTIAL	4	3
FAIL	1	1
Cost	$4.56	$5.03
Turns	88	131

MCP's advantage is modest because Claude Code is already good at grep/glob searching. The real value: MCP gives direction on the first call — the LLM then spends turns reading code deeply rather than searching blindly. On concept queries (B-tier), MCP consistently surfaces files the LLM wouldn't find via grep alone.

Parameters Reference

// BM25 column weights (src/store/store.c)
bm25(node_fts, 10.0, 5.0, 1.0, 0.25)  // name, qualified_name, file_path, search_terms

// Composite weights
W_PPR         = 0.35
W_BM25        = 0.30
W_COCHANGE    = 0.20
W_BETWEENNESS = 0.15
W_AUTHORITY   = 0.10

// FTS pipeline
PER_FILE_CAP      = 3       // max results per file in FTS candidate set
FILE_TRACK_CAP    = 128     // hash table size for file tracking
FTS_CANDIDATE_LIMIT = 500   // SQL LIMIT on FTS5 query

// PPR
seed_count  = 10    // top FTS hits used as PPR seeds
iterations  = 15
damping     = 0.85

// Output
locate_results  = 10
explore_fallback = 20

Files Changed

src/store/store.c — CamelCase splitting (camel_case_split, build_search_terms), FTS5 schema migration with backfill, per-file cap, stop word filtering, prefix query builder, in-degree authority, betweenness centrality, composite scoring
src/mcp/mcp.c — Locate output compaction, explore FTS fallback, result count tuning
src/store/store.h — cbm_ranked_result_t typedef cleanup
benchmarks/ — 15 A/B/C test cases, score_ranking.sh scoring script, run_llm_bench.sh LLM harness, autoresearch_cases.json, result archives, viewer.html

Assets 25

checksums.txt

sha256:da2746115804e6e299c00678f351b281a952e60252e1412073d4d856b3c26516

1.04 KB 2026-03-29T05:56:05Z
checksums.txt.bundle

sha256:b99d637aa2550ab61c0f663a1412c63c614dc13e3886da44ce22fd34142c1b5d

8.36 KB 2026-03-29T05:56:05Z
codebase-memory-mcp-darwin-amd64.tar.gz

sha256:a08967ff1380d7e89c34df4e908c02e9a2438842b197677ce93347936939f66f

12.5 MB 2026-03-29T05:56:05Z
codebase-memory-mcp-darwin-amd64.tar.gz.bundle

sha256:5a8e9c32fee68c100cafb1bdae3299088b5843d7a066f7b1699be06f0e4a4abd

8.36 KB 2026-03-29T05:56:05Z
codebase-memory-mcp-darwin-arm64.tar.gz

sha256:aeb98684b7cca3552ab185d0ff5807a9ec8c0ac697a6e774a6243b92c938a994

12.3 MB 2026-03-29T05:56:05Z
codebase-memory-mcp-darwin-arm64.tar.gz.bundle

sha256:220caf91f3b847ed6a6de5064a73106438661b4416530a7517919e0c0e91e1cf

8.36 KB 2026-03-29T05:56:05Z
codebase-memory-mcp-linux-amd64.tar.gz

sha256:3fe376b3598a19fc84b0356431ec50a525f1454f4cf8c6457131d8797b8cb58a

11.7 MB 2026-03-29T05:56:05Z
codebase-memory-mcp-linux-amd64.tar.gz.bundle

sha256:35297730fd1261c95fcab39f48532cb4c7eb103c0730139ac83f88e5bce3fd27

8.36 KB 2026-03-29T05:56:05Z
codebase-memory-mcp-linux-arm64.tar.gz

sha256:0b2629b4772db23b1387467efb2748964148248f6983ef68f2c75743c58bb61d

11.5 MB 2026-03-29T05:56:05Z
codebase-memory-mcp-linux-arm64.tar.gz.bundle

sha256:3bb1b7b1401920bc50384ce571cce01dcfaa236c51e344b40e3c7b50fd29609e

8.36 KB 2026-03-29T05:56:05Z
Source code (zip)

2026-03-29T05:13:14Z
Source code (tar.gz)

2026-03-29T05:13:14Z

27 Mar 11:48

github-actions

v0.7.0

6308e57

v0.7.0

What's New

Fork Features

Auto-index on first tool call — automatically indexes a project when it's not yet indexed, with symlink-aware session detection
Search & trace parameter wiring — search, trace, path_filter, and edge_types parameters that were silently ignored are now properly wired up

From Upstream (v0.5.7)

C++ SEGV fix — NULL deref in LSP type resolver on large header files
Fast→full mode detection — auto-enable UI for ui-variant binary
VT gate hardening — 120min timeout, all files block equally
README — updated install docs with --skip-config, Windows improvements
TEST_PLAN v8 — 66 languages, MATLAB/Lean/FORM/Magma/Wolfram/K8s/Kustomize

Install

One-line install (macOS / Linux):

curl -fsSL https://raw.githubusercontent.com/maplenk/codebase-memory-mcp/main/install.sh | bash

Windows (PowerShell):

powershell -ExecutionPolicy ByPass -c "irm https://raw.githubusercontent.com/maplenk/codebase-memory-mcp/main/install.ps1 | iex"

Claude Code — just say:

Install this MCP server: https://github.com/maplenk/codebase-memory-mcp

Options: --ui (graph visualization), --skip-config (binary only)

Full Changelog: v0.6.0...v0.7.0

Security Verification

All release binaries have been independently verified:

VirusTotal — scanned by 70+ antivirus engines:

Binary	Scan
install.sh	View Report
install.ps1	View Report
codebase-memory-mcp-windows-amd64.exe	View Report
codebase-memory-mcp-ui-windows-amd64.exe	View Report
codebase-memory-mcp-ui-linux-arm64	View Report
codebase-memory-mcp-ui-linux-amd64	View Report
codebase-memory-mcp-ui-darwin-arm64	View Report
codebase-memory-mcp-ui-darwin-amd64	View Report
codebase-memory-mcp-linux-arm64	View Report
codebase-memory-mcp-linux-amd64	View Report
codebase-memory-mcp-darwin-arm64	View Report
codebase-memory-mcp-darwin-amd64	View Report
LICENSE	View Report

Build Provenance (SLSA) — cryptographic proof each binary was built by GitHub Actions from this repo:

gh attestation verify <downloaded-file> --repo maplenk/codebase-memory-mcp

SBOM — Software Bill of Materials (sbom.json) lists all vendored dependencies.

See SECURITY.md for full details.

Assets 25

26 Mar 11:20

github-actions

v0.6.0

63dd989

v0.6.0

First independent fork release with 8 new MCP tools (get_architecture_summary, get_key_symbols, get_impact_analysis, explore, understand, prepare_change, get_session_context, get_session_summary), PageRank ranking, token budgets, and proactive session hints. See README for details.

Assets 25

Releases: maplenk/codebase-memory-mcp

v0.8.0

Ranking v2 — From PageRank to Multi-Signal Composite Search

The Problem

The New Architecture

Multi-Signal Composite Ranking

FTS5 Search Pipeline

Personalized PageRank (PPR)

Betweenness Centrality

In-Degree Authority (Simplified HITS)

Explore Mode FTS Fallback

Compact Output

Development Process

Score Progression

LLM End-to-End Validation

Parameters Reference

Files Changed

Uh oh!

v0.7.0

What's New

Fork Features

From Upstream (v0.5.7)

Install

Security Verification

Uh oh!

v0.6.0

Uh oh!