perf: fix build hang on large repos by replacing per-row queries with… by jindalarpit · Pull Request #268 · tirth8205/code-review-graph

jindalarpit · 2026-04-13T18:03:55Z

Fix build hang on large repos (#189)

Problem

On large repositories (50K+ files, 333K+ nodes), code-review-graph build hangs indefinitely after community detection completes. The build appears stuck with no output after logging "Community detection complete: 34822 communities". This affects both CLI and MCP usage.

Root Cause

_compute_summaries() in code_review_graph/tools/build.py was running per-row SQL queries in a loop:

Risk index: 2 × N individual COUNT(*) queries (one for caller count, one for test coverage) for every Function/Class/Test node. On a 333K-node graph with 1.8M edges, this meant ~666K individual SQL queries.
Community summaries: 1 heavy JOIN + GROUP BY + ORDER BY query per community (~35K queries for 34,822 communities).
Flow snapshots: 1 SELECT per node ID in each flow path.

Fix

Replaced all per-row queries with bulk operations:

Phase	Before	After
Risk index	2 × N queries (~666K)	2 bulk `GROUP BY` aggregates
Community summaries	1 JOIN per community (~35K)	1 window function query
Flow snapshots	1 SELECT per path node	1 bulk load of all node names

Also added logger.info() calls at each postprocessing phase so users can see what stage the build is in, instead of silence after community detection.

Changes

code_review_graph/tools/build.py: Rewrote _compute_summaries() to use bulk aggregate queries and in-memory lookups instead of per-row SQL. Added progress logging for each phase.

Testing

All 740 existing tests pass. Linting clean.

pytest tests/ --tb=short -q   # 740 passed
ruff check code_review_graph/  # All checks passed

… bulk aggregates _compute_summaries() was running 2×N individual COUNT(*) queries for the risk_index (one for caller count, one for test coverage per node). On a 333K-node graph with 1.8M edges, this meant ~666K individual SQL queries, causing the build to appear hung after community detection completed. Fix: replace per-row queries with two bulk GROUP BY aggregates (caller counts + tested nodes), then iterate in-memory. Also bulk-load node name mappings for flow_snapshots and use a single window-function query for community top-symbols instead of one JOIN per community. Added progress logging at each postprocessing phase so users can see what stage the build is in. Fixes tirth8205#189

panga · 2026-04-13T19:39:19Z

I can confirm this fix resolves the issue. In a larger repository with 5000+ files, the build no longer hangs.

tirth8205 · 2026-04-14T11:58:08Z

Superseded by #184 which covers all 3 sections with comprehensive tests and passing CI.

tirth8205 closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fix build hang on large repos by replacing per-row queries with…#268

perf: fix build hang on large repos by replacing per-row queries with…#268
jindalarpit wants to merge 1 commit intotirth8205:mainfrom
jindalarpit:fix/build-hang-large-repos

jindalarpit commented Apr 13, 2026

Uh oh!

panga commented Apr 13, 2026

Uh oh!

tirth8205 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jindalarpit commented Apr 13, 2026

Fix build hang on large repos (#189)

Problem

Root Cause

Fix

Changes

Testing

Uh oh!

panga commented Apr 13, 2026

Uh oh!

tirth8205 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants