Skip to content

perf: fix build hang on large repos by replacing per-row queries with…#268

Closed
jindalarpit wants to merge 1 commit intotirth8205:mainfrom
jindalarpit:fix/build-hang-large-repos
Closed

perf: fix build hang on large repos by replacing per-row queries with…#268
jindalarpit wants to merge 1 commit intotirth8205:mainfrom
jindalarpit:fix/build-hang-large-repos

Conversation

@jindalarpit
Copy link
Copy Markdown
Contributor

Fix build hang on large repos (#189)

Problem

On large repositories (50K+ files, 333K+ nodes), code-review-graph build hangs indefinitely after community detection completes. The build appears stuck with no output after logging "Community detection complete: 34822 communities". This affects both CLI and MCP usage.

Root Cause

_compute_summaries() in code_review_graph/tools/build.py was running per-row SQL queries in a loop:

  • Risk index: 2 × N individual COUNT(*) queries (one for caller count, one for test coverage) for every Function/Class/Test node. On a 333K-node graph with 1.8M edges, this meant ~666K individual SQL queries.
  • Community summaries: 1 heavy JOIN + GROUP BY + ORDER BY query per community (~35K queries for 34,822 communities).
  • Flow snapshots: 1 SELECT per node ID in each flow path.

Fix

Replaced all per-row queries with bulk operations:

Phase Before After
Risk index 2 × N queries (~666K) 2 bulk GROUP BY aggregates
Community summaries 1 JOIN per community (~35K) 1 window function query
Flow snapshots 1 SELECT per path node 1 bulk load of all node names

Also added logger.info() calls at each postprocessing phase so users can see what stage the build is in, instead of silence after community detection.

Changes

  • code_review_graph/tools/build.py: Rewrote _compute_summaries() to use bulk aggregate queries and in-memory lookups instead of per-row SQL. Added progress logging for each phase.

Testing

All 740 existing tests pass. Linting clean.

pytest tests/ --tb=short -q   # 740 passed
ruff check code_review_graph/  # All checks passed

… bulk aggregates

_compute_summaries() was running 2×N individual COUNT(*) queries for the
risk_index (one for caller count, one for test coverage per node). On a
333K-node graph with 1.8M edges, this meant ~666K individual SQL queries,
causing the build to appear hung after community detection completed.

Fix: replace per-row queries with two bulk GROUP BY aggregates (caller
counts + tested nodes), then iterate in-memory. Also bulk-load node
name mappings for flow_snapshots and use a single window-function query
for community top-symbols instead of one JOIN per community.

Added progress logging at each postprocessing phase so users can see
what stage the build is in.

Fixes tirth8205#189
@panga
Copy link
Copy Markdown

panga commented Apr 13, 2026

I can confirm this fix resolves the issue. In a larger repository with 5000+ files, the build no longer hangs.

@tirth8205
Copy link
Copy Markdown
Owner

Superseded by #184 which covers all 3 sections with comprehensive tests and passing CI.

@tirth8205 tirth8205 closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants