perf: fix build hang on large repos by replacing per-row queries with…#268
Closed
jindalarpit wants to merge 1 commit intotirth8205:mainfrom
Closed
perf: fix build hang on large repos by replacing per-row queries with…#268jindalarpit wants to merge 1 commit intotirth8205:mainfrom
jindalarpit wants to merge 1 commit intotirth8205:mainfrom
Conversation
… bulk aggregates _compute_summaries() was running 2×N individual COUNT(*) queries for the risk_index (one for caller count, one for test coverage per node). On a 333K-node graph with 1.8M edges, this meant ~666K individual SQL queries, causing the build to appear hung after community detection completed. Fix: replace per-row queries with two bulk GROUP BY aggregates (caller counts + tested nodes), then iterate in-memory. Also bulk-load node name mappings for flow_snapshots and use a single window-function query for community top-symbols instead of one JOIN per community. Added progress logging at each postprocessing phase so users can see what stage the build is in. Fixes tirth8205#189
|
I can confirm this fix resolves the issue. In a larger repository with 5000+ files, the build no longer hangs. |
Owner
|
Superseded by #184 which covers all 3 sections with comprehensive tests and passing CI. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix build hang on large repos (#189)
Problem
On large repositories (50K+ files, 333K+ nodes),
code-review-graph buildhangs indefinitely after community detection completes. The build appears stuck with no output after logging "Community detection complete: 34822 communities". This affects both CLI and MCP usage.Root Cause
_compute_summaries()incode_review_graph/tools/build.pywas running per-row SQL queries in a loop:COUNT(*)queries (one for caller count, one for test coverage) for every Function/Class/Test node. On a 333K-node graph with 1.8M edges, this meant ~666K individual SQL queries.SELECTper node ID in each flow path.Fix
Replaced all per-row queries with bulk operations:
GROUP BYaggregatesAlso added
logger.info()calls at each postprocessing phase so users can see what stage the build is in, instead of silence after community detection.Changes
code_review_graph/tools/build.py: Rewrote_compute_summaries()to use bulk aggregate queries and in-memory lookups instead of per-row SQL. Added progress logging for each phase.Testing
All 740 existing tests pass. Linting clean.