Summary
After community detection completes, the build pipeline hangs for minutes (or never finishes) on large repositories. The root cause is two N+1 query loops inside _compute_summaries in tools/build.py.
Root Cause
Community summaries (community_summaries block)
For each community, two separate queries are executed:
# 1 query per community — edge join for top symbols
top_symbols = conn.execute(
"SELECT n.name FROM nodes n "
"LEFT JOIN edges e1 ON e1.source_qualified = n.qualified_name "
"LEFT JOIN edges e2 ON e2.target_qualified = n.qualified_name "
"WHERE n.community_id = ? AND n.kind != 'File' "
"GROUP BY n.id ORDER BY COUNT(e1.id) + COUNT(e2.id) DESC LIMIT 5",
(cid,),
).fetchall()
# 1 query per community — file paths for purpose prefix
file_rows = conn.execute(
"SELECT DISTINCT file_path FROM nodes WHERE community_id = ? LIMIT 20",
(cid,),
).fetchall()
On a repo with ~500 communities → ~1,000 SQL round-trips.
Risk index (risk_index block)
For each node (Function, Class, Test), three separate queries are executed:
for n in nodes:
# query 1
caller_count = conn.execute(
"SELECT COUNT(*) FROM edges WHERE target_qualified = ? AND kind = 'CALLS'", (qn,)
).fetchone()[0]
# query 2
tested = conn.execute(
"SELECT COUNT(*) FROM edges WHERE source_qualified = ? AND kind = 'TESTED_BY'", (qn,)
).fetchone()[0]
# query 3
conn.execute("INSERT OR REPLACE INTO risk_index ...", (...))
On a repo with ~17,705 nodes → ~53,000 SQL round-trips — the primary bottleneck.
Proposed Fix
Precompute community summaries in 3 queries
# 1. Aggregate top symbols per community
top_sym_rows = conn.execute("""
SELECT n.community_id, n.name,
COUNT(e1.id) + COUNT(e2.id) AS edge_count
FROM nodes n
LEFT JOIN edges e1 ON e1.source_qualified = n.qualified_name
LEFT JOIN edges e2 ON e2.target_qualified = n.qualified_name
WHERE n.community_id IS NOT NULL AND n.kind != 'File'
GROUP BY n.community_id, n.id
ORDER BY n.community_id, edge_count DESC
""").fetchall()
top_syms_by_comm = defaultdict(list)
for row in top_sym_rows:
if len(top_syms_by_comm[row[0]]) < 5:
top_syms_by_comm[row[0]].append(row[1])
# 2. Fetch all file paths
file_rows = conn.execute("""
SELECT community_id, file_path FROM nodes WHERE community_id IS NOT NULL
""").fetchall()
paths_by_comm = defaultdict(list)
for fr in file_rows:
paths_by_comm[fr[0]].append(fr[1])
# 3. Build rows + batch insert
rows_to_insert = []
for r in communities:
cid, cname, csize, clang = r[0], r[1], r[2], r[3]
key_syms = json.dumps(top_syms_by_comm.get(cid, []))
paths = paths_by_comm.get(cid, [])
purpose = ""
if paths:
prefix = commonprefix(paths)
if "/" in prefix:
purpose = prefix.rsplit("/", 1)[0].split("/")[-1]
rows_to_insert.append((cid, cname, purpose, key_syms, csize, clang or ""))
conn.executemany("""
INSERT OR REPLACE INTO community_summaries
(community_id, name, purpose, key_symbols, size, dominant_language)
VALUES (?, ?, ?, ?, ?, ?)
""", rows_to_insert)
Precompute risk index in 3 queries
# 1. Caller counts per qualified name
caller_counts = {
row[0]: row[1]
for row in conn.execute("""
SELECT target_qualified, COUNT(*)
FROM edges WHERE kind = 'CALLS'
GROUP BY target_qualified
""").fetchall()
}
# 2. All tested qualified names
tested_qns = {
row[0]
for row in conn.execute("""
SELECT source_qualified FROM edges WHERE kind = 'TESTED_BY'
""").fetchall()
}
# 3. Compute risk in Python + batch insert
risk_rows = []
for n in nodes:
nid, qn, name = n[0], n[1], n[2]
caller_count = caller_counts.get(qn, 0)
coverage = "tested" if qn in tested_qns else "untested"
sec_relevant = 1 if any(kw in name.lower() for kw in security_kw) else 0
risk = min(
(0.3 if caller_count > 10 else 0.15 if caller_count > 3 else 0.0)
+ (0.3 if coverage == "untested" else 0.0)
+ (0.4 if sec_relevant else 0.0),
1.0,
)
risk_rows.append((nid, qn, risk, caller_count, coverage, sec_relevant))
conn.executemany("""
INSERT OR REPLACE INTO risk_index
(node_id, qualified_name, risk_score, caller_count,
test_coverage, security_relevant, last_computed)
VALUES (?, ?, ?, ?, ?, ?, datetime('now'))
""", risk_rows)
Expected Outcome
| Metric |
Before |
After |
SQL queries in _compute_summaries |
~54,000 |
~6 |
| Execution time |
Hangs / minutes |
Seconds |
Affected File
code_review_graph/tools/build.py — _compute_summaries()
This is tested with my company java-codebase of 2987 files with around ~17,705 nodes and effective now.
Thanks for providing the this repository.
Summary
After community detection completes, the build pipeline hangs for minutes (or never finishes) on large repositories. The root cause is two N+1 query loops inside
_compute_summariesintools/build.py.Root Cause
Community summaries (
community_summariesblock)For each community, two separate queries are executed:
On a repo with ~500 communities → ~1,000 SQL round-trips.
Risk index (
risk_indexblock)For each node (
Function,Class,Test), three separate queries are executed:On a repo with ~17,705 nodes → ~53,000 SQL round-trips — the primary bottleneck.
Proposed Fix
Precompute community summaries in 3 queries
Precompute risk index in 3 queries
Expected Outcome
_compute_summariesAffected File
code_review_graph/tools/build.py—_compute_summaries()This is tested with my company java-codebase of 2987 files with around ~17,705 nodes and effective now.
Thanks for providing the this repository.