perf: N+1 SQL query bottleneck in _compute_summaries causes build to hang on large repos

## Summary

After community detection completes, the build pipeline hangs for minutes (or never finishes) on large repositories. The root cause is two N+1 query loops inside `_compute_summaries` in `tools/build.py`.

## Root Cause

### Community summaries (`community_summaries` block)

For each community, two separate queries are executed:

```python
# 1 query per community — edge join for top symbols
top_symbols = conn.execute(
    "SELECT n.name FROM nodes n "
    "LEFT JOIN edges e1 ON e1.source_qualified = n.qualified_name "
    "LEFT JOIN edges e2 ON e2.target_qualified = n.qualified_name "
    "WHERE n.community_id = ? AND n.kind != 'File' "
    "GROUP BY n.id ORDER BY COUNT(e1.id) + COUNT(e2.id) DESC LIMIT 5",
    (cid,),
).fetchall()

# 1 query per community — file paths for purpose prefix
file_rows = conn.execute(
    "SELECT DISTINCT file_path FROM nodes WHERE community_id = ? LIMIT 20",
    (cid,),
).fetchall()
```

On a repo with ~500 communities → **~1,000 SQL round-trips**.

### Risk index (`risk_index` block)

For each node (`Function`, `Class`, `Test`), three separate queries are executed:

```python
for n in nodes:
    # query 1
    caller_count = conn.execute(
        "SELECT COUNT(*) FROM edges WHERE target_qualified = ? AND kind = 'CALLS'", (qn,)
    ).fetchone()[0]

    # query 2
    tested = conn.execute(
        "SELECT COUNT(*) FROM edges WHERE source_qualified = ? AND kind = 'TESTED_BY'", (qn,)
    ).fetchone()[0]

    # query 3
    conn.execute("INSERT OR REPLACE INTO risk_index ...", (...))
```

On a repo with ~17,705 nodes → **~53,000 SQL round-trips** — the primary bottleneck.

## Proposed Fix

### Precompute community summaries in 3 queries

```python
# 1. Aggregate top symbols per community
top_sym_rows = conn.execute("""
    SELECT n.community_id, n.name,
           COUNT(e1.id) + COUNT(e2.id) AS edge_count
    FROM nodes n
    LEFT JOIN edges e1 ON e1.source_qualified = n.qualified_name
    LEFT JOIN edges e2 ON e2.target_qualified = n.qualified_name
    WHERE n.community_id IS NOT NULL AND n.kind != 'File'
    GROUP BY n.community_id, n.id
    ORDER BY n.community_id, edge_count DESC
""").fetchall()

top_syms_by_comm = defaultdict(list)
for row in top_sym_rows:
    if len(top_syms_by_comm[row[0]]) < 5:
        top_syms_by_comm[row[0]].append(row[1])

# 2. Fetch all file paths
file_rows = conn.execute("""
    SELECT community_id, file_path FROM nodes WHERE community_id IS NOT NULL
""").fetchall()

paths_by_comm = defaultdict(list)
for fr in file_rows:
    paths_by_comm[fr[0]].append(fr[1])

# 3. Build rows + batch insert
rows_to_insert = []
for r in communities:
    cid, cname, csize, clang = r[0], r[1], r[2], r[3]
    key_syms = json.dumps(top_syms_by_comm.get(cid, []))
    paths = paths_by_comm.get(cid, [])
    purpose = ""
    if paths:
        prefix = commonprefix(paths)
        if "/" in prefix:
            purpose = prefix.rsplit("/", 1)[0].split("/")[-1]
    rows_to_insert.append((cid, cname, purpose, key_syms, csize, clang or ""))

conn.executemany("""
    INSERT OR REPLACE INTO community_summaries
    (community_id, name, purpose, key_symbols, size, dominant_language)
    VALUES (?, ?, ?, ?, ?, ?)
""", rows_to_insert)
```

### Precompute risk index in 3 queries

```python
# 1. Caller counts per qualified name
caller_counts = {
    row[0]: row[1]
    for row in conn.execute("""
        SELECT target_qualified, COUNT(*)
        FROM edges WHERE kind = 'CALLS'
        GROUP BY target_qualified
    """).fetchall()
}

# 2. All tested qualified names
tested_qns = {
    row[0]
    for row in conn.execute("""
        SELECT source_qualified FROM edges WHERE kind = 'TESTED_BY'
    """).fetchall()
}

# 3. Compute risk in Python + batch insert
risk_rows = []
for n in nodes:
    nid, qn, name = n[0], n[1], n[2]
    caller_count = caller_counts.get(qn, 0)
    coverage = "tested" if qn in tested_qns else "untested"
    sec_relevant = 1 if any(kw in name.lower() for kw in security_kw) else 0
    risk = min(
        (0.3 if caller_count > 10 else 0.15 if caller_count > 3 else 0.0)
        + (0.3 if coverage == "untested" else 0.0)
        + (0.4 if sec_relevant else 0.0),
        1.0,
    )
    risk_rows.append((nid, qn, risk, caller_count, coverage, sec_relevant))

conn.executemany("""
    INSERT OR REPLACE INTO risk_index
    (node_id, qualified_name, risk_score, caller_count,
     test_coverage, security_relevant, last_computed)
    VALUES (?, ?, ?, ?, ?, ?, datetime('now'))
""", risk_rows)
```

## Expected Outcome

| Metric | Before | After |
|---|---|---|
| SQL queries in `_compute_summaries` | ~54,000 | ~6 |
| Execution time | Hangs / minutes | Seconds |

## Affected File

`code_review_graph/tools/build.py` — `_compute_summaries()`


#### This is tested with my company java-codebase of 2987 files with around ~17,705 nodes and effective now.
Thanks for providing the this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: N+1 SQL query bottleneck in _compute_summaries causes build to hang on large repos #243

Summary

Root Cause

Community summaries (`community_summaries` block)

Risk index (`risk_index` block)

Proposed Fix

Precompute community summaries in 3 queries

Precompute risk index in 3 queries

Expected Outcome

Affected File

This is tested with my company java-codebase of 2987 files with around ~17,705 nodes and effective now.

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Before	After
SQL queries in `_compute_summaries`	~54,000	~6
Execution time	Hangs / minutes	Seconds

perf: N+1 SQL query bottleneck in _compute_summaries causes build to hang on large repos #243

Description

Summary

Root Cause

Community summaries (community_summaries block)

Risk index (risk_index block)

Proposed Fix

Precompute community summaries in 3 queries

Precompute risk index in 3 queries

Expected Outcome

Affected File

This is tested with my company java-codebase of 2987 files with around ~17,705 nodes and effective now.

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Community summaries (`community_summaries` block)

Risk index (`risk_index` block)