fix(bench): sample symbols in embedding benchmark to prevent CI timeouts by carlos-alm · Pull Request #927 · optave/ops-codegraph-tool

carlos-alm · 2026-04-14T06:04:00Z

Summary

The embedding benchmark grew from ~1,095 to 7,128 symbols, causing all model workers to exceed the 30-min timeout during the sequential search phase
Add deterministic symbol sampling (seeded PRNG, cap at 1,500 symbols) so the search evaluation stays bounded as the codebase grows
Embedding phase still covers all DB symbols — only the search evaluation loop is sampled
Bump job-level timeout from 195→240 min to match realistic 7×30 min worst-case

Test plan

Trigger benchmark.yml via workflow_dispatch with version: dev and verify the embedding-benchmark job completes successfully
Verify the worker logs show Sampling 1500 of N symbols (deterministic seed=42)
Confirm EMBEDDING-BENCHMARKS.md is updated with valid results via auto-PR

The codebase grew from ~1095 to 7128 symbols, causing every model worker to exceed the 30-min timeout during the search phase. Add deterministic sampling (seeded PRNG, cap 1500 symbols) so runtime stays bounded as the repo grows. Embedding still covers all DB symbols; only the search evaluation is sampled.

claude · 2026-04-14T06:04:13Z

Claude finished @carlos-alm's task —— View job

PR Review: Embedding Benchmark Timeout Fix

Todo List

Read and analyze changed files
Conduct root cause analysis
Review configuration changes (timeout increase)
Check test integrity and sampling approach
Evaluate code quality
Assess backlog compliance
Provide final assessment

Root Cause Analysis

Problem Being Solved: Legitimate scaling issue — codebase growth from ~1,095 to 7,128 symbols caused all embedding model workers to exceed the 30-min timeout during sequential search evaluation. This is not hiding symptoms but addressing genuine resource constraints.

Why This Approach:

✅ Correct targeting: Sampling only affects the search evaluation loop (scripts/embedding-benchmark.ts:114-124), not the embedding generation phase
✅ Preserves benchmark integrity: All DB symbols still get embedded; only the search recall measurement is sampled
✅ Deterministic: Seeded PRNG ensures reproducible results across CI runs
✅ Proportional timeout: 195→240 min increase aligns with 7×30min + setup headroom calculation

Risk Assessment: LOW — This fixes a real scalability bottleneck without compromising measurement quality.

Backlog Compliance

Zero-dep: ✓ — No new runtime dependencies added
Foundation-aligned: ✓ — Fully compliant, enhances existing benchmark infrastructure
Problem-fit: 4/5 — Improves CI reliability for embedding benchmarks, which help validate AI search accuracy
Breaking: No — Purely internal benchmark optimization, no API changes
Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None. This is a well-executed fix for a legitimate scaling problem.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: Solid engineering solution to a real CI scaling problem. Deterministic sampling preserves benchmark validity while ensuring CI reliability. Timeout increase is mathematically justified. Implementation is clean and well-documented.

Minor suggestions for future:

Consider monitoring symbol growth trends to anticipate when the next scaling adjustment might be needed
Could document the sampling methodology in benchmark result metadata for transparency

greptile-apps · 2026-04-14T06:06:05Z

Greptile Summary

This PR fixes a CI timeout regression in the embedding benchmark by adding deterministic symbol sampling (mulberry32 PRNG, Fisher-Yates shuffle, seed=42, capped at 1,500) that applies only to the search-recall evaluation loop, leaving the buildEmbeddings phase covering all ~7,128 DB symbols. The job-level timeout is bumped from 195→240 min to match the new 7×30-min worst-case budget.

Confidence Score: 5/5

Safe to merge — the sampling logic is correct and bounded, the timeout math is consistent, and the one open concern (ambiguous symbols field in output JSON) is a non-blocking P2.

The PRNG (mulberry32) and Fisher-Yates shuffle are implemented correctly; the input order is stable (ORDER BY file, line) making the sample reproducible; buildEmbeddings is unaffected by the sample; and the timeout bump is arithmetically sound. The only remaining finding is a cosmetic clarity issue in the output JSON's symbols field, which doesn't affect correctness or CI behavior.

scripts/embedding-benchmark.ts — the output JSON symbols field is worth clarifying to distinguish sampled vs. total symbol counts.

Important Files Changed

Filename	Overview
scripts/embedding-benchmark.ts	Adds deterministic symbol sampling (mulberry32 + Fisher-Yates, seed=42, cap=1500) before the search recall loop; embedding phase is unaffected. Output JSON's `symbols` field will now report the sampled count (1500), not the full DB count (~7128), which may mislead benchmark readers.
.github/workflows/benchmark.yml	Bumps job-level timeout from 195→240 min with an updated comment. Math is consistent: 7 models × 30 min worst-case = 210 min + 30 min setup headroom = 240 min.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Worker process start] --> B[loadSymbols\nORDER BY file, line]
    B --> C{symbols.length\n> MAX_SYMBOLS\n1500?}
    C -- No --> D[Use all symbols]
    C -- Yes --> E[seededShuffle seed=42\nFisher-Yates + mulberry32]
    E --> F[.slice 0, 1500]
    F --> D
    D --> G[buildEmbeddings\nall DB symbols via SQLite]
    G --> H[Search recall loop\nover sampled 1500 symbols]
    H --> I[Compute hits@1/3/5/10]
    I --> J[stdout: JSON\nsymbols: 1500, result]
    J --> K[Parent collects\nsymbolCount = last data.symbols]
    K --> L[Output JSON\nsymbols: 1500 sampled count]

_{Reviews (2): Last reviewed commit: "fix(bench): correct stale timing estimat..." | Re-trigger Greptile}

greptile-apps · 2026-04-14T06:06:09Z

scripts/embedding-benchmark.ts

+ * Cap symbol count so CI stays under the per-model timeout.
+ * At ~1500 symbols on a CPU-only runner, embed + search takes ~12-15 min
+ * per model — well within the 45-min timeout with headroom to spare.
+ */
+const MAX_SYMBOLS = 1500;


Stale timing estimates in JSDoc

The comment says "embed + search takes ~12-15 min per model" and "well within the 45-min timeout", but TIMEOUT_MS is 1_800_000 ms = 30 min, not 45 min. The TIMEOUT_MS comment also separately estimates embed at ~18 min and search at ~5 min (~23 min total), which contradicts the 12-15 min claim here. Because buildEmbeddings runs over all DB symbols (~7,128), not just the sampled 1,500, the 23-min estimate is more plausible.

Suggested change

* Cap symbol count so CI stays under the per-model timeout.

* At ~1500 symbols on a CPU-only runner, embed + search takes ~12-15 min

* per model — well within the 45-min timeout with headroom to spare.

*/

const MAX_SYMBOLS = 1500;

/**

* Cap symbol count so CI stays under the per-model timeout.

* At ~1500 symbols on a CPU-only runner, search evaluation takes ~5 min;

* embedding all DB symbols takes ~18 min — ~23 min total, within the 30-min timeout.

*/

Fixed in e7b1008 — updated the JSDoc to match the actual TIMEOUT_MS (30 min) and the realistic timing breakdown (~18 min embed + ~5 min search = ~23 min total). The stale "12-15 min" and "45-min timeout" references are gone.

github-actions · 2026-04-14T06:11:47Z

Codegraph Impact Analysis

1 functions changed → 1 callers affected across 1 files

seededShuffle in scripts/embedding-benchmark.ts:79 (1 transitive callers)

carlos-alm · 2026-04-15T00:53:47Z

@greptileai

greptile-apps bot reviewed Apr 14, 2026

View reviewed changes

carlos-alm added 2 commits April 14, 2026 18:50

Merge branch 'main' into fix/embedding-benchmark-timeout

40b2525

fix(bench): correct stale timing estimates in JSDoc comment (#927)

e7b1008

carlos-alm merged commit 20ad454 into main Apr 15, 2026
26 checks passed

carlos-alm deleted the fix/embedding-benchmark-timeout branch April 15, 2026 01:17

github-actions bot locked and limited conversation to collaborators Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): sample symbols in embedding benchmark to prevent CI timeouts#927

fix(bench): sample symbols in embedding benchmark to prevent CI timeouts#927
carlos-alm merged 3 commits intomainfrom
fix/embedding-benchmark-timeout

carlos-alm commented Apr 14, 2026

Uh oh!

claude bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Apr 14, 2026

Uh oh!

carlos-alm Apr 15, 2026

Uh oh!

carlos-alm Apr 15, 2026

Uh oh!

github-actions bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

carlos-alm commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 14, 2026

Summary

Test plan

Uh oh!

claude bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Embedding Benchmark Timeout Fix

Todo List

Root Cause Analysis

Backlog Compliance

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 14, 2026 •

edited

Loading

greptile-apps bot commented Apr 14, 2026 •

edited

Loading

github-actions bot commented Apr 14, 2026 •

edited

Loading