Skip to content

fix(bench): sample symbols in embedding benchmark to prevent CI timeouts#927

Merged
carlos-alm merged 3 commits intomainfrom
fix/embedding-benchmark-timeout
Apr 15, 2026
Merged

fix(bench): sample symbols in embedding benchmark to prevent CI timeouts#927
carlos-alm merged 3 commits intomainfrom
fix/embedding-benchmark-timeout

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • The embedding benchmark grew from ~1,095 to 7,128 symbols, causing all model workers to exceed the 30-min timeout during the sequential search phase
  • Add deterministic symbol sampling (seeded PRNG, cap at 1,500 symbols) so the search evaluation stays bounded as the codebase grows
  • Embedding phase still covers all DB symbols — only the search evaluation loop is sampled
  • Bump job-level timeout from 195→240 min to match realistic 7×30 min worst-case

Test plan

  • Trigger benchmark.yml via workflow_dispatch with version: dev and verify the embedding-benchmark job completes successfully
  • Verify the worker logs show Sampling 1500 of N symbols (deterministic seed=42)
  • Confirm EMBEDDING-BENCHMARKS.md is updated with valid results via auto-PR

The codebase grew from ~1095 to 7128 symbols, causing every model worker
to exceed the 30-min timeout during the search phase. Add deterministic
sampling (seeded PRNG, cap 1500 symbols) so runtime stays bounded as the
repo grows. Embedding still covers all DB symbols; only the search
evaluation is sampled.
@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Embedding Benchmark Timeout Fix

Todo List

  • Read and analyze changed files
  • Conduct root cause analysis
  • Review configuration changes (timeout increase)
  • Check test integrity and sampling approach
  • Evaluate code quality
  • Assess backlog compliance
  • Provide final assessment

Root Cause Analysis

Problem Being Solved: Legitimate scaling issue — codebase growth from ~1,095 to 7,128 symbols caused all embedding model workers to exceed the 30-min timeout during sequential search evaluation. This is not hiding symptoms but addressing genuine resource constraints.

Why This Approach:

  • Correct targeting: Sampling only affects the search evaluation loop (scripts/embedding-benchmark.ts:114-124), not the embedding generation phase
  • Preserves benchmark integrity: All DB symbols still get embedded; only the search recall measurement is sampled
  • Deterministic: Seeded PRNG ensures reproducible results across CI runs
  • Proportional timeout: 195→240 min increase aligns with 7×30min + setup headroom calculation

Risk Assessment: LOW — This fixes a real scalability bottleneck without compromising measurement quality.

Backlog Compliance

  • Zero-dep: ✓ — No new runtime dependencies added
  • Foundation-aligned: ✓ — Fully compliant, enhances existing benchmark infrastructure
  • Problem-fit: 4/5 — Improves CI reliability for embedding benchmarks, which help validate AI search accuracy
  • Breaking: No — Purely internal benchmark optimization, no API changes
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None. This is a well-executed fix for a legitimate scaling problem.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Solid engineering solution to a real CI scaling problem. Deterministic sampling preserves benchmark validity while ensuring CI reliability. Timeout increase is mathematically justified. Implementation is clean and well-documented.

Minor suggestions for future:

  • Consider monitoring symbol growth trends to anticipate when the next scaling adjustment might be needed
  • Could document the sampling methodology in benchmark result metadata for transparency

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 14, 2026

Greptile Summary

This PR fixes a CI timeout regression in the embedding benchmark by adding deterministic symbol sampling (mulberry32 PRNG, Fisher-Yates shuffle, seed=42, capped at 1,500) that applies only to the search-recall evaluation loop, leaving the buildEmbeddings phase covering all ~7,128 DB symbols. The job-level timeout is bumped from 195→240 min to match the new 7×30-min worst-case budget.

Confidence Score: 5/5

Safe to merge — the sampling logic is correct and bounded, the timeout math is consistent, and the one open concern (ambiguous symbols field in output JSON) is a non-blocking P2.

The PRNG (mulberry32) and Fisher-Yates shuffle are implemented correctly; the input order is stable (ORDER BY file, line) making the sample reproducible; buildEmbeddings is unaffected by the sample; and the timeout bump is arithmetically sound. The only remaining finding is a cosmetic clarity issue in the output JSON's symbols field, which doesn't affect correctness or CI behavior.

scripts/embedding-benchmark.ts — the output JSON symbols field is worth clarifying to distinguish sampled vs. total symbol counts.

Important Files Changed

Filename Overview
scripts/embedding-benchmark.ts Adds deterministic symbol sampling (mulberry32 + Fisher-Yates, seed=42, cap=1500) before the search recall loop; embedding phase is unaffected. Output JSON's symbols field will now report the sampled count (1500), not the full DB count (~7128), which may mislead benchmark readers.
.github/workflows/benchmark.yml Bumps job-level timeout from 195→240 min with an updated comment. Math is consistent: 7 models × 30 min worst-case = 210 min + 30 min setup headroom = 240 min.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Worker process start] --> B[loadSymbols\nORDER BY file, line]
    B --> C{symbols.length\n> MAX_SYMBOLS\n1500?}
    C -- No --> D[Use all symbols]
    C -- Yes --> E[seededShuffle seed=42\nFisher-Yates + mulberry32]
    E --> F[.slice 0, 1500]
    F --> D
    D --> G[buildEmbeddings\nall DB symbols via SQLite]
    G --> H[Search recall loop\nover sampled 1500 symbols]
    H --> I[Compute hits@1/3/5/10]
    I --> J[stdout: JSON\nsymbols: 1500, result]
    J --> K[Parent collects\nsymbolCount = last data.symbols]
    K --> L[Output JSON\nsymbols: 1500 sampled count]
Loading

Reviews (2): Last reviewed commit: "fix(bench): correct stale timing estimat..." | Re-trigger Greptile

Comment on lines +22 to +26
* Cap symbol count so CI stays under the per-model timeout.
* At ~1500 symbols on a CPU-only runner, embed + search takes ~12-15 min
* per model — well within the 45-min timeout with headroom to spare.
*/
const MAX_SYMBOLS = 1500;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale timing estimates in JSDoc

The comment says "embed + search takes ~12-15 min per model" and "well within the 45-min timeout", but TIMEOUT_MS is 1_800_000 ms = 30 min, not 45 min. The TIMEOUT_MS comment also separately estimates embed at ~18 min and search at ~5 min (~23 min total), which contradicts the 12-15 min claim here. Because buildEmbeddings runs over all DB symbols (~7,128), not just the sampled 1,500, the 23-min estimate is more plausible.

Suggested change
* Cap symbol count so CI stays under the per-model timeout.
* At ~1500 symbols on a CPU-only runner, embed + search takes ~12-15 min
* per model well within the 45-min timeout with headroom to spare.
*/
const MAX_SYMBOLS = 1500;
/**
* Cap symbol count so CI stays under the per-model timeout.
* At ~1500 symbols on a CPU-only runner, search evaluation takes ~5 min;
* embedding all DB symbols takes ~18 min ~23 min total, within the 30-min timeout.
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e7b1008 — updated the JSDoc to match the actual TIMEOUT_MS (30 min) and the realistic timing breakdown (~18 min embed + ~5 min search = ~23 min total). The stale "12-15 min" and "45-min timeout" references are gone.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e7b1008 — updated the JSDoc to match the actual TIMEOUT_MS (30 min) and the realistic timing breakdown (~18 min embed + ~5 min search = ~23 min total). The stale "12-15 min" and "45-min timeout" references are gone.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

Codegraph Impact Analysis

1 functions changed1 callers affected across 1 files

  • seededShuffle in scripts/embedding-benchmark.ts:79 (1 transitive callers)

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 20ad454 into main Apr 15, 2026
26 checks passed
@carlos-alm carlos-alm deleted the fix/embedding-benchmark-timeout branch April 15, 2026 01:17
@github-actions github-actions bot locked and limited conversation to collaborators Apr 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant