Skip to content

feat: add CI performance benchmark suite#384

Closed
benvinegar wants to merge 5 commits into
mainfrom
feat/ci-performance-benchmarks
Closed

feat: add CI performance benchmark suite#384
benvinegar wants to merge 5 commits into
mainfrom
feat/ci-performance-benchmarks

Conversation

@benvinegar

Copy link
Copy Markdown
Member

Summary

  • add a shared benchmark runner with JSON output, median/p95 aggregation, thresholds, and PR-vs-main comparison
  • add benchmark coverage for startup/loading, patch parsing, render/layout planning, highlight prefetch, large-stream rendering/scrolling, memory, and optional competitor tools
  • update the benchmark workflow to run on PRs, post/update a curated PR comment, upload raw artifacts, and fail on meaningful regressions

Validation

  • bun run format:check
  • bun run typecheck
  • bun run lint
  • bun test
  • bun run bench -- --samples 3 --include-competitors --out benchmarks/results/local.json

This PR description was generated by Pi using OpenAI GPT-5

@greptile-apps

greptile-apps Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a comprehensive CI benchmark suite covering startup, patch parsing, render/layout planning, highlight prefetch, large-stream rendering, memory, and optional competitor comparisons. It adds a shared runner (run.ts) with JSON output and median/p95 aggregation, a comparison script that enforces regression thresholds, and a GitHub Actions workflow that runs head and base in sibling worktrees, posts a curated PR comment, and gates merges on meaningful regressions.

  • New benchmark scripts (working-tree-load, changeset-parse, render-layout, memory, competitors) expand coverage well beyond the prior three scripts; fixtures are deterministic and cleaned up via try/finally.
  • classifyMetric in benchmark-result.ts has an unreachable competitor_ branch — all competitor timing metrics match the earlier _ms check, so they get comparable: true with a 15% regression threshold in the serialized JSON; the behavior is currently masked by a name-match guard in compareMetric but the JSON artifacts are misleading and the setup is fragile.
  • comment-pr.ts fetches only the first 100 PR comments with no pagination, so the marker lookup silently misses the existing benchmark comment on active PRs and posts a duplicate on every run.

Confidence Score: 3/5

Safe to merge after the two functional fixes; the pagination bug will silently flood active PRs with duplicate benchmark comments from day one, and the misclassification in classifyMetric embeds incorrect metadata in every JSON artifact.

The pagination bug in comment-pr.ts will create a new benchmark comment on every CI run for any PR with more than 100 existing comments, rather than updating the existing one — a silent but immediately visible problem for busy PRs. The classifyMetric ordering issue means every competitor timing metric is written to JSON artifacts as comparable: true with a regression threshold, which is only papered over by a defensive name-check in compareMetric; if that guard is ever simplified the CI gate for competitor metrics would break unexpectedly.

benchmarks/lib/benchmark-result.ts (classifyMetric branch ordering) and benchmarks/comment-pr.ts (comment pagination) need fixes before the benchmark infrastructure can be relied on.

Important Files Changed

Filename Overview
benchmarks/lib/benchmark-result.ts Core types and aggregation utilities; classifyMetric has an unreachable competitor_ branch (dead code causing incorrect comparable: true in JSON artifacts) and informational field is never populated.
benchmarks/comment-pr.ts GitHub PR comment upsert script; comment search is capped at 100 with no pagination — duplicate comments will be posted on PRs with >100 comments.
benchmarks/compare.ts Comparison and markdown generation; logic is sound but relies on metric.name.includes("competitor_") as a belt-and-suspenders fix for the misclassification in classifyMetric.
benchmarks/run.ts Orchestration runner; spawns each script N times, aggregates samples, and writes JSON. Straightforward and correct.
.github/workflows/benchmarks.yml CI workflow; head+base runs, comparison, PR comment, and regression gate. fetch-depth: 0 is applied on all triggers (push, manual) not just PRs, but this is minor.
benchmarks/lib/fixtures.ts Deterministic synthetic fixture helpers; creates temp git repos and unified patches. Cleanup is guarded by try/finally where it matters.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[push / pull_request / workflow_dispatch] --> B[Checkout full history]
    B --> C[bun install]
    C --> D[Run head benchmarks\nbun run bench --samples N --out head.json]
    D --> E{PR event?}
    E -- No --> F[Publish step summary\nhead.txt]
    E -- Yes --> G[git worktree add origin/main\ncp benchmarks → base worktree\nbun run benchmarks/run.ts --out base.json]
    G --> H[Compare head vs base\nbench:compare → comparison.json + summary.md]
    H --> I[Publish step summary\nsummary.md]
    I --> J[Comment benchmark summary on PR\nbench:comment-pr --body summary.md]
    J --> K{compare step failed?}
    K -- Yes --> L[exit 1 — block merge]
    K -- No --> M[Upload artifacts\nbenchmark-results/*]
    F --> M
    L --> M
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
benchmarks/lib/benchmark-result.ts:68-98
The `startsWith("competitor_")` guard is unreachable dead code: every competitor timing metric ends with `_ms` and is caught by the first branch, giving it `comparable: true` with a 15% regression threshold. The intent of `comparable: false` for competitors is only saved by the `metric.name.includes("competitor_")` fall-through in `compareMetric`. If that guard is ever removed or refactored, competitor regressions would start failing CI. The fix is to check the competitor prefix before the generic `_ms` check.

```suggestion
  if (name.startsWith("competitor_")) {
    return { unit: "ms", comparable: false };
  }

  if (name.endsWith("_ms")) {
    return {
      unit: "ms",
      comparable: true,
      threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },
    };
  }

  if (
    name.startsWith("is_") ||
    name.endsWith("_ready_before_move") ||
    name.endsWith("_available")
  ) {
    return { unit: "boolean", comparable: false };
  }

  if (name.includes("rss") || name.includes("heap")) {
    return {
      unit: "bytes",
      comparable: true,
      threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },
    };
  }

  if (name.endsWith("_bytes")) {
    return { unit: "bytes", comparable: false };
  }
```

### Issue 2 of 3
benchmarks/comment-pr.ts:62-65
The comment search is capped at 100 with no pagination. On any active PR with more than 100 comments the marker lookup silently returns nothing, so every subsequent benchmark run posts a brand-new comment instead of updating the existing one.

```suggestion
async function fetchAllComments(
  repository: string,
  pullRequestNumber: number,
): Promise<Array<{ id: number; body?: string }>> {
  const all: Array<{ id: number; body?: string }> = [];
  let page = 1;
  while (true) {
    const batch = (await githubRequest(
      `/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100&page=${page}`,
    )) as Array<{ id: number; body?: string }>;
    all.push(...batch);
    if (batch.length < 100) break;
    page += 1;
  }
  return all;
}

const comments = await fetchAllComments(repository, pullRequestNumber);
const existing = comments.find((comment) => comment.body?.includes(marker));
```

### Issue 3 of 3
benchmarks/lib/benchmark-result.ts:16-18
The `informational` field is declared in `BenchmarkMetricResult` but `aggregateMetric` never sets it, so it is always `undefined` in every serialized artifact and in every comparison row. The `metric.informational` branch in `compareMetric` is therefore dead code — it can never flip a metric to `"informational"` status. Either populate the field in `aggregateMetric` or remove it from the interface to avoid misleading future readers.

```suggestion
  comparable: boolean;
  source: string;
```

Reviews (1): Last reviewed commit: "feat: add CI performance benchmark suite" | Re-trigger Greptile

Comment thread benchmarks/lib/benchmark-result.ts Outdated
Comment on lines +68 to +98
if (name.endsWith("_ms")) {
return {
unit: "ms",
comparable: true,
threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },
};
}

if (
name.startsWith("is_") ||
name.endsWith("_ready_before_move") ||
name.endsWith("_available")
) {
return { unit: "boolean", comparable: false };
}

if (name.includes("rss") || name.includes("heap")) {
return {
unit: "bytes",
comparable: true,
threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },
};
}

if (name.endsWith("_bytes")) {
return { unit: "bytes", comparable: false };
}

if (name.startsWith("competitor_")) {
return { unit: "ms", comparable: false };
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The startsWith("competitor_") guard is unreachable dead code: every competitor timing metric ends with _ms and is caught by the first branch, giving it comparable: true with a 15% regression threshold. The intent of comparable: false for competitors is only saved by the metric.name.includes("competitor_") fall-through in compareMetric. If that guard is ever removed or refactored, competitor regressions would start failing CI. The fix is to check the competitor prefix before the generic _ms check.

Suggested change
if (name.endsWith("_ms")) {
return {
unit: "ms",
comparable: true,
threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },
};
}
if (
name.startsWith("is_") ||
name.endsWith("_ready_before_move") ||
name.endsWith("_available")
) {
return { unit: "boolean", comparable: false };
}
if (name.includes("rss") || name.includes("heap")) {
return {
unit: "bytes",
comparable: true,
threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },
};
}
if (name.endsWith("_bytes")) {
return { unit: "bytes", comparable: false };
}
if (name.startsWith("competitor_")) {
return { unit: "ms", comparable: false };
}
if (name.startsWith("competitor_")) {
return { unit: "ms", comparable: false };
}
if (name.endsWith("_ms")) {
return {
unit: "ms",
comparable: true,
threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },
};
}
if (
name.startsWith("is_") ||
name.endsWith("_ready_before_move") ||
name.endsWith("_available")
) {
return { unit: "boolean", comparable: false };
}
if (name.includes("rss") || name.includes("heap")) {
return {
unit: "bytes",
comparable: true,
threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },
};
}
if (name.endsWith("_bytes")) {
return { unit: "bytes", comparable: false };
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: benchmarks/lib/benchmark-result.ts
Line: 68-98

Comment:
The `startsWith("competitor_")` guard is unreachable dead code: every competitor timing metric ends with `_ms` and is caught by the first branch, giving it `comparable: true` with a 15% regression threshold. The intent of `comparable: false` for competitors is only saved by the `metric.name.includes("competitor_")` fall-through in `compareMetric`. If that guard is ever removed or refactored, competitor regressions would start failing CI. The fix is to check the competitor prefix before the generic `_ms` check.

```suggestion
  if (name.startsWith("competitor_")) {
    return { unit: "ms", comparable: false };
  }

  if (name.endsWith("_ms")) {
    return {
      unit: "ms",
      comparable: true,
      threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },
    };
  }

  if (
    name.startsWith("is_") ||
    name.endsWith("_ready_before_move") ||
    name.endsWith("_available")
  ) {
    return { unit: "boolean", comparable: false };
  }

  if (name.includes("rss") || name.includes("heap")) {
    return {
      unit: "bytes",
      comparable: true,
      threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },
    };
  }

  if (name.endsWith("_bytes")) {
    return { unit: "bytes", comparable: false };
  }
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — competitor metrics are now classified before generic _ms metrics, so optional competitor timings stay non-comparable without relying on the comparison fallback.

This comment was generated by Pi using OpenAI GPT-5

Comment thread benchmarks/comment-pr.ts Outdated
Comment on lines +62 to +65
const comments = (await githubRequest(
`/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100`,
)) as Array<{ id: number; body?: string }>;
const existing = comments.find((comment) => comment.body?.includes(marker));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The comment search is capped at 100 with no pagination. On any active PR with more than 100 comments the marker lookup silently returns nothing, so every subsequent benchmark run posts a brand-new comment instead of updating the existing one.

Suggested change
const comments = (await githubRequest(
`/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100`,
)) as Array<{ id: number; body?: string }>;
const existing = comments.find((comment) => comment.body?.includes(marker));
async function fetchAllComments(
repository: string,
pullRequestNumber: number,
): Promise<Array<{ id: number; body?: string }>> {
const all: Array<{ id: number; body?: string }> = [];
let page = 1;
while (true) {
const batch = (await githubRequest(
`/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100&page=${page}`,
)) as Array<{ id: number; body?: string }>;
all.push(...batch);
if (batch.length < 100) break;
page += 1;
}
return all;
}
const comments = await fetchAllComments(repository, pullRequestNumber);
const existing = comments.find((comment) => comment.body?.includes(marker));
Prompt To Fix With AI
This is a comment left during a code review.
Path: benchmarks/comment-pr.ts
Line: 62-65

Comment:
The comment search is capped at 100 with no pagination. On any active PR with more than 100 comments the marker lookup silently returns nothing, so every subsequent benchmark run posts a brand-new comment instead of updating the existing one.

```suggestion
async function fetchAllComments(
  repository: string,
  pullRequestNumber: number,
): Promise<Array<{ id: number; body?: string }>> {
  const all: Array<{ id: number; body?: string }> = [];
  let page = 1;
  while (true) {
    const batch = (await githubRequest(
      `/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100&page=${page}`,
    )) as Array<{ id: number; body?: string }>;
    all.push(...batch);
    if (batch.length < 100) break;
    page += 1;
  }
  return all;
}

const comments = await fetchAllComments(repository, pullRequestNumber);
const existing = comments.find((comment) => comment.body?.includes(marker));
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the PR comment updater now paginates issue comments until all pages are fetched before searching for the benchmark marker.

This comment was generated by Pi using OpenAI GPT-5

Comment on lines +16 to +18
comparable: boolean;
informational?: boolean;
source: string;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The informational field is declared in BenchmarkMetricResult but aggregateMetric never sets it, so it is always undefined in every serialized artifact and in every comparison row. The metric.informational branch in compareMetric is therefore dead code — it can never flip a metric to "informational" status. Either populate the field in aggregateMetric or remove it from the interface to avoid misleading future readers.

Suggested change
comparable: boolean;
informational?: boolean;
source: string;
comparable: boolean;
source: string;
Prompt To Fix With AI
This is a comment left during a code review.
Path: benchmarks/lib/benchmark-result.ts
Line: 16-18

Comment:
The `informational` field is declared in `BenchmarkMetricResult` but `aggregateMetric` never sets it, so it is always `undefined` in every serialized artifact and in every comparison row. The `metric.informational` branch in `compareMetric` is therefore dead code — it can never flip a metric to `"informational"` status. Either populate the field in `aggregateMetric` or remove it from the interface to avoid misleading future readers.

```suggestion
  comparable: boolean;
  source: string;
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — removed the unused informational field and the dead comparison branch so informational status is derived from comparable: false.

This comment was generated by Pi using OpenAI GPT-5

@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown

Hunk benchmark results

❌ One or more benchmarks regressed beyond the configured threshold.

Base: e453a75c3a5f · Head: 9f2c01529465

Key Hunk benchmarks

Benchmark Base median PR median Delta Threshold Status
bootstrap-load/file_pair_bootstrap_ms 45.82 ms 46.52 ms +0.70 ms (+1.5%) +15% and +5.00 ms
bootstrap-load/git_bootstrap_ms 102.0 ms 105.0 ms +3.09 ms (+3.0%) +15% and +5.00 ms
changeset-parse/balanced_changeset_parse_patch_ms 16.34 ms 16.72 ms +0.38 ms (+2.3%) +15% and +5.00 ms
changeset-parse/large_single_file_parse_patch_ms 6.17 ms 6.29 ms +0.12 ms (+1.9%) +15% and +5.00 ms
changeset-parse/many_small_files_parse_patch_ms 11.47 ms 11.30 ms -0.17 ms (-1.5%) +15% and +5.00 ms
highlight-prefetch/next_file_ready_ms 110.7 ms 303.2 ms +192.4 ms (+173.8%) +15% and +5.00 ms
highlight-prefetch/selected_startup_ms 51.28 ms 57.38 ms +6.10 ms (+11.9%) +15% and +5.00 ms
large-stream/cold_first_frame_ms 104.3 ms 124.5 ms +20.24 ms (+19.4%) +15% and +5.00 ms
large-stream/warm_first_frame_ms 76.25 ms 87.46 ms +11.21 ms (+14.7%) +15% and +5.00 ms
large-stream/windowed_scroll_ticks_ms 6358.5 ms 6418.9 ms +60.40 ms (+0.9%) +15% and +5.00 ms
render-layout/balanced_stream_review_plan_ms 14.91 ms 14.49 ms -0.42 ms (-2.8%) +15% and +5.00 ms
render-layout/large_single_file_geometry_ms 34.10 ms 40.91 ms +6.81 ms (+20.0%) +15% and +5.00 ms
render-layout/large_single_file_review_plan_ms 27.58 ms 21.96 ms -5.62 ms (-20.4%) +15% and +5.00 ms
render-layout/many_small_files_review_plan_ms 10.97 ms 11.13 ms +0.16 ms (+1.5%) +15% and +5.00 ms
working-tree-load/large_worktree_load_ms 180.8 ms 182.2 ms +1.46 ms (+0.8%) +15% and +5.00 ms
working-tree-load/medium_worktree_load_ms 54.89 ms 56.52 ms +1.63 ms (+3.0%) +15% and +5.00 ms
working-tree-load/small_worktree_load_ms 17.12 ms 19.26 ms +2.14 ms (+12.5%) +15% and +5.00 ms
working-tree-load/untracked_few_large_load_ms 90.33 ms 87.54 ms -2.79 ms (-3.1%) +15% and +5.00 ms
working-tree-load/untracked_many_small_load_ms 203.4 ms 204.3 ms +0.89 ms (+0.4%) +15% and +5.00 ms

20 additional comparable Hunk metrics were checked but hidden to keep this comment readable. See the workflow artifacts for full JSON and text output.

Raw JSON and text logs are available in the benchmark workflow artifacts.

@benvinegar

Copy link
Copy Markdown
Member Author

Closing this for now. The benchmark harness is useful, but running it as PR CI is still too noisy/heavy; we'll keep the idea as local benchmark coverage instead.\n\nThis comment was generated by Pi using OpenAI GPT-5

@benvinegar benvinegar closed this May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant