feat: add CI performance benchmark suite by benvinegar · Pull Request #384 · modem-dev/hunk

benvinegar · 2026-05-29T22:53:56Z

Summary

add a shared benchmark runner with JSON output, median/p95 aggregation, thresholds, and PR-vs-main comparison
add benchmark coverage for startup/loading, patch parsing, render/layout planning, highlight prefetch, large-stream rendering/scrolling, memory, and optional competitor tools
update the benchmark workflow to run on PRs, post/update a curated PR comment, upload raw artifacts, and fail on meaningful regressions

Validation

bun run format:check
bun run typecheck
bun run lint
bun test
bun run bench -- --samples 3 --include-competitors --out benchmarks/results/local.json

This PR description was generated by Pi using OpenAI GPT-5

greptile-apps · 2026-05-29T22:59:06Z

Greptile Summary

This PR introduces a comprehensive CI benchmark suite covering startup, patch parsing, render/layout planning, highlight prefetch, large-stream rendering, memory, and optional competitor comparisons. It adds a shared runner (run.ts) with JSON output and median/p95 aggregation, a comparison script that enforces regression thresholds, and a GitHub Actions workflow that runs head and base in sibling worktrees, posts a curated PR comment, and gates merges on meaningful regressions.

New benchmark scripts (working-tree-load, changeset-parse, render-layout, memory, competitors) expand coverage well beyond the prior three scripts; fixtures are deterministic and cleaned up via try/finally.
classifyMetric in benchmark-result.ts has an unreachable competitor_ branch — all competitor timing metrics match the earlier _ms check, so they get comparable: true with a 15% regression threshold in the serialized JSON; the behavior is currently masked by a name-match guard in compareMetric but the JSON artifacts are misleading and the setup is fragile.
comment-pr.ts fetches only the first 100 PR comments with no pagination, so the marker lookup silently misses the existing benchmark comment on active PRs and posts a duplicate on every run.

Confidence Score: 3/5

Safe to merge after the two functional fixes; the pagination bug will silently flood active PRs with duplicate benchmark comments from day one, and the misclassification in classifyMetric embeds incorrect metadata in every JSON artifact.

The pagination bug in comment-pr.ts will create a new benchmark comment on every CI run for any PR with more than 100 existing comments, rather than updating the existing one — a silent but immediately visible problem for busy PRs. The classifyMetric ordering issue means every competitor timing metric is written to JSON artifacts as comparable: true with a regression threshold, which is only papered over by a defensive name-check in compareMetric; if that guard is ever simplified the CI gate for competitor metrics would break unexpectedly.

benchmarks/lib/benchmark-result.ts (classifyMetric branch ordering) and benchmarks/comment-pr.ts (comment pagination) need fixes before the benchmark infrastructure can be relied on.

Important Files Changed

Filename	Overview
benchmarks/lib/benchmark-result.ts	Core types and aggregation utilities; `classifyMetric` has an unreachable `competitor_` branch (dead code causing incorrect `comparable: true` in JSON artifacts) and `informational` field is never populated.
benchmarks/comment-pr.ts	GitHub PR comment upsert script; comment search is capped at 100 with no pagination — duplicate comments will be posted on PRs with >100 comments.
benchmarks/compare.ts	Comparison and markdown generation; logic is sound but relies on `metric.name.includes("competitor_")` as a belt-and-suspenders fix for the misclassification in `classifyMetric`.
benchmarks/run.ts	Orchestration runner; spawns each script N times, aggregates samples, and writes JSON. Straightforward and correct.
.github/workflows/benchmarks.yml	CI workflow; head+base runs, comparison, PR comment, and regression gate. `fetch-depth: 0` is applied on all triggers (push, manual) not just PRs, but this is minor.
benchmarks/lib/fixtures.ts	Deterministic synthetic fixture helpers; creates temp git repos and unified patches. Cleanup is guarded by `try/finally` where it matters.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[push / pull_request / workflow_dispatch] --> B[Checkout full history]
    B --> C[bun install]
    C --> D[Run head benchmarks\nbun run bench --samples N --out head.json]
    D --> E{PR event?}
    E -- No --> F[Publish step summary\nhead.txt]
    E -- Yes --> G[git worktree add origin/main\ncp benchmarks → base worktree\nbun run benchmarks/run.ts --out base.json]
    G --> H[Compare head vs base\nbench:compare → comparison.json + summary.md]
    H --> I[Publish step summary\nsummary.md]
    I --> J[Comment benchmark summary on PR\nbench:comment-pr --body summary.md]
    J --> K{compare step failed?}
    K -- Yes --> L[exit 1 — block merge]
    K -- No --> M[Upload artifacts\nbenchmark-results/*]
    F --> M
    L --> M

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
benchmarks/lib/benchmark-result.ts:68-98
The `startsWith("competitor_")` guard is unreachable dead code: every competitor timing metric ends with `_ms` and is caught by the first branch, giving it `comparable: true` with a 15% regression threshold. The intent of `comparable: false` for competitors is only saved by the `metric.name.includes("competitor_")` fall-through in `compareMetric`. If that guard is ever removed or refactored, competitor regressions would start failing CI. The fix is to check the competitor prefix before the generic `_ms` check.

```suggestion
  if (name.startsWith("competitor_")) {
    return { unit: "ms", comparable: false };
  }

  if (name.endsWith("_ms")) {
    return {
      unit: "ms",
      comparable: true,
      threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },
    };
  }

  if (
    name.startsWith("is_") ||
    name.endsWith("_ready_before_move") ||
    name.endsWith("_available")
  ) {
    return { unit: "boolean", comparable: false };
  }

  if (name.includes("rss") || name.includes("heap")) {
    return {
      unit: "bytes",
      comparable: true,
      threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },
    };
  }

  if (name.endsWith("_bytes")) {
    return { unit: "bytes", comparable: false };
  }
```

### Issue 2 of 3
benchmarks/comment-pr.ts:62-65
The comment search is capped at 100 with no pagination. On any active PR with more than 100 comments the marker lookup silently returns nothing, so every subsequent benchmark run posts a brand-new comment instead of updating the existing one.

```suggestion
async function fetchAllComments(
  repository: string,
  pullRequestNumber: number,
): Promise<Array<{ id: number; body?: string }>> {
  const all: Array<{ id: number; body?: string }> = [];
  let page = 1;
  while (true) {
    const batch = (await githubRequest(
      `/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100&page=${page}`,
    )) as Array<{ id: number; body?: string }>;
    all.push(...batch);
    if (batch.length < 100) break;
    page += 1;
  }
  return all;
}

const comments = await fetchAllComments(repository, pullRequestNumber);
const existing = comments.find((comment) => comment.body?.includes(marker));
```

### Issue 3 of 3
benchmarks/lib/benchmark-result.ts:16-18
The `informational` field is declared in `BenchmarkMetricResult` but `aggregateMetric` never sets it, so it is always `undefined` in every serialized artifact and in every comparison row. The `metric.informational` branch in `compareMetric` is therefore dead code — it can never flip a metric to `"informational"` status. Either populate the field in `aggregateMetric` or remove it from the interface to avoid misleading future readers.

```suggestion
  comparable: boolean;
  source: string;
```

_{Reviews (1): Last reviewed commit: "feat: add CI performance benchmark suite" | Re-trigger Greptile}

greptile-apps · 2026-05-29T22:59:10Z

+  if (name.endsWith("_ms")) {
+    return {
+      unit: "ms",
+      comparable: true,
+      threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },
+    };
+  }
+
+  if (
+    name.startsWith("is_") ||
+    name.endsWith("_ready_before_move") ||
+    name.endsWith("_available")
+  ) {
+    return { unit: "boolean", comparable: false };
+  }
+
+  if (name.includes("rss") || name.includes("heap")) {
+    return {
+      unit: "bytes",
+      comparable: true,
+      threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },
+    };
+  }
+
+  if (name.endsWith("_bytes")) {
+    return { unit: "bytes", comparable: false };
+  }
+
+  if (name.startsWith("competitor_")) {
+    return { unit: "ms", comparable: false };
+  }


The startsWith("competitor_") guard is unreachable dead code: every competitor timing metric ends with _ms and is caught by the first branch, giving it comparable: true with a 15% regression threshold. The intent of comparable: false for competitors is only saved by the metric.name.includes("competitor_") fall-through in compareMetric. If that guard is ever removed or refactored, competitor regressions would start failing CI. The fix is to check the competitor prefix before the generic _ms check.

Suggested change

if (name.endsWith("_ms")) {

return {

unit: "ms",

comparable: true,

threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },

};

}

if (

name.startsWith("is_") ||

name.endsWith("_ready_before_move") ||

name.endsWith("_available")

) {

return { unit: "boolean", comparable: false };

}

if (name.includes("rss") || name.includes("heap")) {

return {

unit: "bytes",

comparable: true,

threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },

};

}

if (name.endsWith("_bytes")) {

return { unit: "bytes", comparable: false };

}

if (name.startsWith("competitor_")) {

return { unit: "ms", comparable: false };

}

if (name.startsWith("competitor_")) {

return { unit: "ms", comparable: false };

}

if (name.endsWith("_ms")) {

return {

unit: "ms",

comparable: true,

threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 },

};

}

if (

name.startsWith("is_") ||

name.endsWith("_ready_before_move") ||

name.endsWith("_available")

) {

return { unit: "boolean", comparable: false };

}

if (name.includes("rss") || name.includes("heap")) {

return {

unit: "bytes",

comparable: true,

threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 },

};

}

if (name.endsWith("_bytes")) {

return { unit: "bytes", comparable: false };

}

Prompt To Fix With AI

This is a comment left during a code review. Path: benchmarks/lib/benchmark-result.ts Line: 68-98 Comment: The `startsWith("competitor_")` guard is unreachable dead code: every competitor timing metric ends with `_ms` and is caught by the first branch, giving it `comparable: true` with a 15% regression threshold. The intent of `comparable: false` for competitors is only saved by the `metric.name.includes("competitor_")` fall-through in `compareMetric`. If that guard is ever removed or refactored, competitor regressions would start failing CI. The fix is to check the competitor prefix before the generic `_ms` check. ```suggestion if (name.startsWith("competitor_")) { return { unit: "ms", comparable: false }; } if (name.endsWith("_ms")) { return { unit: "ms", comparable: true, threshold: { maxRegressionRatio: 1.15, minAbsoluteRegression: 5 }, }; } if ( name.startsWith("is_") || name.endsWith("_ready_before_move") || name.endsWith("_available") ) { return { unit: "boolean", comparable: false }; } if (name.includes("rss") || name.includes("heap")) { return { unit: "bytes", comparable: true, threshold: { maxRegressionRatio: 1.2, minAbsoluteRegression: 8 * 1024 * 1024 }, }; } if (name.endsWith("_bytes")) { return { unit: "bytes", comparable: false }; } ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed — competitor metrics are now classified before generic _ms metrics, so optional competitor timings stay non-comparable without relying on the comparison fallback.

This comment was generated by Pi using OpenAI GPT-5

greptile-apps · 2026-05-29T22:59:11Z

+const comments = (await githubRequest(
+  `/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100`,
+)) as Array<{ id: number; body?: string }>;
+const existing = comments.find((comment) => comment.body?.includes(marker));


The comment search is capped at 100 with no pagination. On any active PR with more than 100 comments the marker lookup silently returns nothing, so every subsequent benchmark run posts a brand-new comment instead of updating the existing one.

Suggested change

const comments = (await githubRequest(

`/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100`,

)) as Array<{ id: number; body?: string }>;

const existing = comments.find((comment) => comment.body?.includes(marker));

async function fetchAllComments(

repository: string,

pullRequestNumber: number,

): Promise<Array<{ id: number; body?: string }>> {

const all: Array<{ id: number; body?: string }> = [];

let page = 1;

while (true) {

const batch = (await githubRequest(

`/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100&page=${page}`,

)) as Array<{ id: number; body?: string }>;

all.push(...batch);

if (batch.length < 100) break;

page += 1;

}

return all;

}

const comments = await fetchAllComments(repository, pullRequestNumber);

const existing = comments.find((comment) => comment.body?.includes(marker));

Prompt To Fix With AI

This is a comment left during a code review. Path: benchmarks/comment-pr.ts Line: 62-65 Comment: The comment search is capped at 100 with no pagination. On any active PR with more than 100 comments the marker lookup silently returns nothing, so every subsequent benchmark run posts a brand-new comment instead of updating the existing one. ```suggestion async function fetchAllComments( repository: string, pullRequestNumber: number, ): Promise<Array<{ id: number; body?: string }>> { const all: Array<{ id: number; body?: string }> = []; let page = 1; while (true) { const batch = (await githubRequest( `/repos/${repository}/issues/${pullRequestNumber}/comments?per_page=100&page=${page}`, )) as Array<{ id: number; body?: string }>; all.push(...batch); if (batch.length < 100) break; page += 1; } return all; } const comments = await fetchAllComments(repository, pullRequestNumber); const existing = comments.find((comment) => comment.body?.includes(marker)); ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed — the PR comment updater now paginates issue comments until all pages are fetched before searching for the benchmark marker.

This comment was generated by Pi using OpenAI GPT-5

greptile-apps · 2026-05-29T22:59:12Z

+  comparable: boolean;
+  informational?: boolean;
+  source: string;


The informational field is declared in BenchmarkMetricResult but aggregateMetric never sets it, so it is always undefined in every serialized artifact and in every comparison row. The metric.informational branch in compareMetric is therefore dead code — it can never flip a metric to "informational" status. Either populate the field in aggregateMetric or remove it from the interface to avoid misleading future readers.

Suggested change

comparable: boolean;

informational?: boolean;

source: string;

comparable: boolean;

source: string;

Prompt To Fix With AI

This is a comment left during a code review. Path: benchmarks/lib/benchmark-result.ts Line: 16-18 Comment: The `informational` field is declared in `BenchmarkMetricResult` but `aggregateMetric` never sets it, so it is always `undefined` in every serialized artifact and in every comparison row. The `metric.informational` branch in `compareMetric` is therefore dead code — it can never flip a metric to `"informational"` status. Either populate the field in `aggregateMetric` or remove it from the interface to avoid misleading future readers. ```suggestion comparable: boolean; source: string; ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed — removed the unused informational field and the dead comparison branch so informational status is derived from comparable: false.

This comment was generated by Pi using OpenAI GPT-5

github-actions · 2026-05-29T23:17:27Z

Hunk benchmark results

❌ One or more benchmarks regressed beyond the configured threshold.

Base: e453a75c3a5f · Head: 9f2c01529465

Key Hunk benchmarks

Benchmark	Base median	PR median	Delta	Threshold	Status
bootstrap-load/file_pair_bootstrap_ms	45.82 ms	46.52 ms	+0.70 ms (+1.5%)	+15% and +5.00 ms	✅
bootstrap-load/git_bootstrap_ms	102.0 ms	105.0 ms	+3.09 ms (+3.0%)	+15% and +5.00 ms	✅
changeset-parse/balanced_changeset_parse_patch_ms	16.34 ms	16.72 ms	+0.38 ms (+2.3%)	+15% and +5.00 ms	✅
changeset-parse/large_single_file_parse_patch_ms	6.17 ms	6.29 ms	+0.12 ms (+1.9%)	+15% and +5.00 ms	✅
changeset-parse/many_small_files_parse_patch_ms	11.47 ms	11.30 ms	-0.17 ms (-1.5%)	+15% and +5.00 ms	✅
highlight-prefetch/next_file_ready_ms	110.7 ms	303.2 ms	+192.4 ms (+173.8%)	+15% and +5.00 ms	❌
highlight-prefetch/selected_startup_ms	51.28 ms	57.38 ms	+6.10 ms (+11.9%)	+15% and +5.00 ms	✅
large-stream/cold_first_frame_ms	104.3 ms	124.5 ms	+20.24 ms (+19.4%)	+15% and +5.00 ms	❌
large-stream/warm_first_frame_ms	76.25 ms	87.46 ms	+11.21 ms (+14.7%)	+15% and +5.00 ms	✅
large-stream/windowed_scroll_ticks_ms	6358.5 ms	6418.9 ms	+60.40 ms (+0.9%)	+15% and +5.00 ms	✅
render-layout/balanced_stream_review_plan_ms	14.91 ms	14.49 ms	-0.42 ms (-2.8%)	+15% and +5.00 ms	✅
render-layout/large_single_file_geometry_ms	34.10 ms	40.91 ms	+6.81 ms (+20.0%)	+15% and +5.00 ms	❌
render-layout/large_single_file_review_plan_ms	27.58 ms	21.96 ms	-5.62 ms (-20.4%)	+15% and +5.00 ms	✅
render-layout/many_small_files_review_plan_ms	10.97 ms	11.13 ms	+0.16 ms (+1.5%)	+15% and +5.00 ms	✅
working-tree-load/large_worktree_load_ms	180.8 ms	182.2 ms	+1.46 ms (+0.8%)	+15% and +5.00 ms	✅
working-tree-load/medium_worktree_load_ms	54.89 ms	56.52 ms	+1.63 ms (+3.0%)	+15% and +5.00 ms	✅
working-tree-load/small_worktree_load_ms	17.12 ms	19.26 ms	+2.14 ms (+12.5%)	+15% and +5.00 ms	✅
working-tree-load/untracked_few_large_load_ms	90.33 ms	87.54 ms	-2.79 ms (-3.1%)	+15% and +5.00 ms	✅
working-tree-load/untracked_many_small_load_ms	203.4 ms	204.3 ms	+0.89 ms (+0.4%)	+15% and +5.00 ms	✅

20 additional comparable Hunk metrics were checked but hidden to keep this comment readable. See the workflow artifacts for full JSON and text output.

Raw JSON and text logs are available in the benchmark workflow artifacts.

benvinegar · 2026-05-30T00:58:05Z

Closing this for now. The benchmark harness is useful, but running it as PR CI is still too noisy/heavy; we'll keep the idea as local benchmark coverage instead.\n\nThis comment was generated by Pi using OpenAI GPT-5

feat: add CI performance benchmark suite

59e9a8e

greptile-apps Bot reviewed May 29, 2026

View reviewed changes

benvinegar added 2 commits May 29, 2026 19:02

perf: trim default benchmark suite

ca0092b

fix: harden benchmark metadata handling

16e0f80

benvinegar added 2 commits May 29, 2026 20:39

ci: speed up pull request benchmarks

5e816a5

ci: make PR benchmarks non-blocking

fe94ea1

benvinegar closed this May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add CI performance benchmark suite#384

feat: add CI performance benchmark suite#384
benvinegar wants to merge 5 commits into
mainfrom
feat/ci-performance-benchmarks

benvinegar commented May 29, 2026

Uh oh!

greptile-apps Bot commented May 29, 2026

Uh oh!

greptile-apps Bot May 29, 2026

Uh oh!

benvinegar May 29, 2026

Uh oh!

greptile-apps Bot May 29, 2026

Uh oh!

benvinegar May 29, 2026

Uh oh!

greptile-apps Bot May 29, 2026

Uh oh!

benvinegar May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

benvinegar commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

benvinegar commented May 29, 2026

Summary

Validation

Uh oh!

greptile-apps Bot commented May 29, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

benvinegar May 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

benvinegar May 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

benvinegar May 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hunk benchmark results

Key Hunk benchmarks

Uh oh!

benvinegar commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 29, 2026 •

edited

Loading