ci: gate releases on benchmark regressions by benvinegar · Pull Request #429 · modem-dev/hunk

benvinegar · 2026-06-13T17:04:35Z

Summary

Add bench:release and bench:release:compare scripts for versioned benchmark snapshots in benchmarks/release/bench-<version>.json.
Add a release benchmark comparison gate that fails only on material regressions using metric-specific relative + absolute thresholds.
Wire the release workflow so binary builds and publishing wait for the benchmark gate, with an explicit manual override reason for workflow-dispatch releases.
Document release benchmark prep/backfill expectations.

Notes

This PR does not commit generated baseline snapshots. Before the next release, backfill at least one prior stable release benchmark and commit the new release snapshot with the release-prep change.

Tests

bun run typecheck
bun run format:check
bun run lint
bun test ./scripts/compare-release-benchmarks.test.ts
bun test ./scripts ./benchmarks/lib
bun test

This PR description was generated by Pi using OpenAI GPT-5

greptile-apps · 2026-06-13T17:09:24Z

Greptile Summary

This PR adds a release benchmark gate to the existing CI workflow: committed versioned snapshots in benchmarks/release/ are compared against the previous stable release, and the binary build + publish jobs are blocked if a material regression is detected. A manual workflow_dispatch override is available but requires an explicit written reason.

New scripts (run-release-benchmark.ts, compare-release-benchmarks.ts) handle snapshot generation and dual-threshold regression detection (relative ratio + absolute floor), with the comparison logic fully unit-tested.
Workflow change inserts a release-benchmark-gate job ahead of build-binaries; for tag-push events the gate is a hard block, while workflow_dispatch releases can be manually overridden only when a non-empty reason string is provided.
Data model additions enrich BenchmarkRunResult with optional packageVersion and runtime fields for auditability.

Confidence Score: 4/5

Safe to merge; the gate is additive and will not affect any release until a benchmark snapshot is committed.

The core comparison logic is well-tested and the workflow wiring looks correct. Two minor issues exist: an argument-order dependency in run-release-benchmark.ts where --version after --out silently discards the custom path, and the threshold column in the CI markdown table omitting the unit from the absolute value. Neither affects the gate's ability to block real regressions in normal usage.

scripts/run-release-benchmark.ts (--version/--out ordering) and scripts/compare-release-benchmarks.ts (formatThreshold unit display)

Important Files Changed

Filename	Overview
scripts/compare-release-benchmarks.ts	Core comparison logic — version parsing, regression detection, and markdown report. Logic is correct. Minor display issue: formatThreshold omits the unit from the absolute part.
scripts/run-release-benchmark.ts	Thin wrapper that spawns benchmarks/run.ts and writes the versioned snapshot. Latent arg-order bug: --version after --out silently overwrites the explicit output path.
.github/workflows/release-prebuilt-npm.yml	Adds release-benchmark-gate job with correct continue-on-error expression, manual-override reason guard, and proper needs wiring to build-binaries.
scripts/compare-release-benchmarks.test.ts	Tests cover version selection, material-regression threshold, pass/fail comparisons, and missing-base/missing-head edge cases. All cases look correct.
benchmarks/lib/benchmark-result.ts	Adds BenchmarkRuntimeInfo and optional packageVersion/runtime fields to BenchmarkRunResult; adds BenchmarkComparisonRow/Result interfaces. Non-breaking additive change.
benchmarks/run.ts	Enriches the benchmark output with packageVersion and runtime info. Straightforward async top-level await additions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[push v* tag / workflow_dispatch] --> B[release-benchmark-gate]
    B --> C{bench:release:compare}
    C -- head file missing --> D[❌ Throw: missing head snapshot]
    C -- no previous baseline --> E[❌ Throw: no baseline]
    C -- comparison done --> F{comparison.failed?}
    F -- No --> G[✅ Gate passes]
    F -- Yes + push event --> H[❌ Job fails — release blocked]
    F -- Yes + workflow_dispatch\nallow_benchmark_regression=true --> I{reason provided?}
    I -- No --> J[❌ Require reason step fails]
    I -- Yes --> K[✅ Override accepted, summary updated]
    G --> L[build-binaries]
    K --> L
    L --> M[stage-release]
    M --> N[publish]

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
scripts/run-release-benchmark.ts:41-52
**`--version` silently clobbers an explicit `--out`**

When `--version` is parsed, `options.out` is unconditionally overwritten with the computed release path. If a caller passes `--out custom/path.json --version 0.16.0`, the `--out` value is discarded and the file goes to `benchmarks/release/bench-0.16.0.json` instead. Only the reverse order (`--version` first, then `--out`) works correctly. A simple guard — only update `options.out` from `--version` when `--out` has not yet been explicitly set — would fix the ordering dependency.

### Issue 2 of 2
scripts/compare-release-benchmarks.ts:282-292
**Absolute threshold displayed without its unit**

`formatThreshold` formats the absolute part with `formatNumber` but never appends the metric unit, so a timing threshold renders as `+15% and +5.00` instead of `+15% and +5ms`, and the memory threshold becomes `+20% and +8388608.0` instead of `+20% and +8MiB`. Reviewers scanning the CI summary table won't immediately know whether the absolute value is in milliseconds or raw bytes, which defeats the column's purpose.

_{Reviews (1): Last reviewed commit: "ci: gate releases on benchmark regressio..." | Re-trigger Greptile}

greptile-apps · 2026-06-13T17:09:27Z

+      index += 1;
+      continue;
+    }
+
+    if (arg === "--samples") {
+      options.samples = Number(readArgValue(args, index));
+      index += 1;
+      continue;
+    }
+
+    if (arg === "--out") {
+      options.out = path.resolve(readArgValue(args, index));


--version silently clobbers an explicit --out

When --version is parsed, options.out is unconditionally overwritten with the computed release path. If a caller passes --out custom/path.json --version 0.16.0, the --out value is discarded and the file goes to benchmarks/release/bench-0.16.0.json instead. Only the reverse order (--version first, then --out) works correctly. A simple guard — only update options.out from --version when --out has not yet been explicitly set — would fix the ordering dependency.

Prompt To Fix With AI

This is a comment left during a code review. Path: scripts/run-release-benchmark.ts Line: 41-52 Comment: **`--version` silently clobbers an explicit `--out`** When `--version` is parsed, `options.out` is unconditionally overwritten with the computed release path. If a caller passes `--out custom/path.json --version 0.16.0`, the `--out` value is discarded and the file goes to `benchmarks/release/bench-0.16.0.json` instead. Only the reverse order (`--version` first, then `--out`) works correctly. A simple guard — only update `options.out` from `--version` when `--out` has not yet been explicitly set — would fix the ordering dependency. How can I resolve this? If you propose a fix, please make it concise.

Fixed by preserving an explicitly provided --out path even when --version appears later, and added a regression test for that argument ordering.

_{Responded by Pi using OpenAI GPT-5.}

greptile-apps · 2026-06-13T17:09:28Z

+  return unit === "bytes" ? "B" : unit;
+}
+
+function formatThreshold(threshold: BenchmarkThreshold | undefined) {
+  if (!threshold) {
+    return "—";
+  }
+
+  return `+${((threshold.maxRegressionRatio - 1) * 100).toFixed(0)}% and +${formatNumber(
+    threshold.minAbsoluteRegression,
+  )}`;


Absolute threshold displayed without its unit

formatThreshold formats the absolute part with formatNumber but never appends the metric unit, so a timing threshold renders as +15% and +5.00 instead of +15% and +5ms, and the memory threshold becomes +20% and +8388608.0 instead of +20% and +8MiB. Reviewers scanning the CI summary table won't immediately know whether the absolute value is in milliseconds or raw bytes, which defeats the column's purpose.

Prompt To Fix With AI

This is a comment left during a code review. Path: scripts/compare-release-benchmarks.ts Line: 282-292 Comment: **Absolute threshold displayed without its unit** `formatThreshold` formats the absolute part with `formatNumber` but never appends the metric unit, so a timing threshold renders as `+15% and +5.00` instead of `+15% and +5ms`, and the memory threshold becomes `+20% and +8388608.0` instead of `+20% and +8MiB`. Reviewers scanning the CI summary table won't immediately know whether the absolute value is in milliseconds or raw bytes, which defeats the column's purpose. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fixed by rendering threshold absolute values with units in the markdown summary: milliseconds for timing thresholds and MiB for byte thresholds, with test coverage.

_{Responded by Pi using OpenAI GPT-5.}

greptile-apps Bot reviewed Jun 13, 2026

View reviewed changes

benvinegar added 2 commits June 13, 2026 13:12

ci: gate releases on benchmark regressions

fe57dc0

fix: polish release benchmark gate output

cb87d6d

benvinegar force-pushed the feat/release-benchmark-gate branch from f6a6772 to cb87d6d Compare June 13, 2026 17:13

benvinegar merged commit 1ad5d36 into main Jun 13, 2026
10 of 11 checks passed

benvinegar deleted the feat/release-benchmark-gate branch June 13, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: gate releases on benchmark regressions#429

ci: gate releases on benchmark regressions#429
benvinegar merged 2 commits into
mainfrom
feat/release-benchmark-gate

benvinegar commented Jun 13, 2026

Uh oh!

greptile-apps Bot commented Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

benvinegar Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

benvinegar Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

benvinegar commented Jun 13, 2026

Summary

Notes

Tests

Uh oh!

greptile-apps Bot commented Jun 13, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

benvinegar Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

benvinegar Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant