Skip to content

ci: gate releases on benchmark regressions#429

Merged
benvinegar merged 2 commits into
mainfrom
feat/release-benchmark-gate
Jun 13, 2026
Merged

ci: gate releases on benchmark regressions#429
benvinegar merged 2 commits into
mainfrom
feat/release-benchmark-gate

Conversation

@benvinegar

Copy link
Copy Markdown
Member

Summary

  • Add bench:release and bench:release:compare scripts for versioned benchmark snapshots in benchmarks/release/bench-<version>.json.
  • Add a release benchmark comparison gate that fails only on material regressions using metric-specific relative + absolute thresholds.
  • Wire the release workflow so binary builds and publishing wait for the benchmark gate, with an explicit manual override reason for workflow-dispatch releases.
  • Document release benchmark prep/backfill expectations.

Notes

This PR does not commit generated baseline snapshots. Before the next release, backfill at least one prior stable release benchmark and commit the new release snapshot with the release-prep change.

Tests

  • bun run typecheck
  • bun run format:check
  • bun run lint
  • bun test ./scripts/compare-release-benchmarks.test.ts
  • bun test ./scripts ./benchmarks/lib
  • bun test

This PR description was generated by Pi using OpenAI GPT-5

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a release benchmark gate to the existing CI workflow: committed versioned snapshots in benchmarks/release/ are compared against the previous stable release, and the binary build + publish jobs are blocked if a material regression is detected. A manual workflow_dispatch override is available but requires an explicit written reason.

  • New scripts (run-release-benchmark.ts, compare-release-benchmarks.ts) handle snapshot generation and dual-threshold regression detection (relative ratio + absolute floor), with the comparison logic fully unit-tested.
  • Workflow change inserts a release-benchmark-gate job ahead of build-binaries; for tag-push events the gate is a hard block, while workflow_dispatch releases can be manually overridden only when a non-empty reason string is provided.
  • Data model additions enrich BenchmarkRunResult with optional packageVersion and runtime fields for auditability.

Confidence Score: 4/5

Safe to merge; the gate is additive and will not affect any release until a benchmark snapshot is committed.

The core comparison logic is well-tested and the workflow wiring looks correct. Two minor issues exist: an argument-order dependency in run-release-benchmark.ts where --version after --out silently discards the custom path, and the threshold column in the CI markdown table omitting the unit from the absolute value. Neither affects the gate's ability to block real regressions in normal usage.

scripts/run-release-benchmark.ts (--version/--out ordering) and scripts/compare-release-benchmarks.ts (formatThreshold unit display)

Important Files Changed

Filename Overview
scripts/compare-release-benchmarks.ts Core comparison logic — version parsing, regression detection, and markdown report. Logic is correct. Minor display issue: formatThreshold omits the unit from the absolute part.
scripts/run-release-benchmark.ts Thin wrapper that spawns benchmarks/run.ts and writes the versioned snapshot. Latent arg-order bug: --version after --out silently overwrites the explicit output path.
.github/workflows/release-prebuilt-npm.yml Adds release-benchmark-gate job with correct continue-on-error expression, manual-override reason guard, and proper needs wiring to build-binaries.
scripts/compare-release-benchmarks.test.ts Tests cover version selection, material-regression threshold, pass/fail comparisons, and missing-base/missing-head edge cases. All cases look correct.
benchmarks/lib/benchmark-result.ts Adds BenchmarkRuntimeInfo and optional packageVersion/runtime fields to BenchmarkRunResult; adds BenchmarkComparisonRow/Result interfaces. Non-breaking additive change.
benchmarks/run.ts Enriches the benchmark output with packageVersion and runtime info. Straightforward async top-level await additions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[push v* tag / workflow_dispatch] --> B[release-benchmark-gate]
    B --> C{bench:release:compare}
    C -- head file missing --> D[❌ Throw: missing head snapshot]
    C -- no previous baseline --> E[❌ Throw: no baseline]
    C -- comparison done --> F{comparison.failed?}
    F -- No --> G[✅ Gate passes]
    F -- Yes + push event --> H[❌ Job fails — release blocked]
    F -- Yes + workflow_dispatch\nallow_benchmark_regression=true --> I{reason provided?}
    I -- No --> J[❌ Require reason step fails]
    I -- Yes --> K[✅ Override accepted, summary updated]
    G --> L[build-binaries]
    K --> L
    L --> M[stage-release]
    M --> N[publish]
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
scripts/run-release-benchmark.ts:41-52
**`--version` silently clobbers an explicit `--out`**

When `--version` is parsed, `options.out` is unconditionally overwritten with the computed release path. If a caller passes `--out custom/path.json --version 0.16.0`, the `--out` value is discarded and the file goes to `benchmarks/release/bench-0.16.0.json` instead. Only the reverse order (`--version` first, then `--out`) works correctly. A simple guard — only update `options.out` from `--version` when `--out` has not yet been explicitly set — would fix the ordering dependency.

### Issue 2 of 2
scripts/compare-release-benchmarks.ts:282-292
**Absolute threshold displayed without its unit**

`formatThreshold` formats the absolute part with `formatNumber` but never appends the metric unit, so a timing threshold renders as `+15% and +5.00` instead of `+15% and +5ms`, and the memory threshold becomes `+20% and +8388608.0` instead of `+20% and +8MiB`. Reviewers scanning the CI summary table won't immediately know whether the absolute value is in milliseconds or raw bytes, which defeats the column's purpose.

Reviews (1): Last reviewed commit: "ci: gate releases on benchmark regressio..." | Re-trigger Greptile

Comment on lines +41 to +52
index += 1;
continue;
}

if (arg === "--samples") {
options.samples = Number(readArgValue(args, index));
index += 1;
continue;
}

if (arg === "--out") {
options.out = path.resolve(readArgValue(args, index));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 --version silently clobbers an explicit --out

When --version is parsed, options.out is unconditionally overwritten with the computed release path. If a caller passes --out custom/path.json --version 0.16.0, the --out value is discarded and the file goes to benchmarks/release/bench-0.16.0.json instead. Only the reverse order (--version first, then --out) works correctly. A simple guard — only update options.out from --version when --out has not yet been explicitly set — would fix the ordering dependency.

Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/run-release-benchmark.ts
Line: 41-52

Comment:
**`--version` silently clobbers an explicit `--out`**

When `--version` is parsed, `options.out` is unconditionally overwritten with the computed release path. If a caller passes `--out custom/path.json --version 0.16.0`, the `--out` value is discarded and the file goes to `benchmarks/release/bench-0.16.0.json` instead. Only the reverse order (`--version` first, then `--out`) works correctly. A simple guard — only update `options.out` from `--version` when `--out` has not yet been explicitly set — would fix the ordering dependency.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by preserving an explicitly provided --out path even when --version appears later, and added a regression test for that argument ordering.

Responded by Pi using OpenAI GPT-5.

Comment on lines +282 to +292
return unit === "bytes" ? "B" : unit;
}

function formatThreshold(threshold: BenchmarkThreshold | undefined) {
if (!threshold) {
return "—";
}

return `+${((threshold.maxRegressionRatio - 1) * 100).toFixed(0)}% and +${formatNumber(
threshold.minAbsoluteRegression,
)}`;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Absolute threshold displayed without its unit

formatThreshold formats the absolute part with formatNumber but never appends the metric unit, so a timing threshold renders as +15% and +5.00 instead of +15% and +5ms, and the memory threshold becomes +20% and +8388608.0 instead of +20% and +8MiB. Reviewers scanning the CI summary table won't immediately know whether the absolute value is in milliseconds or raw bytes, which defeats the column's purpose.

Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/compare-release-benchmarks.ts
Line: 282-292

Comment:
**Absolute threshold displayed without its unit**

`formatThreshold` formats the absolute part with `formatNumber` but never appends the metric unit, so a timing threshold renders as `+15% and +5.00` instead of `+15% and +5ms`, and the memory threshold becomes `+20% and +8388608.0` instead of `+20% and +8MiB`. Reviewers scanning the CI summary table won't immediately know whether the absolute value is in milliseconds or raw bytes, which defeats the column's purpose.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by rendering threshold absolute values with units in the markdown summary: milliseconds for timing thresholds and MiB for byte thresholds, with test coverage.

Responded by Pi using OpenAI GPT-5.

@benvinegar benvinegar force-pushed the feat/release-benchmark-gate branch from f6a6772 to cb87d6d Compare June 13, 2026 17:13
@benvinegar benvinegar merged commit 1ad5d36 into main Jun 13, 2026
10 of 11 checks passed
@benvinegar benvinegar deleted the feat/release-benchmark-gate branch June 13, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant