Fix ~30s setup + post step hangs caused by dangling gRPC HTTP/2 sessions#99
Merged
Conversation
bruce-y
approved these changes
Apr 25, 2026
Both the setup and post steps were each spending ~30s of dead air after the action's JS work had finished. Pre-fix, every job using setup-docker-builder paid ~60s of pure wall-clock latency for nothing. Cause: createBlacksmithAgentClient() built a fresh @connectrpc/connect-node gRPC transport on every call, opening an HTTP/2 session via http2.connect() that was never close()d or abort()ed. Open HTTP/2 sessions hold a ref'd handle on the Node event loop, so the action's process couldn't exit until the Blacksmith agent's server-side idle timeout (~30s) closed the connection for it. This happened twice per job: once in setup (reportMetric + getStickyDisk) and once in post (reportMetric + commitStickyDisk). Fix: - Cache one Http2SessionManager + gRPC client per process and reuse it across all reporter calls. - Add closeBlacksmithAgentClient() that aborts the cached session. - Call it at the end of both the main and post action bodies, followed by setImmediate(() => process.exit(process.exitCode ?? 0)) as a belt-and-suspenders against any other stray open handle (axios keep-alive, fs watcher, etc.). Also adds .github/workflows/step-duration-regression.yml: a regression test that exercises the action against a Blacksmith runner, queries the Actions API for the resulting step timings, and fails CI if either the setup or post step exceeds 5s. Triggers on PRs, push to main, manual dispatch, and once daily so we also catch upstream regressions when the action source itself hasn't changed. A/B vs main on the validation run: setup 32s -> 3s, post 30s -> 1s.
616b62f to
5933660
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Both the setup and post steps of
setup-docker-builderwere each spending ~30s of dead air after the action's actual work had finished. Pre-fix, every job using this action paid ~60s of pure wall-clock latency.Root cause:
reporter.createBlacksmithAgentClient()built a fresh@connectrpc/connect-nodegRPC transport on every call, which opens an HTTP/2 session viahttp2.connect(). Those sessions were neverclose()d orabort()ed.Http2SessionManagerdefaults topingIdleConnection: falseand a 15-minuteidleConnectionTimeoutMs, and an open HTTP/2 session holds a ref'd handle on the Node event loop — so the action's Node process couldn't exit until the Blacksmith agent's server-side idle timeout (~30s) closed the connection for it.Fix
Http2SessionManager+ gRPC client per Node process and reuse it acrossreportMetric/commitStickyDisk/getStickyDisk. Reduces TCP/HTTP2 churn during normal operation.closeBlacksmithAgentClient()that aborts the cached session manager.setImmediate(() => process.exit(process.exitCode ?? 0))belt-and-suspenders at the end of each, so any future stray open handle (axios keep-alive, fs watcher, etc.) cannot silently regress this. Preserves any failure exit code set viacore.setFailed.dist/.Validation (A/B from this branch, pre-merge)
Same workflow, identical steps, only the action source differs:
useblacksmith/setup-docker-builder@main./(this branch)Run: https://github.com/useblacksmith/setup-docker-builder/actions/runs/24923828961
Regression test
Adds
.github/workflows/step-duration-regression.yml:Setup Docker Builder under testandPost Setup Docker Builder under test, and fails if either exceedsSTEP_MAX_SECONDS(5s; pre-fix runs were ~32s each, post-fix ~1–3s).main,workflow_dispatch, and once daily so we also catch upstream regressions (e.g. an agent-side change re-introducing an idle-connection delay) when the action source itself hasn't changed.Http2Session/ axios keep-alive / fs watcher) without baking the diagnosis into the test.Test plan
pnpm test)pnpm typecheck)dist/rebuilt and committed@mainshows ~30s drop on each step (run linked above)$GITHUB_STEP_SUMMARYNote
Medium Risk
Adds a new CI gate that can fail PRs based on GitHub Actions step-timing data, which may be somewhat flaky due to runner/API variability, but it’s isolated to workflow logic and doesn’t change runtime production code paths.
Overview
Adds a new GitHub Actions workflow,
step-duration-regression.yml, that exercises the action on a Blacksmith runner and then queries the Actions API to compute the durations of the setup step and its corresponding post step.The workflow writes a step-duration table to
$GITHUB_STEP_SUMMARYand fails CI if either duration is missing or exceeds the configured threshold (STEP_MAX_SECONDS, default5).Reviewed by Cursor Bugbot for commit 5933660. Bugbot is set up for automated code reviews on this repo. Configure here.