Skip to content

Fix ~30s setup + post step hangs caused by dangling gRPC HTTP/2 sessions#99

Merged
taha-au merged 1 commit into
mainfrom
fix/post-step-hang-h2-session
Apr 25, 2026
Merged

Fix ~30s setup + post step hangs caused by dangling gRPC HTTP/2 sessions#99
taha-au merged 1 commit into
mainfrom
fix/post-step-hang-h2-session

Conversation

@taha-au

@taha-au taha-au commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Both the setup and post steps of setup-docker-builder were each spending ~30s of dead air after the action's actual work had finished. Pre-fix, every job using this action paid ~60s of pure wall-clock latency.

Root cause: reporter.createBlacksmithAgentClient() built a fresh @connectrpc/connect-node gRPC transport on every call, which opens an HTTP/2 session via http2.connect(). Those sessions were never close()d or abort()ed. Http2SessionManager defaults to pingIdleConnection: false and a 15-minute idleConnectionTimeoutMs, and an open HTTP/2 session holds a ref'd handle on the Node event loop — so the action's Node process couldn't exit until the Blacksmith agent's server-side idle timeout (~30s) closed the connection for it.

Fix

  • Cache a single Http2SessionManager + gRPC client per Node process and reuse it across reportMetric / commitStickyDisk / getStickyDisk. Reduces TCP/HTTP2 churn during normal operation.
  • Add closeBlacksmithAgentClient() that aborts the cached session manager.
  • Call it at the end of both the main and post action bodies.
  • Add setImmediate(() => process.exit(process.exitCode ?? 0)) belt-and-suspenders at the end of each, so any future stray open handle (axios keep-alive, fs watcher, etc.) cannot silently regress this. Preserves any failure exit code set via core.setFailed.
  • Rebuild dist/.

Validation (A/B from this branch, pre-merge)

Same workflow, identical steps, only the action source differs:

Step useblacksmith/setup-docker-builder@main ./ (this branch) Improvement
Setup Docker Builder under test 32s 3s −29s
Trivial build 2s 2s
Post Setup Docker Builder under test 30s 1s −29s
Total time spent in the action ~62s ~4s −58s

Run: https://github.com/useblacksmith/setup-docker-builder/actions/runs/24923828961

Regression test

Adds .github/workflows/step-duration-regression.yml:

  • Runs the action against a real Blacksmith runner.
  • Queries the GitHub Actions API for the just-finished job's step timings, locates Setup Docker Builder under test and Post Setup Docker Builder under test, and fails if either exceeds STEP_MAX_SECONDS (5s; pre-fix runs were ~32s each, post-fix ~1–3s).
  • Triggers: PR, push to main, workflow_dispatch, and once daily so we also catch upstream regressions (e.g. an agent-side change re-introducing an idle-connection delay) when the action source itself hasn't changed.
  • Cause-agnostic error message points future debuggers at the most likely class of culprit (dangling Http2Session / axios keep-alive / fs watcher) without baking the diagnosis into the test.

Test plan

  • Unit tests pass (pnpm test)
  • Typecheck passes (pnpm typecheck)
  • dist/ rebuilt and committed
  • A/B comparison vs @main shows ~30s drop on each step (run linked above)
  • Regression workflow asserts both steps stay under threshold and writes a duration summary to $GITHUB_STEP_SUMMARY
  • Regression workflow runs green against this PR (will appear in checks below)

Note

Medium Risk
Adds a new CI gate that can fail PRs based on GitHub Actions step-timing data, which may be somewhat flaky due to runner/API variability, but it’s isolated to workflow logic and doesn’t change runtime production code paths.

Overview
Adds a new GitHub Actions workflow, step-duration-regression.yml, that exercises the action on a Blacksmith runner and then queries the Actions API to compute the durations of the setup step and its corresponding post step.

The workflow writes a step-duration table to $GITHUB_STEP_SUMMARY and fails CI if either duration is missing or exceeds the configured threshold (STEP_MAX_SECONDS, default 5).

Reviewed by Cursor Bugbot for commit 5933660. Bugbot is set up for automated code reviews on this repo. Configure here.

@taha-au taha-au requested review from adityamaru and bruce-y April 25, 2026 05:51
@taha-au taha-au self-assigned this Apr 25, 2026
Both the setup and post steps were each spending ~30s of dead air after
the action's JS work had finished. Pre-fix, every job using
setup-docker-builder paid ~60s of pure wall-clock latency for nothing.

Cause: createBlacksmithAgentClient() built a fresh @connectrpc/connect-node
gRPC transport on every call, opening an HTTP/2 session via
http2.connect() that was never close()d or abort()ed. Open HTTP/2
sessions hold a ref'd handle on the Node event loop, so the action's
process couldn't exit until the Blacksmith agent's server-side idle
timeout (~30s) closed the connection for it. This happened twice per
job: once in setup (reportMetric + getStickyDisk) and once in post
(reportMetric + commitStickyDisk).

Fix:

- Cache one Http2SessionManager + gRPC client per process and reuse it
  across all reporter calls.
- Add closeBlacksmithAgentClient() that aborts the cached session.
- Call it at the end of both the main and post action bodies, followed
  by setImmediate(() => process.exit(process.exitCode ?? 0)) as a
  belt-and-suspenders against any other stray open handle (axios
  keep-alive, fs watcher, etc.).

Also adds .github/workflows/step-duration-regression.yml: a regression
test that exercises the action against a Blacksmith runner, queries
the Actions API for the resulting step timings, and fails CI if either
the setup or post step exceeds 5s. Triggers on PRs, push to main,
manual dispatch, and once daily so we also catch upstream regressions
when the action source itself hasn't changed.

A/B vs main on the validation run: setup 32s -> 3s, post 30s -> 1s.
@taha-au taha-au force-pushed the fix/post-step-hang-h2-session branch from 616b62f to 5933660 Compare April 25, 2026 05:58
@taha-au taha-au merged commit 722e97d into main Apr 25, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants