Summary
In the post-job step, shutdownBuildkitd() runs sudo pkill -TERM buildkitd and awaits it directly. pkill exits with code 1 when no process matches. If buildkitd has already exited by the time the post-step runs, that exit-1 rejects the promise, shutdownBuildkitd() throws, and the action reports a fatal cleanup failure — which then skips the sticky-disk cache commit. So a perfectly successful build silently loses its layer cache for that run.
This isn't only cosmetic noise: the skipped commit defeats the action's main purpose (persisting the BuildKit cache on the sticky disk).
Environment
- Action:
useblacksmith/setup-docker-builder@v1 (resolves to v1.8.0, tag sha a592b831ebb20e68f7cf47329cf2c3c67b8a7655)
- buildkitd:
0.29.3-blacksmith
- Followed by
docker/bake-action@v7; max-cache-size-mb: "30720"
Observed logs (post-job cleanup)
Starting buildkitd with command: nohup sudo buildkitd --debug --config=buildkitd.toml ... &
buildkitd daemon started successfully with PID 4159
buildkitd version: 0.29.3-blacksmith
...
Post job cleanup.
buildkitd addr: tcp://127.0.0.1:1234
buildkitd process: 4159
Sending SIGTERM to buildkitd for graceful shutdown
##[error]error shutting down buildkitd process: Command failed: sudo pkill -TERM buildkitd
##[error]Cleanup failed: Command failed: sudo pkill -TERM buildkitd
##[warning]Skipping sticky disk commit due to cleanup error: Command failed: sudo pkill -TERM buildkitd
The build itself succeeded; only the post-step cleanup "failed". The job is green overall, but the two red ##[error] annotations are misleading and the cache commit is dropped.
Root cause
Decompiled from dist/index.js (v1.8.0), lightly reformatted:
async function shutdownBuildkitd() {
const TIMEOUT = 3e4;
try {
info("Sending SIGTERM to buildkitd for graceful shutdown");
await exec(`sudo pkill -TERM buildkitd`); // ← throws when pkill exits 1 (no match)
const start = Date.now();
while (Date.now() - start < TIMEOUT) {
try {
const { stdout } = await exec("pgrep buildkitd");
debug(`buildkitd process still running with PID: ${stdout.trim()}, waiting...`);
await new Promise(r => setTimeout(r, 300));
} catch (e) {
if (e.code === 1) { info("buildkitd successfully shutdown gracefully"); return } // ← already handles "no process"
throw e;
}
}
// ... SIGKILL fallback ...
}
}
The initial await exec("sudo pkill -TERM buildkitd") is intolerant of pkill's exit code 1 (= "no processes matched"). When buildkitd has already exited before the post-step — idle reap, or a crash (the action even ships a logBuildkitdCrashLogs() helper, so this is anticipated) — the pkill returns 1, the promise rejects, and the whole shutdown is treated as an error.
Notably, the very next loop already treats "no buildkitd process" (pgrep exit 1) as the success case (buildkitd successfully shutdown gracefully). The initial pkill just needs the same tolerance.
Suggested fix
Treat exit code 1 from the initial pkill -TERM buildkitd as "already gone → success", mirroring the existing pgrep handling. For example:
try {
await exec(`sudo pkill -TERM buildkitd`);
} catch (e) {
if (e.code === 1) { info("buildkitd already stopped"); return; } // nothing to terminate
throw e;
}
(Equivalently, sudo pkill -TERM buildkitd || true, though catching exit 1 specifically keeps real failures fatal.) This would clear the spurious ##[error] annotations and, more importantly, stop dropping the sticky-disk commit when the daemon exited on its own.
Bug report drafted with assistance from Claude Code.
Summary
In the post-job step,
shutdownBuildkitd()runssudo pkill -TERM buildkitdandawaits it directly.pkillexits with code 1 when no process matches. Ifbuildkitdhas already exited by the time the post-step runs, that exit-1 rejects the promise,shutdownBuildkitd()throws, and the action reports a fatal cleanup failure — which then skips the sticky-disk cache commit. So a perfectly successful build silently loses its layer cache for that run.This isn't only cosmetic noise: the skipped commit defeats the action's main purpose (persisting the BuildKit cache on the sticky disk).
Environment
useblacksmith/setup-docker-builder@v1(resolves to v1.8.0, tag shaa592b831ebb20e68f7cf47329cf2c3c67b8a7655)0.29.3-blacksmithdocker/bake-action@v7;max-cache-size-mb: "30720"Observed logs (post-job cleanup)
The build itself succeeded; only the post-step cleanup "failed". The job is green overall, but the two red
##[error]annotations are misleading and the cache commit is dropped.Root cause
Decompiled from
dist/index.js(v1.8.0), lightly reformatted:The initial
await exec("sudo pkill -TERM buildkitd")is intolerant ofpkill's exit code 1 (= "no processes matched"). Whenbuildkitdhas already exited before the post-step — idle reap, or a crash (the action even ships alogBuildkitdCrashLogs()helper, so this is anticipated) — thepkillreturns 1, the promise rejects, and the whole shutdown is treated as an error.Notably, the very next loop already treats "no buildkitd process" (
pgrepexit 1) as the success case (buildkitd successfully shutdown gracefully). The initialpkilljust needs the same tolerance.Suggested fix
Treat exit code 1 from the initial
pkill -TERM buildkitdas "already gone → success", mirroring the existingpgrephandling. For example:(Equivalently,
sudo pkill -TERM buildkitd || true, though catching exit 1 specifically keeps real failures fatal.) This would clear the spurious##[error]annotations and, more importantly, stop dropping the sticky-disk commit when the daemon exited on its own.Bug report drafted with assistance from Claude Code.