Contributing guidelines and issue reporting guide
Well-formed report checklist
Description of bug
Disclaimer: this bug report was written via Claude, but the behavior matches my experience. I'm have a custom snapshotter (among other things), and this build https://github.com/clipper-registry/blog-buildkit-benchmark/actions/runs/27998059001/job/82864320244 should have many cached layers, but does not.
Bug description
With a lazy/remote snapshotter (stargz/eStargz, overlaybd, …) and a multi-stage
build where a later stage consumes an earlier, different-base stage via
COPY --from, --cache-to mode=max silently drops the consumed stage's cache
layers. On a later build, a step in that stage that should be restored from cache
re-runs instead. No error is reported — the export "succeeds" with a degraded
cache.
Reproduction
Requires a lazy snapshotter; this uses the upstream stargz snapshotter and the
public ghcr.io/stargz-containers eStargz images.
# 1. buildkit with the stargz snapshotter
docker buildx create --name stargz --driver docker-container \
--driver-opt image=moby/buildkit:latest \
--buildkitd-flags "--oci-worker-snapshotter=stargz"
# 2. Multi-stage: build stage on one esgz base, final stage on a DIFFERENT esgz
# base, COPY --from, plus a downstream step (ARG BUST) that forces a re-run.
cat > Dockerfile <<'EOF'
FROM ghcr.io/stargz-containers/ubuntu:22.04-esgz AS build
RUN echo stable > /stable
ARG BUST=0
RUN echo "$BUST" > /bust && cat /stable > /combined
FROM ghcr.io/stargz-containers/alpine:3.15.3-esgz
COPY --from=build /combined /combined
EOF
# 3. First build — exports the cache
docker buildx build --builder stargz --platform linux/amd64 --build-arg BUST=1 \
--cache-to type=local,dest=./cache,mode=max --output type=cacheonly .
# 4. Drop local state, then rebuild changing only BUST. The second RUN must
# re-run, which requires the first RUN's filesystem state to be restored.
docker buildx prune -af --builder stargz
docker buildx build --builder stargz --platform linux/amd64 --build-arg BUST=2 \
--cache-from type=local,src=./cache --output type=cacheonly .
Expected
On the second build, RUN echo stable > /stable (unchanged) is CACHED.
Actual
#8 [build 2/3] RUN echo stable > /stable
#8 DONE 0.4s <-- re-runs instead of CACHED
Its cache layer was dropped during the first build's export, so it cannot be
restored when the downstream BUST step re-runs. (Inspecting ./cache's
application/vnd.buildkit.cacheconfig.v0 config confirms the build-stage record
has an empty layers field; only the final COPY record has a layer.)
The same Dockerfile with an eager snapshotter, or with the same base image in both
stages, caches correctly.
Root cause
- The cache exporter recurses per record into cross-stage deps
(solver/exporter.go ExportTo).
- Loading a record's result (
worker/cacheresult.go LoadRemotes →
Worker.LoadRef → CacheManager.Get) runs checkLazyProviders
(cache/manager.go), which returns NeedsRemoteProviderError for any lazy
ancestor blob lacking a DescHandler.
LoadRef recovers those handlers from CacheOptGetterOf(ctx) — the
withAncestorCacheOpts getter, which resolves handlers from the ancestor
states of whatever state it was rooted at.
- The regression (
051818cf3): the exporter now sets that getter once, at
the outermost export, and all nested records inherit it
(if CacheOptGetterOf(ctx) == nil && e.recordCtxOpts != nil). Since
withAncestorCacheOpts walks only one record's ancestry, the getter rooted at
the final stage's result does not reach the build stage's source op, so the
build-stage result's lazy base-layer handlers are unresolvable.
LoadRef therefore still fails with NeedsRemoteProviderError; ExportTo
returns it; the deps loop swallows it (if err != nil { continue }, added in
the same PR series to avoid failing export on subbranch errors). The
build-stage result is dropped from the cache.
Eager snapshotters are unaffected (base blobs are materialized, isLazy == false,
so no handler is needed).
Suggested fix
Re-root the opt-getter at each record's own state (the pre-051818cf3 behavior),
matching the exporter's per-record recursion:
mainCtx := ctx
- if CacheOptGetterOf(ctx) == nil && e.recordCtxOpts != nil {
+ if e.recordCtxOpts != nil {
ctx = e.recordCtxOpts(ctx)
}
A getter is only ever queried for a record's own ancestor blobs, so per-record
re-rooting resolves exactly what each record needs. (If the perf intent of
051818cf3 matters, an alternative is to keep "set once" but make
withAncestorCacheOpts traverse cross-stage edges.) It would also help to log,
rather than silently swallow, the subbranch export error so this failure mode is
visible.
With the above change, the reproduction's second build reports
RUN echo stable > /stable ... CACHED.
Contributing guidelines and issue reporting guide
Well-formed report checklist
Description of bug
Disclaimer: this bug report was written via Claude, but the behavior matches my experience. I'm have a custom snapshotter (among other things), and this build https://github.com/clipper-registry/blog-buildkit-benchmark/actions/runs/27998059001/job/82864320244 should have many cached layers, but does not.
Bug description
With a lazy/remote snapshotter (stargz/eStargz, overlaybd, …) and a multi-stage
build where a later stage consumes an earlier, different-base stage via
COPY --from,--cache-to mode=maxsilently drops the consumed stage's cachelayers. On a later build, a step in that stage that should be restored from cache
re-runs instead. No error is reported — the export "succeeds" with a degraded
cache.
Reproduction
Requires a lazy snapshotter; this uses the upstream stargz snapshotter and the
public
ghcr.io/stargz-containerseStargz images.Expected
On the second build,
RUN echo stable > /stable(unchanged) isCACHED.Actual
Its cache layer was dropped during the first build's export, so it cannot be
restored when the downstream
BUSTstep re-runs. (Inspecting./cache'sapplication/vnd.buildkit.cacheconfig.v0config confirms the build-stage recordhas an empty
layersfield; only the finalCOPYrecord has a layer.)The same Dockerfile with an eager snapshotter, or with the same base image in both
stages, caches correctly.
Root cause
(
solver/exporter.goExportTo).worker/cacheresult.goLoadRemotes→Worker.LoadRef→CacheManager.Get) runscheckLazyProviders(
cache/manager.go), which returnsNeedsRemoteProviderErrorfor any lazyancestor blob lacking a
DescHandler.LoadRefrecovers those handlers fromCacheOptGetterOf(ctx)— thewithAncestorCacheOptsgetter, which resolves handlers from the ancestorstates of whatever state it was rooted at.
051818cf3): the exporter now sets that getter once, atthe outermost export, and all nested records inherit it
(
if CacheOptGetterOf(ctx) == nil && e.recordCtxOpts != nil). SincewithAncestorCacheOptswalks only one record's ancestry, the getter rooted atthe final stage's result does not reach the build stage's source op, so the
build-stage result's lazy base-layer handlers are unresolvable.
LoadReftherefore still fails withNeedsRemoteProviderError;ExportToreturns it; the deps loop swallows it (
if err != nil { continue }, added inthe same PR series to avoid failing export on subbranch errors). The
build-stage result is dropped from the cache.
Eager snapshotters are unaffected (base blobs are materialized,
isLazy == false,so no handler is needed).
Suggested fix
Re-root the opt-getter at each record's own state (the pre-
051818cf3behavior),matching the exporter's per-record recursion:
A getter is only ever queried for a record's own ancestor blobs, so per-record
re-rooting resolves exactly what each record needs. (If the perf intent of
051818cf3matters, an alternative is to keep "set once" but makewithAncestorCacheOptstraverse cross-stage edges.) It would also help to log,rather than silently swallow, the subbranch export error so this failure mode is
visible.
With the above change, the reproduction's second build reports
RUN echo stable > /stable ... CACHED.