Skip to content

mode=max cache export silently drops build-stage layers with a lazy snapshotter + multi-stage COPY --from #6893

Description

@kyle-basis

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • I have found a bug that the documentation does not mention anything about my problem
  • I have found a bug that there are no open or closed issues that are related to my problem
  • I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

Disclaimer: this bug report was written via Claude, but the behavior matches my experience. I'm have a custom snapshotter (among other things), and this build https://github.com/clipper-registry/blog-buildkit-benchmark/actions/runs/27998059001/job/82864320244 should have many cached layers, but does not.

Bug description

With a lazy/remote snapshotter (stargz/eStargz, overlaybd, …) and a multi-stage
build where a later stage consumes an earlier, different-base stage via
COPY --from, --cache-to mode=max silently drops the consumed stage's cache
layers
. On a later build, a step in that stage that should be restored from cache
re-runs instead. No error is reported — the export "succeeds" with a degraded
cache.

Reproduction

Requires a lazy snapshotter; this uses the upstream stargz snapshotter and the
public ghcr.io/stargz-containers eStargz images.

# 1. buildkit with the stargz snapshotter
docker buildx create --name stargz --driver docker-container \
  --driver-opt image=moby/buildkit:latest \
  --buildkitd-flags "--oci-worker-snapshotter=stargz"

# 2. Multi-stage: build stage on one esgz base, final stage on a DIFFERENT esgz
#    base, COPY --from, plus a downstream step (ARG BUST) that forces a re-run.
cat > Dockerfile <<'EOF'
FROM ghcr.io/stargz-containers/ubuntu:22.04-esgz AS build
RUN echo stable > /stable
ARG BUST=0
RUN echo "$BUST" > /bust && cat /stable > /combined
FROM ghcr.io/stargz-containers/alpine:3.15.3-esgz
COPY --from=build /combined /combined
EOF

# 3. First build — exports the cache
docker buildx build --builder stargz --platform linux/amd64 --build-arg BUST=1 \
  --cache-to type=local,dest=./cache,mode=max --output type=cacheonly .

# 4. Drop local state, then rebuild changing only BUST. The second RUN must
#    re-run, which requires the first RUN's filesystem state to be restored.
docker buildx prune -af --builder stargz
docker buildx build --builder stargz --platform linux/amd64 --build-arg BUST=2 \
  --cache-from type=local,src=./cache --output type=cacheonly .

Expected

On the second build, RUN echo stable > /stable (unchanged) is CACHED.

Actual

#8 [build 2/3] RUN echo stable > /stable
#8 DONE 0.4s          <-- re-runs instead of CACHED

Its cache layer was dropped during the first build's export, so it cannot be
restored when the downstream BUST step re-runs. (Inspecting ./cache's
application/vnd.buildkit.cacheconfig.v0 config confirms the build-stage record
has an empty layers field; only the final COPY record has a layer.)

The same Dockerfile with an eager snapshotter, or with the same base image in both
stages, caches correctly.

Root cause

  1. The cache exporter recurses per record into cross-stage deps
    (solver/exporter.go ExportTo).
  2. Loading a record's result (worker/cacheresult.go LoadRemotes
    Worker.LoadRefCacheManager.Get) runs checkLazyProviders
    (cache/manager.go), which returns NeedsRemoteProviderError for any lazy
    ancestor blob lacking a DescHandler.
  3. LoadRef recovers those handlers from CacheOptGetterOf(ctx) — the
    withAncestorCacheOpts getter, which resolves handlers from the ancestor
    states
    of whatever state it was rooted at.
  4. The regression (051818cf3): the exporter now sets that getter once, at
    the outermost export
    , and all nested records inherit it
    (if CacheOptGetterOf(ctx) == nil && e.recordCtxOpts != nil). Since
    withAncestorCacheOpts walks only one record's ancestry, the getter rooted at
    the final stage's result does not reach the build stage's source op, so the
    build-stage result's lazy base-layer handlers are unresolvable.
  5. LoadRef therefore still fails with NeedsRemoteProviderError; ExportTo
    returns it; the deps loop swallows it (if err != nil { continue }, added in
    the same PR series to avoid failing export on subbranch errors). The
    build-stage result is dropped from the cache.

Eager snapshotters are unaffected (base blobs are materialized, isLazy == false,
so no handler is needed).

Suggested fix

Re-root the opt-getter at each record's own state (the pre-051818cf3 behavior),
matching the exporter's per-record recursion:

 	mainCtx := ctx
-	if CacheOptGetterOf(ctx) == nil && e.recordCtxOpts != nil {
+	if e.recordCtxOpts != nil {
 		ctx = e.recordCtxOpts(ctx)
 	}

A getter is only ever queried for a record's own ancestor blobs, so per-record
re-rooting resolves exactly what each record needs. (If the perf intent of
051818cf3 matters, an alternative is to keep "set once" but make
withAncestorCacheOpts traverse cross-stage edges.) It would also help to log,
rather than silently swallow, the subbranch export error so this failure mode is
visible.

With the above change, the reproduction's second build reports
RUN echo stable > /stable ... CACHED.

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions