storage: add splitfdstream for efficient layer transfer with reflink support#651
storage: add splitfdstream for efficient layer transfer with reflink support#651giuseppe wants to merge 12 commits intocontainers:mainfrom
Conversation
4cca2a3 to
3b6f591
Compare
mtrmac
left a comment
There was a problem hiding this comment.
- 6% of the code base for an extremely niche use case :/
I’m afraid I can’t spend time on this now; and anyway some other work on pkg/archive needs to happen first, and expect the rebase to be somewhat non-trivial.
what other work would block this feature? I don't love either the amount of extra code, but it is almost all in new APIs/packages that are experimental and not leaking too much to the rest of the library. We went around this problem for a long time, and in the end this seems like an ~ok solution to not expose too much of the containers storage internals augmenting the tar format that is already plugged everywhere. |
3b6f591 to
26cd4b6
Compare
|
to add more context: the goal of this feature is to be able to copy a layer, similarly to what we would do with There are other interesting use cases enabled with a RPC like facilitate usage from different languages (Rust in our case), inject into a build container to grab R/O access to layers/images, but this is out of scope now. I realize complexity is not trivial just to get access to the raw files, but what alternatives do we have? From the store API PoV we could just extend it to return a map Any suggestions? |
|
(Just to avoid a misunderstanding, I didn’t really read the PR. I have, at this point, ~no opinion on RPC vs. Go API.) |
storage/pkg/archive/archive.go
Outdated
| return &tarReaderIterator{ | ||
| tr: tr, | ||
| trBuf: trBuf, | ||
| buffer: make([]byte, 1<<20), |
There was a problem hiding this comment.
Seems like 1<<20 should be in a const somewhere with a rationale?
|
|
||
| // splitFDStreamSocketDescriptor is the fd for the Unix socket used to | ||
| // receive file descriptors via SCM_RIGHTS in the re-exec child. | ||
| const splitFDStreamSocketDescriptor = 5 |
There was a problem hiding this comment.
Hardcoding this seems a bit fragile, can't we pass it as an env var or a CLI arg?
| // FDs are streamed to the child process one-at-a-time over a Unix socket | ||
| // using SCM_RIGHTS, avoiding EMFILE from inheriting too many FDs at exec. |
There was a problem hiding this comment.
It doesn't have to be truly one at a time, see also https://github.com/cgwalters/jsonrpc-fdpass?tab=readme-ov-file#41-fd-batching
In my initial PoC work here I handled this by just sending the splitfdstream inline data itself as a fd (could be O_TMPFILE or memfd or pipe), and then sending all the associated fds over jsonrpc-fdpass which automatically handles buffering.
Ensure we processing just 1 recvmsg() at a time (only ~200 fds on Linux) seems really reasonable and unlikely to get anywhere close to modern fd limits.
So if we do choose jsonrpc-fdpass, I think we also need to standardize this "splitfdstream serialization".
| return err | ||
| } | ||
|
|
||
| // Try copy_file_range - kernel-level copy, more efficient than userspace |
There was a problem hiding this comment.
Surprising we weren't doing this before.
storage/pkg/splitfdstream/fdpass.go
Outdated
|
|
||
| // ReadLine reads bytes until a newline is encountered, using recvmsg. | ||
| // Any FDs received are closed since line-reading doesn't expect them. | ||
| func (p *FDPasser) ReadLine() ([]byte, error) { |
There was a problem hiding this comment.
Why would one want this?
|
|
||
| // JSON-RPC 2.0 protocol constants | ||
| const ( | ||
| JSONRPCVersion = "2.0" |
There was a problem hiding this comment.
BTW see also https://github.com/cgwalters/jsonrpc-fdpass-go
23ed544 to
8d74f84
Compare
|
still PoC, so you can avoid a review as it is not yet ready but we can use to analyze the complexity. Moved to use https://github.com/cgwalters/jsonrpc-fdpass-go Now the code size is almost 50% than the last revision |
8d74f84 to
51a6c01
Compare
|
Packit jobs failed. @containers/packit-build please check. |
1b087bb to
bc6c5ff
Compare
a0876b6 to
d77fbc4
Compare
d77fbc4 to
6d605ac
Compare
6d605ac to
0ebb56c
Compare
|
Ready for review |
cgwalters
left a comment
There was a problem hiding this comment.
Assisted-by: OpenCode (Claude Opus 4)
Cross-referenced with composefs/composefs-rs#218 (the Rust consumer side of this protocol).
Anything with AI: I looked at at least briefly and it seemed likely valid, but not necessarily a true deep verification.
Bigger picture: Can we agree that the high level interface to this would be through something like skopeo proxy? It'd be useful to have a sketch of what that looks like as it would likely inform the API shapes here.
I think in theory we could omit write support here in a first pass (it might simplify the code a bit) - while we do want to optimize containers-storage ➡️ containers-storage: I think the more pressing thing is containers-storage: ➡️ composefs.
| // File not found in diff directory (e.g., naiveDiff was used), | ||
| // write content inline from the tar stream in chunks | ||
| remaining := header.Size | ||
| for remaining > 0 { |
There was a problem hiding this comment.
AI: Important: Protocol mismatch — this loop writes multiple splitfdstream prefix+data pairs for files >1MB (one per iteration calling WriteInline). However, the reader in pkg/archive/splitfdstream.go's Next() reads exactly one content prefix per file entry. On the second iteration, the reader will interpret the next chunk's prefix as a tar header, causing desync/parse errors. Either accumulate all content into a single WriteInline call, or update the reader to handle multiple inline chunks per entry.
storage/pkg/splitfdstream/types.go
Outdated
| if err != nil { | ||
| continue | ||
| } | ||
| if strings.Contains(string(data), `"config"`) { |
There was a problem hiding this comment.
AI: Important: This heuristic is fragile. A manifest list could contain the string "config" in annotations or other fields. The correct approach is to unmarshal the JSON and check for a top-level "config" key (or check mediaType). This was also flagged in the composefs-rs PR as needing to match heuristics between Go and Rust sides — an ad-hoc string check makes that harder to keep in sync.
| return | ||
| } | ||
|
|
||
| streamData, err := io.ReadAll(stream) |
There was a problem hiding this comment.
AI: Important: This reads the entire stream into memory. The overlay driver already buffers it in a bytes.Buffer (overlay_splitfdstream.go:152), so layer data is held in memory twice. For large layers this could cause OOM. Consider streaming directly from the io.ReadCloser into the memfd via io.Copy instead, and also refactoring GetSplitFDStream to return a streaming interface rather than buffering into bytes.Buffer.
storage/pkg/splitfdstream/types.go
Outdated
| // Walk the layer chain using store.LayerParent | ||
| var layerIDs []string | ||
| layerID := topLayerID | ||
| for layerID != "" && len(layerIDs) < 50 { |
There was a problem hiding this comment.
AI: The magic number 50 is undocumented. Images can have more than 50 layers (especially multi-stage Dockerfiles with many RUN commands). If exceeded, the chain is silently truncated. This should be a named constant with a comment explaining the rationale, and exceeding it should produce an error or warning rather than silent truncation.
| defer unix.Close(parentSocketFD) | ||
| for _, fd := range fds { | ||
| rights := unix.UnixRights(int(fd.Fd())) | ||
| if err := unix.Sendmsg(parentSocketFD, []byte{0}, rights, nil, 0); err != nil { |
There was a problem hiding this comment.
AI: Important: Sendmsg errors are silently swallowed (just returns from the goroutine). If sending FDs fails partway through, the child process blocks forever waiting for remaining FDs in Recvmsg, leading to a deadlock. The error should be captured and used to kill the child process or close the socket to unblock it.
Separately, PR_SET_PDEATHSIG is not set on the child process, which was specifically called out in the composefs-rs PR as needed to avoid process leaks if the parent crashes. (This is a pre-existing pattern across chrootarchive, so may be best addressed separately.)
| Flags: unix.O_RDONLY | unix.O_CLOEXEC | unix.O_PATH, | ||
| Resolve: unix.RESOLVE_NO_SYMLINKS | unix.RESOLVE_BENEATH, | ||
| }) | ||
| if err != nil { |
There was a problem hiding this comment.
AI: When Openat2 returns any error, the code does continue to try the next diffDir. This silently ignores errors like EACCES, ENOMEM, etc. Only ENOENT (and possibly ELOOP for RESOLVE_NO_SYMLINKS) should trigger continue; other errors should be returned.
Also, RESOLVE_NO_SYMLINKS is a behavior change from the old procfs approach which followed symlinks. If any legitimate composefs path contains a symlink component, this will silently skip the entry. Worth documenting that this is intentional.
|
|
||
| // HandleConnection handles a single client connection. | ||
| func (s *JSONRPCServer) HandleConnection(conn *net.UnixConn) { | ||
| s.connections.Add(1) |
There was a problem hiding this comment.
AI: This Add(1) is called inside HandleConnection, which is called in a goroutine from both acceptConnections and SplitFDStreamSocket. There's a race with Stop() -> connections.Wait(): if Stop() is called between the goroutine being spawned and Add(1) executing, Wait() returns prematurely. The Add(1) should happen before the goroutine is spawned.
| xattrs[xattrKey] = base64.StdEncoding.EncodeToString([]byte(v)) | ||
| } | ||
| modTime := hdr.ModTime | ||
| return minimal.FileMetadata{ |
There was a problem hiding this comment.
AI: This is a near-duplicate of minimal.NewFileMetadata (in compression.go), which also handles AccessTime and ChangeTime via timeIfNotZero — those are missing here. Composefs dump format emits these times, so dumps from GenerateDumpFromTarHeaders will silently lose them. Consider reusing minimal.NewFileMetadata to avoid the duplication and the data loss.
| if err != nil { | ||
| return err | ||
| } | ||
| if n == 0 { |
There was a problem hiding this comment.
AI (edited): Consider returning io.ErrUnexpectedEOF when remaining > 0 after the loop
| clientConn.Close() | ||
|
|
||
| // Initialize server if not already created | ||
| if s.jsonRPCServer == nil { |
There was a problem hiding this comment.
AI: (low) This lazy initialization appears protected by the graphLock acquired in startUsingGraphDriver. It would help to add a comment noting that graphLock serializes access here, especially since the struct has a FIXME about fields that need locking — listing jsonRPCServer under the graphLock-protected section would clarify the invariant.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Extend the store with splitfdstream capabilities exposed via a UNIX socket for JSON-RPC communication. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Implement the SplitFDStreamDriver interface for the overlay driver, enabling efficient layer operations with reflink support. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
|
@giuseppe looks like at least a rebase is needed here. |
|
this needs to be rebased on top of #677 anyway, I'll need to rework it on top of that |
Add a new
splitfdstreampackage that enables efficient container layer transfer between storage instances by separating tar metadata from file content. File content is passed as file descriptors.