Add Compression best practices guide#52968
Add Compression best practices guide#52968alinpahontu2912 wants to merge 2 commits intodotnet:mainfrom
Conversation
docs/standard/io/snippets/zip-tar-best-practices/csharp/Program.cs
Outdated
Show resolved
Hide resolved
docs/standard/io/snippets/zip-tar-best-practices/csharp/Program.cs
Outdated
Show resolved
Hide resolved
docs/standard/io/snippets/zip-tar-best-practices/csharp/Program.cs
Outdated
Show resolved
Hide resolved
rzikm
left a comment
There was a problem hiding this comment.
It's getting better, few additional comments.
| continue; | ||
|
|
||
| // Normalize and validate the path, same as the ZIP example. | ||
| string destPath = Path.GetFullPath(Path.Join(fullDestDir, entry.Name)); |
There was a problem hiding this comment.
Some samples use Path.Join, some Path.Combine, is there a meaningful difference between the two?
| // Create the parent directory and any missing intermediate directories. | ||
| Directory.CreateDirectory(Path.GetDirectoryName(destPath)!); | ||
| using var fileStream = File.Create(destPath); | ||
| entry.DataStream.CopyTo(fileStream); |
There was a problem hiding this comment.
What about ExtractToFile (both for Tar and Zip), is it safe for them to use them after the validation above?
| void TarStreamingRead(Stream archiveStream) | ||
| { | ||
| using var reader = new TarReader(archiveStream); | ||
| TarEntry? entry; | ||
| while ((entry = reader.GetNextEntry()) is not null) | ||
| { | ||
| if (entry.DataStream is not null) | ||
| { | ||
| string safePath = "output.bin"; | ||
| // Copy now — the stream becomes invalid after the next GetNextEntry() call | ||
| using var fileStream = File.Create(safePath); | ||
| entry.DataStream.CopyTo(fileStream); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
I am not sure what this code snippet is supposed to illustrate
| ### Convenience APIs (one-shot operations) | ||
|
|
||
| - `ZipFile.CreateFromDirectory` / `ZipFile.ExtractToDirectory`—create or extract an entire archive in one call. | ||
| - `TarFile.CreateFromDirectory` / `TarFile.ExtractToDirectory`—same for TAR. | ||
| - Best for: simple workflows with trusted input, quick scripts, build tooling. | ||
|
|
||
| ### Streaming APIs (entry-by-entry control) | ||
|
|
||
| - `ZipArchive`—open an archive, iterate entries, read or write selectively. | ||
| - `TarReader` / `TarWriter`—sequential entry-by-entry access. | ||
| - Best for: large archives, selective extraction, untrusted input, custom processing. | ||
|
|
||
| If you control the archive source (your own build output, known-safe backups), the convenience APIs are the simplest choice. If the archive comes from an external source (user uploads, downloads, network transfers), use the streaming APIs with the safety checks described in this article. | ||
|
|
There was a problem hiding this comment.
this section is missing ExtractTo* helpers on ZipArchiveEntry or TarEntry
|
|
||
| ### ZipArchiveMode.Update loads entries into memory | ||
|
|
||
| Don't use `ZipArchiveMode.Update` for large or untrusted archives. When you open a `ZipArchive` in `Update` mode and call `Open()` or `OpenAsync()` on an entry, its uncompressed data is loaded into a `MemoryStream` to support in-place modifications. Accessing entry metadata (such as `FullName`, `Length`, or `ExternalAttributes`) does not trigger decompression. For large or malicious archives, opening entry content streams can cause `OutOfMemoryException`. |
There was a problem hiding this comment.
It would help to mention here that users should check Length before opening the entry
|
|
||
| Don't use `ZipArchiveMode.Update` for large or untrusted archives. When you open a `ZipArchive` in `Update` mode and call `Open()` or `OpenAsync()` on an entry, its uncompressed data is loaded into a `MemoryStream` to support in-place modifications. Accessing entry metadata (such as `FullName`, `Length`, or `ExternalAttributes`) does not trigger decompression. For large or malicious archives, opening entry content streams can cause `OutOfMemoryException`. | ||
|
|
||
| Additionally, when you open a `ZipArchive` in `Read` mode with an **unseekable** stream (for example, a network stream), the runtime copies the entire stream into a `MemoryStream` up front to enable seeking through the central directory. |
There was a problem hiding this comment.
| Additionally, when you open a `ZipArchive` in `Read` mode with an **unseekable** stream (for example, a network stream), the runtime copies the entire stream into a `MemoryStream` up front to enable seeking through the central directory. | |
| Additionally, when you open a `ZipArchive` in `Read` mode with an **unseekable** stream (for example, a network stream), the runtime buffers the entire archive contents in memory to enable seeking through the central directory. |
There was a problem hiding this comment.
Also, this sentence is in ZipArchiveMode.Update loads entries into memory subsection, but talks about Read mode, so some restructuring is needed (maybe remove the subsection heading?)
| - **ZIP:** Unix permissions are stored in the upper 16 bits of `ExternalAttributes`. When extracting on Unix via `ExtractToDirectory` or `ExtractToFile`, the runtime restores ownership permissions (read/write/execute for user/group/other), subject to the process umask. SetUID, SetGID, and StickyBit are stripped. Permissions are not applied if the upper bits are zero. This happens when the ZIP was created on Windows, because the Windows runtime sets `DefaultFileExternalAttributes` to `0`. On Windows, these attributes are always ignored during extraction. | ||
| - **TAR:** The `TarEntry.Mode` property represents `UnixFileMode` and can store all 12 permission bits (read/write/execute for user/group/other, plus SetUID, SetGID, and StickyBit). When extracting on Unix via `ExtractToDirectory` or `ExtractToFile`, the runtime applies only the 9 ownership bits (rwx for user/group/other), subject to the process umask. SetUID, SetGID, and StickyBit are stripped for security. | ||
|
|
||
| When processing untrusted archives, validate `TarEntry.Mode` before extracting. An archive could set executable permissions on files that should not be executable. |
There was a problem hiding this comment.
The examples above don't touch/apply Mode at all, that is probably related to the lack of mention of ExtractToFile helpers in the earlier parts of the article.
There was a problem hiding this comment.
An archive could set executable permissions on files that should not be executable.
That sentence is weird, the archive author clearly may have wanted some files to be executable, so who really decides that the files should not be executable?
I would reword it simply as a warning along the lines that untrusted archives may contain malicious executable files
|
|
||
| ### File name sanitization differs by platform | ||
|
|
||
| On Windows, when using `ExtractToDirectory`, the runtime replaces control characters and ``"*:<>?|`` with underscores in entry names. On Unix, only null characters are replaced. Archive entries with names like `file:name.txt` are renamed to `file_name.txt` on Windows but extracted as-is on Unix. The per-entry APIs (`Open()`, `ExtractToFile()`) do not perform any name sanitization. |
There was a problem hiding this comment.
The per-entry APIs (
Open(),ExtractToFile()) do not perform any name sanitization.
Do you think that we should mention entry.ExtractToFile(entry.Name) is safe only if user properly validates Name/Mode/...?
|
|
||
| ## Data integrity | ||
|
|
||
| ZIP entries include a CRC-32 checksum that you can use to verify data hasn't been corrupted or tampered with. |
There was a problem hiding this comment.
I think you should also mention TAR CRC in the first paragraph, now users might assume that this section is Zip-only and skip the rest of it.
Summary
Add a guide explaining how to best work with Zip and Tar archives in .NET.
Internal previews