Skip to content

Add Compression best practices guide#52968

Draft
alinpahontu2912 wants to merge 2 commits intodotnet:mainfrom
alinpahontu2912:zip_tar_bestpractices
Draft

Add Compression best practices guide#52968
alinpahontu2912 wants to merge 2 commits intodotnet:mainfrom
alinpahontu2912:zip_tar_bestpractices

Conversation

@alinpahontu2912
Copy link
Copy Markdown
Member

@alinpahontu2912 alinpahontu2912 commented Apr 10, 2026

Summary

Add a guide explaining how to best work with Zip and Tar archives in .NET.


Internal previews

📄 File 🔗 Preview link
docs/fundamentals/toc.yml docs/fundamentals/toc
docs/standard/io/zip-tar-best-practices.md Best practices for working with ZIP and TAR archives in .NET

Copy link
Copy Markdown
Member

@rzikm rzikm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting better, few additional comments.

continue;

// Normalize and validate the path, same as the ZIP example.
string destPath = Path.GetFullPath(Path.Join(fullDestDir, entry.Name));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some samples use Path.Join, some Path.Combine, is there a meaningful difference between the two?

Comment on lines +189 to +192
// Create the parent directory and any missing intermediate directories.
Directory.CreateDirectory(Path.GetDirectoryName(destPath)!);
using var fileStream = File.Create(destPath);
entry.DataStream.CopyTo(fileStream);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ExtractToFile (both for Tar and Zip), is it safe for them to use them after the validation above?

Comment on lines +243 to +257
void TarStreamingRead(Stream archiveStream)
{
using var reader = new TarReader(archiveStream);
TarEntry? entry;
while ((entry = reader.GetNextEntry()) is not null)
{
if (entry.DataStream is not null)
{
string safePath = "output.bin";
// Copy now — the stream becomes invalid after the next GetNextEntry() call
using var fileStream = File.Create(safePath);
entry.DataStream.CopyTo(fileStream);
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this code snippet is supposed to illustrate

Comment on lines +32 to +45
### Convenience APIs (one-shot operations)

- `ZipFile.CreateFromDirectory` / `ZipFile.ExtractToDirectory`—create or extract an entire archive in one call.
- `TarFile.CreateFromDirectory` / `TarFile.ExtractToDirectory`—same for TAR.
- Best for: simple workflows with trusted input, quick scripts, build tooling.

### Streaming APIs (entry-by-entry control)

- `ZipArchive`—open an archive, iterate entries, read or write selectively.
- `TarReader` / `TarWriter`—sequential entry-by-entry access.
- Best for: large archives, selective extraction, untrusted input, custom processing.

If you control the archive source (your own build output, known-safe backups), the convenience APIs are the simplest choice. If the archive comes from an external source (user uploads, downloads, network transfers), use the streaming APIs with the safety checks described in this article.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section is missing ExtractTo* helpers on ZipArchiveEntry or TarEntry


### ZipArchiveMode.Update loads entries into memory

Don't use `ZipArchiveMode.Update` for large or untrusted archives. When you open a `ZipArchive` in `Update` mode and call `Open()` or `OpenAsync()` on an entry, its uncompressed data is loaded into a `MemoryStream` to support in-place modifications. Accessing entry metadata (such as `FullName`, `Length`, or `ExternalAttributes`) does not trigger decompression. For large or malicious archives, opening entry content streams can cause `OutOfMemoryException`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to mention here that users should check Length before opening the entry


Don't use `ZipArchiveMode.Update` for large or untrusted archives. When you open a `ZipArchive` in `Update` mode and call `Open()` or `OpenAsync()` on an entry, its uncompressed data is loaded into a `MemoryStream` to support in-place modifications. Accessing entry metadata (such as `FullName`, `Length`, or `ExternalAttributes`) does not trigger decompression. For large or malicious archives, opening entry content streams can cause `OutOfMemoryException`.

Additionally, when you open a `ZipArchive` in `Read` mode with an **unseekable** stream (for example, a network stream), the runtime copies the entire stream into a `MemoryStream` up front to enable seeking through the central directory.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Additionally, when you open a `ZipArchive` in `Read` mode with an **unseekable** stream (for example, a network stream), the runtime copies the entire stream into a `MemoryStream` up front to enable seeking through the central directory.
Additionally, when you open a `ZipArchive` in `Read` mode with an **unseekable** stream (for example, a network stream), the runtime buffers the entire archive contents in memory to enable seeking through the central directory.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this sentence is in ZipArchiveMode.Update loads entries into memory subsection, but talks about Read mode, so some restructuring is needed (maybe remove the subsection heading?)

- **ZIP:** Unix permissions are stored in the upper 16 bits of `ExternalAttributes`. When extracting on Unix via `ExtractToDirectory` or `ExtractToFile`, the runtime restores ownership permissions (read/write/execute for user/group/other), subject to the process umask. SetUID, SetGID, and StickyBit are stripped. Permissions are not applied if the upper bits are zero. This happens when the ZIP was created on Windows, because the Windows runtime sets `DefaultFileExternalAttributes` to `0`. On Windows, these attributes are always ignored during extraction.
- **TAR:** The `TarEntry.Mode` property represents `UnixFileMode` and can store all 12 permission bits (read/write/execute for user/group/other, plus SetUID, SetGID, and StickyBit). When extracting on Unix via `ExtractToDirectory` or `ExtractToFile`, the runtime applies only the 9 ownership bits (rwx for user/group/other), subject to the process umask. SetUID, SetGID, and StickyBit are stripped for security.

When processing untrusted archives, validate `TarEntry.Mode` before extracting. An archive could set executable permissions on files that should not be executable.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples above don't touch/apply Mode at all, that is probably related to the lack of mention of ExtractToFile helpers in the earlier parts of the article.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An archive could set executable permissions on files that should not be executable.

That sentence is weird, the archive author clearly may have wanted some files to be executable, so who really decides that the files should not be executable?

I would reword it simply as a warning along the lines that untrusted archives may contain malicious executable files


### File name sanitization differs by platform

On Windows, when using `ExtractToDirectory`, the runtime replaces control characters and ``"*:<>?|`` with underscores in entry names. On Unix, only null characters are replaced. Archive entries with names like `file:name.txt` are renamed to `file_name.txt` on Windows but extracted as-is on Unix. The per-entry APIs (`Open()`, `ExtractToFile()`) do not perform any name sanitization.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-entry APIs (Open(), ExtractToFile()) do not perform any name sanitization.

Do you think that we should mention entry.ExtractToFile(entry.Name) is safe only if user properly validates Name/Mode/...?


## Data integrity

ZIP entries include a CRC-32 checksum that you can use to verify data hasn't been corrupted or tampered with.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should also mention TAR CRC in the first paragraph, now users might assume that this section is Zip-only and skip the rest of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants