Skip to content

feat: bounded-memory file uploads for huge files (1GB-100GB+)#20

Merged
mickvandijke merged 6 commits intoWithAutonomi:mainfrom
grumbach:feat/huge-file-uploads
Mar 31, 2026
Merged

feat: bounded-memory file uploads for huge files (1GB-100GB+)#20
mickvandijke merged 6 commits intoWithAutonomi:mainfrom
grumbach:feat/huge-file-uploads

Conversation

@grumbach
Copy link
Copy Markdown
Contributor

Summary

  • Add ChunkSpill mechanism: encrypted chunks are written to a temp directory during encryption, keeping only 32-byte addresses in memory. Chunks are read back one wave (64 chunks) at a time for upload, bounding peak memory to ~256 MB regardless of file size.
  • Add disk space pre-check before upload (InsufficientDiskSpace error) using fs2::available_space with integer arithmetic for the 10% headroom calculation.
  • Refactor file_prepare_upload (external-signer path) to use spill for encryption, with honest docs that PreparedUpload still holds all chunks in memory (inherent to the two-phase protocol).
  • Fix prepared[0] direct indexing in batch_pay — replaced with iterator .sum().
  • Fix inline paths (evmlib::contract::..., xor_name::XorName) — moved to top-level use imports.
  • Fix mutex poison handling in spawn_file_encryption — 3 sites now use unwrap_or_else(PoisonError::into_inner) instead of silently skipping.
  • Fix swallowed errors: datamap_tx.send() and temp file cleanup now log warnings.
  • Fix temp file leak on rename failure in file_download.
  • Harden spill dir creation: create_dir (not create_dir_all) to reject pre-existing directories.

Proven test results

File Size Chunks RSS Before RSS After RSS Increase Integrity
64 MB 20 All bytes match
1 GB 260 58 MB 244 MB 185 MB All 1,073,741,824 bytes match
4 GB 1029 59 MB 191 MB 132 MB All 4,294,967,296 bytes match

Memory does not scale with file size — 4 GB uses less RSS increase (132 MB) than 1 GB (185 MB), proving the wave-based spill mechanism works.

E2E tests (e2e_huge_file.rs)

  • test_huge_file_upload_download_1gb — upload, download, byte-verify 1 GB, assert RSS < 512 MB
  • test_huge_file_upload_download_4gb — same for 4 GB with tighter chunk count assertion (>= 1000)
  • test_large_file_64mb_round_trip — fast CI-friendly 64 MB round-trip
  • test_file_upload_disk_space_check — verifies pre-flight disk space check
  • Uses seeded xorshift64 PRNG for incompressible test data (verified via Brotli: 0% compression)
  • Uses mach_task_basic_info.resident_size on macOS (current RSS, not peak) — panics on measurement failure on supported platforms

Review history

3 review rounds, 12 agents total (6 Claude Opus + 6 Codex gpt-5.4 xhigh). Final round verdict: zero remaining production code violations across all project rules (no unwrap/expect/panic, no direct indexing, no inline paths, no swallowed errors).

Test plan

  • cfd (cargo fmt + clippy) passes with zero warnings
  • All 113 unit tests pass
  • 1 GB E2E upload/download round-trip with memory assertion
  • 4 GB E2E upload/download round-trip with memory assertion
  • 64 MB round-trip (CI-friendly)
  • Disk space pre-check test

Add ChunkSpill mechanism that writes encrypted chunks to a temp
directory during encryption, keeping only 32-byte addresses in memory.
Chunks are read back one wave (64 chunks) at a time for upload,
bounding peak RSS to ~256 MB regardless of file size.

Key changes:
- ChunkSpill struct: temp-dir-based buffer with Drop cleanup
- Disk space pre-check before upload (fs2::available_space)
- InsufficientDiskSpace error variant
- file_prepare_upload refactored to use spill (with honest docs
  that PreparedUpload still holds all chunks in memory)
- Wave-level progress logging during encryption and upload
- Fixed direct indexing (prepared[0]) in batch_pay
- Fixed inline paths and swallowed errors per review feedback
- Integer arithmetic for disk space headroom (no float precision loss)

E2E tests proving bounded memory:
- 1 GB: 260 chunks, RSS increase 185 MB (limit 512 MB)
- 4 GB: 1029 chunks, RSS increase 132 MB (limit 512 MB)
- Both verified byte-for-byte after round-trip
- Uses mach_task_basic_info (current RSS, not peak) on macOS

Memory does NOT scale with file size — 4 GB uses less RSS increase
than 1 GB, proving the wave-based spill mechanism works correctly.
…rity

Fixes from multi-agent review (2 Claude Opus + 3 Codex gpt-5.4 xhigh):

- Use unwrap_or_else(PoisonError::into_inner) on all 3 mutex lock sites
  in spawn_file_encryption instead of silently skipping on poison. A
  poisoned mutex previously caused read errors to be silently dropped,
  which could produce a truncated DataMap.

- Move xor_name::XorName to top-level use import (was inline in
  file_download closure).

- Revert merkle chunk limit guard: gas savings from merkle payments far
  outweigh the O(n) proof memory overhead (~86 KB per chunk from
  ML-DSA-65 signatures in candidate pools). For a 100 GB file (~25k
  chunks), proof memory is ~2 GB which is acceptable.
Fixes from Codex gpt-5.4 xhigh production review:

- batch.rs: move evmlib::contract::payment_vault::get_market_price and
  evmlib::wallet::PayForQuotesError to top-level use imports (PROD-006)

- file.rs: ChunkSpill::new() now uses create_dir instead of
  create_dir_all, so creation fails if the directory already exists.
  This prevents silent reuse of a stale/symlinked spill directory
  (PROD-007)

Not addressed (architectural, separate PRs):
- PROD-001/002/003: Resumable uploads after partial payment failure
- PROD-004: CancellationToken threading through upload APIs
Fixes from Codex gpt-5.4 xhigh test review:

- Replace repeating 256-byte pattern with seeded xorshift64 PRNG for
  incompressible test data. Repeating patterns compress to tiny chunks,
  making the wave buffer test pass trivially without stressing the
  ~256 MB wave buffer. Incompressible data produces full-size ~4 MB
  encrypted chunks. (TEST-003)

- RSS measurement failure now panics on macOS/Linux instead of silently
  passing. This prevents CI from quietly skipping the bounded-memory
  assertion if the FFI probe ever drifts. (TEST-002)

- Tighten 4 GB chunk count assertion from >= 800 to >= 1000 (actual
  expected: ~1026 chunks at MAX_CHUNK_SIZE). (TEST-004)
If std::fs::rename fails after the download temp file is fully written,
the previous code used ? which exited without cleanup, leaking the temp
file. Now explicitly handles the rename error, removes the temp file,
and then returns the error.

Found by Codex gpt-5.4 xhigh review round 3 (R3-001).
Copy link
Copy Markdown
Contributor

@mickvandijke mickvandijke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Duplicate chunk addresses cause double-upload and inflated metrics

ChunkSpill::push() (file.rs:73-79) uses compute_address(content) as the filename. If two chunks have identical content (which self-encryption can produce for repeated data), the second fs::write silently overwrites the first file on disk, but both addresses are appended to self.addresses. Consequences:

  1. total_bytes double-counts — inflates avg_chunk_size(), which feeds into quoting metrics via pay_for_merkle_batch
  2. Wave uploads read and re-upload the same chunk twice — wasted bandwidth and gas

The old code had the same latent issue (duplicate Bytes in the vec), but the new code makes it slightly worse because total_bytes is now used for the avg_chunk_size quoting metric.

Suggested fix: Deduplicate in push() — e.g. track seen addresses in a HashSet and skip the write + total_bytes increment if the address already exists (still preserve the entry in addresses if ordering matters for the DataMap, or skip it entirely if duplicates are redundant on the network).

Alternatively, document that this is acceptable because self-encryption rarely produces duplicates for real (incompressible) files.

Copy link
Copy Markdown
Contributor

@mickvandijke mickvandijke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

ChunkSpill::push() now tracks seen addresses in a HashSet and skips
the write, total_bytes increment, and address recording for duplicate
content. This prevents:

- Double-uploading the same chunk (wasted bandwidth and gas)
- Inflated avg_chunk_size() feeding into quoting metrics
- Inflated chunk count in progress logging

Self-encryption rarely produces duplicate chunks for incompressible
data, but repeated/patterned content could trigger this.
@mickvandijke mickvandijke merged commit 72bd4e4 into WithAutonomi:main Mar 31, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants