Skip to content

fix: large directory upload (100K files)#1640

Draft
fforbeck wants to merge 6 commits intomainfrom
fix/large-dir-upload
Draft

fix: large directory upload (100K files)#1640
fforbeck wants to merge 6 commits intomainfrom
fix/large-dir-upload

Conversation

@fforbeck
Copy link
Copy Markdown
Member

@fforbeck fforbeck commented Apr 30, 2025

PR Summary: Fix Stack Overflow in Large Dataset Processing

Context

This PR addresses a series of stack overflow issues encountered when processing large datasets (100K+ files). Due to the difficulty in simulating the exact conditions that caused the "Maximum call stack size exceeded" error, I took a trial-and-error approach, working closely with the user to validate each fix. I focused on improving the code paths that had the highest probability of causing a stack overflow, based on the error patterns and user feedback.

The fixes were implemented and tested incrementally:

  1. First, addressing the UnixFS directory processing
  2. Then the ShardedDAGIndex archive process
  3. Finally, the Index Add process

1. UnixFS Directory Processing

Problem

  • The UnixFSDirectoryBuilder in unixfs.js was using recursive directory building
  • With 100K files, this created a huge structure in memory
  • The stack overflow occurred during the finalize operation
  • This prevented the successful processing of large directory structures

Solution

  • Replaced recursive directory building with an iterative approach
  • Implemented explicit stack/queue for directory traversal
  • Process directories and files in a breadth-first manner
  • Maintain proper ordering while avoiding deep recursion
  • Added memory limits and checks during processing

Tests

  • ✅ The user was able to test the upload, and the process failed in a different stage, after the encoding. So this fix worked.

2. ShardedDAGIndex Archive Process

Problem

  • The archive function in sharded-dag-index.js was processing all shards and slices at once
  • With 100K files, this created large arrays in memory during sorting operations
  • The stack overflow occurred when trying to sort these large arrays
  • This prevented successful uploads of datasets larger than ~50K files

Solution

  • Implemented batch processing with configurable thresholds:
    const LARGE_DATASET_ARCHIVE_THRESHOLD = 50_000
    const ARCHIVE_BATCH_SIZE = 10_000
  • Process shards in batches of 10,000
  • Process slices within each shard in batches
  • Sort operations are performed on smaller chunks of data
  • Maintains data integrity while preventing memory issues
  • Preserves the sequential process for fewer than 50K shards

Tests

  • Waiting for user feedback

3. Index Add Process

Problem

  • The add function in upload-api/src/index/add.js was processing all shard allocations concurrently
  • With 100K files, this created 100K concurrent promises
  • The JavaScript stack was overwhelmed by the number of concurrent operations
  • This caused a stack overflow during the index registration phase

Solution

  • Implemented batch processing for shard allocations:
    const ALLOCATION_BATCH_SIZE = 10_000
  • Process shard allocations in batches of 10,000
  • Use Promise.all within each batch for concurrent processing
  • Maintain proper error handling and propagation
  • Prevents stack overflow while maintaining good performance

Tests

  • Waiting for user feedback

@fforbeck fforbeck force-pushed the fix/large-dir-upload branch from 4696938 to 8e6cabb Compare April 30, 2025 12:18
@fforbeck fforbeck self-assigned this Apr 30, 2025
Comment thread packages/blob-index/src/sharded-dag-index.js Outdated
Comment thread packages/blob-index/src/sharded-dag-index.js Outdated
@fforbeck fforbeck force-pushed the fix/large-dir-upload branch from 0b72a0f to 5ec8fe8 Compare May 1, 2025 14:23
@fforbeck fforbeck requested a review from alanshaw May 2, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants