Skip to content

Parallel dataset downloads#66

Draft
jhamon wants to merge 4 commits intomainfrom
cursor/SDK-326-parallel-dataset-downloads-43fd
Draft

Parallel dataset downloads#66
jhamon wants to merge 4 commits intomainfrom
cursor/SDK-326-parallel-dataset-downloads-43fd

Conversation

@jhamon
Copy link
Contributor

@jhamon jhamon commented Feb 3, 2026

Implement parallel downloading for multi-file datasets to significantly improve loading performance.

Previously, datasets with multiple parquet files downloaded serially, leading to long load times (e.g., 70-80 seconds for 10 files). This change introduces ThreadPoolExecutor to download files concurrently, reducing load times by 3-4x and better utilizing network bandwidth. It includes configurable parallel workers, maintains existing error handling, and ensures correct file ordering.


Linear Issue: SDK-326

Open in Cursor Open in Web

cursoragent and others added 2 commits February 3, 2026 19:38
- Add max_parallel_downloads configuration to Cache class (default: 4)
- Extract download logic into _download_and_read_parquet helper function
- Implement parallel downloads using ThreadPoolExecutor
- Maintain file ordering by extracting index from filenames
- Keep serial processing for single files or when max_workers=1
- Update progress bar to track completion of parallel downloads

Co-authored-by: jhamon <jhamon@pinecone.io>
- Add unit tests for parallel download configuration
- Test file index extraction from filenames
- Verify correct ordering of files after parallel downloads
- Test both serial and parallel processing paths
- Verify max_workers limits are respected
- All 8 new tests pass

Co-authored-by: jhamon <jhamon@pinecone.io>
@cursor
Copy link

cursor bot commented Feb 3, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

cursoragent and others added 2 commits February 3, 2026 19:45
Co-authored-by: jhamon <jhamon@pinecone.io>
- Remove unused imports (Mock, call, pytest)
- Fix import ordering
- Replace typing.Tuple with builtin tuple for Python 3.10+ compatibility
- Update .gitignore to exclude __pycache__ directories

Co-authored-by: jhamon <jhamon@pinecone.io>
@cursor cursor bot force-pushed the cursor/SDK-326-parallel-dataset-downloads-43fd branch from 59a671d to 2f65aa5 Compare February 3, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants