Parallel dataset downloads by jhamon · Pull Request #66 · pinecone-io/pinecone-datasets

jhamon · 2026-02-03T19:44:16Z

Implement parallel downloading for multi-file datasets to significantly improve loading performance.

Previously, datasets with multiple parquet files downloaded serially, leading to long load times (e.g., 70-80 seconds for 10 files). This change introduces ThreadPoolExecutor to download files concurrently, reducing load times by 3-4x and better utilizing network bandwidth. It includes configurable parallel workers, maintains existing error handling, and ensures correct file ordering.

Linear Issue: SDK-326

- Add max_parallel_downloads configuration to Cache class (default: 4) - Extract download logic into _download_and_read_parquet helper function - Implement parallel downloads using ThreadPoolExecutor - Maintain file ordering by extracting index from filenames - Keep serial processing for single files or when max_workers=1 - Update progress bar to track completion of parallel downloads Co-authored-by: jhamon <jhamon@pinecone.io>

- Add unit tests for parallel download configuration - Test file index extraction from filenames - Verify correct ordering of files after parallel downloads - Test both serial and parallel processing paths - Verify max_workers limits are respected - All 8 new tests pass Co-authored-by: jhamon <jhamon@pinecone.io>

cursor · 2026-02-03T19:44:17Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

Co-authored-by: jhamon <jhamon@pinecone.io>

- Remove unused imports (Mock, call, pytest) - Fix import ordering - Replace typing.Tuple with builtin tuple for Python 3.10+ compatibility - Update .gitignore to exclude __pycache__ directories Co-authored-by: jhamon <jhamon@pinecone.io>

cursoragent and others added 2 commits February 3, 2026 19:38

cursoragent and others added 2 commits February 3, 2026 19:45

style: apply ruff formatting to parallel download files

fa021c1

Co-authored-by: jhamon <jhamon@pinecone.io>

fix: resolve linting issues

2f65aa5

- Remove unused imports (Mock, call, pytest) - Fix import ordering - Replace typing.Tuple with builtin tuple for Python 3.10+ compatibility - Update .gitignore to exclude __pycache__ directories Co-authored-by: jhamon <jhamon@pinecone.io>

cursor bot force-pushed the cursor/SDK-326-parallel-dataset-downloads-43fd branch from 59a671d to 2f65aa5 Compare February 3, 2026 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel dataset downloads#66

Parallel dataset downloads#66
jhamon wants to merge 4 commits intomainfrom
cursor/SDK-326-parallel-dataset-downloads-43fd

jhamon commented Feb 3, 2026

Uh oh!

cursor bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jhamon commented Feb 3, 2026

Uh oh!

cursor bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants