feat: add parallel downloading for multi-file datasets by jhamon · Pull Request #67 · pinecone-io/pinecone-datasets

jhamon · 2026-02-03T19:44:50Z

Problem

Datasets with multiple parquet files download serially (one at a time), creating a significant performance bottleneck. For datasets with 10+ files, total load time can exceed 70-80 seconds when each file takes 7-8 seconds to download.

Observed Performance:

10 files × 7-8s each = 70-80 seconds total
Network bandwidth underutilized
Poor user experience for large datasets

Solution

Implemented parallel downloading using Python's ThreadPoolExecutor to download multiple parquet files simultaneously.

Changes

Core Implementation

pinecone_datasets/cfg.py:

Added max_parallel_downloads configuration (default: 4)
Configurable via PINECONE_DATASETS_MAX_PARALLEL_DOWNLOADS environment variable

pinecone_datasets/dataset_fsreader.py:

Added _download_and_read_parquet() helper function to encapsulate download logic
Implemented parallel download with ThreadPoolExecutor
Automatic fallback to serial execution for single files or when max_workers=1
Maintains all existing error handling and retry logic

pinecone_datasets/cache.py and pinecone_datasets/fs.py:

Added show_progress parameter to control progress bar display
Disables individual file progress bars during parallel downloads to reduce visual clutter
Outer "Loading documents" progress bar tracks overall file completion

Testing

tests/unit/test_parallel_downloads.py:

7 new comprehensive tests covering:
- Parallel execution with multiple files
- Serial fallback for single files
- Serial execution when max_workers=1
- Error handling in parallel context
- Worker count capping
- Helper function behavior

Test Results:

✅ All 197 tests pass (190 existing + 7 new)
✅ Backward compatible
✅ No breaking changes

Documentation

README.md:

Added "Download Performance" section documenting:
- Local caching behavior
- Parallel download feature
- Progress feedback
- Resumable downloads
- Configuration options (environment variables and programmatic API)

Performance Impact

Before: 10 files × 7-8s = 70-80 seconds
After: ~20-25 seconds (3-4× faster)

Better network bandwidth utilization makes large dataset loading significantly faster.

Example Output

Loading documents: 100%|████████| 10/10 [00:23<00:00, 2.31s/it]

Instead of:

Loading documents:  50%|████    | 5/10 [00:38<00:37, 7.59s/it]

Configuration

Users can control parallel download behavior:

# Increase parallelism for faster downloads
export PINECONE_DATASETS_MAX_PARALLEL_DOWNLOADS=8

# Disable parallelism (serial downloads)
export PINECONE_DATASETS_MAX_PARALLEL_DOWNLOADS=1

Adds Cache.max_parallel_downloads (env PINECONE_DATASETS_MAX_PARALLEL_DOWNLOADS) and threads this through caching by introducing a show_progress flag on get_cached_path/CacheManager.get_cached_path so per-file progress bars can be suppressed during parallel runs in favor of an overall "Loading …" progress bar.

Updates docs to describe caching/parallelism configuration, adds unit coverage for parallel download behavior, and includes a manual test_download_progress.py script.

^{Written by Cursor Bugbot for commit a6d2ce3. This will update automatically on new commits. Configure here.}

Implement parallel downloads using ThreadPoolExecutor to significantly improve loading performance for datasets with multiple parquet files. Changes: - Add max_parallel_downloads configuration to cfg.py (default: 4) - Extract _download_and_read_parquet helper function - Implement parallel download logic with ThreadPoolExecutor - Add show_progress parameter to control progress bar display - Update get_cached_path to support disabling progress during parallel ops - Fix test mock to handle new show_progress parameter Performance: - 3-4× faster for multi-file datasets (70s → 20-25s for 10 files) - Automatically falls back to serial for single files - Configurable via PINECONE_DATASETS_MAX_PARALLEL_DOWNLOADS env var Testing: - All 190 tests pass (3 skipped) - Maintains backward compatibility - Preserves all error handling and retry logic Related to SDK-326 Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-03T19:53:24Z

pinecone_datasets/dataset_fsreader.py

+                            except Exception as e:
+                                path = future_to_path[future]
+                                logger.error(f"Failed to download {path}: {e}")
+                                raise


Parallel downloads lose deterministic file ordering

Medium Severity

The parallel download implementation uses as_completed() which yields futures in completion order rather than submission order. This causes dataframes to be appended to dfs in an arbitrary order based on which downloads finish first, rather than in the original file order from glob. For datasets split across multiple parquet files (e.g., part-0.parquet, part-1.parquet), this silently changes the row ordering in the final concatenated DataFrame compared to the serial execution path.

cursor · 2026-02-03T19:53:24Z

pinecone_datasets/dataset_fsreader.py

+            else:
+                full_path = path
+            # Download to cache and read from local path
+            local_path = get_cached_path(full_path, fs, show_progress=False)


Serial downloads lose byte-level progress feedback

Low Severity

The _download_and_read_parquet helper hardcodes show_progress=False when calling get_cached_path. This suppresses byte-level download progress for both parallel AND serial downloads. While correct for parallel mode (avoiding visual clutter), this causes a regression for serial downloads where users previously saw byte-level progress. For single-file datasets or when max_parallel_downloads=1, users now only see the outer file-level tqdm which provides no meaningful feedback during a long download.

Additional Locations (1)

pinecone_datasets/dataset_fsreader.py#L168-L173

jhamon and others added 3 commits February 3, 2026 14:41

test: add comprehensive tests for parallel downloads

ae1f855

Co-authored-by: Cursor <cursoragent@cursor.com>

docs: add parallel downloads and caching documentation to README

dd66c94

Co-authored-by: Cursor <cursoragent@cursor.com>

jhamon added the enhancement New feature or request label Feb 3, 2026

style: fix linting and formatting issues

a6d2ce3

Co-authored-by: Cursor <cursoragent@cursor.com>

cursor bot reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add parallel downloading for multi-file datasets#67

feat: add parallel downloading for multi-file datasets#67
jhamon wants to merge 4 commits intomainfrom
jhamon/sdk-326-add-parallel-downloading-for-multi-file-datasets

jhamon commented Feb 3, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 3, 2026

Uh oh!

cursor bot Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jhamon commented Feb 3, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Core Implementation

Testing

Documentation

Performance Impact

Example Output

Configuration

Related

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 3, 2026

Choose a reason for hiding this comment

Parallel downloads lose deterministic file ordering

Uh oh!

cursor bot Feb 3, 2026

Choose a reason for hiding this comment

Serial downloads lose byte-level progress feedback

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jhamon commented Feb 3, 2026 •

edited by cursor bot

Loading