Skip to content

Added attributes to load dataset function#40

Merged
WerWojtas merged 24 commits into
masterfrom
werwojtas-features
Jan 6, 2026
Merged

Added attributes to load dataset function#40
WerWojtas merged 24 commits into
masterfrom
werwojtas-features

Conversation

@WerWojtas

@WerWojtas WerWojtas commented Jan 6, 2026

Copy link
Copy Markdown
Collaborator

📝 Description

Added attributes to load dataset function
New Parameters Added

  • split: Load only a specific split (e.g., "train", "validation", "test")
  • name: Filter files by matching a pattern in the filename/path
  • streaming: Return file paths instead of loading data into memory
  • download_mode: Control caching behavior (:reuse_dataset_if_exists, :force_redownload)
  • verification_mode: Control validation checks (:basic_checks, :no_checks)
  • num_proc: Number of parallel processes for faster loading
  • cache_dir: Custom cache directory location
  • offline: Only use cached files, no network requests

🎯 Type of Changes

Remove the lines that do not apply:

  • ✨ New feature

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new parameters to the load_dataset function to provide more control over dataset loading behavior, including download/cache control, verification settings, parallel processing, and streaming support.

Key Changes:

  • Added download_mode parameter to control caching behavior (:reuse_dataset_if_exists, :force_redownload)
  • Added verification_mode parameter to control validation checks (:basic_checks, :no_checks)
  • Added num_proc parameter for parallel dataset downloading and loading
  • Implemented streaming mode to progressively fetch data without full download
  • Added comprehensive tests for all new features
  • Updated documentation with examples and usage patterns

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 21 comments.

Show a summary per file
File Description
lib/elixir_datasets.ex Implemented core filtering, parallel processing, and streaming functionality for dataset loading
lib/huggingface/hub.ex Added support for download_mode and verification_mode parameters in cached download logic
lib/elixir_datasets/utils/loader.ex Added parallel processing support with num_proc parameter for loading datasets
test/elixir_datasets_test.exs Added extensive tests for split filtering, streaming, parallel processing, and new parameters
test/huggingface/hub_test.exs Added tests for download_mode and verification_mode options
examples/example_1.livemd Added comprehensive documentation and examples demonstrating all new features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/huggingface/hub.ex Outdated
Comment thread test/huggingface/hub_test.exs
Comment thread lib/elixir_datasets/utils/loader.ex
Comment thread lib/elixir_datasets.ex Outdated
Comment thread examples/example_1.livemd Outdated
Comment thread lib/huggingface/hub.ex Outdated
Comment thread lib/elixir_datasets.ex Outdated
Comment thread lib/elixir_datasets.ex Outdated
Comment thread examples/example_1.livemd Outdated
Comment thread test/elixir_datasets_test.exs Outdated
WeronikaW-REM and others added 14 commits January 6, 2026 20:40
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@WerWojtas WerWojtas merged commit 370fb2c into master Jan 6, 2026
1 check passed
@WerWojtas WerWojtas deleted the werwojtas-features branch January 6, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants