Skip to content

chore(deps): update dependency datasets to v4 - autoclosed#105

Closed
dreadnode-renovate-bot[bot] wants to merge 1 commit into
mainfrom
renovate/datasets-4.x
Closed

chore(deps): update dependency datasets to v4 - autoclosed#105
dreadnode-renovate-bot[bot] wants to merge 1 commit into
mainfrom
renovate/datasets-4.x

Conversation

@dreadnode-renovate-bot
Copy link
Copy Markdown
Contributor

@dreadnode-renovate-bot dreadnode-renovate-bot Bot commented Jul 16, 2025

This PR contains the following updates:

Package Change Age Confidence
datasets >=3.5.0,<4.0.0 -> >=4.0.0,<4.1.0 age confidence

Release Notes

huggingface/datasets (datasets)

v4.0.0

Compare Source

New Features

Build streaming data pipelines in a few lines of code !

from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)


* Add `num_proc=` to `.push_to_hub()` (Dataset and IterableDataset) by @&#8203;lhoestq in https://github.com/huggingface/datasets/pull/7606

```python

### Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)

Syntax:

ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

Iterate on a column:

for text in ds["text"]:
...

Load one cell without bringing the full column in memory

first_text = ds["text"][0] # equivalent to ds[0]["text"]

* Torchcodec decoding by @&#8203;TyTodd in https://github.com/huggingface/datasets/pull/7616
- Enables streaming only the ranges you need ! 

```python

### Don't download full audios/videos when it's not necessary
### Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames
  • Requires torch>=2.7.0 and FFmpeg >= 4
  • Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
  • Load audio data with AudioDecoder:
audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

### old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]
  • Load video data with VideoDecoder:
video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

  • Remove scripts altogether by @​lhoestq in #​7592

    • trust_remote_code is no longer supported
  • Torchcodec decoding by @​TyTodd in #​7616

    • torchcodec replaces soundfile for audio decoding
    • torchcodec replaces decord for video decoding
  • Replace Sequence by List by @​lhoestq in #​7634

    • Introduction of the List type
    from datasets import Features, List, Value
    
    features = Features({
        "texts": List(Value("string")),
        "four_paragraphs": List(Value("string"), length=4)
    })
    • Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature
    from datasets import Sequence
    
    Sequence(Value("string"))  # List(Value("string"))
    Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.6.0...4.0.0


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

@dreadnode-renovate-bot dreadnode-renovate-bot Bot added area/python Changes to Python package configuration and dependencies type/digest Dependency digest updates labels Jul 16, 2025
@dreadnode-renovate-bot dreadnode-renovate-bot Bot force-pushed the renovate/datasets-4.x branch 2 times, most recently from 167adf0 to 85840e1 Compare July 27, 2025 00:23
@dreadnode-renovate-bot dreadnode-renovate-bot Bot force-pushed the renovate/datasets-4.x branch 2 times, most recently from 9ac5d0c to 567e8bb Compare August 24, 2025 00:21
| datasource | package  | from  | to    |
| ---------- | -------- | ----- | ----- |
| pypi       | datasets | 3.6.0 | 4.0.0 |
@dreadnode-renovate-bot dreadnode-renovate-bot Bot changed the title chore(deps): update dependency datasets to v4 chore(deps): update dependency datasets to v4 - autoclosed Sep 7, 2025
@dreadnode-renovate-bot dreadnode-renovate-bot Bot deleted the renovate/datasets-4.x branch September 7, 2025 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/python Changes to Python package configuration and dependencies type/digest Dependency digest updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants