Mixture of datasets by m1kush · Pull Request #138 · llm-random/nano

m1kush · 2026-01-01T15:06:30Z

This pull request refactors the dataset configuration system to support mixtures of datasets and standardizes the way tokenization functions are specified across all configuration files. It introduces a new dataloader that can handle multiple datasets with configurable weights, updates all relevant config files to use this new system, and replaces custom tokenization functions with a unified, parameterized tokenizer configuration.

Key changes include:

Multi-dataset support and dataloader refactor

Replaced single-dataset dataloader configuration with a new get_mixture_of_datasets_dataloader, allowing specification of multiple datasets and their weights for both training and evaluation. This affects default.yaml, c4.yaml, fineweb.yaml, and local_dummy.yaml configs. [1] [2] [3] [4]
Added a new example configuration smollm_corpus.yaml demonstrating the use of multiple datasets with different weights and a specific tokenizer.
Introduced a test configuration dataset_mixture_test.yaml to verify the new dataset mixture functionality.

Tokenization function standardization

Updated all model and project configs to use the new get_tokenize_fn function with explicit model_name parameters, replacing previous custom or model-specific tokenization functions. This change affects all pc_project configs, as well as Llama and SmolLM configs. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

Backward compatibility and migration

Updated all existing dataset configurations to use the new multi-dataset format, ensuring backward compatibility and a smooth migration path for future extensions. [1] [2] [3]

These changes make the configuration system more flexible, easier to maintain, and ready for more complex training setups involving multiple datasets and unified tokenization logic.

…refactor dataset classes

…d sampling

…refactor dataset classes

…mproved consistency

…refactor dataset classes

…oader and standardize tokenize_fn across models

…ataloader settings

…l configurations

Copilot

Pull request overview

This pull request refactors the dataset configuration system to support training on mixtures of multiple datasets with configurable weights. It replaces dataset-specific tokenization functions with a unified get_tokenize_fn factory function and introduces a new MixtureOfDatasets class that can sample from multiple datasets according to specified proportions.

Key changes:

Unified tokenization through get_tokenize_fn with model-specific configuration
New MixtureOfDatasets class enabling weighted sampling from multiple datasets
Updated all configuration files to use the new get_mixture_of_datasets_dataloader interface

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
src/tests/test_tokenize_fn.py	New test file validating tokenization behavior across different models (GPT-2, Llama, SmolLM)
src/tests/init.py	New init file for tests package
src/core/datasets.py	Core refactor: removed model-specific tokenize functions, added `get_tokenize_fn` factory, renamed `AbstractDataset` to `GenericDataset`, removed `FineWebEduDataset` and `C4Dataset`, added `MixtureOfDatasets` class and `get_mixture_of_datasets_dataloader` function
configs/pc_project/*.yaml	Updated 14 project configs to use `get_tokenize_fn` with explicit model_name parameters
configs/_trainer/llama.yaml	Updated trainer config to use unified tokenization function
configs/_dataset/default.yaml	Changed from `get_dataloader` to `get_mixture_of_datasets_dataloader` with expanded parameter set
configs/_dataset/c4.yaml	Updated to new mixture format with multiple datasets and weights
configs/_dataset/fineweb.yaml	Updated to new mixture format (has structural issue)
configs/_dataset/local_dummy.yaml	Updated to new mixture format for local testing
configs/_dataset/smollm_corpus.yaml	New example config demonstrating multi-dataset mixture with 4 datasets
configs/dataset_mixture_test.yaml	New test configuration for validating dataset mixture functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

configs/_dataset/default.yaml

src/tests/test_tokenize_fn.py

configs/_dataset/smollm_corpus.yaml

configs/_dataset/c4.yaml

src/core/datasets.py

configs/_dataset/c4.yaml

src/core/datasets.py

madragonse

lgtm + maybe apply some of LLM reviewer suggestion?

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…o match eval

…tting in datasets.py and test_tokenize_fn.py

…aner code

…datasets_dataloader for improved clarity

…ey are provided and not empty

m1kush · 2026-03-18T22:37:51Z

@madragonse fixed

madragonse

lgtm, needs also PR #169 to log exps to wandb

m1kush added 9 commits January 1, 2026 15:49

add skeleton of mixture of datasets with new configuration files and …

6084ec7

…refactor dataset classes

refactor dataset handling to support mixture of datasets with weighte…

93ff6b9

…d sampling

add skeleton of mixture of datasets with new configuration files and …

fe2a7b2

…refactor dataset classes

refactor dataset initialization to always use MixtureOfDatasets for i…

63a836a

…mproved consistency

add skeleton of mixture of datasets with new configuration files and …

d004369

…refactor dataset classes

add tests for tokenize_fn to verify behavior across different models

3899160

update dataloader configurations to use get_mixture_of_datasets_datal…

f8a618f

…oader and standardize tokenize_fn across models

refactor dataset configurations to use path-weight pairs and update d…

19b1d5c

…ataloader settings

update dataset weights in fineweb.yaml and clean up smollm_corpus.yam…

70ef0e9

…l configurations

Copilot AI review requested due to automatic review settings January 1, 2026 15:06

Copilot started reviewing on behalf of m1kush January 1, 2026 15:06 View session

m1kush assigned crewtool and madragonse Jan 1, 2026

Copilot AI reviewed Jan 1, 2026

View reviewed changes

madragonse previously approved these changes Mar 6, 2026

View reviewed changes

Update src/tests/test_tokenize_fn.py

ca95813

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

m1kush dismissed madragonse’s stale review via ca95813 March 18, 2026 21:41

Update src/core/datasets.py

cc3693f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

m1kush marked this pull request as draft March 18, 2026 21:53

m1kush and others added 8 commits March 18, 2026 23:03

update c4.yaml to use dynamic paths for train and eval datasets

47304b4

update default.yaml to change the train seed value from 1000 to 123 t…

a56ab19

…o match eval

fix indentation in datasets.py for better readability

73d4f73

refactor: improve code readability by adjusting indentation and forma…

96d177d

…tting in datasets.py and test_tokenize_fn.py

refactor: remove commented-out print statement in datasets.py for cle…

8f413ad

…aner code

refactor: update type hints for datasets parameter in get_mixture_of_…

6f42629

…datasets_dataloader for improved clarity

fix: add validation for paths and weights in datasets.py to ensure th…

980b737

…ey are provided and not empty

Merge branch 'main' into mixture-of-datasets

5e1c549

m1kush marked this pull request as ready for review March 18, 2026 22:37

madragonse self-requested a review March 20, 2026 12:34

madragonse reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixture of datasets#138

Mixture of datasets#138
m1kush wants to merge 19 commits intomainfrom
mixture-of-datasets

m1kush commented Jan 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madragonse left a comment •

edited

Loading

Uh oh!

m1kush commented Mar 18, 2026

Uh oh!

madragonse left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

m1kush commented Jan 1, 2026

Multi-dataset support and dataloader refactor

Tokenization function standardization

Backward compatibility and migration

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madragonse left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m1kush commented Mar 18, 2026

Uh oh!

madragonse left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

madragonse left a comment •

edited

Loading