Skip to content

Mixture of datasets#138

Open
m1kush wants to merge 19 commits intomainfrom
mixture-of-datasets
Open

Mixture of datasets#138
m1kush wants to merge 19 commits intomainfrom
mixture-of-datasets

Conversation

@m1kush
Copy link
Contributor

@m1kush m1kush commented Jan 1, 2026

This pull request refactors the dataset configuration system to support mixtures of datasets and standardizes the way tokenization functions are specified across all configuration files. It introduces a new dataloader that can handle multiple datasets with configurable weights, updates all relevant config files to use this new system, and replaces custom tokenization functions with a unified, parameterized tokenizer configuration.

Key changes include:

Multi-dataset support and dataloader refactor

  • Replaced single-dataset dataloader configuration with a new get_mixture_of_datasets_dataloader, allowing specification of multiple datasets and their weights for both training and evaluation. This affects default.yaml, c4.yaml, fineweb.yaml, and local_dummy.yaml configs. [1] [2] [3] [4]
  • Added a new example configuration smollm_corpus.yaml demonstrating the use of multiple datasets with different weights and a specific tokenizer.
  • Introduced a test configuration dataset_mixture_test.yaml to verify the new dataset mixture functionality.

Tokenization function standardization

  • Updated all model and project configs to use the new get_tokenize_fn function with explicit model_name parameters, replacing previous custom or model-specific tokenization functions. This change affects all pc_project configs, as well as Llama and SmolLM configs. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

Backward compatibility and migration

  • Updated all existing dataset configurations to use the new multi-dataset format, ensuring backward compatibility and a smooth migration path for future extensions. [1] [2] [3]

These changes make the configuration system more flexible, easier to maintain, and ready for more complex training setups involving multiple datasets and unified tokenization logic.

Copilot AI review requested due to automatic review settings January 1, 2026 15:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the dataset configuration system to support training on mixtures of multiple datasets with configurable weights. It replaces dataset-specific tokenization functions with a unified get_tokenize_fn factory function and introduces a new MixtureOfDatasets class that can sample from multiple datasets according to specified proportions.

Key changes:

  • Unified tokenization through get_tokenize_fn with model-specific configuration
  • New MixtureOfDatasets class enabling weighted sampling from multiple datasets
  • Updated all configuration files to use the new get_mixture_of_datasets_dataloader interface

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
src/tests/test_tokenize_fn.py New test file validating tokenization behavior across different models (GPT-2, Llama, SmolLM)
src/tests/init.py New init file for tests package
src/core/datasets.py Core refactor: removed model-specific tokenize functions, added get_tokenize_fn factory, renamed AbstractDataset to GenericDataset, removed FineWebEduDataset and C4Dataset, added MixtureOfDatasets class and get_mixture_of_datasets_dataloader function
configs/pc_project/*.yaml Updated 14 project configs to use get_tokenize_fn with explicit model_name parameters
configs/_trainer/llama.yaml Updated trainer config to use unified tokenization function
configs/_dataset/default.yaml Changed from get_dataloader to get_mixture_of_datasets_dataloader with expanded parameter set
configs/_dataset/c4.yaml Updated to new mixture format with multiple datasets and weights
configs/_dataset/fineweb.yaml Updated to new mixture format (has structural issue)
configs/_dataset/local_dummy.yaml Updated to new mixture format for local testing
configs/_dataset/smollm_corpus.yaml New example config demonstrating multi-dataset mixture with 4 datasets
configs/dataset_mixture_test.yaml New test configuration for validating dataset mixture functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

madragonse
madragonse previously approved these changes Mar 6, 2026
Copy link
Collaborator

@madragonse madragonse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm + maybe apply some of LLM reviewer suggestion?

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@m1kush m1kush marked this pull request as draft March 18, 2026 21:53
@m1kush m1kush marked this pull request as ready for review March 18, 2026 22:37
@m1kush
Copy link
Contributor Author

m1kush commented Mar 18, 2026

@madragonse fixed

@madragonse madragonse self-requested a review March 20, 2026 12:34
Copy link
Collaborator

@madragonse madragonse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, needs also PR #169 to log exps to wandb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants