Open
Conversation
…refactor dataset classes
…refactor dataset classes
…mproved consistency
…refactor dataset classes
…oader and standardize tokenize_fn across models
…ataloader settings
There was a problem hiding this comment.
Pull request overview
This pull request refactors the dataset configuration system to support training on mixtures of multiple datasets with configurable weights. It replaces dataset-specific tokenization functions with a unified get_tokenize_fn factory function and introduces a new MixtureOfDatasets class that can sample from multiple datasets according to specified proportions.
Key changes:
- Unified tokenization through
get_tokenize_fnwith model-specific configuration - New
MixtureOfDatasetsclass enabling weighted sampling from multiple datasets - Updated all configuration files to use the new
get_mixture_of_datasets_dataloaderinterface
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| src/tests/test_tokenize_fn.py | New test file validating tokenization behavior across different models (GPT-2, Llama, SmolLM) |
| src/tests/init.py | New init file for tests package |
| src/core/datasets.py | Core refactor: removed model-specific tokenize functions, added get_tokenize_fn factory, renamed AbstractDataset to GenericDataset, removed FineWebEduDataset and C4Dataset, added MixtureOfDatasets class and get_mixture_of_datasets_dataloader function |
| configs/pc_project/*.yaml | Updated 14 project configs to use get_tokenize_fn with explicit model_name parameters |
| configs/_trainer/llama.yaml | Updated trainer config to use unified tokenization function |
| configs/_dataset/default.yaml | Changed from get_dataloader to get_mixture_of_datasets_dataloader with expanded parameter set |
| configs/_dataset/c4.yaml | Updated to new mixture format with multiple datasets and weights |
| configs/_dataset/fineweb.yaml | Updated to new mixture format (has structural issue) |
| configs/_dataset/local_dummy.yaml | Updated to new mixture format for local testing |
| configs/_dataset/smollm_corpus.yaml | New example config demonstrating multi-dataset mixture with 4 datasets |
| configs/dataset_mixture_test.yaml | New test configuration for validating dataset mixture functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
madragonse
previously approved these changes
Mar 6, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tting in datasets.py and test_tokenize_fn.py
…datasets_dataloader for improved clarity
…ey are provided and not empty
Contributor
Author
|
@madragonse fixed |
madragonse
reviewed
Mar 20, 2026
Collaborator
madragonse
left a comment
There was a problem hiding this comment.
lgtm, needs also PR #169 to log exps to wandb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request refactors the dataset configuration system to support mixtures of datasets and standardizes the way tokenization functions are specified across all configuration files. It introduces a new dataloader that can handle multiple datasets with configurable weights, updates all relevant config files to use this new system, and replaces custom tokenization functions with a unified, parameterized tokenizer configuration.
Key changes include:
Multi-dataset support and dataloader refactor
get_mixture_of_datasets_dataloader, allowing specification of multiple datasets and their weights for both training and evaluation. This affectsdefault.yaml,c4.yaml,fineweb.yaml, andlocal_dummy.yamlconfigs. [1] [2] [3] [4]smollm_corpus.yamldemonstrating the use of multiple datasets with different weights and a specific tokenizer.dataset_mixture_test.yamlto verify the new dataset mixture functionality.Tokenization function standardization
get_tokenize_fnfunction with explicitmodel_nameparameters, replacing previous custom or model-specific tokenization functions. This change affects allpc_projectconfigs, as well as Llama and SmolLM configs. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]Backward compatibility and migration
These changes make the configuration system more flexible, easier to maintain, and ready for more complex training setups involving multiple datasets and unified tokenization logic.