A lightweight language model pretraining framework using vanilla PyTorch and HuggingFace Accelerate for distributed training.
- Vanilla PyTorch: Pure PyTorch implementation without heavy frameworks
- HuggingFace Accelerate: Seamless distributed training support (multi-GPU, mixed precision)
- Hydra Configuration: YAML-based configuration with composable model presets
- Mixed Precision Training: Built-in bf16 support for faster training
- Gradient Accumulation: Efficient training with large effective batch sizes
- Dataset Packing: Efficient sequence packing to maximize token utilization
- Automatic Logging: GPU metrics and training stats via trackio
- Custom Tokenizers: Support for custom tokenizer training and usage
- Checkpoint Management: Top-K checkpoint saving based on validation loss
- Evaluation Benchmarks: Built-in support for HumanEval, IFEVAL, and MMLU
The model configuration currently only supports the Qwen3 LLM architecture (for now). The framework provides predefined model size presets under configs/model/:
| Preset | Layers | Hidden Size | FFN Size | Head Dim | Heads |
|---|---|---|---|---|---|
qwen_tiny |
4 | 768 | 1024 | 64 | 4 |
qwen_small |
8 | 768 | 1024 | 64 | 8 |
qwen_medium |
15 | 768 | 1024 | 64 | 8 |
qwen_large |
20 | 768 | 1024 | 128 | 8 |
qwen_xlarge |
20 | 768 | 1024 | 64 | 8 |
All presets use Qwen/Qwen3-0.6B as the base model config reference and support customizable vocabulary size via custom tokenizers.
pretrain/
├── configs/
│ ├── train.yaml # Main training config (Hydra)
│ └── model/ # Model size presets
│ ├── qwen_tiny.yaml
│ ├── qwen_small.yaml
│ ├── qwen_medium.yaml
│ ├── qwen_large.yaml
│ └── qwen_xlarge.yaml
├── src/pretrain/
│ ├── config.py # Training configuration dataclass
│ ├── trainer.py # Training loops and utilities
│ ├── model.py # Data loading and model definitions
│ ├── checkpoint/ # Checkpoint management (top-K saving)
│ └── evaluation/ # Benchmark evaluation (HumanEval, IFEVAL, MMLU)
├── main.py # Entry point for training
├── get_data.py # Data download script
├── tokenize_data.py # Tokenization script
├── pack_data.py # Dataset packing script
├── launch.sh # Accelerate launch helper
├── data/ # Raw data storage
├── tokenized_data/ # Preprocessed tokenized datasets
└── tokenizer/ # Custom tokenizer files
Training is configured via Hydra YAML files. The main config is configs/train.yaml, which composes a model preset via defaults:
defaults:
- model: qwen_small # or qwen_tiny, qwen_medium, qwen_large, qwen_xlarge
tokenizer_path: "tokenizer/"
# Training
learning_rate: 1e-4
batch_size: 16
num_epochs: 1
max_grad_norm: 1.0
# Optimizer
warmup_steps: 500
# Validation
val_check_interval: 1000
val_size: 15000
# Data
max_seq_length: 512
use_packed_data: true
# Accelerate
mixed_precision: "bf16"
gradient_accumulation_steps: 1
# Checkpointing
save_top_k: 3
save_every_n_steps: 500
# Logging
project_name: "scaling-laws"
auto_log_gpu: trueYou can override any parameter from the command line:
python main.py batch_size=32 learning_rate=3e-4 model=qwen_mediumDownload and prepare your training data:
python get_data.pyThis downloads bilingual Greek-English text data from HuggingFace Hub.
Tokenize your dataset:
python tokenize_data.pyPack tokenized sequences for efficient training (eliminates padding waste):
python pack_data.pyRun training with your configuration:
# Single GPU
python main.py
# Multi-GPU with Accelerate
accelerate launch main.py
# With config overrides
python main.py model=qwen_large learning_rate=3e-4
# Configure accelerate (first time)
accelerate configTo launch distributed training across multiple GPUs:
# Launch on all available GPUs
accelerate launch --multi_gpu main.py
# Launch on specific number of GPUs
accelerate launch --num_processes 4 main.pyTraining uses bfloat16 by default for faster computation and lower memory usage.
Simulate larger batch sizes without OOM errors:
gradient_accumulation_steps=4 # Effective batch_size = 16 * 4 = 64Linear warmup scheduler for stable training start:
warmup_steps=500Automatic validation runs during training:
val_check_interval=1000 # Run validation every 1000 stepsCombines multiple examples into fixed-length sequences to achieve near 100% token utilization (no padding waste). Enable with use_packed_data: true.
Automatically saves the top-K best checkpoints based on validation loss. Older/worse checkpoints are removed to save disk space.
Automatic GPU utilization, memory, and power logging via trackio.
The framework includes built-in evaluation benchmarks, configurable in the YAML config:
- HumanEval: Code generation benchmark
- IFEVAL: Instruction-following evaluation
- MMLU: Multiple-choice knowledge assessment
Each benchmark supports configurable sample count, temperature, and max generation tokens. Results are saved to eval_results/.
Training metrics are logged using trackio, which provides:
- Training loss, learning rate, tokens processed
- Validation loss at regular intervals
- GPU utilization, memory usage, and power consumption
- Web-based dashboard for visualization
Metrics logged:
train_loss: Training loss per batchval_loss: Validation losslearning_rate: Current learning ratetokens_passed: Total tokens processed- GPU metrics (when
auto_log_gpu: true)
The project expects tokenized datasets in HuggingFace Datasets format with:
input_ids: Tokenized sequences- Train/test splits stored in
tokenized_data/train_data/ - Packed data support for configurable sequence lengths (via
max_seq_length, default 512)
- Optimizer: AdamW with beta=(0.9, 0.95), weight_decay=0.1, eps=1e-10
- Scheduler: Linear warmup
- Gradient Clipping: Max norm of 1.0
- Sequence Length: Configurable via
max_seq_length(default 512)