This project implements a modular pipeline for fine-tuning Small Language Models (SLMs) on multilingual summarization tasks using parallel Wikipedia articles.
The codebase is organized into modular components:
dataset.py- Extract Wikipedia articles (in all 5 languages parallel)generate_summaries.py- Generate summaries using a large language modelconvert_to_hf_dataset.py- Create HuggingFace datasets from raw data and summaries
finetune_model_*.py- Fine-tune a model on the summarization taskcompare_model_*.py- Compare the finetuned model with the base model
model_utils.py- Model loading/saving utilitiesdata_utils.py- Data processing utilitiesmetrics_utils.py- Metrics computation and visualization
Install dependencies:
pip install -r requirements.txtFirst, extract parallel Wikipedia articles:
python dataset.py --languages fr de ja ru --num-documents 5000 --output-dir Data/directory/rawNext, generate summaries for each language:
python generate_summaries.py \
--raw-data-dir Data/directory/raw \
--summaries-dir Data/directory/summaries \
--languages fr de ja ru \
--model-name Qwen/Qwen2.5-7B-InstructFinally, convert the raw data and summaries to a HuggingFace dataset:
python convert_to_hf_dataset.py \
--raw-data-dir Data/directory/raw \
--summaries-dir Data/directory/summaries \
--hf-dataset-dir Data/directory/hf_dataset \
--languages fr de ja ruFine-tune a small language model on a specific language:
python finetune_model.py \
--model-name google/mt5-base \
--dataset-path Data/directory/hf_dataset \
--language fr \
--output-dir models \
--batch-size 4 \
--num-epochs 3Options:
- Use
--no-lorato disable LoRA (use full fine-tuning) - Use
--lora-rand--lora-alphato configure LoRA parameters - Use
--learning-rateto adjust the learning rate
Evaluate the fine-tuned model:
python evaluate_model.py \
--model-path models/fr_mt5-base_20250302_123456/final_model \
--dataset-path Data/directory/hf_dataset \
--language frFor LoRA models, use:
python evaluate_model.py \
--model-path models/fr_mt5-base_lora_20250302_123456/final_model \
--dataset-path Data/directory/hf_dataset \
--language fr \
--lora \
--base-model google/mt5-baseCompare your fine-tuned model against baselines:
python compare_baselines.py \
--config model_config.json \
--dataset-path Data/directory/hf_dataset \
--language fr \
--output-dir model_comparisonWhere model_config.json defines the models to compare:
{
"models": [
{
"name": "google/mt5-base",
"type": "huggingface",
"display_name": "mT5-Base (Baseline)"
},
{
"name": "./models/fr_mt5-base_20250302_123456/final_model",
"type": "finetuned",
"display_name": "Fine-tuned mT5-Base"
}
]
}For Colab compatibility:
- Keep batch sizes small (2-4) to fit in GPU memory
- Use LoRA fine-tuning (enabled by default)
- For large models, reduce
max_input_lengthandmax_target_length
Example Colab setup:
# Install requirements
!pip install -r requirements.txt
!git clone https://github.com/shashuat/multilingual-summarization.git
%cd multilingual-summarization
# Run the fine-tuning
!python finetune_model.py \
--model-name google/mt5-base \
--dataset-path ./Data/directory/hf_dataset \
--language fr \
--batch-size 2 \
--num-epochs 3Models that work well on Google Colab (16GB GPU):
google/mt5-base(580M parameters, multilingual)
Models are evaluated using:
- ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L)
- LLM (Gemma3-27b)
├── data/
│ ├── raw/ # Raw Wikipedia articles
│ ├── summaries/ # Generated summaries
│ └── hf_dataset/ # Processed HuggingFace datasets
├── models/ # Fine-tuned models
├── model_comparison/ # Model comparison results
├── dataset.py # Wikipedia data extraction script
├── generate_summaries.py # Summary generation script
├── convert_to_hf_dataset.py # Dataset conversion script
├── model_utils.py # Model utilities
├── data_utils.py # Data processing utilities
├── metrics_utils.py # Metrics computation utilities
├── finetune_model.py # Fine-tuning script
├── evaluate_model.py # Evaluation script
├── compare_baselines.py # Model comparison script
└── requirements.txt # Project dependencies