Text Normalization Project
A comprehensive implementation of text normalization combining the power of T5 models with rule-based preprocessing techniques. This repository provides a complete workflow for data cleaning, augmentation, model training, and inference.
This project implements a text normalization pipeline featuring:
| Component | Description |
|---|---|
| Model | T5 transformer from HuggingFace |
| Preprocessing | Rule-based cleaning and augmentation |
| Output | Normalized text with quality metrics |
- Data cleaning and preprocessing
- Data augmentation
- Model training with early stopping
- Comprehensive evaluation metrics
- Command-line interface
git clone git@github.com:amitsou/T5-TextNormalizer.git
cd T5-TextNormalizerCreate and activate the virtual environment:
python -m venv venv
source venv/bin/activateInstall dependencies:
pip install -r requirements.txtPlace your CSV file in the following directory:
root/
├── data/
│ └── raw_data/
│ └── normalization_assessment_dataset_10k.csv
└── ...Process your dataset using the preparation script. This script split the dataset into train/test/val using the (80/10/10) rule.
In order to retrain the T5, a data augmentation approach has been conducted. Thus, the --augment argument controls the data augmentation during the data preparation phase.
When you execute the script with --prepare --augment N, it creates N additional copies of each training example with controlled modifications
For example:
python main.py --prepare --augment 2Train the T5 model:
In order to customize the model's parameters, please adjsut the model_params.py file. For the moment the configuration is adjusted for an RTX3060 GPU. The model will be trained for 5 epochs.
python main.py --trainThis command:
Evaluate the model:
python main.py --test --samples 100The script will:
- Load the test data
- Randomly select 100 samples
- Generate predictions for each sample
- Calculate all metrics
Metrics reported: The metrics calculated provide different perspectives on the model's performance:
-
BLEU Score: Measures how well the predicted text matches the ground truth, focusing on word order and accuracy
-
Character Accuracy: Shows the percentage of characters that match exactly between prediction and ground truth
-
Word Accuracy: Measures the percentage of words that match exactly
-
Normalized Edit Distance: Shows how many operations (insertions, deletions, substitutions) are needed to transform the prediction into the ground truth
Run inference on standard example inputs:
python app.py --inferenceThe aforementioned command generates normalized text output for built-in (given) examples.
| RAW TEXT | PREDICTED NORMALIZED |
|---|---|
| Pixouu/Abdou Gambetta/Copyright Control | Pixouu/Abdou Gambetta |
| Mike Hoyer/JERRY CHESNUT/SONY/ATV MUSIC PUBLISHING (UK) LIMITED | Correct and normalize: Mike Ho |
| Metric | Score |
|---|---|
| BLEU Score | 0.1658 |
| Character-level Accuracy | 0.4467 |
| Word-level Accuracy | 0.4431 |
| Normalized Edit Distance | 0.4197 |
| Input | Predicted | Actual |
|---|---|---|
| Paweł Jabłoński | Pawe Jaboski | Paweł Jabłoński |
| Yuki Kishida/Kentaro Sonoda | No Prediction | Yuki Kishida/Kentaro Sonoda |
| R.K.M./Nico Gomez/Universal Music publish GmbH/Universal Music Publishing N.V./Universal Music Publishing Gmbh | Nico Gomez | Nico Gomez |
The development setup consists of the following hardware specifications:
- Laptop with 32GB RAM
- Intel Core i7 Processor
- NVIDIA RTX 3060 (6GB VRAM)
Due to these hardware constraints, particularly the limited GPU memory (6GB VRAM), challenges were faced in running large-scale deep learning models and high-resolution experiments efficiently.
As a result, some model training and evaluations were conducted with optimizations for lower VRAM consumption, and certain large-scale experiments were not feasible within this setup.
In order to learn more regarding the task, consider reading the following: