GitHub - amitsou/T5-TextNormalizer

Text Normalization Project

A comprehensive implementation of text normalization combining the power of T5 models with rule-based preprocessing techniques. This repository provides a complete workflow for data cleaning, augmentation, model training, and inference.

Overview

This project implements a text normalization pipeline featuring:

Component	Description
Model	T5 transformer from HuggingFace
Preprocessing	Rule-based cleaning and augmentation
Output	Normalized text with quality metrics

Key Features

Data cleaning and preprocessing
Data augmentation
Model training with early stopping
Comprehensive evaluation metrics
Command-line interface

Installation

Clone Repository

git clone git@github.com:amitsou/T5-TextNormalizer.git
cd T5-TextNormalizer

Set Up Environment

Create and activate the virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Data Preparation

Dataset Setup

Place your CSV file in the following directory:

root/
├── data/
│   └── raw_data/
│       └── normalization_assessment_dataset_10k.csv
└── ...

Data Processing

Process your dataset using the preparation script. This script split the dataset into train/test/val using the (80/10/10) rule.

In order to retrain the T5, a data augmentation approach has been conducted. Thus, the --augment argument controls the data augmentation during the data preparation phase.

When you execute the script with --prepare --augment N, it creates N additional copies of each training example with controlled modifications

For example:

python main.py --prepare --augment 2

Usage

Training

Train the T5 model:

In order to customize the model's parameters, please adjsut the model_params.py file. For the moment the configuration is adjusted for an RTX3060 GPU. The model will be trained for 5 epochs.

python main.py --train

This command:

Testing

Evaluate the model:

python main.py --test --samples 100

The script will:

Load the test data
Randomly select 100 samples
Generate predictions for each sample
Calculate all metrics

Metrics reported: The metrics calculated provide different perspectives on the model's performance:

BLEU Score: Measures how well the predicted text matches the ground truth, focusing on word order and accuracy
Character Accuracy: Shows the percentage of characters that match exactly between prediction and ground truth
Word Accuracy: Measures the percentage of words that match exactly
Normalized Edit Distance: Shows how many operations (insertions, deletions, substitutions) are needed to transform the prediction into the ground truth

Inference

Run inference on standard example inputs:

python app.py --inference

The aforementioned command generates normalized text output for built-in (given) examples.

Output Examples

Inference examples:

RAW TEXT	PREDICTED NORMALIZED
Pixouu/Abdou Gambetta/Copyright Control	Pixouu/Abdou Gambetta
Mike Hoyer/JERRY CHESNUT/SONY/ATV MUSIC PUBLISHING (UK) LIMITED	Correct and normalize: Mike Ho

Model Test/Evaluation Examples:

Evaluation Results Table

Metric	Score
BLEU Score	0.1658
Character-level Accuracy	0.4467
Word-level Accuracy	0.4431
Normalized Edit Distance	0.4197

Example Predictions Table

Input	Predicted	Actual
Paweł Jabłoński	Pawe Jaboski	Paweł Jabłoński
Yuki Kishida/Kentaro Sonoda	No Prediction	Yuki Kishida/Kentaro Sonoda
R.K.M./Nico Gomez/Universal Music publish GmbH/Universal Music Publishing N.V./Universal Music Publishing Gmbh	Nico Gomez	Nico Gomez

Hardware Limitations

The development setup consists of the following hardware specifications:

Laptop with 32GB RAM
Intel Core i7 Processor
NVIDIA RTX 3060 (6GB VRAM)

Due to these hardware constraints, particularly the limited GPU memory (6GB VRAM), challenges were faced in running large-scale deep learning models and high-resolution experiments efficiently.

As a result, some model training and evaluations were conducted with optimizations for lower VRAM consumption, and certain large-scale experiments were not feasible within this setup.

In order to learn more regarding the task, consider reading the following:

Text Normalization Report

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
docs		docs
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Overview

Key Features

Installation

Clone Repository

Set Up Environment

Data Preparation

Dataset Setup

Data Processing

Usage

Training

Testing

Inference

Output Examples

Inference examples:

Model Test/Evaluation Examples:

Evaluation Results Table

Example Predictions Table

Hardware Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Key Features

Installation

Clone Repository

Set Up Environment

Data Preparation

Dataset Setup

Data Processing

Usage

Training

Testing

Inference

Output Examples

Inference examples:

Model Test/Evaluation Examples:

Evaluation Results Table

Example Predictions Table

Hardware Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages