NoiseCliff

Detect when noisy labels will break a safety classifier before retraining on bad data.

Machine learning systems often retrain classifiers as new labeled data arrives. If those labels contain too much noise, the retrained model silently degrades.

NoiseCliff is a pre-training quality gate that checks a label batch and estimates whether the dataset is too noisy to safely retrain on.

Instead of discovering label problems after training fails, the gate detects them before the training job starts.

What This Project Provides

NoiseCliff includes two things:

1️⃣ A practical tool

A CLI that checks annotation batches and blocks retraining if the data appears too noisy.

python gate/check.py --batch new_labels.csv --baseline toxicchat

Example output:

NoiseCliff Quality Gate

Batch: week_47_labels.csv
Samples: 2,847
Toxic rate: 9.4%

Estimated noise rate: 24.3%
Estimated PR-AUC if trained: 0.481 (baseline 0.628)

Collapse risk: DANGER
Safe to retrain: NO

The gate can run automatically in CI before any training job begins.

2️⃣ Analysis of when label noise becomes dangerous

Experiments show classifier performance degrades gradually until roughly ~20% label noise, after which performance collapses rapidly.

Below the threshold:

~0.02 PR-AUC loss per 5% noise interval

Beyond it:

~0.09 PR-AUC loss per interval

This creates a noise cliff where model reliability drops sharply.

Standard metrics like F1 macro can hide this failure, especially on imbalanced datasets.

Why This Matters

In production ML systems:

annotation pipelines drift
weak labels get mixed with human labels
guidelines change over time
reviewers disagree

All of these introduce label noise.

Without monitoring, models can be retrained on corrupted data and degrade silently.

NoiseCliff acts as a data quality guardrail for ML pipelines.

Quickstart

Clone the repository:

git clone https://github.com/DevelopedBy-Siva/noise-cliff.git
cd noise-cliff
pip install -r requirements.txt

Initialize the baseline:

python gate/bootstrap_toxicchat_baseline.py

Check a label batch:

python gate/check.py --batch labels.csv

CSV format:

text,label
"This prompt is toxic",1
"This prompt is safe",0

Quality Gate CLI

Basic usage:

python gate/check.py --batch labels.csv --baseline toxicchat

Useful options:

--save       Save report as JSON
--json       Output machine-readable report
--ci         Exit with error code if unsafe
--explain    Show suspicious examples

Example:

python gate/check.py \
  --batch week_47_labels.csv \
  --baseline toxicchat \
  --save

CI Integration

NoiseCliff can block retraining jobs automatically.

Example GitHub Actions step:

- name: NoiseCliff quality gate
  run: |
    python gate/check.py \
      --batch data/new_labels.csv \
      --baseline toxicchat \
      --ci

Exit codes:

Code	Meaning
0	Safe batch
1	Unsafe batch (CI fail)
2	Invalid input

How Noise Estimation Works

The gate estimates noise without external ground truth.

Steps:

Train a lightweight TF-IDF + Logistic Regression probe
Run 3-fold cross-validation
Measure uncertainty signals from predictions

Signals used:

Signal	Meaning
Prediction entropy	noisy labels produce uncertain predictions
Margin	clean labels produce confident predictions
Near-threshold mass	noise concentrates predictions near boundary
Label distribution drift	dataset shifts may indicate annotation issues

These signals are compared against calibrated baseline curves to estimate the noise rate.

Calibration

The gate ships with a baseline calibrated for ToxicChat.

If your dataset differs significantly, calibrate a new baseline.

python gate/calibrate.py --csv my_data.csv --name my_dataset

Then use it:

python gate/check.py --batch new_labels.csv --baseline my_dataset

Calibration:

Injects controlled noise levels
measures PR-AUC degradation
finds the dataset’s tipping point
stores calibration curves

Gate Accuracy

Validated across 9 injected noise levels × 3 seeds on ToxicChat.

Injected	Estimated	Error
0%	0.000	0.000
5%	0.050	~0
10%	0.104	+0.004
20%	0.195	-0.005
30%	0.311	+0.011
40%	0.357	-0.043

Mean absolute error: 1.1 percentage points

Operational zones:

Zone	Noise
CLEAN	<10%
WATCH	10–18%
DANGER	18–30%
CRITICAL	>30%

Experiments

The project includes experiments evaluating classifier robustness to label noise.

Two datasets were used:

SST-2

Used to validate the experimental setup.

ToxicChat

Real-world dataset of user prompts from the Vicuna LLM demo.

10,165 prompts
7.33% toxic
2.01% jailbreak prompts
labeled by multiple annotators

Noise Experiments (ToxicChat)

Primary metric: PR-AUC (appropriate for imbalanced classification).

Degradation curves

Noise	LogReg PR-AUC	DeBERTa PR-AUC
0%	0.628	0.845
10%	0.605	0.804
20%	0.532	0.750
40%	0.185	0.243

Performance degrades slowly until ~20% noise, then collapses rapidly.

Metric Pitfall

On imbalanced datasets, F1 macro can increase even while the classifier gets worse.

Example from experiments:

F1_macro rises from 0.606 → 0.656
while PR-AUC drops from 0.627 → 0.382

Monitoring only F1 hides model failure.

PR-AUC reflects ranking quality across thresholds and is more reliable for toxicity classification.

Cleaning Strategy Results

Cleaning approaches evaluated:

Strategy	Result
Loss filtering	consistent recovery
Heuristic filtering	small improvements
Confidence filtering	dangerous on imbalanced data

At 40% noise, no strategy fully recovers the classifier.

Preventing noisy retraining batches is safer than post-hoc fixes.

Quantity vs Quality

Experiments show weak labels can sometimes help.

On ToxicChat:

All data (weak labels included) performs better
than smaller human-only datasets.

Reason: weak labels from moderation models still contain signal.

Noise characteristics matter as much as noise amount.

Business Interpretation

Assume:

100k daily interactions
7.3% toxic prompts

That means roughly 7,300 toxic prompts daily.

Training with 20% label noise reduces PR-AUC:

0.628 → 0.532

At 40% noise, ranking approaches random.

Both missed toxic prompts and false blocks increase.

The ~20% noise threshold is where operational risk begins to accelerate.

Project Structure

noise-cliff/

gate/              # quality gate CLI
models/            # classifiers
experiments/       # experiment scripts
noise/             # noise injection utilities
training/          # training pipeline
evaluation/        # evaluation metrics
cleaning/          # label cleaning strategies
results/           # experiment outputs

Running Experiments

python experiments/find_tipping_point.py
python experiments/run_noise_sweep.py
python experiments/run_cleaning.py
python experiments/run_quantity_vs_quality.py

Tech Stack

Python 3.10, PyTorch, HuggingFace Transformers, scikit-learn, pandas, matplotlib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NoiseCliff

What This Project Provides

1️⃣ A practical tool

2️⃣ Analysis of when label noise becomes dangerous

Why This Matters

Quickstart

Quality Gate CLI

CI Integration

How Noise Estimation Works

Calibration

Gate Accuracy

Experiments

SST-2

ToxicChat

Noise Experiments (ToxicChat)

Degradation curves

Metric Pitfall

Cleaning Strategy Results

Quantity vs Quality

Business Interpretation

Project Structure

Running Experiments

Tech Stack

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
cleaning		cleaning
data		data
evaluation		evaluation
experiments		experiments
gate		gate
models		models
noise		noise
notebooks		notebooks
results		results
scripts		scripts
training		training
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NoiseCliff

What This Project Provides

1️⃣ A practical tool

2️⃣ Analysis of when label noise becomes dangerous

Why This Matters

Quickstart

Quality Gate CLI

CI Integration

How Noise Estimation Works

Calibration

Gate Accuracy

Experiments

SST-2

ToxicChat

Noise Experiments (ToxicChat)

Degradation curves

Metric Pitfall

Cleaning Strategy Results

Quantity vs Quality

Business Interpretation

Project Structure

Running Experiments

Tech Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages