Detect when noisy labels will break a safety classifier before retraining on bad data.
Machine learning systems often retrain classifiers as new labeled data arrives. If those labels contain too much noise, the retrained model silently degrades.
NoiseCliff is a pre-training quality gate that checks a label batch and estimates whether the dataset is too noisy to safely retrain on.
Instead of discovering label problems after training fails, the gate detects them before the training job starts.
NoiseCliff includes two things:
A CLI that checks annotation batches and blocks retraining if the data appears too noisy.
python gate/check.py --batch new_labels.csv --baseline toxicchatExample output:
NoiseCliff Quality Gate
Batch: week_47_labels.csv
Samples: 2,847
Toxic rate: 9.4%
Estimated noise rate: 24.3%
Estimated PR-AUC if trained: 0.481 (baseline 0.628)
Collapse risk: DANGER
Safe to retrain: NO
The gate can run automatically in CI before any training job begins.
Experiments show classifier performance degrades gradually until roughly ~20% label noise, after which performance collapses rapidly.
Below the threshold:
~0.02 PR-AUC loss per 5% noise interval
Beyond it:
~0.09 PR-AUC loss per interval
This creates a noise cliff where model reliability drops sharply.
Standard metrics like F1 macro can hide this failure, especially on imbalanced datasets.
In production ML systems:
- annotation pipelines drift
- weak labels get mixed with human labels
- guidelines change over time
- reviewers disagree
All of these introduce label noise.
Without monitoring, models can be retrained on corrupted data and degrade silently.
NoiseCliff acts as a data quality guardrail for ML pipelines.
Clone the repository:
git clone https://github.com/DevelopedBy-Siva/noise-cliff.git
cd noise-cliff
pip install -r requirements.txtInitialize the baseline:
python gate/bootstrap_toxicchat_baseline.pyCheck a label batch:
python gate/check.py --batch labels.csvCSV format:
text,label
"This prompt is toxic",1
"This prompt is safe",0
Basic usage:
python gate/check.py --batch labels.csv --baseline toxicchatUseful options:
--save Save report as JSON
--json Output machine-readable report
--ci Exit with error code if unsafe
--explain Show suspicious examples
Example:
python gate/check.py \
--batch week_47_labels.csv \
--baseline toxicchat \
--saveNoiseCliff can block retraining jobs automatically.
Example GitHub Actions step:
- name: NoiseCliff quality gate
run: |
python gate/check.py \
--batch data/new_labels.csv \
--baseline toxicchat \
--ciExit codes:
| Code | Meaning |
|---|---|
| 0 | Safe batch |
| 1 | Unsafe batch (CI fail) |
| 2 | Invalid input |
The gate estimates noise without external ground truth.
Steps:
- Train a lightweight TF-IDF + Logistic Regression probe
- Run 3-fold cross-validation
- Measure uncertainty signals from predictions
Signals used:
| Signal | Meaning |
|---|---|
| Prediction entropy | noisy labels produce uncertain predictions |
| Margin | clean labels produce confident predictions |
| Near-threshold mass | noise concentrates predictions near boundary |
| Label distribution drift | dataset shifts may indicate annotation issues |
These signals are compared against calibrated baseline curves to estimate the noise rate.
The gate ships with a baseline calibrated for ToxicChat.
If your dataset differs significantly, calibrate a new baseline.
python gate/calibrate.py --csv my_data.csv --name my_datasetThen use it:
python gate/check.py --batch new_labels.csv --baseline my_datasetCalibration:
- Injects controlled noise levels
- measures PR-AUC degradation
- finds the dataset’s tipping point
- stores calibration curves
Validated across 9 injected noise levels × 3 seeds on ToxicChat.
| Injected | Estimated | Error |
|---|---|---|
| 0% | 0.000 | 0.000 |
| 5% | 0.050 | ~0 |
| 10% | 0.104 | +0.004 |
| 20% | 0.195 | -0.005 |
| 30% | 0.311 | +0.011 |
| 40% | 0.357 | -0.043 |
Mean absolute error: 1.1 percentage points
Operational zones:
| Zone | Noise |
|---|---|
| CLEAN | <10% |
| WATCH | 10–18% |
| DANGER | 18–30% |
| CRITICAL | >30% |
The project includes experiments evaluating classifier robustness to label noise.
Two datasets were used:
Used to validate the experimental setup.
Real-world dataset of user prompts from the Vicuna LLM demo.
- 10,165 prompts
- 7.33% toxic
- 2.01% jailbreak prompts
- labeled by multiple annotators
Primary metric: PR-AUC (appropriate for imbalanced classification).
| Noise | LogReg PR-AUC | DeBERTa PR-AUC |
|---|---|---|
| 0% | 0.628 | 0.845 |
| 10% | 0.605 | 0.804 |
| 20% | 0.532 | 0.750 |
| 40% | 0.185 | 0.243 |
Performance degrades slowly until ~20% noise, then collapses rapidly.
On imbalanced datasets, F1 macro can increase even while the classifier gets worse.
Example from experiments:
F1_macro rises from 0.606 → 0.656
while PR-AUC drops from 0.627 → 0.382
Monitoring only F1 hides model failure.
PR-AUC reflects ranking quality across thresholds and is more reliable for toxicity classification.
Cleaning approaches evaluated:
| Strategy | Result |
|---|---|
| Loss filtering | consistent recovery |
| Heuristic filtering | small improvements |
| Confidence filtering | dangerous on imbalanced data |
At 40% noise, no strategy fully recovers the classifier.
Preventing noisy retraining batches is safer than post-hoc fixes.
Experiments show weak labels can sometimes help.
On ToxicChat:
All data (weak labels included) performs better
than smaller human-only datasets.
Reason: weak labels from moderation models still contain signal.
Noise characteristics matter as much as noise amount.
Assume:
100k daily interactions
7.3% toxic prompts
That means roughly 7,300 toxic prompts daily.
Training with 20% label noise reduces PR-AUC:
0.628 → 0.532
At 40% noise, ranking approaches random.
Both missed toxic prompts and false blocks increase.
The ~20% noise threshold is where operational risk begins to accelerate.
noise-cliff/
gate/ # quality gate CLI
models/ # classifiers
experiments/ # experiment scripts
noise/ # noise injection utilities
training/ # training pipeline
evaluation/ # evaluation metrics
cleaning/ # label cleaning strategies
results/ # experiment outputs
python experiments/find_tipping_point.py
python experiments/run_noise_sweep.py
python experiments/run_cleaning.py
python experiments/run_quantity_vs_quality.py
Python 3.10, PyTorch, HuggingFace Transformers, scikit-learn, pandas, matplotlib