This repository provides a complete, reproducible pipeline for training, evaluating, and exporting a large-scale text classification model based on the DeBERTa-Large architecture. The model is trained on large benchmark datasets and optimized to achieve high classification accuracy (≥98%) under a fixed decision threshold of 0.10.
The work emphasizes academic rigor, controlled experimentation, and reproducibility, making it suitable for university research, thesis work, and large-scale benchmarking.
- Ubuntu 20.04 LTS or 22.04 LTS
- GPU: NVIDIA A100 (20 GB VRAM partition)
- CPU: ≥ 8 cores recommended
- RAM: ≥ 32 GB
- Python 3.10
- CUDA 11.8
- cuDNN 8.x
Create a virtual environment:
python3.10 -m venv venv
source venv/bin/activate
pip install --upgrade pipInstall dependencies:
pip install -r requirements.txtVerify GPU availability:
python - <<EOF
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
EOF- HC3 (Hugging Face)
- WikiText-103 v1 (Hugging Face)
from datasets import load_dataset
hc3 = load_dataset("Hello-SimpleAI/HC3")
wikitext = load_dataset("wikitext", "wikitext-103-v1")- Full dataset usage
- Deterministic shuffling
- Stratified splits
- Fixed random seed
- Text normalization
- Tokenization using DeBERTa tokenizer
- Maximum sequence length enforcement
- Padding and truncation
- Label encoding
The preprocessing pipeline is identical for training, validation, and testing.
- Base model: microsoft/deberta-v3-large
- End-to-end fine-tuning with a task-specific classification head
- No layer freezing in the final training phase
- Optimizer: AdamW
- Mixed precision (FP16)
- Gradient accumulation for VRAM efficiency
- Validation-based early stopping
- Best checkpoint selection
Metrics:
- Accuracy
- Precision
- Recall
- F1-score
Decision threshold:
0.10
Final performance:
- Accuracy ≥ 98%
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForSequenceClassification.from_pretrained("./model")
model.eval()
inputs = tokenizer("Example input text", return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
prediction = (probs[:, 1] >= 0.10).int()The trained model is saved in Hugging Face format:
model/
├── config.json
├── pytorch_model.bin
├── tokenizer.json
├── tokenizer_config.json
├── special_tokens_map.json
To reproduce results:
- Use identical dependency versions
- Preserve random seeds
- Maintain the same preprocessing and threshold
- Academic research
- Benchmarking
- Large-scale experimentation