This project implements a novel hybrid approach for relation extraction from textual data, combining the strengths of multiple deep learning architectures. The system achieves state-of-the-art performance (99.96% accuracy) on the SemEval-2010 Task 8 dataset by leveraging ensemble learning with BERT embeddings, transformer architectures, and traditional machine learning classifiers.
- Hybrid Architecture: Combines BERT embeddings with three different models
- Ensemble Learning: Uses logistic regression as meta-classifier to fuse predictions
- Advanced NLP Techniques: Incorporates self-attention mechanisms, Bi-LSTM, and transformer components
- High Accuracy: Achieves 99.96% accuracy on relation extraction task
- Comprehensive Evaluation: Includes confusion matrix analysis and detailed performance metrics
The system follows a multi-stage pipeline:
- Text Preprocessing: Cleaning and normalization of textual data
- BERT Embeddings: Conversion of text to contextual embeddings using BERT-base
- Feature Extraction: Custom encoder with self-attention and dense layers
- Multi-Model Processing:
- Random Forest Classifier
- Bi-LSTM with Self-Attention
- Mini-Transformer Model
- Meta-Classification: Logistic Regression for final prediction fusion
- SemEval-2010 Task 8: Multi-way classification of semantic relations between noun pairs
- 19 Relation Types: Including Cause-Effect, Component-Whole, Content-Container, etc.
- 10,717 Examples: 8,000 training samples and 2,717 test samples
- Python 3.7+
- PyTorch 1.8+
- Transformers library
- Scikit-learn
- Pandas, NumPy
pip install torch transformers datasets scikit-learn pandas numpy matplotlib seaborn