The RWD Classifier is a natural language processing system designed to analyze text and classify it according to a structured dictionary of right-wing discourse themes and subthemes. The system uses BERT for text classification, trained on a specialized dictionary that categorizes terms by their ideological weight and association with right-wing perspectives.
- Thematic Classification: Identifies right-wing discourse themes and subthemes in text
- Weighted Dictionary: Uses a carefully curated dictionary with weighted terms (+2 to -2 scale)
- BERT-based Model: Leverages state-of-the-art transformer architecture for accurate classification
- Batch Processing: Can analyze individual texts or process entire Excel files
- Explainable AI: Provides supporting evidence (key terms and descriptions) for classifications
- Python 3.7 or higher
- pip package manager
-
Clone the repository:
git clone https://github.com/codestreamhubio/rw-discourse-classifier.git cd rw-discourse-classifier -
Install required packages:
pip install -r requirements.txt
To train a new classification model using your dictionary:
```bash
python train_model.py
```
Uses default settings with RWDictionary.xlsx
For Custom Train:
```bash
python train_model.py --input_file RWDictionary.xlsx --model_name bert-base-uncased --epochs 15 --batch_size 16 --output_dir rwd_classifier
```
--input_file: Path to Excel dictionary file (default: RWDictionary.xlsx)--model_name: Pretrained BERT model name (default: bert-base-uncased)--epochs: Number of training epochs (default: 15)--batch_size: Training batch size (default: 16)--max_length: Maximum token sequence length (default: 128)--output_dir: Directory to save trained model (default: rwd_classifier)
To analyze an Excel file containing text data:
```bash
python analyze_text.py
```
Uses default input/output file with input_data.xlsx
For Custom Analysis:
```bash
python analyze_text.py --input input_data.xlsx --output output_data.xlsx --text_column Text --model_path rwd_classifier
```
--input: Input Excel file path (default: input_data.xlsx)--output: Output Excel file path (default: overwrites input file)--text_column: Column name containing text to classify (default: 'Text')--model_path: Path to model directory (default: rwd_classifier)
The system requires an Excel dictionary file with two sheets:
Contains terms organized by:
- Theme (e.g., "Nationalism", "Traditional Values")
- Sub-theme (e.g., "Border Security", "Family Structure")
- Weight categories:
- +2 (Strongly Supports RW View)
- +1 (Moderately Supports RW View)
- 0 (Neutral/Ambiguous)
- -1 (Moderately Opposes RW View)
- -2 (Strongly Opposes RW View)
Contains detailed descriptions for each sub-theme.
The classifier provides:
- Predicted Theme: Broad ideological category
- Predicted Subtheme: Specific discourse element
- Subtheme Description: Explanation of the subtheme
- Base Model: BERT (bert-base-uncased)
- Classification Head: Single linear layer
- Training: Fine-tuned with AdamW optimizer
- Learning Rate: 2e-5 with 500 warmup steps
- Tokenization: BERT WordPiece tokenizer
- Sequence Length: 128 tokens (truncated/padded)
- Label Encoding: sklearn LabelEncoder