Skip to content

codestreamhubio/rw-discourse-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Right-Wing Discourse Classifier

Overview

The RWD Classifier is a natural language processing system designed to analyze text and classify it according to a structured dictionary of right-wing discourse themes and subthemes. The system uses BERT for text classification, trained on a specialized dictionary that categorizes terms by their ideological weight and association with right-wing perspectives.

Key Features

  • Thematic Classification: Identifies right-wing discourse themes and subthemes in text
  • Weighted Dictionary: Uses a carefully curated dictionary with weighted terms (+2 to -2 scale)
  • BERT-based Model: Leverages state-of-the-art transformer architecture for accurate classification
  • Batch Processing: Can analyze individual texts or process entire Excel files
  • Explainable AI: Provides supporting evidence (key terms and descriptions) for classifications

Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Steps

  1. Clone the repository:

    git clone https://github.com/codestreamhubio/rw-discourse-classifier.git
    cd rw-discourse-classifier
  2. Install required packages:

    pip install -r requirements.txt

Usage

Training the Model

To train a new classification model using your dictionary:

```bash
python train_model.py
```

Uses default settings with RWDictionary.xlsx

For Custom Train:

```bash
python train_model.py --input_file RWDictionary.xlsx --model_name bert-base-uncased --epochs 15 --batch_size 16 --output_dir rwd_classifier
```

Arguments:

  • --input_file: Path to Excel dictionary file (default: RWDictionary.xlsx)
  • --model_name: Pretrained BERT model name (default: bert-base-uncased)
  • --epochs: Number of training epochs (default: 15)
  • --batch_size: Training batch size (default: 16)
  • --max_length: Maximum token sequence length (default: 128)
  • --output_dir: Directory to save trained model (default: rwd_classifier)

Analyzing Text

To analyze an Excel file containing text data:

```bash
python analyze_text.py
```

Uses default input/output file with input_data.xlsx

For Custom Analysis:

```bash
python analyze_text.py --input input_data.xlsx --output output_data.xlsx --text_column Text --model_path rwd_classifier
```

Arguments:

  • --input: Input Excel file path (default: input_data.xlsx)
  • --output: Output Excel file path (default: overwrites input file)
  • --text_column: Column name containing text to classify (default: 'Text')
  • --model_path: Path to model directory (default: rwd_classifier)

Dictionary Structure

The system requires an Excel dictionary file with two sheets:

1. Weighted Sheet

Contains terms organized by:

  • Theme (e.g., "Nationalism", "Traditional Values")
  • Sub-theme (e.g., "Border Security", "Family Structure")
  • Weight categories:
    • +2 (Strongly Supports RW View)
    • +1 (Moderately Supports RW View)
    • 0 (Neutral/Ambiguous)
    • -1 (Moderately Opposes RW View)
    • -2 (Strongly Opposes RW View)

2. Typology Sheet

Contains detailed descriptions for each sub-theme.

Output Interpretation

The classifier provides:

  • Predicted Theme: Broad ideological category
  • Predicted Subtheme: Specific discourse element
  • Subtheme Description: Explanation of the subtheme

Technical Details

Model Architecture

  • Base Model: BERT (bert-base-uncased)
  • Classification Head: Single linear layer
  • Training: Fine-tuned with AdamW optimizer
  • Learning Rate: 2e-5 with 500 warmup steps

Data Processing

  • Tokenization: BERT WordPiece tokenizer
  • Sequence Length: 128 tokens (truncated/padded)
  • Label Encoding: sklearn LabelEncoder

License

About

It is an advanced NLP system that performs fine-grained ideological analysis using hierarchical classification and contextual semantic understanding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages