A public repository containing datasets and code for the paper HateXScore: A Metric Suite for Evaluating Reasoning Quality in Hate Speech Explanations (EACL 2026).
HateXScore is a metric suite for evaluating the quality of model-generated explanations in hate speech detection. It focuses on whether an explanation clearly states a conclusion, quotes relevant evidence faithfully, identifies the targeted protected group, and remains logically consistent with the model prediction.
This repository provides a modular implementation of HateXScore with the following components:
HTC: Hate-Type CheckQF: Quotation FaithfulnessTGI: Target-Group IdentificationCC: Consistency Check
The final HateXScore is computed from these four sub-metrics. By default, all metrics have equal weight, and the repository also supports configurable metric weights.
hatexscore/
├── __init__.py
├── htc.py
├── qf.py
├── tgi.py
├── cc.py
└── utils.py
hatexscore/htc.py: conclusion detection logic and label extraction utilitieshatexscore/qf.py: quotation overlap extraction, masking, probability estimation, and quotation faithfulness scoringhatexscore/tgi.py: protected-group matching with language-aware tokenization and lemmatizationhatexscore/cc.py: consistency rule between prediction, QF, and TGIhatexscore/utils.py: evaluator class, CLI entrypoint, dataset loading, protected-group lists, and final score aggregation
- Supports English, Chinese, and Korean
- Supports multiple protected-group inventories
- Works on JSONL reasoning outputs paired with CSV datasets
- Keeps the original evaluation logic while exposing configurable final metric weights
- Designed to align with the HateXScore paper workflow
Create and activate a Python environment first, then install the dependencies used in the current implementation.
pip install numpy pandas spacy jieba fuzzysearch openai konlpyYou may also need:
- a spaCy English model such as
en_core_web_sm - Java installed for
konlpyin Korean settings
Example:
python -m spacy download en_core_web_smThe current implementation expects:
- A JSONL file containing generated reasoning outputs
- A CSV file containing the original input text and gold labels
Each JSONL line should contain fields used by the script such as:
{"ID": .., "text": "...", "raw": "...model explanation...", "label": "hateful", "flag": true}The CSV schema depends on the dataset selected with --dataset. The script already contains dataset-specific column mappings for:
implicithatexplainhatechecktoxicnhasockold
Because the code uses package-relative imports, run it from the parent directory of hatexscore with module mode:
python -m hatexscore.utils \
--dataset hatexplain \
--data_path /path/to/reason_output.json \
--input_csv /path/to/input.csv \
--output_dir /path/to/output_dir \
--model gpt \
--protected_group un_en \
--lang enThe original code used a simple average across the four metrics. This repository keeps the same default behavior by setting all weights to 1.0, but also lets you adjust the final aggregation.
Available arguments:
--weight_htc--weight_qf--weight_tgi--weight_cc
Example:
python -m hatexscore.utils \
--dataset hatexplain \
--data_path /path/to/reason_output.json \
--input_csv /path/to/input.csv \
--output_dir /path/to/output_dir \
--model gpt \
--protected_group un_en \
--lang en \
--weight_htc 1.0 \
--weight_qf 2.0 \
--weight_tgi 1.0 \
--weight_cc 1.0The final score is computed as a weighted average:
HateXScore =
(HTC * w_htc + QF * w_qf + TGI * w_tgi + CC * w_cc)
/ (w_htc + w_qf + w_tgi + w_cc)
The current implementation includes several built-in inventories:
facebookyoutubeun_enun_zhun_kr
These are selected through --protected_group.
For each sample, the script writes a JSON object containing:
- input text
- reasoning
- gold label
- predicted label
- per-metric scores
- final
HateXScore
The output file is written to:
{output_dir}/{dataset}_metric_{model}.json
You can also import the evaluator directly:
from hatexscore import ReasoningMetricsEvaluator
weights = {
"HTC": 1.0,
"Quotation Faithfulness": 1.0,
"TGI": 1.0,
"Consistency Check": 1.0,
}
evaluator = ReasoningMetricsEvaluator(
language="en",
metric_weights=weights,
runtime_args=args,
)Then evaluate a sample:
sample = {
"text": "example text",
"reasoning": "example explanation",
"gold_label": "hateful",
"prediction": "hateful",
}
result = evaluator.evaluate_sample(sample, target_group_list)- The current implementation preserves the original logic of the provided codebase as closely as possible.
QFuses the OpenRouter-compatibleOpenAIclient configured in the source code.- If you run
python utils.pydirectly inside thehatexscore/directory, relative imports may fail. Usepython -m hatexscore.utilsinstead. - Some dependencies in the original larger script are not needed in this simplified modular repo.
If you use this repository in academic work, please cite the HateXScore paper.
@article{hu2026hatexscore,
title={HateXScore: A Metric Suite for Evaluating Reasoning Quality in Hate Speech Explanations},
author={Hu, Yujia and Lee, Roy Ka-Wei},
journal={arXiv preprint arXiv:2601.13547},
year={2026}
}This repository is intended for research use on hate speech detection and explanation evaluation. It may process sensitive or offensive text as part of the evaluation pipeline.