This repository is the official implementation of our ACL Findings 2025 paper Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity.
conda env create -f environment.yaml
conda activate snne
pip install flash-attn==2.6.1 --no-build-isolation
pip install -e .For almost all tasks, the dataset is downloaded automatically from the Hugging Face Datasets library upon first execution.
The only exception is BioASQ (task b, BioASQ11, 2023), for which the data needs to be downloaded manually and stored at ./data/bioasq/training11b.json.
- QA
./scripts/generate/generate_qa.sh- Summarization
./scripts/generate/generate_summarization.sh- Translation
./scripts/generate/generate_translation.sh- SNNE and WSNNE
./scripts/compute/compute_snne.sh- Graph baselines (SumEigv, Deg, Eccen) + NumSet + LexSim
./scripts/compute/compute_graph_baselines.sh- KLE
./scripts/compute/compute_kle.sh- LUQ
./scripts/compute/compute_luq.sh- SAR
./scripts/compute/compute_sar.sh- Eigenscore
./scripts/compute/compute_eigenscore.sh- SE, NE, DSE, and pTrue: Open the Jupyter notebook in
notebooks/evaluation.ipynb, populate thewandb_idvariable in the second cell with the id assigned to your run, and execute all cells of the notebook. - Other methods: Open the csv files in the corresponding folder
*_resultsand find the evaluation metrics.
If you have any questions related to the code or the paper, feel free to email Dang Nguyen (nguyentuanhaidang@gmail.com). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Please cite our paper if you find the repo helpful in your work:
@article{nguyen2025beyond,
title={Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity},
author={Nguyen, Dang and Payani, Ali and Mirzasoleiman, Baharan},
journal={In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2025}
}The structure of this repo is largely based on semantic_uncertainty. The graph baselines are adapted from UQ-NLG while summarization and translation parts are adapted from lm-polygraph. We are very grateful for their open sources.