Citation Benchmark is a research project that proposes a comprehensive framework for evaluating how well large language models (LLMs) generate and utilize reference citations. The system introduces novel metrics and automated pipelines to assess the factual accuracy and attribution quality of LLM-generated text β minimizing the need for costly and subjective human judgment.
π This project was developed as part of the NLP Final Project at the Computer Engineering Department, Sharif University of Technology, Tehran.
- β Establish a standardized benchmark procedure for evaluating citation quality across different LLMs
- π Introduce novel automated metrics that leverage correlation-based assessments of sentence fragments within documents
- β‘ Optimize input processing by reducing token count through relevant snippets, summaries, and named-entity-based document reconstruction
- ποΈ Develop structured input formats that improve consistency and accuracy of model outputs
- π Propose a new citation evaluation metric compatible with Perplexity.AI and Microsoft Copilot models
The framework follows a multi-stage pipeline:
Extract cited sentences along with their reference URLs from LLM responses. Supports multiple formats:
- Perplexity.AI β Bracketed citation numbers (e.g.,
[1][2]) - Microsoft Copilot β UTF-8 superscript-based citation markers
Each cited passage is decomposed into independent, self-contained atomic facts using a referee LLM (GPT-4o Mini). This ensures fine-grained verification at the statement level.
Citation URLs are scraped to extract the visible text content of the referenced web pages using a custom Webscraper class built with BeautifulSoup.
Each atomic fact is verified against its cited reference content using LLM-based entailment checks. A binary vector is produced for each set of facts:
1β Fact is supported by the cited reference0β Fact is not supported by the cited reference
Standard metrics including Citation Recall, Citation Precision, ROUGE-L, STR-EM, and QA-based accuracy are computed using the ALCE evaluation framework, enhanced with an NLI-based AutoAIS model (google/t5_xxl_true_nli_mixture).
The main pipeline (MainPipeline.ipynb) is structured into four core components:
| Component | Description |
|---|---|
| π§ Utils | Text normalization, citation removal, GPU memory management, prompt formatting, and model loading utilities |
| π Searcher | Within-document retrieval using TF-IDF or GTR dense retrieval (sentence-transformers/gtr-t5-large) |
| π Run | End-to-end inference pipeline supporting OpenAI API, Azure, and local HuggingFace models (OPT, LLaMA, Vicuna) with ICL demonstrations |
| π Eval | Comprehensive evaluation including ROUGE, STR-EM, QA-F1, Citation Recall/Precision, and MAUVE scores |
citation-benchmark/
βββ π MainPipeline.ipynb # Main inference & evaluation pipeline (ALCE-based)
βββ π Metric.ipynb # Novel citation metric pipeline (Perplexity/Copilot)
βββ π query_llms/ # Utility scripts for querying LLMs via Poe API
β βββ poe-api.py # Basic async Poe API wrapper example
β βββ poe_api_wrapper.ipynb # Notebook for LLM querying via poe-api-wrapper
βββ π Report/ # LaTeX source for the academic report
β βββ paper.tex # Main LaTeX document
β βββ paper.bib # Bibliography references
β βββ sections/ # Report sections (abstract, intro, related work, etc.)
β βββ Makefile # Build system for the report PDF
βββ π LICENSE # MIT License
βββ π README.md # This file
The framework is validated using the following diverse datasets:
| Dataset | Description |
|---|---|
| ASQA | Ambiguous factoid questions requiring long-form answers with citations |
| QAMPARI | Questions with multiple entity answers |
| ELI5 | Open-ended "Explain Like I'm 5" questions |
| Wikidata5m | Knowledge graph-based attribution evaluation |
| Model | Type |
|---|---|
| GPT-3.5 Turbo / GPT-4 | OpenAI API |
| LLaMA 2 / LLaMA 3 (8B) | Local (HuggingFace) |
| OPT-6.7B | Local (HuggingFace) |
| Vicuna | Local (HuggingFace) |
| GPT-4o Mini | Via Poe API (for referee tasks) |
| Perplexity.AI / Copilot | Evaluated as target systems |
- Python 3.10+
- CUDA-compatible GPU (recommended: β₯16GB VRAM)
- Hugging Face account (for gated models like LLaMA)
# Clone the repository
git clone https://github.com/NLP-Final-Projects/citation-benchmark.git
cd citation-benchmark
# Install dependencies
pip install torch transformers spacy scikit-learn rouge-score nltk openai poe-api-wrapper beautifulsoup4 requests bitsandbytes safetensors
# Download spaCy model
python -m spacy download en_core_web_sm
# Download ALCE dataset
wget https://huggingface.co/datasets/princeton-nlp/ALCE-data/resolve/main/ALCE-data.tar
tar xvf ALCE-data.tar && mv ALCE-data data && rm ALCE-data.tarOpen MainPipeline.ipynb in Jupyter or Google Colab and follow the step-by-step cells for:
- Data preparation & document summarization
- Prompt generation with ICL demonstrations
- Model inference
- Comprehensive evaluation
Open Metric.ipynb to run the novel citation evaluation pipeline:
- Extract citations from LLM responses (Perplexity.AI / Copilot format)
- Decompose into atomic facts
- Scrape reference web pages
- Verify facts against references
- Compute citation accuracy scores
Using the ASQA dataset with OPT-6.7B (1-shot, 3 documents):
| Metric | Score |
|---|---|
| STR-EM | 22.33 |
| STR-HIT | 8.00 |
| ROUGE-Lsum | 29.92 |
| Citation Recall | 4.20 |
| Citation Precision | 5.67 |
| Avg. Output Length | 87.7 words |
π‘ A novel contribution is the use of named entity extraction + LLaMA 3 for document reconstruction (instead of traditional summarization), which achieved higher reference-finding metrics.
| Name | |
|---|---|
| Ilia Hashemi Rad | iliahashemirad@gmail.com |
| Ali Nazari | ali.nazari.8102@gmail.com |
| Shayan Salehi | s.salehi1381@gmail.com |
| Seyed Mohammad Yousef Najafi | najafim2002@gmail.com |
| Amir Mohammad Fakhimi | fakhimi.amirmohamad@gmail.com |
If you use this work in your research, please cite:
@misc{citation-benchmark2024,
title={Citation Benchmark: A Framework for Evaluating Citation Quality in Generative Language Models},
author={Hashemi Rad, Ilia and Nazari, Ali and Salehi, Shayan and Najafi, Seyed Mohammad Yousef and Fakhimi, Amir Mohammad},
year={2024},
institution={Sharif University of Technology}
}- ALCE β Enabling LLMs to Generate Text with Citations (Princeton NLP)
- poe-api-wrapper β Python library for querying LLMs via Poe
- Hugging Face β Model hosting and datasets
- Sharif University of Technology β Academic support
This project is licensed under the MIT License β see the LICENSE file for details.
β If you find this project useful, please consider giving it a star! β