Skip to content

minsing-jin/KoHalluLens

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

58 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

KoHalluLens: LLM Hallucination Evaluation Benchmark in Korean

HalluLens: LLM Hallucination Benchmark๋ฅผ ํ•œ๊ตญ์–ด adaptation์„ ํ•˜์—ฌ ๋ชจ๋ธ์˜ Hallucination์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ Original Project Links

GitHub HalluLens ย  arXiv Paper

Authors:
Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang,
Nicola Cancedda, Pascale Fung

๐Ÿ“‘ Table of Contents

๐Ÿ˜ตโ€๐Ÿ’ซ LLM Hallucination ์œ ํ˜•

LLM Hallucination Taxonomy

Extrinsic Hallucination: ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฌผ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ž…๋ ฅ๋œ ๋ฌธ๋งฅ(context)์— ์˜ํ•ด ๋’ท๋ฐ›์นจ๋  ์ˆ˜๋„, ๋ฐ˜๋ฐ•๋  ์ˆ˜๋„ ์—†์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ™˜๊ฐ์€ ๋ชจ๋ธ์ด (์ž‘์—… ์ง€์‹œ์— ๊ธฐ๋ฐ˜ํ•œ ์ž์œ  ํ˜•์‹ ํ…์ŠคํŠธ ๋“ฑ) ์ƒˆ๋กœ์šด ์ฝ˜ํ…์ธ ๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ์ง€์‹์˜ ๊ฒฉ์ฐจ๋ฅผ ๋ฉ”์šฐ๋ ค ํ•  ๋•Œ ์ž์ฃผ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ง€์‹์„ ํก์ˆ˜ํ•˜๋Š” ๋ชจ๋ธ์˜ ํ•œ๊ณ„์™€ ์ž์‹ ์˜ ์ง€์‹ ๊ฒฝ๊ณ„๋ฅผ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•จ์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.

Intrinsic Hallucination: **์ž…๋ ฅ๋œ ๋ฌธ๋งฅ(context)**๊ณผ ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฌผ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ์ž…๋ ฅ ๋ฌธ๋งฅ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ดํ•ดํ•˜์ง€ ๋ชปํ•  ๋•Œ, ์ž…๋ ฅ ์งˆ์˜(query)์™€ ๋ชจ์ˆœ๋˜๊ฑฐ๋‚˜ ์›๋ณธ ์ž…๋ ฅ ์งˆ์˜์— ์˜ํ•ด ๋’ท๋ฐ›์นจ๋˜์ง€ ์•Š๋Š” ๋‚ด์šฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ถ”๋ก  ์‹œ์ (inference-time)์— ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•จ์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿงช ์ฃผ์š” ํ‰๊ฐ€ํ•ญ๋ชฉ

Extrinsic Hallucination

  1. PreciseWikiQA: ๋ชจ๋ธ์ด trainํ•œ ๋ฐ์ดํ„ฐ ๋‚ด ์ง€์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์งง๊ณ  ์‚ฌ์‹ค ํ™•์ธ์„ ์š”๊ตฌํ•˜๋Š” ์งˆ์˜์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ํ™˜๊ฐ(hallucination) ์ˆ˜์ค€์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์งˆ๋ฌธ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ฒ”์œ„ ๋‚ด๋กœ ํ•œ์ •๋ฉ๋‹ˆ๋‹ค.
  2. LongWiki: ๋ชจ๋ธ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋‚ด ์ง€์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์žฅ๋ฌธ(long-form) ์ฝ˜ํ…์ธ  ์ƒ์„ฑ์‹œ ๋ชจ๋ธ์˜ ํ™˜๊ฐ ์ˆ˜์ค€์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  3. NonExistentRefusal: ๊ทธ๋Ÿด๋“ฏํ•˜๊ฒŒ ๋“ค๋ฆฌ์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ์กด์žฌํ•˜์ง€ ์•Š๋Š” ์‚ฌ๋ก€์™€ ๊ฐ™์ด, ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ์ง€์‹์— ๋Œ€ํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ฐ›์•˜์„ ๋•Œ ๋ชจ๋ธ์ด ํ™˜๊ฐ ์ •๋ณด(์ง€์–ด๋‚ธ ์ •๋ณด)๋ฅผ ์ƒ์„ฑํ•  ๊ฐ€๋Šฅ์„ฑ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. (์ด๋ฅผ ์œ„ํ•ด) ๋™๋ฌผ, ์‹๋ฌผ, ๊ธฐ์—…, ๋ธŒ๋žœ๋“œ ๋“ฑ ๋‹ค์–‘ํ•œ ์˜์—ญ์—์„œ ๊ทธ๋Ÿด๋“ฏํ•˜๊ฒŒ ๋“ค๋ฆฌ๋Š”, ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฐœ์ฒด๋ช…์„ ์ƒ์„ฑํ•˜์—ฌ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‘ ๊ฐ€์ง€ ํ•˜์œ„ ์ž‘์—…์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: (i) MixedEntities (ii) GeneratedEntities

Table 1: Extrinsic hallucination evaluation results on three HalluLens tasks โ€“ PreciseWikiQA, LongWiki, and NonExistentEntities โ€“ in percentage (average of three trials of evaluation). Hallu refers to Hallucinated when not refused, a ratio of answers include incorrect answers when it did not refuse. Correct refers to total correct answer rate, where refusal is considered to be incorrect. False Accept. refers to false acceptance rate, likelihood of model fails to prevent from hallucination on nonexistent entities.

cf)

  • โš ๏ธ์ฃผ์˜: ๋ณธ benchmark๋Š” ๋ชจ๋ธ์ด Wikipedia ์ง€์‹์„ ํ•™์Šตํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ์œ„ํ‚คํ”ผ๋””์•„ ์ง€์‹์„ ํ•™์Šตํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด, ํ‰๊ฐ€ ๊ฒฐ๊ณผ๊ฐ€ ์™œ๊ณก๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Intrinsic Hallucination์€ ํ˜„์žฌ KoHalluLens์—์„œ ๋‹ค๋ฃจ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๐Ÿƒ ์‹คํ–‰ ๋ฐฉ๋ฒ• (How to Run)

๐Ÿ› ๏ธ Installation

git clone https://github.com/facebookresearch/HalluLens.git
cd HalluLens

conda create --name hallulens python==3.12 #3.8.17
conda activate hallulens

[uv ์„ค์น˜์‹œ]

pip install uv
uv sync

๐Ÿ›ข๏ธ Getting ready with data

We provide script to download all data needed for all three tasks. This code will download all the data that you need for HalluLens. All data will be downloded under the /data folder.

โš ๏ธ ๋ฐ์ดํ„ฐ ์ค€๋น„์‹œ ์ฐธ๊ณ ์‚ฌํ•ญ

Wikipedia dump is large (~16GB), so please make sure you have enough space. And it may not be able to download from this codes.
์ฐธ๊ณ : en-wiki-20230401.db ํŒŒ์ผ์€ ์ง์ ‘ ๋‹ค์šด๋กœ๋“œ ํ›„ ์ง€์ • ๊ฒฝ๋กœ์— ๋„ฃ์–ด์ฃผ์…”์•ผ ํ•ฉ๋‹ˆ๋‹ค. (์ƒ์„ธ ๋‚ด์šฉ์€ ์•„๋ž˜ 'ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ' ์ฐธ๊ณ )

bash scripts/download_data.sh

It include as follow:

Getting ready with LLM inference.

[Together ai setup]
  • togther ai api key .env ํŒŒ์ผ์— ์„ค์ •
  • inference_method ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ 'together'๋กœ ์„ค์ •

[VLLM inference setup]

Set up your own inference method and replace it in function custom_api utils/lm.py

  • For our experiments, we used model checkpoints from Huggingface and hosted through vLLM package -- which you can directly use the default setup call_vllm_api. Refer to VLLM blog for details. For example:
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8
  • We have set the test set prompt generators and LLM evaluator to be same as our experiment set ups. We recommend to use same set up to replicate the results.

๐Ÿ”ฌ Run Evaluation

Overview

All scripts for each task is in scripts. There are mainly three steps for each tasks:

  1. do_generate_prompt : It generates test prompt for each task under the folder of data
  2. do_inference: This argument enables the inference of your model
  3. do_eval: Evalaution for each tasks.

By default, all three steps will be conducted when you run the scripts below. If you want the separate step, you can comment out the step you want to skip.

Task 1: PreciseWikiQA

tasks/shortform/precise_wikiqa.py

bash scripts/task1_precisewikiqa.sh

Task 2: LongWiki

tasks/longwiki/longwiki_main.py

bash scripts/task2_longwiki.sh

Task 3: NonExistentRefusal

There are two subtasks:

(1) MixedEntities

tasks/refusal_test/nonsense_mixed_entities.py

bash scripts/task3-1_mixedentities.sh

(2) GeneratedEntities

tasks/refusal_test/round_robin_nonsene_name.py

Prerequisite: set your keys for BRAVE_API_KEY and OPENAI_KEY.

  • Note: We used Brave Search API for search function. You can either use it with your own access key or your preferred API.
bash scripts/task3-2_generatedentities.sh

โš ๏ธ Notice

(0) API setting

[Mandatory]

  1. together ai api
  2. brave search api
  3. openai api

[Optional]

  • Anthropic ai api
  • grok api
  • other api keys for custom llm hosting

(1) ๋ฐ์ดํ„ฐ ์ค€๋น„ (Getting ready with data)

  • ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ

    • โญ๏ธ ์ค‘์š”!!: donwload.sh๋กœ ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ ์‹œ enwiki-20230401.db ํŒŒ์ผ์ด ์ •์ƒ์ ์œผ๋กœ ๋ฐ›์•„์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์‹คํŒจ์‹œ ์ด ๋งํฌ ์—์„œ ์ง์ ‘ ๋‹ค์šด๋กœ๋“œํ•ด ์ฃผ์„ธ์š”. 20GB๋กœ ๋งค์šฐ ํฝ๋‹ˆ๋‹ค.
    • ๋‹ค์šด๋กœ๋“œํ•œ ํŒŒ์ผ์€ ๋ฐ˜๋“œ์‹œ ๋‹ค์Œ ๊ฒฝ๋กœ(defalut path์ž„)์— ์ €์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
      • ๊ฒฝ๋กœ: hallulens/data/wiki_data/.cache/enwiki-20230401.db
  • Data Download

    • We provide a script to download all data needed for the three tasks. This will download all necessary data into the /data folder.
    • โš ๏ธNotice: The Wikipedia dump is large (~16GB), so please ensure you have enough space. The download may fail via the script.
      bash scripts/download_data.sh
    • This script includes:

(2) Customization & Configuration

  • VLLM ์‚ฌ์šฉ ๋ฐ ๋ชจ๋ธ ๋ณ€๊ฒฝ:
    • inference_method ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ 'vllm'์œผ๋กœ ๋ณ€๊ฒฝํ•˜๊ณ , model์— ํ—ˆ๊น…ํŽ˜์ด์Šค ๋ชจ๋ธ๋ช…์„ ์ž…๋ ฅํ•˜์„ธ์š”.
  • LLM as Judge ๋ฐฉ์‹ ๋ณ€๊ฒฝ (VLLM, Custom ๋“ฑ):
    • ์ฝ”๋“œ ๋‚ด call_together_api ํ•จ์ˆ˜๋ฅผ call_vllm_api ๋˜๋Š” custom_api ํ•จ์ˆ˜๋กœ hallulens ํŒŒ์ผ์—์„œ ์ „์ฒด ๋ณ€๊ฒฝํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ถ”ํ›„ ๋” ์œ ์—ฐํ•œ ์„ค์ • ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.
  • ์ƒˆ๋กœ์šด LLM ํ˜ธ์ŠคํŒ… ๋ฐฉ์‹ ์ถ”๊ฐ€:
    • hallulens/utils/lm.py ํŒŒ์ผ์˜ custom_api์™€ generate ํ•จ์ˆ˜๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(3) Troubleshooting

  • Together.ai Rate Limit or gpt Rate Limit: OpenAI api, together.ai ํ˜ธ์ŠคํŒ… ์‚ฌ์šฉ ์‹œ API ์š”์ฒญ ์ œํ•œ(Rate Limit)์ด ๋ฐœ์ƒํ•˜์—ฌ ์†๋„๋ฅผ ๋‚ฎ์ท„์Šต๋‹ˆ๋‹ค. Max_worker ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋†’์ด๊ฑฐ๋‚˜ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ด๋ฉด Rate Limit์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ž‘๊ฑฐ๋‚˜ ์„ฑ๋Šฅ ๋‚ฎ์€ ๋ชจ๋ธ์˜ ํ‰๊ฐ€ ๋ถˆ๊ฐ€๋Šฅ ๊ฐ€๋Šฅ์„ฑ: ์„ฑ๋Šฅ์ด ๋‚ฎ์€ ๋ชจ๋ธ์€ ํ‰๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋‹ต๋ณ€ ํ˜•์‹(์˜ฌ๋ฐ”๋ฅธ Json ํ˜•ํƒœ)์„ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•ด longwiki_qa ๋˜๋Š” precise_wikiqa ํ‰๊ฐ€๊ฐ€ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • precise_wikiqa Abstain ๋ฌธ์ œ: precise_wikiqa ํƒœ์Šคํฌ์—์„œ ๋ชจ๋ธ ์ถ”๋ก  ์‹คํŒจ๋‚˜ abstain ๋ฌธ์ œ๊ฐ€ ๋ฐ˜๋ณต๋œ๋‹ค๋ฉด, ๋ถˆ์™„์ „ํ•˜๊ฒŒ ์ƒ์„ฑ๋œ output ํด๋”์˜ ๋Œ€์ƒ ๋ชจ๋ธ ๊ฒฐ๊ณผ๋ฌผ(.jsonl ํŒŒ์ผ)์„ ์‚ญ์ œ ํ›„ ๋‹ค์‹œ ์‹œ๋„ํ•ด ์ฃผ์„ธ์š”. ์ด์ „์˜ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฌผ์„ ๊ณ„์† ์ฐธ์กฐํ•˜์—ฌ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“œ Citation

@article{bang2025hallulens,
      title={HalluLens: LLM Hallucination Benchmark}, 
      author={Yejin Bang and Ziwei Ji and Alan Schelten and Anthony Hartshorn and Tara Fowler and Cheng Zhang and Nicola Cancedda and Pascale Fung},
      year={2025},
      eprint={2504.17550},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.17550}, 
}

๐Ÿชช License

The majority of HalluLens is licensed under CC-BY-NC. However, portions of the project are available under separate license terms:

  • FActScore is licensed under the MIT license.
  • VeriScore is licensed under the Apache 2.0 license.

About

Korean adaptation of HalluLens

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.5%
  • Shell 1.5%