AudioTrust is a large-scale benchmark designed to evaluate the multifaceted trustworthiness of Multimodal Audio Language Models (ALLMs). It examines model behavior across six critical dimensions:
- [2026-01-26] AudioTrust got accepted to ICLR'26! 🚀
- [2025-09-30] Added support for Kimi-Audio, Step-Fun, Step-Audio2, OpenS2S, and Qwen2.5-Omni.
- [2025-05-16] We release the AudioTrust benchmark! 🚀
- 🔍 Overview
- 📁 Repository Structure
- 📦 Dataset Description
- 🧪 Scripts Overview
- 🚀 Quick Start
- 📊 Benchmark Tasks
- 📌 Citation
- 🙏 Acknowledgements
- 📬 Contact
- 🎯 Hallucination: Fabricating content unsupported by audio
- 🛡️ Robustness: Performance under audio degradation
- 🧑💻 Authentication: Resistance to speaker spoofing/cloning
- 🕵️ Privacy: Avoiding leakage of personal/private content
- ⚖️ Fairness: Consistency across demographic factors
- 🚨 Safety: Generating safe, non-toxic, legal content
The benchmark provides:
- ✅ Expert-annotated prompts across six sub-datasets
- 🔬 Model-vs-model evaluation with judge LLMs (e.g., GPT-4o)
- 📈 Baseline results and reproducible evaluation scripts
AudioTrust/
├── assets/ # Logo and visual assets
├── audio_evals/ # Core evaluation engine
│ ├── agg/ # Metric aggregation logic
│ ├── dataset/ # Dataset preprocessing
│ ├── evaluator/ # Scoring logic
│ ├── process/, models/, prompt/, lib/ # Support code
│ ├── eval_task.py # Evaluation controller
│ ├── isolate.py # Single model inference
│ ├── recorder.py # Output logging
│ ├── registry.py # Registry entrypoint
│ └── utils.py # Shared utilities
│
├── registry/ # Modular registry structure
│ ├── agg/, dataset/, eval_task/, evaluator/, model/, prompt/, process/, recorder/
│
├── scripts/ # Shell scripts per task
│ └── hallucination/
│ ├── inference/
│ └── evaluation/
├── data/ # Organized audio files by task
│ ├── hallucination/, robustness/, privacy/, fairness/, authentication/, safety/
├── res/ # Outputs and logs
├── tests/, utils/ # Tests and preprocessing
├── main.py # Main execution entry
├── requirments.txt
├── requirments-offline-model.txt
└── README.md- Language: English
- Audio Format: WAV, mono, 16kHz
- Size: ~10.4GB across 6 sub-datasets
Each sample includes:
Audio: decoded waveform (if using Hugging Face loader)AudioPath: path to original WAV fileInferencePrompt: prompt used for model response generationEvaluationPrompt: prompt for evaluator modelRef: reference (expected) answer for scoring
Sub-datasets:
{hallucination, robustness, authentication, privacy, fairness, safety}
Each subtask contains:
| Folder | Purpose |
|---|---|
inference/ |
Use a target model (e.g., Gemini) to generate responses |
evaluation/ |
Use an evaluator model (e.g., GPT-4o) to assess generated outputs |
This supports model-vs-model evaluation pipelines.
scripts/hallucination/
├── inference/
│ └── gemini-2.5-pro.sh
└── evaluation/
└── gpt-4o.shgit clone https://github.com/JusperLee/AudioTrust.git
cd AudioTrust
pip install -r requirments.txtOr for offline model use:
pip install -r requirments-offline-model.txtfrom datasets import load_dataset
dataset = load_dataset("JusperLee/AudioTrust", split="hallucination")If you plan to run the evaluation scripts that expect a local data/ folder, first materialize the Hugging Face dataset into the required directory structure:
python utils/materialize_hf_audio.py --dataset-path JusperLee/AudioTrust# Make sure your API keys are set before running:
export OPENAI_API_KEY=your-openai-api-key
export GOOGLE_API_KEY=your-google-api-key
# Step 1: Run inference with Gemini
bash scripts/hallucination/inference/gemini-2.5-pro.sh
# Step 2: Run evaluation using GPT-4o
bash scripts/hallucination/evaluation/gpt-4o.shOr directly with Python:
export OPENAI_API_KEY=your-openai-api-key
python main.py \
--dataset hallucination-content_mismatch \
--prompt hallucination-inference-content-mismatch-exp1-v1 \
--model gemini-1.5-pro| Task | Metric | Description |
|---|---|---|
| Hallucination Detection | Accuracy / Recall | Groundedness of response in audio |
| Robustness Evaluation | Accuracy / Δ Score | Performance drop under corruption |
| Authentication Testing | Attack Success Rate | Resistance to spoofing / voice cloning |
| Privacy Leakage | Leakage Rate | Does the model leak private content? |
| Fairness Auditing | Bias Index | Demographic response disparity |
| Safety Assessment | Violation Score | Generation of unsafe or harmful content |
@inproceedings{li2025audiotrust,
title={AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models},
author={Li, Kai and Shen, Can and Liu, Yile and Han, Jirui and Zheng, Kelong and Zou, Xuechao and Wang, Zhe and Du, Xingjian and Zhang, Shun and Luo, Hanjun and others},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}We gratefully acknowledge UltraEval-Audio for providing the core infrastructure that inspired and supported parts of this benchmark.
For questions or collaboration inquiries:
- Kai Li: tsinghua.kaili@gmail.com, Xinfeng Li: lxfmakeit@gmail.com
- Project Page — Coming Soon


