Skip to content

JusperLee/AudioTrust

Repository files navigation

Logo

🎧 AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

📜ICLR'26 | 🎶 Demo | 🤗 Dataset

访客统计 GitHub stars Static Badge

AudioTrust is a large-scale benchmark designed to evaluate the multifaceted trustworthiness of Multimodal Audio Language Models (ALLMs). It examines model behavior across six critical dimensions:

💥 News

📌 Table of Contents

🔍 Overview

  • 🎯 Hallucination: Fabricating content unsupported by audio
  • 🛡️ Robustness: Performance under audio degradation
  • 🧑‍💻 Authentication: Resistance to speaker spoofing/cloning
  • 🕵️ Privacy: Avoiding leakage of personal/private content
  • ⚖️ Fairness: Consistency across demographic factors
  • 🚨 Safety: Generating safe, non-toxic, legal content

alt text alt text

The benchmark provides:

  • ✅ Expert-annotated prompts across six sub-datasets
  • 🔬 Model-vs-model evaluation with judge LLMs (e.g., GPT-4o)
  • 📈 Baseline results and reproducible evaluation scripts

📁 Repository Structure

AudioTrust/
├── assets/                        # Logo and visual assets
├── audio_evals/                  # Core evaluation engine
│   ├── agg/                      # Metric aggregation logic
│   ├── dataset/                  # Dataset preprocessing
│   ├── evaluator/                # Scoring logic
│   ├── process/, models/, prompt/, lib/  # Support code
│   ├── eval_task.py              # Evaluation controller
│   ├── isolate.py                # Single model inference
│   ├── recorder.py               # Output logging
│   ├── registry.py               # Registry entrypoint
│   └── utils.py                  # Shared utilities
│
├── registry/                     # Modular registry structure
│   ├── agg/, dataset/, eval_task/, evaluator/, model/, prompt/, process/, recorder/
│
├── scripts/                      # Shell scripts per task
│   └── hallucination/
│       ├── inference/
│       └── evaluation/
├── data/                         # Organized audio files by task
│   ├── hallucination/, robustness/, privacy/, fairness/, authentication/, safety/
├── res/                          # Outputs and logs
├── tests/, utils/                # Tests and preprocessing
├── main.py                       # Main execution entry
├── requirments.txt
├── requirments-offline-model.txt
└── README.md

📦 Dataset Description

  • Language: English
  • Audio Format: WAV, mono, 16kHz
  • Size: ~10.4GB across 6 sub-datasets

Each sample includes:

  • Audio: decoded waveform (if using Hugging Face loader)
  • AudioPath: path to original WAV file
  • InferencePrompt: prompt used for model response generation
  • EvaluationPrompt: prompt for evaluator model
  • Ref: reference (expected) answer for scoring

Sub-datasets:

  • {hallucination, robustness, authentication, privacy, fairness, safety}

🧪 Scripts Overview

Each subtask contains:

Folder Purpose
inference/ Use a target model (e.g., Gemini) to generate responses
evaluation/ Use an evaluator model (e.g., GPT-4o) to assess generated outputs

This supports model-vs-model evaluation pipelines.

🧩 Example: Hallucination Task

scripts/hallucination/
├── inference/
│   └── gemini-2.5-pro.sh
└── evaluation/
    └── gpt-4o.sh

🚀 Quick Start

1. Install Dependencies

git clone https://github.com/JusperLee/AudioTrust.git
cd AudioTrust
pip install -r requirments.txt

Or for offline model use:

pip install -r requirments-offline-model.txt

2. Load Dataset from Hugging Face

from datasets import load_dataset
dataset = load_dataset("JusperLee/AudioTrust", split="hallucination")

Materialize the HF dataset to the project data/ layout

If you plan to run the evaluation scripts that expect a local data/ folder, first materialize the Hugging Face dataset into the required directory structure:

python utils/materialize_hf_audio.py --dataset-path JusperLee/AudioTrust

3. Run Inference and Evaluation

# Make sure your API keys are set before running:
export OPENAI_API_KEY=your-openai-api-key
export GOOGLE_API_KEY=your-google-api-key

# Step 1: Run inference with Gemini
bash scripts/hallucination/inference/gemini-2.5-pro.sh

# Step 2: Run evaluation using GPT-4o
bash scripts/hallucination/evaluation/gpt-4o.sh

Or directly with Python:

export OPENAI_API_KEY=your-openai-api-key
python main.py \
  --dataset hallucination-content_mismatch \
  --prompt hallucination-inference-content-mismatch-exp1-v1 \
  --model gemini-1.5-pro

📊 Benchmark Tasks

Task Metric Description
Hallucination Detection Accuracy / Recall Groundedness of response in audio
Robustness Evaluation Accuracy / Δ Score Performance drop under corruption
Authentication Testing Attack Success Rate Resistance to spoofing / voice cloning
Privacy Leakage Leakage Rate Does the model leak private content?
Fairness Auditing Bias Index Demographic response disparity
Safety Assessment Violation Score Generation of unsafe or harmful content

📌 Citation

@inproceedings{li2025audiotrust,
  title={AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models},
  author={Li, Kai and Shen, Can and Liu, Yile and Han, Jirui and Zheng, Kelong and Zou, Xuechao and Wang, Zhe and Du, Xingjian and Zhang, Shun and Luo, Hanjun and others},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

🙏 Acknowledgements

We gratefully acknowledge UltraEval-Audio for providing the core infrastructure that inspired and supported parts of this benchmark.

📬 Contact

For questions or collaboration inquiries:

About

AudioTrust: Benchmarking the Multi-faceted Trustworthiness of Audio Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors