Kai Li*, Jintao Cheng*, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu
Tsinghua University, Shanda AI, Johns Hopkins University
*Equal contribution
Completed during Kai Li's internship at Shanda AI.
๐ Arxiv 2026 | ๐ถ Demo | ๐ค Dataset | ๐ค Space
- [2026-03-09] We release
infer_audiosep.pyfor one-command AudioSep inference in Hive. The script automatically downloads config and checkpoints from ShandaAI/AudioSep-hive. ๐ - [2026-03-09] We release
infer_flowsep.pyfor one-command FlowSep inference in Hive, with automatic config and checkpoint download from ShandaAI/FlowSep-hive. ๐ - [2026-03-09] We release
app.py, a unified Gradio demo that supports both AudioSep-hive and FlowSep-hive in a single interface. ๐ - [2026-03-09] Community Hugging Face Space is available at JusperLee/Hive for quick interactive demo. ๐
- [2026-02-09] Thanks to @faiteamartaliius for using this codebase to synthesize data and publicly sharing a third-party Hive-style dataset: faiteamartaliius/Hive. ๐
- ๐ Abstract
- ๐ Repository Structure
- ๐ฏ Hive Dataset
- โ๏ธ Data Collection Pipeline
- ๐ง Inference
- ๐ฅ๏ธ Gradio App
- ๐ Citation
- โ๏ธ License
- ๐ Acknowledgments
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from unconstrained mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: ubiquitous in-the-wild datasets contain weak labels and severe event co-occurrence. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence noise by mining high-purity single-event segments from unconstrained recordings and synthesizing mixtures via semantically consistent strategies. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2k hours of audio. Experimental results demonstrate that, despite using only ~0.2% of the data scale of million-hour baselines, models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibit remarkable zero-shot generalization on out-of-distribution evaluation benchmarks such as MUSDB18-HQ and USS-Bench. These findings highlight that prioritizing supervision purity enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs.
.
โโโ hive_dataset/ # Hive Dataset generation and curation
โ โโโ mix_from_metadata/ # Generate mixtures from metadata
โ โ โโโ mix_from_metadata.py
โ โ โโโ dataset_paths.json
โ โโโ mix_curation/ # Data curation for mix audio
โ โ โโโ mix_data_curation.py
โ โ โโโ ontology.json
โ โโโ README.md # Dataset documentation
โ โโโ requirements.txt
โ โโโ LICENSE
โโโ pipeline/ # Single-Event Data Collection Pipeline
โ โโโ code/ # Pipeline scripts
โ โ โโโ 01_audio_chunking.py
โ โ โโโ 02_filter_single_label.py
โ โ โโโ 03_filter_single_event_qwen.py
โ โ โโโ 04_audioset_label_audiotag.py
โ โ โโโ 05_leaf_label_qwen.py
โ โ โโโ 06_superres_apollo.py
โ โโโ data/ # Pipeline data directories
โ โโโ ontology/ # AudioSet ontologies
โ โโโ icefall/ # AudioTag model repository
โ โโโ Apollo/ # Apollo model repository
โ โโโ requirements.txt # Pipeline dependencies
โ โโโ README.md # Pipeline documentation
โโโ LICENCE # MIT License
โโโ README.md
Hive is a high-quality synthetic dataset with 2,442 hours of raw audio and 19.6M mixtures for Universal Sound Separation.
Features:
- 283 sound categories from AudioSet ontology
- Semantically consistent mixing logic
- 44.1kHz sample rate
Please refer to hive_dataset/ for details
An automated 6-step pipeline for mining high-purity single-event audio from weakly-labeled sources.
Pipeline Stages:
- Audio Chunking - Split long audio into segments
- Single Label Filtering - Remove multi-label samples
- Single Event Filtering - Verify acoustic purity with Qwen3-Omni
- AudioSet Label Tagging - Assign ontology labels with AudioTag
- Leaf Label Classification - Refine to leaf nodes with Qwen3-Omni
- Audio Super-Resolution - Upsample to 44.1kHz with Apollo
Please refer to pipeline/ for details
Hive provides two inference scripts with automatic checkpoint/config download from Hugging Face:
infer_audiosep.py-> ShandaAI/AudioSep-hiveinfer_flowsep.py-> ShandaAI/FlowSep-hive
cd Hive
pip install torch torchaudio librosa pyyaml pytorch-lightning huggingface_hub gradiopython infer_audiosep.py \
--audio_file /path/to/mixture.wav \
--text "acoustic guitar" \
--output_file /path/to/audiosep_output.wavpython infer_flowsep.py \
--audio_file /path/to/mixture.wav \
--text "acoustic guitar" \
--output_file /path/to/flowsep_output.wavapp.py launches an interactive local demo with both models in one UI:
- Model choices:
AudioSep-hive,FlowSep-hive - Input: mixed audio + text query
- Output: separated waveform
Run:
cd Hive
python app.pyThen open the local Gradio URL printed in terminal.
If you use this code or the Hive Dataset, please cite:
@article{li2026semantically,
title={A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation},
author={Li, Kai and Cheng, Jintao and Zeng, Chang and Yan, Zijun and Wang, Helin and Su, Zixiong and Zheng, Bo and Hu, Xiaolin},
journal={arXiv preprint arXiv:2601.22599},
year={2026}
}This project is licensed under the Apache License 2.0. See LICENSE for details.
- Qwen3-Omni: Apache 2.0
- AudioTag: Apache 2.0
- Apollo: Check model repository for specific license
The Hive dataset is a collaborative achievement built upon the foundation of the open-source audio community. We extend our deepest gratitude to the researchers and organizations who curated the twelve foundational datasets. Their work provides the essential long-tailed acoustic space for advancing Universal Sound Separation.
We gratefully acknowledge the following core datasets which provided the majority of our high-fidelity clips:
- BBC Sound Effects (369,603 clips, 1,020.62h) - Professional-grade recordings with broadcast-level fidelity under Remix License
- AudioSet (326,890 clips, 896.61h) - Large-scale benchmark from YouTube under CC BY (Google)
- VGGSound (115,191 clips, 319.10h) - Real-world acoustic diversity under CC BY 4.0 (University of Oxford)
- FreeSound (17,451 clips, 46.90h) - Rich crowdsourced soundscapes under CC0/BY/BY-NC (MTG-UPF)
Our sincere thanks go to the following datasets for providing the raw source audio that forms the specialized domains of the Hive Dataset:
Music & Speech:
- MUSIC21 (32,701 clips, 90.28h) - Solo and ensemble instruments for harmonic structure modeling
- Voicebank-DEMAND (12,376 clips, 9.94h) - Clean speech signals under CC BY 4.0
- FSD50K (636 clips, 0.80h) - Finely annotated subset based on AudioSet ontology
Environmental & Events:
- ClothoV2 (14,759 clips, 38.19h) - Audio captioning dataset with rich temporal evolution
- AVE (3,054 clips, 6.91h) - Audio-visual event localization under CC BY-NC-SA
- SoundBible (2,501 clips, 5.78h) - Curated short clips under CC BY 4.0
- DCASE (1,969 clips, 5.46h) - Acoustic scene detection challenges
- ESC50 (1,433 clips, 1.99h) - Environmental sound classification benchmark under CC BY-NC 3.0
All source data were processed in strict accordance with their respective licenses (e.g., CC BY, CC0, Remix License). An automated data collection pipeline was employed to ensure that only semantically aligned and single-label pure segments were extracted, respecting the original intent of the data contributors while enhancing their utility for sound separation tasks.
Important Note: This repository releases only the metadata (JSON files containing mixing parameters and source references) for reproducibility. We do not redistribute the original audio files from the source datasets. Users must independently download and prepare the source datasets according to their respective licenses and terms of use.
We thank all original contributors for their invaluable service to the scientific community.
