[CVPR 2026] Official PyTorch implementation of "Hear What Matters! Text-conditioned Selective Video-to-Audio Generation".
arxiv preprint.
Keywords: Video-to-Audio Generation, Selective Sound Generation, Multimodal Deep Learning.
![]() |
|
We assume using miniforge environment.
- Python 3.9+
- PyTorch 2.6.0+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
1. Install prerequisite:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
conda install ffmpeg=6.1.0 x264 -c conda-forge # optional(Or any other CUDA versions that your GPUs/driver support)
2. Clone our repository:
git clone https://github.com/jnwnlee/SelVA.git3. Install with pip:
cd SelVA
pip install -e .(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
Pretrained models:
The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in selva/utils/download_utils.py.
The models are also available at https://huggingface.co/jnwnlee/SelVA/tree/main. Place weights of SelVA at ./weights/ folder, and external weights at ./ext_weights folder.
Refer to MODELS.md for further details.
By default, these scripts use the small_16k model.
In our experiments, inference only takes around 4GB of GPU memory (in 16-bit mode).
python demo.py --duration=8 --video=<path to video> --prompt "your prompt" The output (audio in .flac format, and video in .mp4 format) will be saved in ./output.
See the file for more options.
The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
See TRAINING.md. (TBA)
See EVAL.md.
SelVA was trained on VGGSound. Pretrained Synchformer was trained on AudioSet. Pretrained MMAudio was trained on several datasets, including AudioSet, Freesound, VGGSound, AudioCaps, and WavCaps. These datasets are subject to specific licenses, which can be accessed on their respective websites. Please follow the corresponding licenses and guidelines at usage.
- add training and example bench.
- add model variants.
- add VGG-MonoAudio.
@article{selva,
title={Hear What Matters! Text-conditioned Selective Video-to-Audio Generation},
author={Lee, Junwon and Nam, Juhan and Lee, Jiyoung},
journal={arXiv preprint arXiv:2512.02650},
year={2025}
}- av-benchmark for benchmarking results.
- kadtk for KAD calculation.
We sincerely thank the authors for open-sourcing the following repos:
- MMAudio
- Synchformer
- Make-An-Audio 2 for the 16kHz BigVGAN pretrained model and the VAE architecture
- BigVGAN
- EDM2 for the magnitude-preserving network architecture
