SelVA: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

[CVPR 2026] Official PyTorch implementation of "Hear What Matters! Text-conditioned Selective Video-to-Audio Generation".
arxiv preprint.
Keywords: Video-to-Audio Generation, Selective Sound Generation, Multimodal Deep Learning.

Installation

Prerequisites

We assume using miniforge environment.

Python 3.9+
PyTorch 2.6.0+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)

1. Install prerequisite:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
conda install ffmpeg=6.1.0 x264 -c conda-forge # optional

(Or any other CUDA versions that your GPUs/driver support)

2. Clone our repository:

git clone https://github.com/jnwnlee/SelVA.git

3. Install with pip:

cd SelVA
pip install -e .

(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)

Pretrained models:

The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in selva/utils/download_utils.py.
The models are also available at https://huggingface.co/jnwnlee/SelVA/tree/main. Place weights of SelVA at ./weights/ folder, and external weights at ./ext_weights folder.
Refer to MODELS.md for further details.

Demo

By default, these scripts use the small_16k model. In our experiments, inference only takes around 4GB of GPU memory (in 16-bit mode).

python demo.py --duration=8 --video=<path to video> --prompt "your prompt"

The output (audio in .flac format, and video in .mp4 format) will be saved in ./output. See the file for more options. The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.

Training

See TRAINING.md. (TBA)

Inference and Evaluation

See EVAL.md.

Training Datasets

SelVA was trained on VGGSound. Pretrained Synchformer was trained on AudioSet. Pretrained MMAudio was trained on several datasets, including AudioSet, Freesound, VGGSound, AudioCaps, and WavCaps. These datasets are subject to specific licenses, which can be accessed on their respective websites. Please follow the corresponding licenses and guidelines at usage.

Updates

add training and example bench.
add model variants.
add VGG-MonoAudio.

Citation

@article{selva,
  title={Hear What Matters! Text-conditioned Selective Video-to-Audio Generation},
  author={Lee, Junwon and Nam, Juhan and Lee, Jiyoung},
  journal={arXiv preprint arXiv:2512.02650},
  year={2025}
}

Relevant Repositories

av-benchmark for benchmarking results.
kadtk for KAD calculation.

Acknowledgement

We sincerely thank the authors for open-sourcing the following repos:

MMAudio
Synchformer
Make-An-Audio 2 for the 16kHz BigVGAN pretrained model and the VAE architecture
BigVGAN
EDM2 for the magnitude-preserving network architecture

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
docs		docs
inference		inference
selva		selva
sets		sets
training		training
.gitignore		.gitignore
README.md		README.md
batch_inference.py		batch_inference.py
demo.py		demo.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SelVA: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Installation

Prerequisites

Demo

Training

Inference and Evaluation

Training Datasets

Updates

Citation

Relevant Repositories

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SelVA: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Installation

Prerequisites

Demo

Training

Inference and Evaluation

Training Datasets

Updates

Citation

Relevant Repositories

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages