Skip to content

jnwnlee/selva

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SelVA: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

[CVPR 2026] Official PyTorch implementation of "Hear What Matters! Text-conditioned Selective Video-to-Audio Generation".
arxiv preprint.
Keywords: Video-to-Audio Generation, Selective Sound Generation, Multimodal Deep Learning.

arXiv githubio Hugging Face Hugging Face

SelVA SelVA Demo Video

Installation

Prerequisites

We assume using miniforge environment.

  • Python 3.9+
  • PyTorch 2.6.0+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)

1. Install prerequisite:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
conda install ffmpeg=6.1.0 x264 -c conda-forge # optional

(Or any other CUDA versions that your GPUs/driver support)

2. Clone our repository:

git clone https://github.com/jnwnlee/SelVA.git

3. Install with pip:

cd SelVA
pip install -e .

(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)

Pretrained models:

The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in selva/utils/download_utils.py.
The models are also available at https://huggingface.co/jnwnlee/SelVA/tree/main. Place weights of SelVA at ./weights/ folder, and external weights at ./ext_weights folder.
Refer to MODELS.md for further details.

Demo

By default, these scripts use the small_16k model. In our experiments, inference only takes around 4GB of GPU memory (in 16-bit mode).

python demo.py --duration=8 --video=<path to video> --prompt "your prompt" 

The output (audio in .flac format, and video in .mp4 format) will be saved in ./output. See the file for more options. The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.

Training

See TRAINING.md. (TBA)

Inference and Evaluation

See EVAL.md.

Training Datasets

SelVA was trained on VGGSound. Pretrained Synchformer was trained on AudioSet. Pretrained MMAudio was trained on several datasets, including AudioSet, Freesound, VGGSound, AudioCaps, and WavCaps. These datasets are subject to specific licenses, which can be accessed on their respective websites. Please follow the corresponding licenses and guidelines at usage.

Updates

  • add training and example bench.
  • add model variants.
  • add VGG-MonoAudio.

Citation

@article{selva,
  title={Hear What Matters! Text-conditioned Selective Video-to-Audio Generation},
  author={Lee, Junwon and Nam, Juhan and Lee, Jiyoung},
  journal={arXiv preprint arXiv:2512.02650},
  year={2025}
}

Relevant Repositories

Acknowledgement

We sincerely thank the authors for open-sourcing the following repos:

About

[CVPR 2026] Official PyTorch implementation of SelVA "Hear What Matters! Text-conditioned Selective Video-to-Audio Generation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors