We introduce MOVA (MOSS Video and Audio), a foundation model designed to break the "silent era" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment.
🌟Key Highlights
- Native Bimodal Generation: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation.
- Precise Lip-Sync & Sound FX: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects.
- Fully Open-Source: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts.
- Asymmetric Dual-Tower Architecture: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction.
- 2026/01/29: 🎉We released MOVA, an open-source foundation model for high-fidelity synchronized video–audio generation!!!
MOVA.Demo.mp4
Single person speech:
Click to expand
single_person.mp4
Multi-person speech:
Click to expand
multi_person.mp4
View more demos on our website.
conda create -n mova python=3.13 -y
conda activate mova
pip install -e .
| Model | Download Link | Note |
|---|---|---|
| MOVA-360p | 🤗 Huggingface | Support TI2VA |
| MOVA-720p | 🤗 Huggingface | Support TI2VA |
hf download OpenMOSS-Team/MOVA-360p --local-dir /path/to/MOVA-360p
hf download OpenMOSS-Team/MOVA-720p --local-dir /path/to/MOVA-720p
Generate a video of single person speech:
export CP_SIZE=1
export CKPT_PATH=/path/to/MOVA-720p/
torchrun \
--nproc_per_node=$CP_SIZE \
scripts/inference_single.py \
--ckpt_path $CKPT_PATH \
--cp_size $CP_SIZE \
--height 720 \
--width 1280 \
--prompt "A man in a blue blazer and glasses speaks in a formal indoor setting, framed by wooden furniture and a filled bookshelf. Quiet room acoustics underscore his measured tone as he delivers his remarks. At one point, he says, \"I would also say that this election in Germany wasn’t surprising.\"" \
--ref_path "./assets/single_person.jpg" \
--output_path "./data/samples/single_person.mp4" \
--seed 42 \
--offload cpu
Generate a video of multi-person speech:
export CP_SIZE=1
export CKPT_PATH=/path/to/MOVA-720p/
torchrun \
--nproc_per_node=$CP_SIZE \
scripts/inference_single.py \
--ckpt_path $CKPT_PATH \
--cp_size $CP_SIZE \
--height 720 \
--width 1280 \
--prompt "The scene shows a man and a child walking together through a park, surrounded by open greenery and a calm, everyday atmosphere. As they stroll side by side, the man turns his head toward the child and asks with mild curiosity, in English, \"What do you want to do when you grow up?\" The boy answers with clear confidence, saying, \"A bond trader. That's what Don does, and he took me to his office.\" The man lets out a soft chuckle, then responds warmly, \"It's a good profession.\" as their walk continues at an unhurried pace, the conversation settling into a quiet, reflective moment." \
--ref_path "./assets/multi_person.png" \
--output_path "./data/samples/multi_person.mp4" \
--seed 42 \
--offload cpu
Please refer to the inference script for more argument usage.
--offload cpu: component-wise CPU offload to reduce VRAM, typically slower and uses more Host RAM.
--offload group: finer-grained layerwise/group offload, often achieves lower VRAM but is usually slower and increases Host RAM pressure (see the benchmark table below).
--remove_video_dit: after switching to low-noise video_dit_2, frees the stage-1 video_dit reference, which can reduce ~28GB of Host RAM when offload is enabled.
We provide inference benchmarks for generating an 8-second 360p videos under different offloading strategies. Note that actual performance may vary depending on hardware configurations, driver versions, and PyTorch/CUDA builds.
| Offload Strategy | VRAM (GB) | Host RAM (GB) | Hardware | Step Time (s) |
|---|---|---|---|---|
| Component-wise offload | 48 | 66.7 | RTX 4090 | 37.5 |
| Component-wise offload | 48 | 66.7 | H100 | 9.0 |
| Layerwise (group offload) | 12 | 76.7 | RTX 4090 | 42.3 |
| Layerwise (group offload) | 12 | 76.7 | H100 | 22.8 |
We also support NPUs. For more details about NPU training/inference, please refer to this document.
We evaluate our model through both objective benchmarks and subjective human evaluations.
We provide quantitative comparison of audiovisual generation performance on Verse-Bench. The Audio and AV-Align metrics are evaluated on all subsets; the Lip Sync and Speech metrics are evaluated on Verse-Bench Set3; and ASR Acc is evaluated on a multi-speaker subset proposed by our team. Boldface and underlined numbers indicate the best and second-best results, respectively.
In the lip-sync task, which shows the largest performance gap, MOVA demonstrates a clear advantage. According to the Lip Sync Error metric, with Dual CFG enabled, MOVA-720p achieves an LSE-D score of 7.094 and an LSE-C score of 7.452. Furthermore, MOVA also attains the best performance on the cpCER metric, which reflects speech recognition accuracy and speaker-switching accuracy.
Below are the Elo scores and win rates comparing MOVA to existing open-source models.
sglang generate \
--model-path OpenMOSS-Team/MOVA-720p \
--prompt "A man in a blue blazer and glasses speaks in a formal indoor setting, \
framed by wooden furniture and a filled bookshelf. \
Quiet room acoustics underscore his measured tone as he delivers his remarks. \
At one point, he says, \"I would also say that this election in Germany wasn’t surprising.\"" \
--image-path "https://github.com/OpenMOSS/MOVA/raw/main/assets/single_person.jpg" \
--adjust-frames false \
--num-gpus 8 \
--ring-degree 2 \
--ulysses-degree 4 \
--num-frames 193 \
--fps 24 \
--seed 67 \
--num-inference-steps 25 \
--enable-torch-compile \
--save-output
The following commands show how to launch LoRA training in different modes; for detailed memory and performance numbers, see the LoRA Resource & Performance Reference section below.
- Model checkpoints: Download MOVA weights to your local path and update the
diffusion_pipelinesection of the corresponding config. - Dataset: Configure your video+audio dataset and transforms in the
datasection of the corresponding config (e.g.,mova_train_low_resource.py); seemova/datasets/video_audio_dataset.pyfor the expected fields. - Environment: Use the same environment as inference, then install training-only extras:
pip install -e ".[train]"(includestorchcodecandbitsandbytes). - Configs: Choose one of the training configs below and edit LoRA, optimizer, and scheduler settings as needed.
- Config:
configs/training/mova_train_low_resource.py - Script:
bash scripts/training_scripts/example/low_resource_train.sh- Config:
configs/training/mova_train_accelerate.py - Script:
bash scripts/training_scripts/example/accelerate_train.sh- Config:
configs/training/mova_train_accelerate_8gpu.py - Accelerate config:
configs/training/accelerate/fsdp_8gpu.yaml - Script:
bash scripts/training_scripts/example/accelerate_train_8gpu.shAll hyper-parameters (LoRA rank/alpha, target modules, optimizer, offload strategy, etc.) are defined in the corresponding config files; the example scripts only take the config path as input.
All peak usage numbers below are measured on 360p, 8-second video training settings and will vary with resolution, duration, and batch size.
| Mode | VRAM (GB/GPU) | Host RAM (GB) | Hardware | Step Time (s) |
|---|---|---|---|---|
| Low-resource LoRA (single GPU) | ≈18GB | ≈80GB | RTX 4090 | 600 |
| Accelerate LoRA (1 GPU) | ≈100GB | ≥128GB | H100 | N/A |
| Accelerate + FSDP LoRA (8 GPUs) | ≈50GB | ≥128GB | H100 | 22.2 |
Note: Training 8-second 360p videos on RTX 4090 is not recommended due to high resource requirements and slow training speed. We strongly suggest reducing video resolution (e.g., 240p) or total frame count to accelerate training and reduce resource consumption.
- Checkpoints
- Multi-GPU inference
- Lora fine-tune
- Ascend NPU Fine-tune
- Ascend NPU Inference
- SGLang Integration
- Technical Report
- Generation Workflow
- Diffusers Integration
We would like to thank the contributors to Wan, SGLang, diffusers, HuggingFace, DiffSynth-Studio, and HunyuanVideo-Foley for their great open-source work, which is helpful to this project.



