UltraGen: High-Resolution Video Generation with Hierarchical Attention [AAAI 2026]

🌐 Project Page • 📄 Paper • 🤗 Hugging Face • 🚀 T3-Video (4K) • 📚 Citation

Overview

UltraGen is a novel video generation framework that enables efficient and end-to-end native high-resolution video synthesis. This is the official implementation of our AAAI 2026 paper.

Highlights

🎯 First Native 4K Model: Achieve native high-quality 4K video generation, eliminating "pseudo-high-resolution" limitations
⚡ Efficient Architecture: Hierarchical dual-branch attention with 4.78× speedup for 4K and 2.69× speedup for 1080P
🏆 Superior Quality: State-of-the-art performance across all metrics (HD-FVD, HD-MSE, HD-LPIPS, CLIP scores)
💡 Novel Design: Global-local attention decomposition solving O((T·H·W)²) complexity bottleneck

For more details, visit our project page or read the paper.

Key Features

🎬 Native High-Resolution: 1080P/2K/4K video generation without super-resolution pipeline
⚡ Hierarchical Attention: Dual-branch architecture (local + global) for efficient computation
🎨 Flexible Control: Fine-grained control over generation parameters
🔧 Easy to Use: Simple one-line command interface
🚀 Production Ready: Optimized for both research and practical applications

Additional Resources

🌐 Project Page: https://sjtuplayer.github.io/projects/UltraGen/
📄 Paper: arXiv:2510.18775
🚀 T3-Video (4K): 10x+ faster 4K generation

Installation

Quick Install

# Clone repository
git clone https://github.com/your-org/UltraGen.git
cd UltraGen

# Create environment
conda create -n ultragen python=3.10 -y
conda activate ultragen

# Install PyTorch
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118

# Install dependencies
pip install -e .

Download Models

Hugging Face: https://huggingface.co/JTUplayer/Ultragen

# Download base model (required)
python -c "from modelscope import snapshot_download; \
snapshot_download('Wan-AI/Wan2.1-T2V-1.3B', local_dir='checkpoints/Wan2.1-T2V-1.3B')"

# Download finetuned checkpoint from Hugging Face
# Visit https://huggingface.co/JTUplayer/Ultragen for the latest model
huggingface-cli download JTUplayer/Ultragen --local-dir checkpoints/ultragen

Quick Start

Generate Your First Video

# Simple one-liner
bash inference.sh "The video captures a breathtaking view of a mountainous landscape at sunrise, with a sea of clouds enveloping the valleys and rolling hills."

Output: outputs/video_0000.mp4

More Examples

# Underwater scene
bash inference.sh "The video captures an underwater scene featuring a clownfish swimming near a sea anemone in a vibrant coral reef environment"

# Urban landscape
bash inference.sh "The video showcases a stunning aerial view of a modern cityscape with a mix of historic and contemporary architecture."

# Natural phenomena
bash inference.sh "The video captures the intense and fiery eruption of a volcano, showcasing the raw power of nature as molten lava flows and spews into the air."

# Time-lapse
bash inference.sh "A time-lapse video captures a cityscape at night with a lightning strike illuminating the sky above a busy highway."

Note: UltraGen works best with landscape and scenic content. Complex human actions and fast movements may have limited support in the current version.

Advanced Usage

# Use Python API for more control
python tools/inference/generate.py \
  --model_dir checkpoints/Wan2.1-T2V-1.3B \
  --checkpoint checkpoints/ultragen_1080p.ckpt \
  --prompt "Your amazing prompt" \
  --output_dir outputs/custom

Batch Generation

# Create prompts.txt with one prompt per line
cat > prompts.txt << EOF
The video captures a breathtaking view of a mountainous landscape at sunrise, with a sea of clouds enveloping the valleys and rolling hills.
The video showcases a stunning aerial view of a modern cityscape with a mix of historic and contemporary architecture.
The video captures an underwater scene featuring a clownfish swimming near a sea anemone in a vibrant coral reef environment
EOF

# Generate all videos
python tools/inference/generate.py \
  --checkpoint checkpoints/ultragen_1080p.ckpt
  --model_dir checkpoints/Wan2.1-T2V-1.3B \
  --prompt_file prompts.txt

Technical Details

UltraGen features a hierarchical dual-branch attention architecture:

Architecture Innovation

Global-Local Attention Decomposition
- Local branch: High-fidelity regional details
- Global branch: Overall semantic consistency
- Avoids O((T·H·W)²) complexity
Spatially Compressed Global Modeling
- Efficient learning of global dependencies
- Reduced computational overhead
Hierarchical Cross-Window Local Attention
- Enhanced information flow across windows
- Lower computational cost

Performance

Compared to baseline Wan2.1:

Method	Resolution	Speedup	HD-FVD ↓	HD-MSE ↑	HD-LPIPS ↑
Wan2.1	1080P	1.00×	245.37	375.82	0.5201
UltraGen	1080P	2.69×	214.12	390.19	0.5455
Wan2.1	4K	1.00×	486.29	362.45	0.6102
UltraGen	4K	4.78×	424.61	386.01	0.6450

Key Improvements:

🚀 2.69× faster at 1080P, 4.78× faster at 4K
📊 Better quality across all metrics (lower FVD, higher MSE and LPIPS)
⚡ Efficient hierarchical attention without sacrificing quality

For more technical details, see our paper.

4K Generation with T3-Video

For even faster 4K generation (10x+ acceleration), check out T3-Video:

Transform Trained Transformer architecture
Native 4K support (3840x2176, 81 frames)
More stable and consistent results
Built on UltraGen base

Paper: Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10×

Training

Prepare Dataset

Recommended Dataset: We recommend using UltraVideo - a high-quality UHD 4K video dataset with comprehensive captions (NeurIPS 2025).

# Download UltraVideo dataset
huggingface-cli download --repo-type dataset APRIL-AIGC/UltraVideo \
  --local-dir ./data/UltraVideo --resume-download

Custom Dataset Format:

dataset/
├── videos/
│   ├── video001.mp4
│   ├── video002.mp4
│   └── ...
└── captions.json  # {"video001.mp4": "description", ...}

Single-GPU Training

python tools/training/train.py \
  --task train \
  --dataset_path data/UltraVideo \
  --output_path experiments/my_model \
  --dit_path checkpoints/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors \
  --train_architecture full \
  --height 1088 --width 1920 --num_frames 81 \
  --learning_rate 1e-4 \
  --use_gradient_checkpointing

Multi-GPU Training (Single Machine)

# 8 GPUs on one machine
torchrun --nproc_per_node=8 tools/training/train.py \
  --task train \
  --dataset_path data/UltraVideo \
  --output_path experiments/my_model \
  --dit_path checkpoints/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors \
  --train_architecture full \
  --learning_rate 1e-4 \
  --use_gradient_checkpointing

Multi-Machine Distributed Training

For large-scale training across multiple machines, use scripts/training/run.sh:

# 1. Create a hosts file listing all machine IPs (one per line)
cat > hosts.txt << EOF
192.168.1.10
192.168.1.11
192.168.1.12
192.168.1.13
EOF

# 2. Configure the training script
# Edit scripts/training/run.sh to set:
#   - N_NODES: number of machines
#   - DATASET_PATH: path to UltraVideo or your dataset
#   - DIT_PATH: base model checkpoint path
#   - OUTPUT_PATH: where to save trained models
#   - Training hyperparameters (learning rate, resolution, etc.)

# 3. Launch distributed training on each node
bash scripts/training/run.sh

Training Configuration:

Automatic node rank detection and master node setup
Supports both 1080P and 4K training (adjust resolution parameters)
Compatible with pssh for parallel node deployment

Project Structure

UltraGen/
├── README.md                  # This file
├── inference.sh               # Quick inference script
├── tools/
│   ├── inference/
│   │   └── generate.py        # Main inference script
│   └── training/
│       ├── train.py           # Training script
├── diffsynth/                 # Core library
├── checkpoints/               # Model checkpoints
├── outputs/                   # Generated videos
└── prompts_example.txt        # Example prompts

Related Projects

🚀 T3-Video: Transform Trained Transformer for 10x+ faster 4K generation
🎨 DiffSynth Studio: Diffusion synthesis framework
📹 Wan2.1: Base video generation model

Acknowledgements

We thank the contributors of DiffSynth Studio, Wan-Video, and the open-source community for their valuable work.

License

Apache License 2.0 - See LICENSE file for details.

Citation

If you find UltraGen useful in your research, please cite:

@inproceedings{hu2026ultragen,
  title={UltraGen: High-Resolution Video Generation with Hierarchical Attention},
  author={Hu, Teng and Zhang, Jiangning and Su, Zihan and Yi, Ran},
  booktitle={AAAI Conference on Artificial Intelligence},
  year={2026}
}

Paper: https://arxiv.org/abs/2510.18775
Project Page: https://sjtuplayer.github.io/projects/UltraGen/

T3-Video Citation

If you use T3-Video for 4K generation:

@misc{t3video,
    title={Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10×}, 
    author={Jiangning Zhang and Junwei Zhu and Teng Hu and Yabiao Wang and Donghao Luo and Weijian Cao and Zhenye Gan and Xiaobin Hu and Zhucun Xue and Chengjie Wang},
    year={2025},
    eprint={2512.13492},
    archivePrefix={arXiv}
}

Contact

Issues: GitHub Issues
Project Page: https://sjtuplayer.github.io/projects/UltraGen/
Paper: arXiv:2510.18775

UltraGen: High-Resolution Video Generation with Hierarchical Attention
AAAI 2026
Made with ❤️ by Shanghai Jiao Tong University & Zhejiang University

Name		Name	Last commit message	Last commit date
Latest commit History 481 Commits
.github/workflows		.github/workflows
assets		assets
diffsynth		diffsynth
scripts/training		scripts/training
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hosts.txt.example		hosts.txt.example
inference.py		inference.py
inference.sh		inference.sh
prompts.txt		prompts.txt
prompts_example.txt		prompts_example.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UltraGen: High-Resolution Video Generation with Hierarchical Attention [AAAI 2026]

Overview

Highlights

Key Features

Additional Resources

Installation

Quick Install

Download Models

Quick Start

Generate Your First Video

More Examples

Advanced Usage

Batch Generation

Technical Details

Architecture Innovation

Performance

4K Generation with T3-Video

Training

Prepare Dataset

Single-GPU Training

Multi-GPU Training (Single Machine)

Multi-Machine Distributed Training

Project Structure

Related Projects

Acknowledgements

License

Citation

T3-Video Citation

Contact

About

Uh oh!

Releases

Packages

Contributors 29

Uh oh!

Languages

License

sjtuplayer/UltraGen

Folders and files

Latest commit

History

Repository files navigation

UltraGen: High-Resolution Video Generation with Hierarchical Attention [AAAI 2026]

Overview

Highlights

Key Features

Additional Resources

Installation

Quick Install

Download Models

Quick Start

Generate Your First Video

More Examples

Advanced Usage

Batch Generation

Technical Details

Architecture Innovation

Performance

4K Generation with T3-Video

Training

Prepare Dataset

Single-GPU Training

Multi-GPU Training (Single Machine)

Multi-Machine Distributed Training

Project Structure

Related Projects

Acknowledgements

License

Citation

T3-Video Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 29

Uh oh!

Languages

Packages