π‘ Tip: If you find this repository's structure or content difficult to understand, visit deepwiki for a comprehensive detailed explanation.
FlashVSR-Pro is an enhanced, production-ready re-implementation of the real-time diffusion-based video super-resolution algorithm introduced in the paper "FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution".
Original Paper: Zhuang, J., Guo, S., Cai, X., Li, X., Liu, Y., Yuan, C., & Xue, T. (2025). FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution. arXiv preprint arXiv:2510.12747.
Paper Link: https://arxiv.org/abs/2510.12747
This project is not the official code release but an independent, refactored implementation focused on improved usability, additional features, and better compatibility for real-world deployment.
This project builds upon the core FlashVSR algorithm and introduces several key improvements:
- π§© Unified Inference Script: A single, parameterized
infer.pyscript replaces multiple original scripts (full,tiny,tiny-long), simplifying the user interface. - π΅ Audio Track Preservation: Automatically detects and preserves the audio track from the input video in the super-resolved output. (Inspiration for this feature was drawn from the FlashVSR_plus project.)
- πΎ Tiled Inference for Reduced VRAM: Implements a tiling mechanism for the DiT model, significantly lowering GPU memory requirements and enabling processing of higher-resolution videos or operation on GPUs with limited VRAM. (The concept for tiled DiT inference was inspired by the FlashVSR_plus project.)
- π³ Optimized Docker Container: A fully-configured Dockerfile that automatically sets up the complete environment, including Conda environment activation upon container startup.
- π§ Automated Block-Sparse Attention Installation: Optimizes and automates the installation of the Block-Sparse-Attention backend within the Docker build process. This eliminates the manual compilation complexity encountered in the original implementation, ensuring a seamless setup experience. My specific improvements to Block-Sparse-Attention are documented in this PR: mit-han-lab/Block-Sparse-Attention#16.
- β‘ Enterprise-Grade Performance Optimizations: Engineered for high-end GPUs (A100/H100/H200, etc), this project features a Zero-Copy Pipeline, Hardware Acceleration (TF32/cuDNN), Asynchronous Execution, and NVENC Video Encoding, maximizing throughput and eliminating bottlenecks for production-grade video processing.
- οΏ½ Strict Precision Alignment (Duration & Resolution): Implements smart padding strategies to overcome algorithmic block-size constraints (e.g., 8n+1 frames, mod-128 resolution) without data loss.
- Exact Duration: Ensures output frame count perfectly matches input, solving the "missing seconds" issue and guaranteeing perfect audio sync (e.g., Input 5.0s β Output 5.0s).
- Exact Resolution: Guarantees output resolution is strictly
Scale Γ Input(e.g., 1280x740 β 2560x1480), using reflective padding instead of cropping to prevent pixel loss.
- π Concise & Professional Logging: Optimized console output that filters out internal noise and presents key metrics (Real FPS, Resolution Scaling) cleanly, making it suitable for production monitoring.
- Real-Time Performance: Achieves ~17 FPS for 768 Γ 1408 videos on a single A100 GPU.
- One-Step Diffusion: Efficient streaming framework based on a distilled one-step diffusion model.
- State-of-the-Art Quality: Combines Locality-Constrained Sparse Attention and a Tiny Conditional Decoder for high-fidelity results.
- Scalability: Reliably scales to ultra-high resolutions.
| Mode | VAE Used | Description |
|---|---|---|
| full | wan2.1 |
High Quality, High VRAM. Ideal for quality-critical tasks. |
| tiny | tcd |
Balanced Quality, Lower VRAM. Ideal for efficient real-time processing. |
| tiny-long | tcd |
Balanced Quality, Lower VRAM. Optimized for long videos. |
Required VAE models will be anticipated in ./models/FlashVSR-v1.1.
| VAE | File | Direct Download Link |
|---|---|---|
| Wan2.1 | models/FlashVSR-v1.1/Wan2.1_VAE.pth |
Download |
| TCDecoder | models/FlashVSR-v1.1/TCDecoder.ckpt |
Download |
Usage Examples:
# High quality (full mode)
python infer.py -i inputs/input.mp4 -o results/ --mode full
# Fast inference (tiny mode)
python infer.py -i inputs/input.mp4 -o results/ --mode tiny
# Long video processing (tiny-long mode)
python infer.py -i inputs/long_input.mp4 -o results/ --mode tiny-longπ For detailed installation instructions, see INSTALLATION.md
FlashVSR-Pro is primarily designed for Linux systems with NVIDIA GPUs. This is due to:
- Block-Sparse-Attention Backend: The core dependency requires CUDA kernel compilation, which is optimized and extensively tested on Linux.
- Docker Environment: The recommended installation uses Docker with Ubuntu base image for consistent environment setup.
- CUDA Compilation: Requires NVIDIA CUDA Toolkit (11.6+) and proper development toolchain, which is more straightforward on Linux.
- GPU: NVIDIA GPU with CUDA compute capability 8.0+ (Ampere architecture or newer)
- Recommended: A100, H100, H200, RTX 3090, RTX 4090
- Minimum: RTX 3060, RTX 3070, or similar Ampere/Ada/Hopper GPU
- VRAM:
- Minimum 8GB for low-resolution videos with tiling enabled
- 16GB+ recommended for high-quality processing
- 24GB+ for 4K video processing
- CUDA: Version 11.6 or higher (12.x recommended)
- System RAM: 16GB+ recommended
While FlashVSR-Pro is optimized for Linux, Windows users have the following options:
-
Docker Desktop (Recommended for Windows)
- Install Docker Desktop for Windows with WSL 2 backend
- Install NVIDIA Container Toolkit for WSL 2
- Follow the Docker installation steps below
- This provides a Linux environment within Windows with GPU passthrough
-
Windows Subsystem for Linux (WSL 2)
- Install WSL 2 with Ubuntu 22.04
- Install CUDA toolkit in WSL 2
- Follow Linux installation instructions within WSL
- Note: GPU passthrough must be properly configured
-
Native Windows Installation (Advanced)
- Requires manual compilation of Block-Sparse-Attention on Windows
- Need Visual Studio 2019+ with C++17 support
- CUDA Toolkit for Windows
- May encounter compatibility issues not tested by maintainers
- Not officially supported - use at your own risk
Important: If you've been struggling with installation on Windows, we strongly recommend using Docker Desktop with WSL 2, as this provides the most reliable experience.
The easiest way to run FlashVSR-Pro is using the provided Docker container, which includes automated setup for the Block-Sparse-Attention backend. This is the recommended method for both Linux and Windows (via WSL 2) users.
- Install Docker
- Install NVIDIA Container Toolkit for GPU support.
- Ensure
git-lfsis installed on your host system to clone model weights.
- Install Docker Desktop for Windows with WSL 2 backend enabled
- Install NVIDIA CUDA on WSL 2
- Ensure
git-lfsis installed:sudo apt-get install git-lfs(in WSL Ubuntu terminal) - Run all subsequent commands in WSL 2 Ubuntu terminal
git clone https://github.com/LujiaJin/FlashVSR-Pro.git
cd FlashVSR-Pro
docker build -t flashvsr-pro:latest .Note: The Dockerfile automatically handles the compilation and installation of the optimized Block-Sparse-Attention backend, eliminating manual configuration.
Before running the container, download the required model weights.
# Download the FlashVSR model weights (version 1.1)
git lfs clone https://huggingface.co/JunhaoZhuang/FlashVSR-v1.1 ./models/FlashVSR-v1.1The container is configured to automatically activate the flashvsr Conda environment upon startup. Make sure that the models/ directory of the host machine already contains the necessary model weight files, and provide the models when starting the container by mounting.
# Basic run with interactive shell
docker run --gpus all -it --rm \
-v $(pwd):/workspace/FlashVSR-Pro \
flashvsr-pro:latest
# You will be dropped into a shell with the `(flashvsr)` environment active.
# Verify by running: `which python`--mode: Inference mode selection.full: UsesWanModel+WanVideoVAE. Highest quality but requires significant VRAM (~14GB+ for 2s 720p without tiling).tiny: UsesWanModel+TCDecoder. Balanced speed and quality.tiny-long: Optimized streaming inference for long videos.
--tile-dit: Enable tiling for the DiT model. Drastically reduces VRAM usage by processing the latent space in blocks. Essential for high-resolution inference on consumer GPUs.--tile-size: Tile size in pixels (default:256).--overlap: Overlap between tiles (default:24).
--tile-vae: Enable tiling for the VAE decoder. Allows decoding very large frames by splitting the latent representation before decoding.
--fps INTEGER: Force a specific frame rate for the output video. By default, tries to match input video FPS or uses 30 for image sequences.--quality INTEGER: Output video quality (0-10). Default10(High). Maps to FFmpeg CRF values.--color-fix: Enable AdaIN/Wavelet-based post-processing color correction to match input color tones.--keep-audio: Transfer audio track from input to output video.
--dtype: Precision format. Options:bf16(default, recommended),fp16,fp32.--device: Processing device. Default:cuda.--scale: Upscaling factor. Default:2.0.--sparse-ratio: Controls Attention sparsity. Default:2.0. Lower (e.g. 1.5) is faster but potentially less stable.--kv-ratio: KV cache ratio. Default:3.0.--local-range: Local attention window size. Default:11.--seed: Random seed for reproducibility.
| Argument | Description | Default |
|---|---|---|
-i, --input |
Path to input video or image folder. | Required |
-o, --output |
Output directory or file path. | ./results |
--mode |
Inference mode: full (Wan VAE), tiny (TCD), tiny-long. |
tiny |
--keep-audio |
Preserve audio from input video (if exists). | False |
--tile-dit |
Enable memory-efficient tiled DiT inference. | False |
--tile-vae |
Enable tiled decoding for VAE (Full mode only). | False |
--tile-size |
Size of each tile when using tiling. | 256 |
--overlap |
Overlap between tiles to reduce seams. | 24 |
--scale |
Super-resolution scale factor. | 2.0 |
--seed |
Random seed for reproducible results. | 0 |
Note: The original FlashVSR is primarily designed and tested for 4x super-resolution. While other scales are supported, for optimal quality and stability, using --scale 4.0 is recommended.
For a full list of arguments, run python infer.py --help.
The main interface is the unified infer.py script.
# Basic upscaling (Tiny mode - balanced quality/speed)
python infer.py -i ./inputs/example0.mp4 -o ./results/ --mode tiny
# Full mode (Highest quality, requires more VRAM)
python infer.py -i ./inputs/example0.mp4 -o ./results/ --mode full
# Tiny-long mode for long videos
python infer.py -i ./inputs/example4.mp4 -o ./results/ --mode tiny-long# Preserve the audio track from the input video
python infer.py -i inputs/input_with_audio.mp4 -o ./results/ --mode tiny --keep-audio
# Use tiled DiT inference to reduce VRAM usage (enables running on smaller GPUs)
python infer.py -i inputs/large_input.mp4 -o ./results/ --mode tiny --tile-dit --tile-size 256 --overlap 24
# Use tiled VAE for lower memory usage in full mode
python infer.py -i inputs/input.mp4 -o ./results/ --mode full --tile-vae
# Combine multiple enhancements
python infer.py -i inputs/large_input_with_audio.mp4 -o ./results/ --mode full --tile-vae --tile-dit --keep-audioFlashVSR-Pro includes comprehensive performance enhancements designed for high-end GPUs (A100/H100/H200/RTX4090, etc) and high-throughput production environments.
- Mechanism: Implements a fully GPU-resident data pipeline. Input frames are read in batches and immediately transferred to VRAM. Resizing, cropping, and normalization all happen on the GPU.
- Benefit: Eliminates the "CPU -> GPU -> CPU" round-trip bottleneck found in standard implementations. Input tensors stay on the GPU throughout the entire inference lifecycle.
- TF32 Support: Explicitly enables TensorFloat-32 (TF32) for Matrix Multiplications and Convolutions on Ampere+ architectures.
- cuDNN Benchmarking: Activates
torch.backends.cudnn.benchmark = True, allowing the driver to auto-tune convolution algorithms for the specific input resolution during the first pass. - Impact: Significantly boosts raw compute throughput for FP32/BF16 operations without code changes.
- Mechanism: Removed unnecessary
torch.cuda.synchronize()calls from the critical inference path. - Benefit: Allows the CPU to issue CUDA kernels ahead of execution, fully filling the GPU's instruction pipeline and maximizing utilization rate.
- Mechanism: Bypasses the standard PIL (Python Imaging Library) conversion stack. The pipeline returns raw Numpy arrays directly to the video encoder (FFmpeg).
- Benefit: Saves thousands of CPU cycles per second by avoiding "Tensor -> Numpy -> PIL Image -> Numpy -> FFmpeg" conversion chains.
- Mechanism: Automatically detects and utilizes NVIDIA's NVENC hardware encoder when available. It includes a robust fallback mechanism that performs a functional test of the encoder before usage, ensuring stability even in varied container environments.
- Benefit: Offloads video compression from the CPU to the GPU's dedicated media engine. This drastically reduces CPU load and I/O latency during the final video saving stage, ensuring that high-throughput inference isn't bottlenecked by disk write speeds.
Symptom: Compilation errors, missing CUDA libraries, or general installation failures on Windows.
Solution:
- Recommended: Use Docker Desktop with WSL 2 backend (see Platform Support section above)
- Ensure you're running commands in WSL 2 Ubuntu terminal, not PowerShell or CMD
- Verify GPU is accessible in WSL:
nvidia-smishould work in WSL terminal - If using native Windows: Ensure Visual Studio 2019+ with C++17 support is installed
- Check that CUDA Toolkit matches PyTorch CUDA version
Symptom: Docker commands fail with permission errors.
Solution:
- Add your user to docker group:
sudo usermod -aG docker $USER - Log out and back in for changes to take effect
- Or use
sudobefore docker commands
Symptom: nvidia-smi fails in WSL or Docker can't access GPU.
Solution:
- Update Windows to latest version (GPU support requires Windows 11 or Windows 10 21H2+)
- Install latest NVIDIA GPU driver for Windows (not Linux driver)
- Verify in WSL:
nvidia-smishould show your GPU - Ensure Docker Desktop has WSL 2 integration enabled
- Check Docker GPU access:
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
- Use
--tile-ditand--tile-vaeto enable tiled inference. - Decrease
--tile-size(e.g., from 256 to 128). - Use
--dtype fp16to reduce memory usage. - Process shorter video segments or lower resolution inputs
- Ensure ffmpeg is installed on the host system (or in WSL for Windows users).
- Check whether the input video contains an audio stream:
ffprobe -i input.mp4 - Verify ffmpeg codec support:
ffmpeg -codecs | grep h264
- Make sure model weights are downloaded with git-lfs:
git lfs pull - If download is incomplete, files may be text pointers instead of actual weights
- Verify model file integrity (should be several GB in size)
- Check that weight files are in the correct directory:
./models/FlashVSR-v1.1/ - Ensure git-lfs was installed BEFORE cloning:
git lfs install
- The Docker build process should handle this automatically. If building manually, ensure you have CUDA 12.1+ and the correct PyTorch version installed.
- Reference the fix I released in: mit-han-lab/Block-Sparse-Attention#16
- On Windows: Compilation may fail due to missing Visual Studio C++ build tools
- Check available CUDA architectures match your GPU compute capability
- Ensure you're using CUDA 12.x for best performance
- Enable TF32: automatically enabled on Ampere+ GPUs
- Check GPU utilization:
nvidia-smishould show near 100% during processing - Verify you're not CPU-bottlenecked by checking CPU usage
- Consider using faster storage (SSD) for large video files
A: Yes, but with specific requirements. Windows is not the primary development platform, but you can use FlashVSR-Pro on Windows through:
- Docker Desktop with WSL 2 (Recommended) - Provides Linux environment with GPU access
- WSL 2 directly - Run Ubuntu in Windows and follow Linux instructions
- Native Windows (Advanced, not officially supported) - Requires manual compilation
See the Platform Support section for detailed instructions.
A: The Block-Sparse-Attention dependency requires CUDA kernel compilation, which is primarily tested on Linux. The Docker/WSL 2 approach avoids these compilation issues by using a pre-configured Linux environment.
A: You need an NVIDIA GPU with compute capability 8.0 or higher (Ampere generation or newer). Recommended GPUs include:
- Consumer: RTX 3060, RTX 3070, RTX 3090, RTX 4070, RTX 4090
- Professional: A100, H100, H200, A6000, RTX 6000 Ada
A: No. FlashVSR-Pro requires NVIDIA GPUs with CUDA support due to the Block-Sparse-Attention kernels being written specifically for NVIDIA CUDA.
A:
- Minimum: 8GB with tiling enabled (
--tile-dit --tile-vae) - Recommended: 16GB for 1080p/1440p video processing
- Optimal: 24GB+ for 4K videos without tiling
A: No. FlashVSR-Pro requires NVIDIA CUDA, which is not available on macOS. Apple Silicon (M1/M2/M3) uses different GPU architecture and is not compatible.
A: We recommend the Docker approach as it handles all dependencies automatically:
- Install Docker Desktop (with WSL 2 if on Windows)
- Follow the Quick Start with Docker guide
- This eliminates most installation issues related to CUDA, compilation, and dependencies
A: Yes! Cloud platforms with NVIDIA GPUs (Google Colab with T4/A100, AWS with GPU instances, etc.) work well. Use the Docker method or follow Linux installation instructions.
FlashVSR-Pro/
βββ .gitmodules # Git submodule configuration
βββ Block-Sparse-Attention/ # Git submodule: Sparse attention backend (with automated build)
βββ models/ # Model weights directory
β βββ FlashVSR-v1.1/ # Model weights V1.1
β βββ prompt_tensor/ # Pre-computed text prompt embeddings
βββ diffsynth/ # Core library (ModelManager, Pipelines)
βββ inputs/ # Default directory for input videos/images
βββ results/ # Default directory for output videos
βββ utils/ # Enhanced utilities module
β βββ __init__.py
β βββ utils.py # Core utilities (Causal_LQ4x_Proj, etc.)
β βββ TCDecoder.py # Tiny Conditional Decoder for 'tiny' mode
β βββ audio_utils.py # Audio preservation functions
β βββ tile_utils.py # Tiled inference for low VRAM
β βββ vae_manager.py # VAE Manager for multiple VAE support
βββ infer.py # Main unified inference script
βββ Dockerfile # Container definition with auto-activation
βββ entrypoint.sh # Container entry script
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup for the `diffsynth` module
βββ LICENSE # Project license file
βββ README.md # This file
This project is released under the same license (Apache Software license) as the original FlashVSR implementation. Please see the LICENSE file in the original repository for details.
If you use the FlashVSR algorithm in your research, please cite the original FlashVSR paper:
@article{zhuang2025flashvsr,
title={FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution},
author={Zhuang, Junhao and Guo, Shi and Cai, Xin and Li, Xiaohui and Liu, Yihao and Yuan, Chun and Xue, Tianfan},
journal={arXiv preprint arXiv:2510.12747},
year={2025}
}If you use this implementation (FlashVSR-Pro) in your work, please cite:
- This repository: https://github.com/LujiaJin/FlashVSR-Pro
- The optimized Block-Sparse-Attention backend: https://github.com/LujiaJin/Block-Sparse-Attention
- The core algorithm is from the original FlashVSR authors.
- Inspiration for the audio preservation and tiled DiT inference features came from the community project FlashVSR_plus.
- The automated build and optimization of the Block-Sparse-Attention backend is a contribution of this project, with improvements documented in mit-han-lab/Block-Sparse-Attention#16.
Contributions, issues, and feature requests are welcome. Feel free to check the issues page if you want to contribute.
Happy Super-Resolution! π