Takuya Murakawa, Takumi Fukuzawa, Ning Ding, Toru Tamaki
Nagoya Institute of Technology
IWAIT 2026
comparison.mp4
- Create and activate a virtual environment using Python 3.12 (or 3.10+):
python3.12 -m venv .venv
source .venv/bin/activate- Install all dependencies from the
requirements.txtfile:
pip install -r requirements.txtNote: Make sure you have Python 3.10 or later installed. Our testing environment uses Python 3.12.3 with PyTorch 2.8.0+cu128 and CUDA 13.1.
If you encounter the following error during setup:
ImportError: cannot import name 'cached_download' from 'huggingface_hub'
Run the following command to fix it:
pip install huggingface-hub==0.25.2Reference: Stack Overflow - ImportError: cannot import name 'cached_download' from 'huggingface_hub'
Before you can run the project, you need to download the following:
-
Pre-trained Stable Diffusion Model Weights:
We used the VAE encoder and decoder inside Stable Diffusion Model. To get the pre-trained stable diffusion v1.5 weights, download them from the following link:
https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5huggingface-cli download stable-diffusion-v1-5/stable-diffusion-v1-5 \ --local-dir ./stable-diffusion-v1-5
-
M3DDM+ Model Checkpoints:
To get pre-trained M3DDM+ model weights, download them from the Hugging Face repository.
https://huggingface.co/MurakawaTakuya/M3DDM-Plushuggingface-cli download MurakawaTakuya/M3DDM-Plus \ --local-dir ./M3DDM-Plus
After downloading the models, your directory should look like this:
M3DDM-Plus/
├── src/ # Source code
│ ├── inference.py
│ ├── evaluate.py
│ ├── train.py
│ ├── model/
│ └── pipelines/
├── stable-diffusion-v1-5/ # SD v1.5 weights (VAE + scheduler)
│ ├── scheduler/
│ │ └── scheduler_config.json
│ └── vae/
│ ├── config.json
│ └── diffusion_pytorch_model.bin
├── M3DDM-Plus/ # M3DDM+ model weights
│ ├── config.json
│ └── diffusion_pytorch_model.bin
└── sample/ # Sample videos (optional)
├── bear.mp4
└── ...
flowchart LR
train.py -->|instantiates| evaluate.py
evaluate.py -->|instantiates| inference.py
inference.pyruns video outpainting on a single input video.evaluate.pycrops each video in a dataset by a specified ratio, runs outpainting to reconstruct the cropped region, and computes metrics (MSE, PSNR, SSIM, LPIPS, BMSE) against the original.train.pytrains the model. At the end of each epoch, it optionally callsevaluate.pyto run outpainting on a real dataset — separate from the loss-based validation step — so you can visually and quantitatively track generation quality during training.
Takes a single input video and expands it to a specified aspect ratio using outpainting.
Try with Samples
The sample videos in the sample/ directory are taken from the DAVIS dataset. You can quickly test the model using these clips.
CUDA_VISIBLE_DEVICES=0 python src/inference.py \
--input_video_path "sample/bear.mp4" \
--pretrained_sd_dir "stable-diffusion-v1-5" \
--video_outpainting_model_dir "M3DDM-Plus" \
--output_dir "sample/output/bear" \
--target_ratio_list "1:1" \
--output_size 256You can run the inference code with the following command:
CUDA_VISIBLE_DEVICES=0 python src/inference.py \
--input_video_path "path/to/input_video.mp4" \
--pretrained_sd_dir "stable-diffusion-v1-5" \
--video_outpainting_model_dir "M3DDM-Plus" \
--output_dir "path/to/output_directory" \
--target_ratio_list "1:1" \
--output_size 256Parameters
video_outpainting_model_dir: The directory where the video-outpainting model weights are stored.target_ratio_list: The aspect ratio for the output video. You can input a single value such as "1:1", "16:9", or "9:16", or you can input a list like "16:9,9:16". For better results, we recommend inputting a single value.
Inference requires approximately 13GB of VRAM for 256*256 resolution on a single NVIDIA RTX 8000. (Increasing frames doesn't increase GPU memory usage.)
To save GPU memory, you can use PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Also, using --enable_attention_slicing will reduce memory consumption at the cost of inference speed.
Training fine-tunes the M3DDM+ model weights. The VAE (from Stable Diffusion v1.5) is frozen throughout; only the 3D UNet loaded from video_outpainting_model_dir is updated.
You can run the training code with the following command:
CUDA_VISIBLE_DEVICES=1 python src/train.py \
--data_dir "path/to/dataset/directory" \
--size 128 \
--epochs 5 \
--lr 1e-5 \
--pretrained_sd_dir "stable-diffusion-v1-5" \
--video_model_dir "M3DDM-Plus" \
--gpus 1 \
--output_dir "output" \
--max_samples 10000 \
--eval_video_dir "path/to/evaluation_video_directory" \
--eval_crop_ratio 0.25 \
--eval_crop_axis "horizontal" \
--eval_target_ratio_list "16:9" \
--limit_val_batches 1000Parameters
data_dir: The directory where the training data is stored. The directory should contain/trainand/valdirectories.video_model_dir: The directory where the video-outpainting model weights are stored.output_dir: The directory where the training results will be saved.max_samples: The maximum number of samples to use for training.eval_video_dir: The directory where the evaluation data is stored.eval_crop_ratio: The ratio of the evaluation data to use for evaluation.eval_crop_axis: The axis to use for cropping the evaluation data.eval_target_ratio_list: The aspect ratio for the output video. You can input a single value such as "1:1", "16:9", or "9:16", or you can input a list like "16:9,9:16". For better results, we recommend inputting a single value.limit_val_batches: The number of videos to use for validation.
Use --disable_validation to disable validation.
Training requires approximately 28GB of VRAM at 128x128 resolution on a single NVIDIA RTX 8000.
To reduce GPU memory usage, you can enable --enable_unet_gradient_checkpointing, which will reduce memory consumption at the cost of training speed.
Takes a folder of videos, crops each by a specified ratio to simulate a narrower input, runs outpainting to reconstruct the cropped region, and computes metrics (MSE, PSNR, SSIM, LPIPS, BMSE) by comparing the generated output against the original.
CUDA_VISIBLE_DEVICES=0 python src/evaluate.py \
--video_dir "path/to/data/directory" \
--pretrained_sd_dir "stable-diffusion-v1-5" \
--video_outpainting_model_dir "M3DDM-Plus" \
--target_ratio_list "16:9" \
--crop_ratio 0.25 \
--crop_axis "horizontal" \
--output_size 256 \
--limit_outpainting_frames -1Parameters
video_dir: The directory where the evaluation data is stored.pretrained_sd_dir: The directory where the pre-trained stable diffusion model weights are stored.video_outpainting_model_dir: The directory where the video-outpainting model weights are stored.target_ratio_list: The aspect ratio for the output video. You can input a single value such as "1:1", "16:9", or "9:16", or you can input a list like "16:9,9:16". For better results, we recommend inputting a single value.crop_ratio: The ratio of the evaluation data to use for evaluation.crop_axis: The axis to use for cropping the evaluation data.output_size: The size of the output video.limit_outpainting_frames: The number of frames to use for outpainting. Use-1to use all frames.
Evaluation requires the same amount of VRAM and time as inference, multiplied by the number of evaluation videos.
To save GPU memory, you can use PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Also, using --enable_attention_slicing will reduce memory consumption at the cost of inference speed.
This project uses Comet ML for experiment tracking and logging.
Add --disable_comet (or -dc) to disable logging to Comet.
Comet configuration uses two configuration files:
./.comet.configin this project directory~/.comet.configin your home directory
For more details, refer to the Comet configuration documentation.
Important: Do not write API keys directly in code.
Create ~/.comet.config with settings common to all your projects as follows:
[comet]
api_key=XXXXXHereIsYourAPIKeyXXXXXXXX
workspace=your_workspace_name
[comet_logging]
hide_api_key=True- Set your Comet API key and default workspace
- Set
hide_api_key=Trueto prevent API keys from appearing in logs
Copy the example configuration file .comet.config.example and name it .comet.config:
cp .comet.config.example .comet.configThen edit ./.comet.config with your data:
[comet]
workspace=your_workspace_name # Change to your workspace name (comet user name)
project_name=M3DDM-Plus
[comet_logging]
file=comet_logs/comet_{project}_{datetime}.log # Change the path to your desired location (optional)- Settings here override those in
~/.comet.config
If our work is helpful, please help to ⭐ the repo.
Please consider citing our paper if you found our work interesting and useful.
@inproceedings{murakawa_IWAIT2026_M3DDMPlus,
title={M3DDM+: An improved video outpainting by a modified masking strategy},
author={Murakawa, Takuya and Fukuzawa, Takumi and Ding, Ning and Tamaki, Toru},
booktitle={Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT)},
year={2026}
}Please feel free to reach out to us:
- Email: t.murakawa.080@nitech.jp and tamaki.toru@nitech.ac.jp
The inference and pipeline code is based on published code of M3DDM-Video-Outpainting. The training and evaluation code was reproduced based on the M3DDM paper as it isn't published, and modified for our proposed method.