Skip to content

tamaki-lab/M3DDM-Plus

Repository files navigation

M3DDM+: An Improved Video Outpainting by a Modified Masking Strategy

Takuya Murakawa, Takumi Fukuzawa, Ning Ding, Toru Tamaki
Nagoya Institute of Technology

IWAIT 2026

arXiv Project Page Hugging Face

comparison.mp4

Environment Setup

  1. Create and activate a virtual environment using Python 3.12 (or 3.10+):
python3.12 -m venv .venv
source .venv/bin/activate
  1. Install all dependencies from the requirements.txt file:
pip install -r requirements.txt

Note: Make sure you have Python 3.10 or later installed. Our testing environment uses Python 3.12.3 with PyTorch 2.8.0+cu128 and CUDA 13.1.

Troubleshooting

If you encounter the following error during setup:

ImportError: cannot import name 'cached_download' from 'huggingface_hub'

Run the following command to fix it:

pip install huggingface-hub==0.25.2

Reference: Stack Overflow - ImportError: cannot import name 'cached_download' from 'huggingface_hub'

Download Models

Before you can run the project, you need to download the following:

  1. Pre-trained Stable Diffusion Model Weights:

    We used the VAE encoder and decoder inside Stable Diffusion Model. To get the pre-trained stable diffusion v1.5 weights, download them from the following link:
    https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5

    huggingface-cli download stable-diffusion-v1-5/stable-diffusion-v1-5 \
      --local-dir ./stable-diffusion-v1-5
  2. M3DDM+ Model Checkpoints:

    To get pre-trained M3DDM+ model weights, download them from the Hugging Face repository.
    https://huggingface.co/MurakawaTakuya/M3DDM-Plus

    huggingface-cli download MurakawaTakuya/M3DDM-Plus \
      --local-dir ./M3DDM-Plus

Directory Structure

After downloading the models, your directory should look like this:

M3DDM-Plus/
├── src/                        # Source code
│   ├── inference.py
│   ├── evaluate.py
│   ├── train.py
│   ├── model/
│   └── pipelines/
├── stable-diffusion-v1-5/      # SD v1.5 weights (VAE + scheduler)
│   ├── scheduler/
│   │   └── scheduler_config.json
│   └── vae/
│       ├── config.json
│       └── diffusion_pytorch_model.bin
├── M3DDM-Plus/                 # M3DDM+ model weights
│   ├── config.json
│   └── diffusion_pytorch_model.bin
└── sample/                     # Sample videos (optional)
    ├── bear.mp4
    └── ...

Code Dependency

flowchart LR
    train.py -->|instantiates| evaluate.py
    evaluate.py -->|instantiates| inference.py

Loading
  • inference.py runs video outpainting on a single input video.
  • evaluate.py crops each video in a dataset by a specified ratio, runs outpainting to reconstruct the cropped region, and computes metrics (MSE, PSNR, SSIM, LPIPS, BMSE) against the original.
  • train.py trains the model. At the end of each epoch, it optionally calls evaluate.py to run outpainting on a real dataset — separate from the loss-based validation step — so you can visually and quantitatively track generation quality during training.

Inference

Takes a single input video and expands it to a specified aspect ratio using outpainting.

Try with Samples

The sample videos in the sample/ directory are taken from the DAVIS dataset. You can quickly test the model using these clips.

CUDA_VISIBLE_DEVICES=0 python src/inference.py \
  --input_video_path "sample/bear.mp4" \
  --pretrained_sd_dir "stable-diffusion-v1-5" \
  --video_outpainting_model_dir "M3DDM-Plus" \
  --output_dir "sample/output/bear" \
  --target_ratio_list "1:1" \
  --output_size 256

You can run the inference code with the following command:

CUDA_VISIBLE_DEVICES=0 python src/inference.py \
  --input_video_path "path/to/input_video.mp4" \
  --pretrained_sd_dir "stable-diffusion-v1-5" \
  --video_outpainting_model_dir "M3DDM-Plus" \
  --output_dir "path/to/output_directory" \
  --target_ratio_list "1:1" \
  --output_size 256

Parameters

  • video_outpainting_model_dir: The directory where the video-outpainting model weights are stored.
  • target_ratio_list: The aspect ratio for the output video. You can input a single value such as "1:1", "16:9", or "9:16", or you can input a list like "16:9,9:16". For better results, we recommend inputting a single value.

GPU Memory

Inference requires approximately 13GB of VRAM for 256*256 resolution on a single NVIDIA RTX 8000. (Increasing frames doesn't increase GPU memory usage.)
To save GPU memory, you can use PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. Also, using --enable_attention_slicing will reduce memory consumption at the cost of inference speed.

Training

Training fine-tunes the M3DDM+ model weights. The VAE (from Stable Diffusion v1.5) is frozen throughout; only the 3D UNet loaded from video_outpainting_model_dir is updated.

You can run the training code with the following command:

CUDA_VISIBLE_DEVICES=1 python src/train.py \
  --data_dir "path/to/dataset/directory" \
  --size 128 \
  --epochs 5 \
  --lr 1e-5 \
  --pretrained_sd_dir "stable-diffusion-v1-5" \
  --video_model_dir "M3DDM-Plus" \
  --gpus 1 \
  --output_dir "output" \
  --max_samples 10000 \
  --eval_video_dir "path/to/evaluation_video_directory" \
  --eval_crop_ratio 0.25 \
  --eval_crop_axis "horizontal" \
  --eval_target_ratio_list "16:9" \
  --limit_val_batches 1000

Parameters

  • data_dir: The directory where the training data is stored. The directory should contain /train and /val directories.
  • video_model_dir: The directory where the video-outpainting model weights are stored.
  • output_dir: The directory where the training results will be saved.
  • max_samples: The maximum number of samples to use for training.
  • eval_video_dir: The directory where the evaluation data is stored.
  • eval_crop_ratio: The ratio of the evaluation data to use for evaluation.
  • eval_crop_axis: The axis to use for cropping the evaluation data.
  • eval_target_ratio_list: The aspect ratio for the output video. You can input a single value such as "1:1", "16:9", or "9:16", or you can input a list like "16:9,9:16". For better results, we recommend inputting a single value.
  • limit_val_batches: The number of videos to use for validation.

Use --disable_validation to disable validation.

GPU Memory

Training requires approximately 28GB of VRAM at 128x128 resolution on a single NVIDIA RTX 8000.
To reduce GPU memory usage, you can enable --enable_unet_gradient_checkpointing, which will reduce memory consumption at the cost of training speed.

Evaluation

Takes a folder of videos, crops each by a specified ratio to simulate a narrower input, runs outpainting to reconstruct the cropped region, and computes metrics (MSE, PSNR, SSIM, LPIPS, BMSE) by comparing the generated output against the original.

CUDA_VISIBLE_DEVICES=0 python src/evaluate.py \
  --video_dir "path/to/data/directory" \
  --pretrained_sd_dir "stable-diffusion-v1-5" \
  --video_outpainting_model_dir "M3DDM-Plus" \
  --target_ratio_list "16:9" \
  --crop_ratio 0.25 \
  --crop_axis "horizontal" \
  --output_size 256 \
  --limit_outpainting_frames -1

Parameters

  • video_dir: The directory where the evaluation data is stored.
  • pretrained_sd_dir: The directory where the pre-trained stable diffusion model weights are stored.
  • video_outpainting_model_dir: The directory where the video-outpainting model weights are stored.
  • target_ratio_list: The aspect ratio for the output video. You can input a single value such as "1:1", "16:9", or "9:16", or you can input a list like "16:9,9:16". For better results, we recommend inputting a single value.
  • crop_ratio: The ratio of the evaluation data to use for evaluation.
  • crop_axis: The axis to use for cropping the evaluation data.
  • output_size: The size of the output video.
  • limit_outpainting_frames: The number of frames to use for outpainting. Use -1 to use all frames.

GPU Memory

Evaluation requires the same amount of VRAM and time as inference, multiplied by the number of evaluation videos. To save GPU memory, you can use PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. Also, using --enable_attention_slicing will reduce memory consumption at the cost of inference speed.

Logging

This project uses Comet ML for experiment tracking and logging.

Add --disable_comet (or -dc) to disable logging to Comet.

Comet Configuration

Comet configuration uses two configuration files:

  • ./.comet.config in this project directory
  • ~/.comet.config in your home directory

For more details, refer to the Comet configuration documentation.

Important: Do not write API keys directly in code.

Global Configuration (Home Directory)

Create ~/.comet.config with settings common to all your projects as follows:

[comet]
api_key=XXXXXHereIsYourAPIKeyXXXXXXXX
workspace=your_workspace_name

[comet_logging]
hide_api_key=True
  • Set your Comet API key and default workspace
  • Set hide_api_key=True to prevent API keys from appearing in logs

Project Configuration

Copy the example configuration file .comet.config.example and name it .comet.config:

cp .comet.config.example .comet.config

Then edit ./.comet.config with your data:

[comet]
workspace=your_workspace_name # Change to your workspace name (comet user name)
project_name=M3DDM-Plus

[comet_logging]
file=comet_logs/comet_{project}_{datetime}.log # Change the path to your desired location (optional)
  • Settings here override those in ~/.comet.config

Citation

If our work is helpful, please help to ⭐ the repo.

Please consider citing our paper if you found our work interesting and useful.

@inproceedings{murakawa_IWAIT2026_M3DDMPlus,
  title={M3DDM+: An improved video outpainting by a modified masking strategy},
  author={Murakawa, Takuya and Fukuzawa, Takumi and Ding, Ning and Tamaki, Toru},
  booktitle={Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT)},
  year={2026}
}

Contact us

Please feel free to reach out to us:

Acknowledgement

The inference and pipeline code is based on published code of M3DDM-Video-Outpainting. The training and evaluation code was reproduced based on the M3DDM paper as it isn't published, and modified for our proposed method.

About

[IWAIT 2026] Code of "M3DDM+: An improved video outpainting by a modified masking strategy"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages