An unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners
This repository is based on MAE-pytorch, thanks very much!!!
I'm conducting extensive experiments mentioned in the paper. The performance still seems to be a little bit different from the original.
- implement the finetune process
- reuse the model in
modeling_pretrain.py - caculate the normalized pixels target
- add the
clstoken in the encoder - visualization of reconstruction image
- knn and linear prob
- 2D
sine-cosineposition embeddings - Fine-tuning semantic segmentation on Cityscapes & ADE20K
- Fine-tuning instance segmentation on COCO
pip install -r requirements.txt
- Pretrain & Finetune
bash pretrain.sh
- Visualization of reconstruction
# Set the path to save images
OUTPUT_DIR='output/'
# path to image for visualization
IMAGE_PATH='files/ILSVRC2012_val_00031649.JPEG'
# path to pretrain model
MODEL_PATH='/path/to/pretrain/checkpoint.pth'
# Now, it only supports pretrained models with normalized pixel targets
python run_mae_vis.py ${IMAGE_PATH} ${OUTPUT_DIR} ${MODEL_PATH}| model | pretrain | finetune | accuracy | log | weight |
|---|---|---|---|---|---|
| vit-base | 800e (normed pixel) | 100e | 83.2% | - | - |
I'm really appreaciate for your star!