Better with Less: Tackling Heterogeneous Multi-Modal Image Joint Pretraining via Conditioned and Degraded Masked Autoencoder
Motivation | CoDe-MAE | OSPretrain-1M | Analysis | Results | Usage&Resources | License | Acknowledgements
Geospatially registered optical and SAR image pairs are pivotal for optical-SAR joint representation learning/pretraining. Existing methods mainly focus on medium-resolution (MR) imagery like Sentinel-1/2, and rely on contrastive alignment objective (explicit), e.g., CROMA and SwinSSL, or cross-modal masked image modeling (implicit), e.g., msGFM. However, high-resolution (HR) images are essential for better, fine-grained understanding of Earth. These alignment strategies fall in the InfoMin trap, discard modality-unique specifics/physics.
To instruct the model to learn better modality synergy, i.e., better understanding towards the inter-modal homogeneity and intra-modal heterogeneity, we propose Conditioned and Degraded MAE (CoDe-MAE). Anchored by optical knowledge distillation and a shared architecture to establish a robust semantic baseline, CoDe-MAE shifts from conventional rigid alignment to a paradigm of better synergy with less alignment. To bridge the severe heterogeneity in HR optical-SAR imagery, it introduces two synergistic mechanisms: (b) Conditioned Contrastive Learning (CCL) that selectively aligns shared semantics via cross-attention without over-constraining the original representations and (c) Cross-modal Degraded Reconstruction (CDR) that avoids ill-posed full-information recovery by predicting spectrally degraded targets. (d) Singular value spectrum explicitly illustrates representation collapse and our non-destructive synergy. Conventional rigid alignment causes severe dimensional collapse (rapid decay). Our CCL acts as a soft bottleneck to securely maintain the baseline's capacity, while CDR further enhances it. Together, CoDe-MAE achieves the highest effective feature rank (a significantly flatter tail).
Pretrained on our curated OSPretrain-1M dataset, it establishes new SOTA performance across diverse single- and dual-modal downstream tasks, substantially outperforming foundation models scaled on vastly larger datasets.
Following FG-MAE and SwinSSL, we evaluate 10-shot classification on three HR (PIE, DDHR-SK, WHU) and three MR (DFC20, BigEarthNet-MM, EuroSat-MM) datasets to gauge intrinsic capacity without full fine-tuning bias. As shown in 2, utilizing a modality-shared backbone, CoDe-MAE consistently outperforms existing reproduced dual-modal foundation models across both modalities. This confirms that our framework successfully extracts superior representations while circumventing feature collapse.
We assess complex bi-modal/temporal building damage assessment on the BRIGHT dataset (IEEE DFC25-T2 split). CoDe-MAE surpasses the previous SOTA dual-modal model, MaRS, by a notable 2.26 mIoU, demonstrating highly robust real-world disaster response applicability.
Joint optical-SAR models typically suffer negative transfer on single-modal tasks, as rigid alignment compromises distinct physical priors. To rigorously test our capacity to prevent this representation collapse, we evaluate CoDe-MAE against specialized single-modal models on established benchmarks.
We evaluate classification (AID, RESISC45), detection (DIOR ), and segmentation (LoveDA). CoDe-MAE establishes new Base-level SOTA performance. Despite pretraining on merely 1M samples, an order of magnitude fewer than OpticalRS-13M and SkySense-21.5M, CoDe-MAE outperforms larger-scale optical-exclusive models (e.g., surpassing SelectiveMAE-L by 1.2 mAP$_{50}$ on DIOR and 1.01 mIoU on LoveDA), demonstrating remarkable data efficiency.
Evaluating SAR-only target classification (FUSAR-SHIP, MSTAR, SAR-ACD) and detection (SARDet-100K), CoDe-MAE achieves SOTA across all datasets. Crucially, it surpasses specialized SAR-specific models (e.g., SARMAE, SARATR-X). Outperforming these domain experts explicitly confirms that our strategy successfully preserves and enhances the modality-unique microwave scattering characteristics essential for SAR interpretation.
We provide OSPretrain-1M, code, and most weights (for both final and ablation, pretrained and fine-tuned) at BaiduNetDisk.
- Data Attribution: OSPretrain-1M is a curated collection derived from 15 public sources. We gratefully acknowledge the creators. By downloading this dataset, you agree to comply with the respective original licenses, and cite the respective papers.
1. Pretraining: See pretrain/README.md for a quick start.
2. Linear probing: See linearprobing/README.md.
3. Classification (single-modal benchmarks): See classification/README.md.
4. Object detection: See detection/README.md.
5. Semantic segmentation: See segmentation/README.md.
6. DFC25-T2 (BRIGHT): See bright/README.md.
We deeply acknowledge the open-source community for making these constituent datasets for OSPretrain-1M available.
This repository benefits a lot from the works listed below:
zhangxiaosong18/hivit
sunsmarterjie/iTPN
waterdisappear/SARATR-X
waterdisappear/SAR-JEPA
MiliLab/SelectiveMAE
ChenHongruixuan/BRIGHT
- Code: Licensed under the Apache License 2.0.
- Dataset: OSPretrain-1M is provided for academic research purposes only.
-
If you have any questions, please contact us via pbow16@nudt.edu.cn.
-
If you find our work is useful, please give us a star 🌟 in GitHub and cite our paper in the following BibTex format:
@ARTICLE{peng2026codemae,
author={Peng, Bowen and Liu, Yongxiang and Jie, Zhou and Xiaodong, Chen and Xiaogang, Yu and Liu, Li},
journal={arXiv preprint arXiv:2604.16952},
title={{Better with Less}: Better with Less: Tackling Heterogeneous Multi-Modal Image Joint Pretraining via Conditioned and Degraded Masked Autoencoder},
year={2026},
}









