TokenFusion Description

TokenFusion is a multimodal token fusion method tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images.

Paper: Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang. Multimodal Token Fusion for Vision Transformers. In CVPR 2022.

Model architecture

The overall architecture of TokenFusion is show below:

Dataset

Dataset used: NYUDv2

Dataset size：colorful images and depth images, with labels in 40 segmentation classes
- Train：795 samples
- Test：654 samples
Data format：image files
- Note：Data will be processed in utils/datasets.py

Environment Requirements

Hardware (GPU)
Framework
- MindSpore
For more information, please check the resources below:
- MindSpore Tutorials
- MindSpore Python API

Script Description

Script and Sample Code

.TokenFusion
├── README.md               # descriptions about TokenFusion
├── models
│   ├── mix_transformer.py  # definition of backbone model
│   ├── segformer.py        # definition of segmentation model
│   └── modules.py          # TokenFusion operations
├── utils
│   ├── datasets.py         # data loader
│   ├── helpers.py          # utility functions
│   ├── transforms.py       # data preprocessing functions
│   └── meter.py            # utility functions
├── eval.py                 # evaluation interface
├── cfg.py                  # configure file
├── config.py               # configure file

Training process

To Be Done

Evaluation Process

Launch

# infer example

python eval.py --checkpoint_path  [CHECKPOINT_PATH]

Checkpoint can be downloaded at here or Mindspore Hub.

Result

result: IoU=54.8, ckpt= ./tokenfusion_ascend_v180_nyudv2_research_cv_acc54.8.ckpt

Parameters	Ascend
Model	TokenFusion
Model Version	tokenfusion_seg_mitb3_nyudv2
Resource	Ascend 910
Uploaded Date	2022-08-10
MindSpore Version	1.8.0
Dataset	NYUDv2
Outputs	probability
Accuracy	1pc: 54.8%
Speed	1pc：1s/step

Description of Random Situation

We set the seed inside datasets.py.

ModelZoo Homepage

Please check the official homepage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contents

TokenFusion Description

Model architecture

Dataset

Environment Requirements

Script Description

Script and Sample Code

Training process

Evaluation Process

Launch

Result

Description of Random Situation

ModelZoo Homepage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fig		fig
models		models
utils		utils
README.md		README.md
cfg.py		cfg.py
config.py		config.py
eval.py		eval.py

Folders and files

Latest commit

History

Repository files navigation

Contents

Launch

Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages