We distribute models pretrained on Conceptual Captions. We share ViLBERT, LXMERT and VL-BERT pretrained as originally presented in their papers, as well as the weights for ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER pretrained in our controlled setup. For the latter, we distribute the weights that lead to higher average downstream performance when fine-tuned once.
| Model | VQAv2 | RefCOCO+ | NLVR2 | Flickr30k IR | Flickr30k TR |
|---|---|---|---|---|---|
| ViLBERT | 66.68 | 70.49 | 74.26 | 58.90 | 75.50 |
| LXMERT | 67.98 | 71.58 | |||
| VL-BERT | 67.44 | 71.00 | |||
| ViLBERT (CTRL) | 68.97 | 70.53 | 72.24 | 60.34 | 78.80 |
| LXMERT (CTRL) | 67.52 | 70.49 | 71.09 | 58.62 | 74.90 |
| VL-BERT (CTRL) | 68.23 | 71.23 | 73.22 | 57.62 | 70.90 |
| VisualBERT (CTRL) | 69.03 | 70.02 | 72.70 | 61.48 | 75.20 |
| UNITER (CTRL) | 68.67 | 71.45 | 73.73 | 60.54 | 76.40 |
Models are defined in configuration files (see config/ for some examples). Rather than using Transformer layers, we specify attention and feed-forward sub-layers for each modality, which allows to quickly extend proposed architectures. In particular, the following sub-layers are defined:
tt_attn_sublayers: text-text attention sub-layerstv_attn_sublayers: text-vision attention sub-layers (text used as query, vision as context)vt_attn_sublayers: vision-text attention sub-layers (vision used as query, text as context)vv_attn_sublayers: vision-vision attention sub-layerst_ff_sublayers: feed-forward sub-layers for the text modalityv_ff_sublayers: feed-forward sub-layers for the vision modality
In addition, the following parameters allow to tune parameter sharing across modalities:
shared_sublayers: sub-layers that share parameters between modalitiessingle_ln_sublayers: sub-layers in which text and vision tensors are concatenated and fed into a single LN layer
Finally, bert_layer2attn_sublayer and bert_layer2ff_sublayer are used to load text-only BERT layers into VOLTA ones.
The following figure shows how these sub-layers are used to construct ViLBERT:
