Transformers (Bugliarello et al.,2021;Hendricks
et al.,2021), we will not focus on improving the
performance through ensembling like Yan et al.
(2021b). Rather, we utilize combinations of VEs
as the setting for answering our research questions.
We cover three popular classes of VEs in our
experiments: 1) object detection models providing
a feature representation of salient image parts con-
taining objects (Region) (Anderson et al.,2018),
2) CNN models computing a feature map of the
image for grid features (Grid), and 3) Vision Trans-
formers (ViT) (Dosovitskiy et al.,2021) computing
contextualized patch features of the image (Patch).
As the downstream domain and task type can be
heavily impacted by the different VEs, we probe
all combinations of the three VEs on six different
V+L tasks, covering retrieval, Q&A, and reasoning.
To investigate the VE complementarity and fea-
ture utilization, we analyze 1) the attention patterns
across modalities and VEs, and 2) the dependency
of specific VEs when performing VE-dropout dur-
ing training and inference. While multi-VE seems
to perform better than single-VE (which could par-
tially attribute to the increased parameter count),
we consistently observe performance gaps between
different multi-VE configurations (e.g. a gap as
large as 8.9 points for the same task) and no sin-
gle winning combination for all task types. Our
attention patterns analysis across the different VEs
reveals that the distinctive information encoded in
the VEs is important for different tasks, and the
model composes the representations by enriching a
dominant VE with complementary information of
the other VEs.
To sum up, our results and analysis suggest that
VEs trained on different objectives, architectures,
and data can have a high impact on the model’s
V+L task performance. We cannot rely on simple
ensemble effects to improve performance; select-
ing and repurposing off-the-shelf VEs is non-trivial,
which emphasizes the necessity to design VEs ex-
plicitly for V+L tasks in the future.
2 Related Work
Multimodal Transformer Architectures. Mul-
timodal Transformer architectures can be di-
vided into single-stream and dual-stream mod-
els (Bugliarello et al.,2021). The single-stream
Transformer takes the concatenated visual and
text tokens as input and processes them modality-
agnostic, i.e. the self-attention jointly attends over
the tokens of both modalities. Dual-stream models
use separate Transformers for each modality that
are connected through a co-attention mechanism
(Tan and Bansal,2019;Lu et al.,2019), concate-
nated in a single-stream model on top (Singh et al.,
2022;Kamath et al.,2021), or the image model
output is used asymmetrically for cross-attention
in the text model (Li et al.,2021,2022).
The Faster R-CNN (Ren et al.,2015) object de-
tector has been the dominant choice for multimodal
models as a Region VE, where most methods pro-
pose to use it as a static feature extractor (Tan and
Bansal,2019;Lu et al.,2019;Su et al.,2020;Chen
et al.,2020;Gan et al.,2020;Li et al.,2020b;Zhang
et al.,2021;Cho et al.,2021), with the notable ex-
ception being Su et al. (2020) who backpropagate
through the Faster R-CNN model. Less popular
VEs are Grid (Huang et al.,2020;Kamath et al.,
2021;Yan et al.,2021a;Shen et al.,2022;Eichen-
berg et al.,2022), and Patch (Kim et al.,2021;
Wang et al.,2022;Eichenberg et al.,2022). In con-
trast to Region VEs, Grid and Patch VEs are com-
monly fine-tuned on the target V+L task, with the
notable exception being Yan et al. (2021a). Follow-
ing Bugliarello et al. (2021); Hendricks et al. (2021)
we focus on single-stream models as they have been
shown to perform on par with dual-stream models
while being easier to extend to multi-VE setups.
Comparing and Combining VEs. Recently, sev-
eral works aim to compare different VEs for V+L
tasks. Jiang et al. (2020) compare Region and Grid
for visual QA tasks, showing that training data, ob-
jectives and other factors all affect the downstream
task performance. Shen et al. (2022); Eichenberg
et al. (2022) compare different pre-trained Grid
and Patch VEs building on CLIP (Radford et al.,
2021). Zhang et al. (2021) compare different de-
sign choices for Region VEs with Grid VEs trained
on the same data. Dai et al. (2023) compares differ-
ent VEs in influence object hallucinations in cap-
tion generation. Closest to our work is the work by
Yan et al. (2021b). While they also experiment with
combining representations of Grid-, Patch-, and Re-
gion VEs, they only focus on the Visual Question
Answering (VQA; Goyal et al.,2017) dataset and
only use the combination of all three VEs. Our
work provides a more in-depth evaluation of dif-
ferent multi-VE setups while experimenting with
six diverse tasks, and shows that different combi-
nations work best for each task.
Analysis of Multimodal Transformers. Our anal-