One does notfit all On the Complementarity of Vision Encoders for Vision and Language Tasks Gregor Geigle Chen Cecilia Liu Jonas Pfeiffer Iryna Gurevych

2025-04-27 0 0 4.99MB 21 页 10玖币
侵权投诉
One does not fit all!
On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
, Chen Cecilia Liu, Jonas Pfeiffer
, Iryna Gurevych
Ubiquitous Knowledge Processing Lab (UKP Lab)
Department of Computer Science and Hessian Center for AI (hessian.AI)
Technical University of Darmstadt
www.ukp.tu-darmstadt.de
Abstract
Current multimodal models, aimed at solving
Vision and Language (V+L) tasks, predomi-
nantly repurpose Vision Encoders (VE) as fea-
ture extractors. While many VEs—of differ-
ent architectures, trained on different data and
objectives—are publicly available, they are
not designed for the downstream V+L tasks.
Nonetheless, most current work assumes that
asingle pre-trained VE can serve as a general-
purpose encoder. In this work, we focus on
analysis and aim to understand whether the in-
formation stored within different VEs is com-
plementary, i.e. if providing the model with
features from multiple VEs can improve the
performance on a target task, and how they
are combined. We exhaustively experiment
with three popular VEs on six downstream V+L
tasks and analyze the attention and VE-dropout
patterns. Our analyses suggest that diverse VEs
complement each other, resulting in improved
downstream V+L task performance, where the
improvements are not due to simple ensem-
ble effects (i.e. the performance does not al-
ways improve when increasing the number of
encoders). We demonstrate that future VEs,
which are not repurposed, but explicitly de-
signed for V+L tasks, have the potential of im-
proving performance on the target V+L tasks.
1 Introduction
The dominant strategy for solving Vi-
sion+Language (V+L) tasks involves using
Transformer models (Vaswani et al.,2017) that
jointly attend over the representations of the
respective modalities (Lu et al.,2019;Su et al.,
2020;Li et al.,2020b;Chen et al.,2020;Huang
et al.,2020,inter alia). While representation-
learning of the text modality is comparatively
straightforward using token embeddings,
1
image
Gregor is now affiliated with WüNLP & Computer Vi-
sion Lab, CAIDAS, University of Würzburg.
Jonas is now affiliated with Google DeepMind.
1
But still far from solved especially in multilingual set-
tings (Rust et al.,2021;Clark et al.,2022;Xue et al.,2022)
representations are more difficult to learn. Given
an image, a common approach is to use pre-trained
Vision Encoders (VE), where the VE’s output
features are passed as inputs, together with the
text embeddings, into a Transformer model. The
attention mechanism then learns a cross-modal
representation space over the text and image
features to solve the target V+L task.
Consequently, the success of a multimodal
model builds heavily on the features extracted from
a VE and is thus highly dependent on the VE’s
architecture, training objectives (e.g. image clas-
sification, image encoding, object detection, etc.),
and pre-training data. This dependency is further
exacerbated for multimodal models that utilize VEs
as static feature extractors (i.e. the weights of the
VE are frozen), but also for models that are trained
end-to-end, as the biases introduced by the archi-
tecture, objectives, and data of the VE remain.
Since many computer vision models can be re-
purposed as VEs for V+L tasks, a few prior works
have focused on identifying individual VEs that
perform the best on downstream tasks (Jiang et al.,
2020;Shen et al.,2022;Eichenberg et al.,2022;
Zhang et al.,2021). A common assumption is that a
single pre-trained VE can perform the best for a tar-
get task or even serve as a general-purpose encoder
for a wide range of V+L tasks. However, a natural
question arises: to what extent is this assumption
correct? Given that all VEs differ in architecture,
objectives, and pre-training data, we hypothesize
that the extracted features of multiple different VEs
encode complementary information.
In this work, we focus on answering: 1) Do
different VEs encode complementary features? 2)
How are features from different VEs utilized by
Transformers? We provide comprehensive analyses
for multi-VE models and test whether combining
VEs is beneficial over a single-VE setup under the
viewpoint of feature complementarity. Similar to
prior work that analyzed other components of V+L
arXiv:2210.06379v2 [cs.CV] 8 Jun 2023
Transformers (Bugliarello et al.,2021;Hendricks
et al.,2021), we will not focus on improving the
performance through ensembling like Yan et al.
(2021b). Rather, we utilize combinations of VEs
as the setting for answering our research questions.
We cover three popular classes of VEs in our
experiments: 1) object detection models providing
a feature representation of salient image parts con-
taining objects (Region) (Anderson et al.,2018),
2) CNN models computing a feature map of the
image for grid features (Grid), and 3) Vision Trans-
formers (ViT) (Dosovitskiy et al.,2021) computing
contextualized patch features of the image (Patch).
As the downstream domain and task type can be
heavily impacted by the different VEs, we probe
all combinations of the three VEs on six different
V+L tasks, covering retrieval, Q&A, and reasoning.
To investigate the VE complementarity and fea-
ture utilization, we analyze 1) the attention patterns
across modalities and VEs, and 2) the dependency
of specific VEs when performing VE-dropout dur-
ing training and inference. While multi-VE seems
to perform better than single-VE (which could par-
tially attribute to the increased parameter count),
we consistently observe performance gaps between
different multi-VE configurations (e.g. a gap as
large as 8.9 points for the same task) and no sin-
gle winning combination for all task types. Our
attention patterns analysis across the different VEs
reveals that the distinctive information encoded in
the VEs is important for different tasks, and the
model composes the representations by enriching a
dominant VE with complementary information of
the other VEs.
To sum up, our results and analysis suggest that
VEs trained on different objectives, architectures,
and data can have a high impact on the model’s
V+L task performance. We cannot rely on simple
ensemble effects to improve performance; select-
ing and repurposing off-the-shelf VEs is non-trivial,
which emphasizes the necessity to design VEs ex-
plicitly for V+L tasks in the future.
2 Related Work
Multimodal Transformer Architectures. Mul-
timodal Transformer architectures can be di-
vided into single-stream and dual-stream mod-
els (Bugliarello et al.,2021). The single-stream
Transformer takes the concatenated visual and
text tokens as input and processes them modality-
agnostic, i.e. the self-attention jointly attends over
the tokens of both modalities. Dual-stream models
use separate Transformers for each modality that
are connected through a co-attention mechanism
(Tan and Bansal,2019;Lu et al.,2019), concate-
nated in a single-stream model on top (Singh et al.,
2022;Kamath et al.,2021), or the image model
output is used asymmetrically for cross-attention
in the text model (Li et al.,2021,2022).
The Faster R-CNN (Ren et al.,2015) object de-
tector has been the dominant choice for multimodal
models as a Region VE, where most methods pro-
pose to use it as a static feature extractor (Tan and
Bansal,2019;Lu et al.,2019;Su et al.,2020;Chen
et al.,2020;Gan et al.,2020;Li et al.,2020b;Zhang
et al.,2021;Cho et al.,2021), with the notable ex-
ception being Su et al. (2020) who backpropagate
through the Faster R-CNN model. Less popular
VEs are Grid (Huang et al.,2020;Kamath et al.,
2021;Yan et al.,2021a;Shen et al.,2022;Eichen-
berg et al.,2022), and Patch (Kim et al.,2021;
Wang et al.,2022;Eichenberg et al.,2022). In con-
trast to Region VEs, Grid and Patch VEs are com-
monly fine-tuned on the target V+L task, with the
notable exception being Yan et al. (2021a). Follow-
ing Bugliarello et al. (2021); Hendricks et al. (2021)
we focus on single-stream models as they have been
shown to perform on par with dual-stream models
while being easier to extend to multi-VE setups.
Comparing and Combining VEs. Recently, sev-
eral works aim to compare different VEs for V+L
tasks. Jiang et al. (2020) compare Region and Grid
for visual QA tasks, showing that training data, ob-
jectives and other factors all affect the downstream
task performance. Shen et al. (2022); Eichenberg
et al. (2022) compare different pre-trained Grid
and Patch VEs building on CLIP (Radford et al.,
2021). Zhang et al. (2021) compare different de-
sign choices for Region VEs with Grid VEs trained
on the same data. Dai et al. (2023) compares differ-
ent VEs in influence object hallucinations in cap-
tion generation. Closest to our work is the work by
Yan et al. (2021b). While they also experiment with
combining representations of Grid-, Patch-, and Re-
gion VEs, they only focus on the Visual Question
Answering (VQA; Goyal et al.,2017) dataset and
only use the combination of all three VEs. Our
work provides a more in-depth evaluation of dif-
ferent multi-VE setups while experimenting with
six diverse tasks, and shows that different combi-
nations work best for each task.
Analysis of Multimodal Transformers. Our anal-
ysis methods draw inspiration from recent works
that probe and analyze pre-trained multimodal
Transformers for a better understanding of their
different components (Bugliarello et al.,2021;Cao
et al.,2020;Li et al.,2020a;Frank et al.,2021;
Hendricks et al.,2021). Cao et al. (2020) propose
a range of different probing tasks to understand
the inner workings of multimodal models. Li et al.
(2020a) analyze how accurate the attention heads of
pre-trained models can perform visual grounding.
Frank et al. (2021) mask parts of the text and image
input and measure how the prediction performance
changes for the respective other modality to test
how symmetric the learned cross-modal connection
is. Bugliarello et al. (2021); Hendricks et al. (2021)
evaluate and disentangle which components of mul-
timodal pre-training proposed in different works
are important for their success. While previous
work has only focused on models with a Region
VE, we also experiment with Grid and Patch VEs.
In summary, our work is the first in-depth study
of multimodal Transformers that use multiple VEs.
3 Multimodal Multi-VE Transformers
Recently, cross-modal attention is the dominant
strategy to learn multimodal representations with
V+L Transformers. In this work, we follow
Bugliarello et al. (2021) and focus on the single-
stream architecture, which shares the attention com-
ponents across all modalities, i.e. the concate-
nated visual and text tokens are processed modality-
agnostic. This architecture achieves state-of-the-art
results (Bugliarello et al.,2021) while being easily
extendable to multiple VEs, by concatenating all
vision tokens.
2
Figure 1illustrates our architecture.
Multimodal input representations. The raw
data for a V+L task consists of either discrete to-
kens/characters (text-modality) or high-resolution
pixel values (image-modality). To extract dense
representations of the respective modalities we fol-
low the standard pre-processing strategies: The text
modality is tokenized using word-piece tokeniza-
tion (Devlin et al.,2019) and mapped to their cor-
responding dense embedding representations. At
the input to the first Transformer layer, positional
embeddings are added to the respective token em-
beddings. For the vision modality, pre-trained VEs
2
We concatenate all VE for analysis in §5. In practice,
concatenation increases the sequence length and incurs a
high computational cost. More efficient methods like resam-
pling (Alayrac et al.,2022) can be explored in future work.
Transformer
MLP
VE
Grid
MLP
VE
Patch
MLP
VE
Region
Embedding
[CLS] What bi #rd ?
What bird?
Tokenizer
Figure 1: Our Multi-VE Architecture: Each VE pro-
duces a list of visual tokens, which are passed through
MLPs and concatenated with the text embeddings. The
Transformer is modality-agnostic and attends over all
tokens. We freeze the VEs during training and only
optimize the MLPs, embeddings, and the Transformer.
are utilized which encode the raw pixel values of
the respective image into dense high-dimensional
feature vectors. These VEs can either encode des-
ignated sections (e.g. Region), or an entire image
(e.g. Grid and Patch). The extracted feature vec-
tors are then passed through a multi-layer percep-
tron (MLP), and subsequently into the Transformer.
This procedure can be repeated for any number of
VEs of interest. In other words, the image features
(from multiple VEs) and the text embeddings are
concatenated and jointly passed through a shared
Transformer model which learns to attend over the
multimodal representations.
V+L task training. We place a classification head
on the output of the [CLS] token (following Devlin
et al. (2019)) and fine-tune the model with cross-
entropy loss on the training data of the target task.
4 Experiments
We evaluate the impact of three different VEs on six
downstream V+L tasks to assess the complemen-
tarity of different image representations. Here, we
experiment with all possible combinations of the
three VEs (i.e. single VE,2-VE, and 3-VE setups).
3
To fairly compare the information stored in the re-
spective VEs, we only fine-tune the multimodal
models on the target V+L task in order to circum-
vent potentially beneficial information leaking into
the multimodal model from auxiliary tasks. We
therefore initialize all models with BERT weights
3
While we report the results of a single VE, we do not aim
to show that one VE outperforms others, as this would require
a more controlled experimental setup, e.g. training dataset and
training objectives amongst other factors (Jiang et al.,2020;
Zhang et al.,2021), which is outside the scope of this work.
VE Model # Train Tasks # V. Tok. Dim.
Region VinVL (Zhang et al.,2021) 2.5M1
bounding box prediction, object & at-
tribute classification
36 2054
Grid CLIP RN50x4 (Radford et al.,2021) 400M2image-text contrastive loss 36 2560
Patch CLIP ViT/B-32 (Radford et al.,2021) 400M2image-text contrastive loss 49 768
Table 1: The three VE models used in our experiments with the number of pre-training images, training objectives,
the number of visual tokens (V.Tok.), and the output feature dimension. Train Datasets:
1
: Combination of multiple
object detection datasets (see (Zhang et al.,2021)). 2: Web-crawled & cleaned image-caption pairs (proprietary).
(Devlin et al.,2019) (base-size). We note, how-
ever, that gains can be achieved when pre-training
the multimodal model on auxiliary data prior to
fine-tuning on the target V+L task (Tan and Bansal,
2019;Lu et al.,2019;Chen et al.,2020,inter alia).
4.1 Vision Encoders
We follow the standard approach and repurpose
three pre-trained vision models as VEs. In a best-
effort attempt for a fair setup, we use the current
best publicly available models of similar sizes.
Each VE has a designated, randomly initialized
2-layer perceptron (MLP) that maps the representa-
tions to the input of the Transformer and is trained
on the target V+L task along with the multimodal
Transformer weights. We keep the VE weights
frozen during training. For a full summary of the
VEs including pre-training data, the number of ex-
tracted tokens as well as dimensions, see Table 1.
Region VE. We utilize Faster R-CNN (Ren et al.,
2015), an object detection model that outputs a
list of bounding boxes and feature vectors for Re-
gions of Interest—salient parts of the image that
likely contain an object. Here we select the pre-
trained VinVL object detector (Zhang et al.,2021),
4
which outperforms previous object detectors on
V+L tasks. We follow Li et al. (2020b); Zhang
et al. (2021) and concatenate each extracted feature
vector with the corresponding normalized box co-
ordinates and width / height. We extract the top-36
regions from the VinVL object detector.
Grid VE. Grid VEs linearize the grid feature map
of a CNN (before final pooling or classification
layers) to a list of visual tokens. Each visual token
corresponds to a specific part of the image with im-
age features on different scales (through different
pooling operations and convolution sizes through-
out the CNN) encoded in it. We use adaptive max
pooling
5
on the feature map to reduce the number
4Not to be confused with their Transformer.
5torch.nn.AdaptiveMaxPool2d
of tokens to 36 per image. We use the CLIP CNN
(RN50x4) (Radford et al.,2021) as initialization,
given it’s recent success on V+L tasks (Shen et al.,
2022;Eichenberg et al.,2022;Alayrac et al.,2022).
Patch VE. Patch VEs use the contextualized out-
put representations of a Vision Transformer (ViT)
(Dosovitskiy et al.,2021) as visual tokens. The
ViT splits an image into uniform patches, which
are used as input tokens. Different from a CNN, the
ViT tokens are fixed in size throughout the model
but they have a global receptive field through the
ViT’s attention mechanism. We exclude the ViT’s
special classification token from the Transformer
input. We also utilize the CLIP-based ViT models
(ViT/B-32) (Radford et al.,2021) for our Patch-VE.
We extract all 49 tokens for the CLIP ViT due to
their smaller feature dimension size.
4.2 Tasks
We experiment with a set of six V+L tasks: Image-
text retrieval (Flickr30k (Young et al.,2014) and
MSCOCO (Lin et al.,2014)), visual question an-
swering (GQA (Hudson and Manning,2019) and
VQA2.0 (Goyal et al.,2017)), visual entailment
(SNLI-VE (Xie et al.,2019)) and memes classifica-
tion (Hateful Memes (Kiela et al.,2020)). For all
experiments we report the mean and standard devi-
ations over three random seeds and present training
details and hyperparameters in Appendix A. We
train all models with a single Nvidia V100 GPU,
training a single model (all tasks, three seeds) takes
approximately 10 GPU days.
4.3 Results & Discussion
We report the results on the six tasks with all possi-
ble combinations of VEs in Table 2.
No “one encoder to rule them all". When com-
paring the results of the single-VE models, there
is no clear single winning VE that outperforms all
other VEs across all tasks. While for QA tasks
Region VE models perform best, for the other tasks
Grid VE outperforms the others. We hypothesize
Retrieval Question Answering Reasoning
Flickr30k MSCOCO GQA VQA SNLI-VE Hateful M.
Vision Encoders R@1 R@1 Acc. Acc. Acc. AUROC
Region 57.46 ±2.74 50.79 ±3.28 55.32 ±0.33 65.73 ±0.54 76.57 ±0.10 74.83 ±0.73
Grid 66.93 ±3.59 58.30 ±3.56 51.51 ±0.17 62.99 ±1.25 77.32 ±0.11 79.03 ±0.27
Patch 54.99 ±6.00 46.30 ±2.07 51.56 ±0.44 62.96 ±0.71 76.32 ±0.09 75.78 ±1.57
Region+Grid 63.43 ±5.85 54.87 ±6.88 55.08 ±0.44 66.30 ±1.52 77.66 ±0.11 78.68 ±1.82
Region+Patch 58.60 ±4.44 58.73 ±4.02 55.58 ±0.09 67.05 ±0.42 76.60 ±0.19 75.87 ±0.63
Grid+Patch 67.53 ±2.07 56.44 ±3.80 51.55 ±0.16 62.64 ±0.30 77.39 ±0.35 79.88 ±0.95
Region+Grid+Patch 62.30 ±2.04 58.33 ±2.51 54.39 ±0.59 66.82 ±1.57 77.87 ±0.24 78.81 ±0.38
With VE-Dropout Training §5.5
Region+Grid 55.11 ±13.44 54.28 ±4.97 54.91 ±0.26 64.72 ±3.93 77.07 ±0.21 75.75 ±0.93
Region+Patch 51.53 ±7.75 52.07 ±4.44 54.46 ±0.70 65.07 ±0.63 76.42 ±0.15 73.57 ±0.69
Grid+Patch 63.13 ±4.16 55.56 ±3.92 51.80 ±0.25 60.62 ±1.38 77.41 ±0.16 77.30 ±0.43
Table 2: Mean and standard deviation over three seeds. Metrics: for retrieval the average recall at 1 between
image-text and text-image retrieval, for Hateful Memes AUROC, and accuracy otherwise. Best single- and multi-VE
setup is bolded and overall best score is
underlined
. We also report the results for VE-Dropout Training (see §5.5).
that the object-centric regions are useful for QA
tasks, which focus on specific elements, while the
uniform grid encoding might be useful for retrieval
and other tasks that look at the entire image. Iso-
lating why certain VEs are useful for specific tasks,
and the role of training objectives, data, and archi-
tecture, requires a controlled setup of training VEs
from scratch (Jiang et al.,2020), which we leave
to future work. Interestingly, the Patch VE never
achieves the best performance, which aligns with
previous findings by Shen et al. (2022); Eichenberg
et al. (2022). These single-VE results demonstrate
that each VE encodes different types of information
that impacts the downstream performance.
VEs can complement each other. When com-
bining the representations from different VEs, we
witness improvements across all V+L tasks. Inter-
estingly, MSCOCO benefits greatly from combin-
ing the two weakest VEs (i.e. Region and Patch),
surpassing their corresponding single VE results
by 7.94 and 12.43 points respectively, achieving
the best performance on this task. Although the
Patch VE never achieves the best performance in
single-VE setups, it provides complementary in-
formation in combination with the best perform-
ing VE, achieving the best overall performance for
many tasks.
6
However, we see that simply using
more encoders does not guarantee improvements
as is evident by the 3-encoder model. While the
3-encoder model consistently achieves near-best
6
Grid+Patch performs better than Grid for Flickr30k, Re-
gion+Patch performs better than Region for GQA, etc.
results, it is rarely the best model (only on 1 out
of 6 tasks). This result shows that simply using
more encoders does not guarantee improvement
(i.e. model performance is not monotonically im-
proving with more VEs.). Hence, it is unlikely that
the improvements are due to an ensemble effect.
One does not fit all. In summary, we see that
neither one VE alone nor a fixed combination of
VEs gives the best results for the entire breadth of
V+L tasks. This shows the current limitations of
repurposed vision encoders and highlights the need
for encoders designed specifically for V+L tasks.
5 Analysis
To better understand how the representations are
combined in different multi-VE setups, we analyze
the flow of attention,phrase-to-image grounding,
and the robustness to dropping VEs at test time. We
overload ‘cross-modality’ to include both VE-text
but also VE-VE interactions for simplicity. We
present the analysis for the best performing model
combinations in what follows but provide a full list
of results in Appendix B.
5.1 CLS Attention Flow
The CLS token can be seen as the fused represen-
tation of the modalities that are used for the final
classification (Cao et al.,2020). We can thus esti-
mate which VEs are important for classification by
considering which VEs the CLS token attends to.
7
7
This is only an estimation because the modalities combine
information through attention, too.
摘要:

Onedoesnotfitall!OntheComplementarityofVisionEncodersforVisionandLanguageTasksGregorGeigle∗,ChenCeciliaLiu,JonasPfeiffer†,IrynaGurevychUbiquitousKnowledgeProcessingLab(UKPLab)DepartmentofComputerScienceandHessianCenterforAI(hessian.AI)TechnicalUniversityofDarmstadtwww.ukp.tu-darmstadt.deAbstractCurr...

展开>> 收起<<
One does notfit all On the Complementarity of Vision Encoders for Vision and Language Tasks Gregor Geigle Chen Cecilia Liu Jonas Pfeiffer Iryna Gurevych.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:4.99MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注