One does notfit all On the Complementarity of Vision Encoders for Vision and Language Tasks Gregor Geigle Chen Cecilia Liu Jonas Pfeiffer Iryna Gurevych

2025-04-27 0 0 4.99MB 21 页 10玖币

侵权投诉

One does not ﬁt all!

On the Complementarity of Vision Encoders for Vision and Language Tasks

Gregor Geigle∗

, Chen Cecilia Liu, Jonas Pfeiffer†

, Iryna Gurevych

Ubiquitous Knowledge Processing Lab (UKP Lab)

Department of Computer Science and Hessian Center for AI (hessian.AI)

Technical University of Darmstadt

www.ukp.tu-darmstadt.de

Abstract

Current multimodal models, aimed at solving

Vision and Language (V+L) tasks, predomi-

nantly repurpose Vision Encoders (VE) as fea-

ture extractors. While many VEs—of differ-

ent architectures, trained on different data and

objectives—are publicly available, they are

not designed for the downstream V+L tasks.

Nonetheless, most current work assumes that

asingle pre-trained VE can serve as a general-

purpose encoder. In this work, we focus on

analysis and aim to understand whether the in-

formation stored within different VEs is com-

plementary, i.e. if providing the model with

features from multiple VEs can improve the

performance on a target task, and how they

are combined. We exhaustively experiment

with three popular VEs on six downstream V+L

tasks and analyze the attention and VE-dropout

patterns. Our analyses suggest that diverse VEs

complement each other, resulting in improved

downstream V+L task performance, where the

improvements are not due to simple ensem-

ble effects (i.e. the performance does not al-

ways improve when increasing the number of

encoders). We demonstrate that future VEs,

which are not repurposed, but explicitly de-

signed for V+L tasks, have the potential of im-

proving performance on the target V+L tasks.

1 Introduction

The dominant strategy for solving Vi-

sion+Language (V+L) tasks involves using

Transformer models (Vaswani et al.,2017) that

jointly attend over the representations of the

respective modalities (Lu et al.,2019;Su et al.,

2020;Li et al.,2020b;Chen et al.,2020;Huang

et al.,2020,inter alia). While representation-

learning of the text modality is comparatively

straightforward using token embeddings,

image

∗

Gregor is now afﬁliated with WüNLP & Computer Vi-

sion Lab, CAIDAS, University of Würzburg.

†Jonas is now afﬁliated with Google DeepMind.

But still far from solved especially in multilingual set-

tings (Rust et al.,2021;Clark et al.,2022;Xue et al.,2022)

representations are more difﬁcult to learn. Given

an image, a common approach is to use pre-trained

Vision Encoders (VE), where the VE’s output

features are passed as inputs, together with the

text embeddings, into a Transformer model. The

attention mechanism then learns a cross-modal

representation space over the text and image

features to solve the target V+L task.

Consequently, the success of a multimodal

model builds heavily on the features extracted from

a VE and is thus highly dependent on the VE’s

architecture, training objectives (e.g. image clas-

siﬁcation, image encoding, object detection, etc.),

and pre-training data. This dependency is further

exacerbated for multimodal models that utilize VEs

as static feature extractors (i.e. the weights of the

VE are frozen), but also for models that are trained

end-to-end, as the biases introduced by the archi-

tecture, objectives, and data of the VE remain.

Since many computer vision models can be re-

purposed as VEs for V+L tasks, a few prior works

have focused on identifying individual VEs that

perform the best on downstream tasks (Jiang et al.,

2020;Shen et al.,2022;Eichenberg et al.,2022;

Zhang et al.,2021). A common assumption is that a

single pre-trained VE can perform the best for a tar-

get task or even serve as a general-purpose encoder

for a wide range of V+L tasks. However, a natural

question arises: to what extent is this assumption

correct? Given that all VEs differ in architecture,

objectives, and pre-training data, we hypothesize

that the extracted features of multiple different VEs

encode complementary information.

In this work, we focus on answering: 1) Do

different VEs encode complementary features? 2)

How are features from different VEs utilized by

Transformers? We provide comprehensive analyses

for multi-VE models and test whether combining

VEs is beneﬁcial over a single-VE setup under the

viewpoint of feature complementarity. Similar to

prior work that analyzed other components of V+L

arXiv:2210.06379v2 [cs.CV] 8 Jun 2023

Transformers (Bugliarello et al.,2021;Hendricks

et al.,2021), we will not focus on improving the

performance through ensembling like Yan et al.

(2021b). Rather, we utilize combinations of VEs

as the setting for answering our research questions.

We cover three popular classes of VEs in our

experiments: 1) object detection models providing

a feature representation of salient image parts con-

taining objects (Region) (Anderson et al.,2018),

2) CNN models computing a feature map of the

image for grid features (Grid), and 3) Vision Trans-

formers (ViT) (Dosovitskiy et al.,2021) computing

contextualized patch features of the image (Patch).

As the downstream domain and task type can be

heavily impacted by the different VEs, we probe

all combinations of the three VEs on six different

V+L tasks, covering retrieval, Q&A, and reasoning.

To investigate the VE complementarity and fea-

ture utilization, we analyze 1) the attention patterns

across modalities and VEs, and 2) the dependency

of speciﬁc VEs when performing VE-dropout dur-

ing training and inference. While multi-VE seems

to perform better than single-VE (which could par-

tially attribute to the increased parameter count),

we consistently observe performance gaps between

different multi-VE conﬁgurations (e.g. a gap as

large as 8.9 points for the same task) and no sin-

gle winning combination for all task types. Our

attention patterns analysis across the different VEs

reveals that the distinctive information encoded in

the VEs is important for different tasks, and the

model composes the representations by enriching a

dominant VE with complementary information of

the other VEs.

To sum up, our results and analysis suggest that

VEs trained on different objectives, architectures,

and data can have a high impact on the model’s

V+L task performance. We cannot rely on simple

ensemble effects to improve performance; select-

ing and repurposing off-the-shelf VEs is non-trivial,

which emphasizes the necessity to design VEs ex-

plicitly for V+L tasks in the future.

2 Related Work

Multimodal Transformer Architectures. Mul-

timodal Transformer architectures can be di-

vided into single-stream and dual-stream mod-

els (Bugliarello et al.,2021). The single-stream

Transformer takes the concatenated visual and

text tokens as input and processes them modality-

agnostic, i.e. the self-attention jointly attends over

the tokens of both modalities. Dual-stream models

use separate Transformers for each modality that

are connected through a co-attention mechanism

(Tan and Bansal,2019;Lu et al.,2019), concate-

nated in a single-stream model on top (Singh et al.,

2022;Kamath et al.,2021), or the image model

output is used asymmetrically for cross-attention

in the text model (Li et al.,2021,2022).

The Faster R-CNN (Ren et al.,2015) object de-

tector has been the dominant choice for multimodal

models as a Region VE, where most methods pro-

pose to use it as a static feature extractor (Tan and

Bansal,2019;Lu et al.,2019;Su et al.,2020;Chen

et al.,2020;Gan et al.,2020;Li et al.,2020b;Zhang

et al.,2021;Cho et al.,2021), with the notable ex-

ception being Su et al. (2020) who backpropagate

through the Faster R-CNN model. Less popular

VEs are Grid (Huang et al.,2020;Kamath et al.,

2021;Yan et al.,2021a;Shen et al.,2022;Eichen-

berg et al.,2022), and Patch (Kim et al.,2021;

Wang et al.,2022;Eichenberg et al.,2022). In con-

trast to Region VEs, Grid and Patch VEs are com-

monly ﬁne-tuned on the target V+L task, with the

notable exception being Yan et al. (2021a). Follow-

ing Bugliarello et al. (2021); Hendricks et al. (2021)

we focus on single-stream models as they have been

shown to perform on par with dual-stream models

while being easier to extend to multi-VE setups.

Comparing and Combining VEs. Recently, sev-

eral works aim to compare different VEs for V+L

tasks. Jiang et al. (2020) compare Region and Grid

for visual QA tasks, showing that training data, ob-

jectives and other factors all affect the downstream

task performance. Shen et al. (2022); Eichenberg

et al. (2022) compare different pre-trained Grid

and Patch VEs building on CLIP (Radford et al.,

2021). Zhang et al. (2021) compare different de-

sign choices for Region VEs with Grid VEs trained

on the same data. Dai et al. (2023) compares differ-

ent VEs in inﬂuence object hallucinations in cap-

tion generation. Closest to our work is the work by

Yan et al. (2021b). While they also experiment with

combining representations of Grid-, Patch-, and Re-

gion VEs, they only focus on the Visual Question

Answering (VQA; Goyal et al.,2017) dataset and

only use the combination of all three VEs. Our

work provides a more in-depth evaluation of dif-

ferent multi-VE setups while experimenting with

six diverse tasks, and shows that different combi-

nations work best for each task.

Analysis of Multimodal Transformers. Our anal-

ysis methods draw inspiration from recent works

that probe and analyze pre-trained multimodal

Transformers for a better understanding of their

different components (Bugliarello et al.,2021;Cao

et al.,2020;Li et al.,2020a;Frank et al.,2021;

Hendricks et al.,2021). Cao et al. (2020) propose

a range of different probing tasks to understand

the inner workings of multimodal models. Li et al.

(2020a) analyze how accurate the attention heads of

pre-trained models can perform visual grounding.

Frank et al. (2021) mask parts of the text and image

input and measure how the prediction performance

changes for the respective other modality to test

how symmetric the learned cross-modal connection

is. Bugliarello et al. (2021); Hendricks et al. (2021)

evaluate and disentangle which components of mul-

timodal pre-training proposed in different works

are important for their success. While previous

work has only focused on models with a Region

VE, we also experiment with Grid and Patch VEs.

In summary, our work is the ﬁrst in-depth study

of multimodal Transformers that use multiple VEs.

3 Multimodal Multi-VE Transformers

Recently, cross-modal attention is the dominant

strategy to learn multimodal representations with

V+L Transformers. In this work, we follow

Bugliarello et al. (2021) and focus on the single-

stream architecture, which shares the attention com-

ponents across all modalities, i.e. the concate-

nated visual and text tokens are processed modality-

agnostic. This architecture achieves state-of-the-art

results (Bugliarello et al.,2021) while being easily

extendable to multiple VEs, by concatenating all

vision tokens.

Figure 1illustrates our architecture.

Multimodal input representations. The raw

data for a V+L task consists of either discrete to-

kens/characters (text-modality) or high-resolution

pixel values (image-modality). To extract dense

representations of the respective modalities we fol-

low the standard pre-processing strategies: The text

modality is tokenized using word-piece tokeniza-

tion (Devlin et al.,2019) and mapped to their cor-

responding dense embedding representations. At

the input to the ﬁrst Transformer layer, positional

embeddings are added to the respective token em-

beddings. For the vision modality, pre-trained VEs

We concatenate all VE for analysis in §5. In practice,

concatenation increases the sequence length and incurs a

high computational cost. More efﬁcient methods like resam-

pling (Alayrac et al.,2022) can be explored in future work.

Transformer

MLP

Grid

MLP

Patch

MLP

Region

Embedding

[CLS] What bi #rd ?

What bird?

Tokenizer

Figure 1: Our Multi-VE Architecture: Each VE pro-

duces a list of visual tokens, which are passed through

MLPs and concatenated with the text embeddings. The

Transformer is modality-agnostic and attends over all

tokens. We freeze the VEs during training and only

optimize the MLPs, embeddings, and the Transformer.

are utilized which encode the raw pixel values of

the respective image into dense high-dimensional

feature vectors. These VEs can either encode des-

ignated sections (e.g. Region), or an entire image

(e.g. Grid and Patch). The extracted feature vec-

tors are then passed through a multi-layer percep-

tron (MLP), and subsequently into the Transformer.

This procedure can be repeated for any number of

VEs of interest. In other words, the image features

(from multiple VEs) and the text embeddings are

concatenated and jointly passed through a shared

Transformer model which learns to attend over the

multimodal representations.

V+L task training. We place a classiﬁcation head

on the output of the [CLS] token (following Devlin

et al. (2019)) and ﬁne-tune the model with cross-

entropy loss on the training data of the target task.

4 Experiments

We evaluate the impact of three different VEs on six

downstream V+L tasks to assess the complemen-

tarity of different image representations. Here, we

experiment with all possible combinations of the

three VEs (i.e. single VE,2-VE, and 3-VE setups).

To fairly compare the information stored in the re-

spective VEs, we only ﬁne-tune the multimodal

models on the target V+L task in order to circum-

vent potentially beneﬁcial information leaking into

the multimodal model from auxiliary tasks. We

therefore initialize all models with BERT weights

While we report the results of a single VE, we do not aim

to show that one VE outperforms others, as this would require

a more controlled experimental setup, e.g. training dataset and

training objectives amongst other factors (Jiang et al.,2020;

Zhang et al.,2021), which is outside the scope of this work.

VE Model # Train Tasks # V. Tok. Dim.

Region VinVL (Zhang et al.,2021) 2.5M1

bounding box prediction, object & at-

tribute classiﬁcation

36 2054

Grid CLIP RN50x4 (Radford et al.,2021) 400M2image-text contrastive loss 36 2560

Patch CLIP ViT/B-32 (Radford et al.,2021) 400M2image-text contrastive loss 49 768

Table 1: The three VE models used in our experiments with the number of pre-training images, training objectives,

the number of visual tokens (V.Tok.), and the output feature dimension. Train Datasets:

: Combination of multiple

object detection datasets (see (Zhang et al.,2021)). 2: Web-crawled & cleaned image-caption pairs (proprietary).

(Devlin et al.,2019) (base-size). We note, how-

ever, that gains can be achieved when pre-training

the multimodal model on auxiliary data prior to

ﬁne-tuning on the target V+L task (Tan and Bansal,

2019;Lu et al.,2019;Chen et al.,2020,inter alia).

4.1 Vision Encoders

We follow the standard approach and repurpose

three pre-trained vision models as VEs. In a best-

effort attempt for a fair setup, we use the current

best publicly available models of similar sizes.

Each VE has a designated, randomly initialized

2-layer perceptron (MLP) that maps the representa-

tions to the input of the Transformer and is trained

on the target V+L task along with the multimodal

Transformer weights. We keep the VE weights

frozen during training. For a full summary of the

VEs including pre-training data, the number of ex-

tracted tokens as well as dimensions, see Table 1.

Region VE. We utilize Faster R-CNN (Ren et al.,

2015), an object detection model that outputs a

list of bounding boxes and feature vectors for Re-

gions of Interest—salient parts of the image that

likely contain an object. Here we select the pre-

trained VinVL object detector (Zhang et al.,2021),

which outperforms previous object detectors on

V+L tasks. We follow Li et al. (2020b); Zhang

et al. (2021) and concatenate each extracted feature

vector with the corresponding normalized box co-

ordinates and width / height. We extract the top-36

regions from the VinVL object detector.

Grid VE. Grid VEs linearize the grid feature map

of a CNN (before ﬁnal pooling or classiﬁcation

layers) to a list of visual tokens. Each visual token

corresponds to a speciﬁc part of the image with im-

age features on different scales (through different

pooling operations and convolution sizes through-

out the CNN) encoded in it. We use adaptive max

pooling

on the feature map to reduce the number

4Not to be confused with their Transformer.

5torch.nn.AdaptiveMaxPool2d

of tokens to 36 per image. We use the CLIP CNN

(RN50x4) (Radford et al.,2021) as initialization,

given it’s recent success on V+L tasks (Shen et al.,

2022;Eichenberg et al.,2022;Alayrac et al.,2022).

Patch VE. Patch VEs use the contextualized out-

put representations of a Vision Transformer (ViT)

(Dosovitskiy et al.,2021) as visual tokens. The

ViT splits an image into uniform patches, which

are used as input tokens. Different from a CNN, the

ViT tokens are ﬁxed in size throughout the model

but they have a global receptive ﬁeld through the

ViT’s attention mechanism. We exclude the ViT’s

special classiﬁcation token from the Transformer

input. We also utilize the CLIP-based ViT models

(ViT/B-32) (Radford et al.,2021) for our Patch-VE.

We extract all 49 tokens for the CLIP ViT due to

their smaller feature dimension size.

4.2 Tasks

We experiment with a set of six V+L tasks: Image-

text retrieval (Flickr30k (Young et al.,2014) and

MSCOCO (Lin et al.,2014)), visual question an-

swering (GQA (Hudson and Manning,2019) and

VQA2.0 (Goyal et al.,2017)), visual entailment

(SNLI-VE (Xie et al.,2019)) and memes classiﬁca-

tion (Hateful Memes (Kiela et al.,2020)). For all

experiments we report the mean and standard devi-

ations over three random seeds and present training

details and hyperparameters in Appendix A. We

train all models with a single Nvidia V100 GPU,

training a single model (all tasks, three seeds) takes

approximately 10 GPU days.

4.3 Results & Discussion

We report the results on the six tasks with all possi-

ble combinations of VEs in Table 2.

No “one encoder to rule them all". When com-

paring the results of the single-VE models, there

is no clear single winning VE that outperforms all

other VEs across all tasks. While for QA tasks

Region VE models perform best, for the other tasks

Grid VE outperforms the others. We hypothesize

Retrieval Question Answering Reasoning

Flickr30k MSCOCO GQA VQA SNLI-VE Hateful M.

Vision Encoders R@1 R@1 Acc. Acc. Acc. AUROC

Region 57.46 ±2.74 50.79 ±3.28 55.32 ±0.33 65.73 ±0.54 76.57 ±0.10 74.83 ±0.73

Grid 66.93 ±3.59 58.30 ±3.56 51.51 ±0.17 62.99 ±1.25 77.32 ±0.11 79.03 ±0.27

Patch 54.99 ±6.00 46.30 ±2.07 51.56 ±0.44 62.96 ±0.71 76.32 ±0.09 75.78 ±1.57

Region+Grid 63.43 ±5.85 54.87 ±6.88 55.08 ±0.44 66.30 ±1.52 77.66 ±0.11 78.68 ±1.82

Region+Patch 58.60 ±4.44 58.73 ±4.02 55.58 ±0.09 67.05 ±0.42 76.60 ±0.19 75.87 ±0.63

Grid+Patch 67.53 ±2.07 56.44 ±3.80 51.55 ±0.16 62.64 ±0.30 77.39 ±0.35 79.88 ±0.95

Region+Grid+Patch 62.30 ±2.04 58.33 ±2.51 54.39 ±0.59 66.82 ±1.57 77.87 ±0.24 78.81 ±0.38

With VE-Dropout Training §5.5

Region+Grid 55.11 ±13.44 54.28 ±4.97 54.91 ±0.26 64.72 ±3.93 77.07 ±0.21 75.75 ±0.93

Region+Patch 51.53 ±7.75 52.07 ±4.44 54.46 ±0.70 65.07 ±0.63 76.42 ±0.15 73.57 ±0.69

Grid+Patch 63.13 ±4.16 55.56 ±3.92 51.80 ±0.25 60.62 ±1.38 77.41 ±0.16 77.30 ±0.43

Table 2: Mean and standard deviation over three seeds. Metrics: for retrieval the average recall at 1 between

image-text and text-image retrieval, for Hateful Memes AUROC, and accuracy otherwise. Best single- and multi-VE

setup is bolded and overall best score is

underlined

. We also report the results for VE-Dropout Training (see §5.5).

that the object-centric regions are useful for QA

tasks, which focus on speciﬁc elements, while the

uniform grid encoding might be useful for retrieval

and other tasks that look at the entire image. Iso-

lating why certain VEs are useful for speciﬁc tasks,

and the role of training objectives, data, and archi-

tecture, requires a controlled setup of training VEs

from scratch (Jiang et al.,2020), which we leave

to future work. Interestingly, the Patch VE never

achieves the best performance, which aligns with

previous ﬁndings by Shen et al. (2022); Eichenberg

et al. (2022). These single-VE results demonstrate

that each VE encodes different types of information

that impacts the downstream performance.

VEs can complement each other. When com-

bining the representations from different VEs, we

witness improvements across all V+L tasks. Inter-

estingly, MSCOCO beneﬁts greatly from combin-

ing the two weakest VEs (i.e. Region and Patch),

surpassing their corresponding single VE results

by 7.94 and 12.43 points respectively, achieving

the best performance on this task. Although the

Patch VE never achieves the best performance in

single-VE setups, it provides complementary in-

formation in combination with the best perform-

ing VE, achieving the best overall performance for

many tasks.

However, we see that simply using

more encoders does not guarantee improvements

as is evident by the 3-encoder model. While the

3-encoder model consistently achieves near-best

Grid+Patch performs better than Grid for Flickr30k, Re-

gion+Patch performs better than Region for GQA, etc.

results, it is rarely the best model (only on 1 out

of 6 tasks). This result shows that simply using

more encoders does not guarantee improvement

(i.e. model performance is not monotonically im-

proving with more VEs.). Hence, it is unlikely that

the improvements are due to an ensemble effect.

One does not ﬁt all. In summary, we see that

neither one VE alone nor a ﬁxed combination of

VEs gives the best results for the entire breadth of

V+L tasks. This shows the current limitations of

repurposed vision encoders and highlights the need

for encoders designed speciﬁcally for V+L tasks.

5 Analysis

To better understand how the representations are

combined in different multi-VE setups, we analyze

the ﬂow of attention,phrase-to-image grounding,

and the robustness to dropping VEs at test time. We

overload ‘cross-modality’ to include both VE-text

but also VE-VE interactions for simplicity. We

present the analysis for the best performing model

combinations in what follows but provide a full list

of results in Appendix B.

5.1 CLS Attention Flow

The CLS token can be seen as the fused represen-

tation of the modalities that are used for the ﬁnal

classiﬁcation (Cao et al.,2020). We can thus esti-

mate which VEs are important for classiﬁcation by

considering which VEs the CLS token attends to.

This is only an estimation because the modalities combine

information through attention, too.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Onedoesnotfitall!OntheComplementarityofVisionEncodersforVisionandLanguageTasksGregorGeigle∗,ChenCeciliaLiu,JonasPfeiffer†,IrynaGurevychUbiquitousKnowledgeProcessingLab(UKPLab)DepartmentofComputerScienceandHessianCenterforAI(hessian.AI)TechnicalUniversityofDarmstadtwww.ukp.tu-darmstadt.deAbstractCurr...

展开>> 收起<<

One does notfit all On the Complementarity of Vision Encoders for Vision and Language Tasks Gregor Geigle Chen Cecilia Liu Jonas Pfeiffer Iryna Gurevych.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

One does notfit all On the Complementarity of Vision Encoders for Vision and Language Tasks Gregor Geigle Chen Cecilia Liu Jonas Pfeiffer Iryna Gurevych

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: