Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering Qingyi Si12 Yuanxin Liu3 Zheng Lin12

2025-04-29 0 0 2.27MB 17 页 10玖币

侵权投诉

Compressing and Debiasing Vision-Language Pre-Trained Models

for Visual Question Answering

Qingyi Si1,2∗

, Yuanxin Liu3∗, Zheng Lin1,2†

Peng Fu1, Yanan Cao1,2, Weiping Wang1

1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

3National Key Laboratory for Multimedia Information Processing,

School of Computer Science, Peking University

{siqingyi,linzheng,fupeng,caoyanan,wangweiping}@iie.ac.cn, liuyuanxin@stu.pku.edu.cn

Abstract

Despite the excellent performance of vision-

language pre-trained models (VLPs) on con-

ventional VQA task, they still suffer from two

problems: First, VLPs tend to rely on language

biases in datasets and fail to generalize to out-

of-distribution (OOD) data. Second, they are

inefﬁcient in terms of memory footprint and

computation. Although promising progress has

been made in both problems, most existing

works tackle them independently. To facilitate

the application of VLP to VQA tasks, it is im-

perative to jointly study VLP compression and

OOD robustness, which, however, has not yet

been explored. This paper investigates whether

a VLP can be compressed and debiased simulta-

neously by searching sparse and robust subnet-

works. To this end, we systematically study the

design of a training and compression pipeline

to search the subnetworks, as well as the assign-

ment of sparsity to different modality-speciﬁc

modules. Our experiments involve 3 VLPs, 2

compression methods, 4 training methods, 2

datasets and a range of sparsity levels. Our re-

sults show that there indeed exist sparse and ro-

bust subnetworks, which are competitive with

the debiased full VLP and clearly outperform

the debiasing SoTAs with fewer parameters on

OOD datasets VQA-CP v2 and VQA-VS.1

1 Introduction

Visual Question Answering (VQA) (Antol et al.,

2015) is an important task at the intersection of CV

and NLP. In the last decade, deep neural networks

have made promising progress in VQA. However,

recent studies (Agrawal et al.,2016;Manjunatha

et al.,2019) have found that VQA models are prone

to dataset biases. As a result, they always suffer

from sharp performance drops when faced with out-

of-distribution (OOD) test datasets, whose answer

distributions are different from the training set.

∗∗

Equal contribution.

†

Corresponding author: Zheng Lin.

The codes can be found at

https://github.com/

PhoebusSi/Compress-Robust-VQA.

050 100 150 200

OOD Performance (Acc %)

Number of Parameters (Millions)

RUBi

VGQE

LPFS-MRL

CF-VQAS-MRL

DLR

LMH

CF-VQAUpDn

LPFUpDn

CGE

Ours-90%

Ours-70%

Ours-50%

Ours-30%

Ours-10%

lxmert(lmh)

lxmert

lxmert(lpf)

Figure 1: Comparison of accuracy and model sizes

with debiasing SoTAs on VQA-CP v2. The green and

cyan lines represent our "lxmert(lpf) + mask train(lmh)"

and "lxmert(lmh) + mask train(lmh)", respectively, with

modality-speciﬁc sparsity.

Although large-scale vision-language pre-

trained models (VLPs) achieve further improve-

ments in the in-distribution (ID) VQA benchmark

(Goyal et al.,2017), they also fail to address the

dataset-bias problem (Agrawal et al.,2018), e.g.,

lxmert (Tan and Bansal,2019) suffers a 23.26%

drop between ID and OOD accuracy. At the same

time, the improvement brought by VLPs is partly

due to their large model size, which increases the

computational cost of deploying VQA models.

To facilitate the application of VLPs to VQA

tasks, the two problems should be addressed

simultaneously. However, existing researches

mostly focus on each of them separately.

The dataset-bias problem in VQA is well studied

by numerous debiasing methods based on conven-

tional small-scale models(Anderson et al.,2018;

Cadene et al.,2019). Their main solution (Cadene

et al.,2019;Clark et al.,2019;Liang et al.,2021b;

Mahabadi and Henderson,2019) is to regularize the

loss according to the bias degree of training sam-

ples. In terms of the increased computational cost,

a line of recent efforts have been made to compress

pre-trained language models (PLMs) in the NLP

arXiv:2210.14558v2 [cs.CV] 11 Oct 2023

ﬁeld (Chen et al.,2020b;Li et al.,2020a,b;Liang

et al.,2021a;Liu et al.,2021,2022;Prasanna et al.,

2020) and VLPs for visual-linguistic tasks (Fang

et al.,2021;Gan et al.,2022). They show that

large-scale PLMs and VLPs can be compressed

into lightweight models without degrading perfor-

mance. Refer to App. Afor more related work.

This paper jointly studies the compression and

debiasing problems of VLP for the VQA task. To

this end, we combine the existing debiasing and

pruning methods to establish a training and com-

pression pipeline, and conduct extensive experi-

ments with the pre-trained lxmert, which is the

most popular VLP in VQA, under different OOD

settings. We show that there exist sparse lxmert sub-

networks that are more robust than the full model,

which suggests that the goal of OOD robustness

and computational efﬁciency can be achieved si-

multaneously.

We also present a comprehensive study on the

design of the training and compression pipeline,

as well as the assignment of sparsity to different

model modules, to identify subnetworks with bet-

ter OOD generalization. Our ﬁndings highlight the

importance of 1) Employing a two-stage training

and compression pipeline and integrating the debi-

asing objective throughout the entire process. 2)

If there are two debiasing methods working well

with the full model, training the full model with the

relatively poor-performing one and compressing it

with the better one. 3) Assigning modality-speciﬁc

sparsity to different modules of VLP.

Our main contributions are as follows: (1) We

present the ﬁrst (to our knowledge) systematic

study on sparsity and OOD robustness for VLPs.

(2) Our empirical studies on the training and com-

pression pipeline and sparsity assignment can serve

as a valuable guideline for the future design of VLP

subnetwork searching methods. (3) We obtain sub-

networks that outperform existing debiasing So-

TAs in terms of the trade-off between accuracy

and model size on OOD datasets VQA-CP v2 and

VQA-VS (see Fig. 1, Tab. 1and Tab. 2).

2 Method

2.1 VLP Architecture and Subnetworks

This section takes lxmert as an example to intro-

duce how we extract subnetworks. Lxmert contains

an embedding layer, a visual fc layer, a pooler layer,

a VQA-speciﬁc classiﬁer and a stack of Trans-

former layers, which involve three encoders: lan-

guage encoder (

Lenc

), object relationship encoder

(Renc) and cross-modality encoder (Cenc).

We adopt unstructured pruning to obtain a com-

pressed version (i.e., a subnetwork) of the origi-

nal VLPs. Speciﬁcally, given a VLP

f(θ)

with

parameters

, we apply a binary pruning mask

m∈ {0,1}|θ|

to the model parameters, which

gives rise to

f(m⊙θ)

, where

⊙

is the element-wise

product. The parameters to be pruned are:

θpr ={Wemb,Wvis-fc,Wplr} ∪ θLenc ∪θRenc ∪θXenc (1)

where

Wemb

Wvis-fc

and

Wplr

are the weights of

embedding layer, vision fc layer and pool layer,

θLenc ∪θRenc ∪θXenc

are the parameters of Trans-

former layers. More details of lxmert can be found

in App. B.1. Another model visualBERT (Li et al.,

2019), which is also used in our experiments, will

be introduced in App. B.2.

2.2 Pruning Methods

We consider two representative pruning methods,

i.e., magnitude-based pruning (Han et al.,2015)

and mask training (Louizos et al.,2018;Ramanujan

et al.,2020;Sanh et al.,2020;Sehwag et al.,2020).

Magnitude-based Pruning approximates the im-

portance of model parameters based on their abso-

lute values and eliminates the less important ones.

We adopt the basic version of magnitude-based

pruning, i.e., one-shot magnitude pruning (OMP).

OMP can optionally be combined with further ﬁne-

tuning of the pruned subnetwork to recover the

performance drop.

Mask Training directly optimizes the binary

pruning mask

towards the given objectives.

Speciﬁcally, each weight matrix

W∈Rdi×do

is as-

sociated with two mask matrices, namely a binary

mask

m∈ {0,1}di×do

and a real-valued mask

m∈Rdi×do

. In the forward propagation,

computed from ˆ

mthrough binarization:

mi,j =(1if ˆ

mi,j ≥ϕ

0else (2)

where

is the threshold. Then, the original weight

matrix

is replaced with a pruned one

m⊙W

When it comes to backward propagation, we follow

(Liu et al.,2022;Mallya et al.,2018;Radiya-Dixit

and Wang,2020;Zhao et al.,2020) and use the

straight-through estimator (Bengio et al.,2013) to

estimate the gradients of

using the gradients of

, and then update

m←ˆ

m−η∂L

∂m

, where

ηis the learning rate.

We initialize

according to the magnitudes of

the pre-trained weights of lxmert. This strategy

is shown to be more effective than random initial-

ization for pre-trained language models (Liu et al.,

2022;Radiya-Dixit and Wang,2020) and we also

validate this in our experiments with lxmert (see

App. C.2). Speciﬁcally, ˆ

mis initialized as:

mi,j =(0if Wi,j is pruned by OMP

α×ϕelse (3)

where

α≥1

is a hyper-parameter. At initialization,

we set the threshold

ϕ= 0.01

(any other value with

the same order of magnitude should also be ﬁne).

To ensure that the subnetwork satisﬁes the given

sparsity, ϕis re-computed every tmtraining steps.

2.3 Debiasing Methods

The deabising methods in VQA usually contain a

main model and a biased model. The biased model,

which learns the language bias, is used to measure

the training samples’ bias degree and adjust the

training loss for the main model. We experiment

with SoTAs debiasing methods, i.e., LMH (Clark

et al.,2019), RUBi (Cadene et al.,2019) and LPF

(Liang et al.,2021b), of which LMH is widely

studied for the OOD scenario of VQA (Chen et al.,

2020a;Liang et al.,2020;Si et al.,2021) and NLU

(Jia and Liang,2017;McCoy et al.,2019;Schuster

et al.,2019;Zhang et al.,2019). For comparison,

we also describe the binary cross-entropy here.

Binary Cross-Entropy (BCE) computes the

cross-entropy between the predicted distribution

(from main model) and the soft target score of

each ground-truth t, as:

Lbce =t·log(δ(pm)) + (1 −t)·log(1 −δ(pm))] (4)

where δdenotes the sigmoid function.

Learned-Mixin +H (LMH) adds a biased model

to learn biases during training, as follows:

ˆpdeb =softmax(log(pm) + g(h)log(pb))

g(h) = softplus(w·h)(5)

where

and

are the predicted distribution of

biased model and main model, respectively.

g(h)

determines how much to trust the learned biases,

based on lxmert’s last hidden representation

. Fol-

lowing (Clark et al.,2019), we directly use the

answers’ frequency under each question type as

pb2

. To prevent

from being ignored, LMH also

adds an entropy penalty item Rin the ﬁnal loss:

Llmh =t·log(δ(ˆpdeb)) + (1 −t)·log(1 −δ(ˆpdeb))] + R(6)

RUBi adopts a training strategy similar to LMH

to regularize the main model’s probability, and uses

standard cross-entropy as the training loss:

ˆpdeb =softmax(pm·δ(pb))

Lrubi =−1

log(ˆpdeb) [ak](7)

LPF measures the bias degree as

αk=pb[ak]

to regularize the loss of the main model:

Llpf =−1

(1 −αk)γlog(sof tmax(pm)) [ak](8)

where the γis a tunable hype-parameter.

2.4 Problem Formulation

Given the pre-trained lxmert

f(θpt)

, our goal is

to ﬁnd a subnetwork

f(m⊙θft)

that satisﬁes a

target sparsity level

and maximizes the OOD

performance:

maxm,θf t (EOOD (f(m⊙θft))) ,s.t. ∥m∥0

|θpr |= (1 −s)(9)

where

EOOD

denotes OOD evaluation,

∥∥0

is the

norm and

|θpr|

is the total number of parameters in

θpr

. This goal is achieved by searching the optimal

mand θft in model training and compression.

Eq. 9only speciﬁes the overall sparsity. In this

work, we also explore a ﬁner-grained control over

sparsity, which allocates different sparsity to differ-

ent modules of lxmert, given that the overall spar-

sity is satisﬁed. Concretely, we consider three mod-

ules from different modalities, i.e., the language

module, the visual module and the cross-modality

module. The constraint in the optimization prob-

lem is then rewritten as3:

s.t. ∥mLan∥0

|θLan|= (1 −sL),∥mV is∥0

|θV is|= (1 −sR),∥mX∥0

|θXenc |= (1 −sX),

sL·|θLan|

|θpr|+sR·|θV is|

|θpr|+sX·|θXenc |

|θpr|=s

(10)

where

θLan =θLEnc ∪ {Wemb}

θV is =θREnc ∪

{Wvis-fc}

and

θXEnc

are model parameters of

We use the same

in our implementation of LMH, RUBi

and LPF. More details of LMH can be found in App. B.3

For simplicity, the pooler layer’s parameters(0.5M) are

not included in eq. 10. We directly set it to the target sparsity

lxmert(bce)+mask train(bce)

lxmert(bce)+mask train(lmh)

lxmert(bce)+OMP+bce ft

lxmert(bce)+OMP+lmh ft

full lxmert(bce

lxmert(lmh)+mask train(bce)

lxmert(lmh)+mask train(lmh)

lxmert(lmh)+OMP+bce ft

lxmert(lmh)+OMP+lmh ft

full lxmert(bce)

full lxmert(lmh)

Figure 2: Results of subnetworks from the BCE ﬁne-tuned lxmert (left) and from the LMH ﬁne-tuned lxmert (right)

on VQA-CP v2. “lxmert(bce/lmh)" denotes full model ﬁne-tuning in Stage1, “mask train(bce/lmh)" and “OMP"

denote pruning in Stage2. “bce/lmh ft" denotes further ﬁne-tuning in Stage3. “Gap" denotes the improvement of

mask train(bce/lmh) over full lxmert(bce/lmh). The shadowed areas denote standard deviations. These abbreviations

are used throughout this paper. Detailed performance on three question types is shown in App. C.1

the language module, visual module and cross-

modality encoder, respectively.

mLan

mV is

and

are the binary masks for the three modules,

respectively.

and

are the target sparsity

levels for the three modules, respectively.

If not otherwise speciﬁed, we set the sparsity of

every weight matrix to target sparsity. For exam-

ple, if

s= 70%

and there is no modality-speciﬁc

constraint, then all weight matrices are at 70%

(uniform sparsity). If

sL= 50%

, then all weight

matrices in

θLan

are at 50% sparsity, while

and

could be different (modality-speciﬁc sparsity).

2.5 Training and Compression Pipeline

We deﬁne two notations:

FL(f(θ))

denotes

training

f(θ)

using loss

L ∈ {Lbce,Llmh}

L(f(θ))

denotes pruning

f(θ)

using method

p∈ {OMP,mask train}

and loss

(if applicable),

which outputs a pruning mask

. A typical train-

ing and compression pipeline involves three stages:

Stage1: Full Model Fine-tuning. The pre-

trained lxmert

f(θpt)

is ﬁne-tuned using loss

which produces f(θft) = FL(f(θ)).

Stage2: Model Compression. The ﬁne-tuned

lxmert

f(θft)

is compressed and we get the sub-

network f(m⊙θft), where m=Pp

L(f(θft)).

Stage3: Further Fine-tuning (optional). The

subnetwork

f(m⊙θft)

is further ﬁne-tuned using

loss

L′

, and gets

f(m⊙θ′

ft) = FL′(f(m⊙θft))

3 Experiments

In this section, we mainly investigate three ques-

tions: (1) How does compression affect lxmert’s

OOD generalization ability? (2) How to design

the training and pruning pipeline to achieve a good

sparsity-performance trade-off? (3) How to assign

sparsity to different modality-speciﬁc modules?

3.1 Datasets, Model and Implementation

We conduct experiments on the OOD benchmarks

VQA-CP v2 (Agrawal et al.,2018) and VQA-VS

(Si et al.,2022b) that evaluate the robustness of

VQA systems, with the accuracy-based evaluation

metric (Antol et al.,2015). A more detailed discus-

sion of the difference between the two datasets is

shown in Sec. 3.5. We thoroughly study the above

three questions on VQA-CP-v2, which is widely

used in the literature on debiasing VQA systems

(refer to Sec. 3.2,3.3 and 3.4 ). Then, based on the

ﬁndings, we further explore the more challenging

VQA-VS (Si et al.,2022b) (refer to Sec. 3.5 ). For

VLP, we adopt the lxmert-base-uncased model (Tan

and Bansal,2019) released by huggingface (Wolf

et al.,2020). All the results are averaged over 4

random seeds. More information of the model and

implementation details are shown in App. B.4.

3.2 Effect of Compression on OOD Accuracy

Subnetworks from BCE Fine-tuned lxmert.

We compress the BCE ﬁne-tuned lxmert using

OMP and mask training and introduce either

Lbce

Llmh

in the pruning (for mask training) or fur-

ther ﬁne-tuning process (for OMP).

The results are shown in the upper row of Fig.

2. We can derive several observations: 1) When

no debiasing methods are used, the subnetworks

of “mask train(bce)" and “OMP + bce ft" improve

over the full lxmert by 1.35%

∼

2.79%, even at

up to 70% sparsity. This implies that lxmert is

overparameterized and pruning may remove some

parameters related to the bias features. 2) “mask

train(lmh)" and “OMP + lmh ft" achieve further per-

formance boost, exceeding full lxmert by a large

margin (11.05%

∼

14.02%). Since mask train-

ing does not change the value of parameters, the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CompressingandDebiasingVision-LanguagePre-TrainedModelsforVisualQuestionAnsweringQingyiSi1,2∗,YuanxinLiu3∗,ZhengLin1,2†PengFu1,YananCao1,2,WeipingWang11InstituteofInformationEngineering,ChineseAcademyofSciences,Beijing,China2SchoolofCyberSecurity,UniversityofChineseAcademyofSciences,Beijing,China3Na...

展开>> 收起<<

Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering Qingyi Si12 Yuanxin Liu3 Zheng Lin12.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering Qingyi Si12 Yuanxin Liu3 Zheng Lin12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: