Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering Qingyi Si12 Yuanxin Liu3 Zheng Lin12

2025-04-29 0 0 2.27MB 17 页 10玖币
侵权投诉
Compressing and Debiasing Vision-Language Pre-Trained Models
for Visual Question Answering
Qingyi Si1,2
, Yuanxin Liu3, Zheng Lin1,2
Peng Fu1, Yanan Cao1,2, Weiping Wang1
1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
3National Key Laboratory for Multimedia Information Processing,
School of Computer Science, Peking University
{siqingyi,linzheng,fupeng,caoyanan,wangweiping}@iie.ac.cn, liuyuanxin@stu.pku.edu.cn
Abstract
Despite the excellent performance of vision-
language pre-trained models (VLPs) on con-
ventional VQA task, they still suffer from two
problems: First, VLPs tend to rely on language
biases in datasets and fail to generalize to out-
of-distribution (OOD) data. Second, they are
inefficient in terms of memory footprint and
computation. Although promising progress has
been made in both problems, most existing
works tackle them independently. To facilitate
the application of VLP to VQA tasks, it is im-
perative to jointly study VLP compression and
OOD robustness, which, however, has not yet
been explored. This paper investigates whether
a VLP can be compressed and debiased simulta-
neously by searching sparse and robust subnet-
works. To this end, we systematically study the
design of a training and compression pipeline
to search the subnetworks, as well as the assign-
ment of sparsity to different modality-specific
modules. Our experiments involve 3 VLPs, 2
compression methods, 4 training methods, 2
datasets and a range of sparsity levels. Our re-
sults show that there indeed exist sparse and ro-
bust subnetworks, which are competitive with
the debiased full VLP and clearly outperform
the debiasing SoTAs with fewer parameters on
OOD datasets VQA-CP v2 and VQA-VS.1
1 Introduction
Visual Question Answering (VQA) (Antol et al.,
2015) is an important task at the intersection of CV
and NLP. In the last decade, deep neural networks
have made promising progress in VQA. However,
recent studies (Agrawal et al.,2016;Manjunatha
et al.,2019) have found that VQA models are prone
to dataset biases. As a result, they always suffer
from sharp performance drops when faced with out-
of-distribution (OOD) test datasets, whose answer
distributions are different from the training set.
Equal contribution.
Corresponding author: Zheng Lin.
1
The codes can be found at
https://github.com/
PhoebusSi/Compress-Robust-VQA.
45
47
49
51
53
55
57
59
61
63
65
67
050 100 150 200
OOD Performance (Acc %)
Number of Parameters (Millions)
RUBi
VGQE
LPFS-MRL
CF-VQAS-MRL
DLR
LMH
CF-VQAUpDn
LPFUpDn
CGE
Ours-90%
Ours-70%
Ours-50%
Ours-30%
Ours-10%
lxmert(lmh)
lxmert
lxmert(lpf)
Figure 1: Comparison of accuracy and model sizes
with debiasing SoTAs on VQA-CP v2. The green and
cyan lines represent our "lxmert(lpf) + mask train(lmh)"
and "lxmert(lmh) + mask train(lmh)", respectively, with
modality-specific sparsity.
Although large-scale vision-language pre-
trained models (VLPs) achieve further improve-
ments in the in-distribution (ID) VQA benchmark
(Goyal et al.,2017), they also fail to address the
dataset-bias problem (Agrawal et al.,2018), e.g.,
lxmert (Tan and Bansal,2019) suffers a 23.26%
drop between ID and OOD accuracy. At the same
time, the improvement brought by VLPs is partly
due to their large model size, which increases the
computational cost of deploying VQA models.
To facilitate the application of VLPs to VQA
tasks, the two problems should be addressed
simultaneously. However, existing researches
mostly focus on each of them separately.
The dataset-bias problem in VQA is well studied
by numerous debiasing methods based on conven-
tional small-scale models(Anderson et al.,2018;
Cadene et al.,2019). Their main solution (Cadene
et al.,2019;Clark et al.,2019;Liang et al.,2021b;
Mahabadi and Henderson,2019) is to regularize the
loss according to the bias degree of training sam-
ples. In terms of the increased computational cost,
a line of recent efforts have been made to compress
pre-trained language models (PLMs) in the NLP
arXiv:2210.14558v2 [cs.CV] 11 Oct 2023
field (Chen et al.,2020b;Li et al.,2020a,b;Liang
et al.,2021a;Liu et al.,2021,2022;Prasanna et al.,
2020) and VLPs for visual-linguistic tasks (Fang
et al.,2021;Gan et al.,2022). They show that
large-scale PLMs and VLPs can be compressed
into lightweight models without degrading perfor-
mance. Refer to App. Afor more related work.
This paper jointly studies the compression and
debiasing problems of VLP for the VQA task. To
this end, we combine the existing debiasing and
pruning methods to establish a training and com-
pression pipeline, and conduct extensive experi-
ments with the pre-trained lxmert, which is the
most popular VLP in VQA, under different OOD
settings. We show that there exist sparse lxmert sub-
networks that are more robust than the full model,
which suggests that the goal of OOD robustness
and computational efficiency can be achieved si-
multaneously.
We also present a comprehensive study on the
design of the training and compression pipeline,
as well as the assignment of sparsity to different
model modules, to identify subnetworks with bet-
ter OOD generalization. Our findings highlight the
importance of 1) Employing a two-stage training
and compression pipeline and integrating the debi-
asing objective throughout the entire process. 2)
If there are two debiasing methods working well
with the full model, training the full model with the
relatively poor-performing one and compressing it
with the better one. 3) Assigning modality-specific
sparsity to different modules of VLP.
Our main contributions are as follows: (1) We
present the first (to our knowledge) systematic
study on sparsity and OOD robustness for VLPs.
(2) Our empirical studies on the training and com-
pression pipeline and sparsity assignment can serve
as a valuable guideline for the future design of VLP
subnetwork searching methods. (3) We obtain sub-
networks that outperform existing debiasing So-
TAs in terms of the trade-off between accuracy
and model size on OOD datasets VQA-CP v2 and
VQA-VS (see Fig. 1, Tab. 1and Tab. 2).
2 Method
2.1 VLP Architecture and Subnetworks
This section takes lxmert as an example to intro-
duce how we extract subnetworks. Lxmert contains
an embedding layer, a visual fc layer, a pooler layer,
a VQA-specific classifier and a stack of Trans-
former layers, which involve three encoders: lan-
guage encoder (
Lenc
), object relationship encoder
(Renc) and cross-modality encoder (Cenc).
We adopt unstructured pruning to obtain a com-
pressed version (i.e., a subnetwork) of the origi-
nal VLPs. Specifically, given a VLP
f(θ)
with
parameters
θ
, we apply a binary pruning mask
m∈ {0,1}|θ|
to the model parameters, which
gives rise to
f(mθ)
, where
is the element-wise
product. The parameters to be pruned are:
θpr ={Wemb,Wvis-fc,Wplr} ∪ θLenc θRenc θXenc (1)
where
Wemb
,
Wvis-fc
and
Wplr
are the weights of
embedding layer, vision fc layer and pool layer,
θLenc θRenc θXenc
are the parameters of Trans-
former layers. More details of lxmert can be found
in App. B.1. Another model visualBERT (Li et al.,
2019), which is also used in our experiments, will
be introduced in App. B.2.
2.2 Pruning Methods
We consider two representative pruning methods,
i.e., magnitude-based pruning (Han et al.,2015)
and mask training (Louizos et al.,2018;Ramanujan
et al.,2020;Sanh et al.,2020;Sehwag et al.,2020).
Magnitude-based Pruning approximates the im-
portance of model parameters based on their abso-
lute values and eliminates the less important ones.
We adopt the basic version of magnitude-based
pruning, i.e., one-shot magnitude pruning (OMP).
OMP can optionally be combined with further fine-
tuning of the pruned subnetwork to recover the
performance drop.
Mask Training directly optimizes the binary
pruning mask
m
towards the given objectives.
Specifically, each weight matrix
WRdi×do
is as-
sociated with two mask matrices, namely a binary
mask
m∈ {0,1}di×do
and a real-valued mask
ˆ
mRdi×do
. In the forward propagation,
m
is
computed from ˆ
mthrough binarization:
mi,j =(1if ˆ
mi,j ϕ
0else (2)
where
ϕ
is the threshold. Then, the original weight
matrix
W
is replaced with a pruned one
mW
.
When it comes to backward propagation, we follow
(Liu et al.,2022;Mallya et al.,2018;Radiya-Dixit
and Wang,2020;Zhao et al.,2020) and use the
straight-through estimator (Bengio et al.,2013) to
estimate the gradients of
ˆ
m
using the gradients of
m
, and then update
ˆ
m
as
ˆ
mˆ
mηL
m
, where
ηis the learning rate.
We initialize
ˆ
m
according to the magnitudes of
the pre-trained weights of lxmert. This strategy
is shown to be more effective than random initial-
ization for pre-trained language models (Liu et al.,
2022;Radiya-Dixit and Wang,2020) and we also
validate this in our experiments with lxmert (see
App. C.2). Specifically, ˆ
mis initialized as:
ˆ
mi,j =(0if Wi,j is pruned by OMP
α×ϕelse (3)
where
α1
is a hyper-parameter. At initialization,
we set the threshold
ϕ= 0.01
(any other value with
the same order of magnitude should also be fine).
To ensure that the subnetwork satisfies the given
sparsity, ϕis re-computed every tmtraining steps.
2.3 Debiasing Methods
The deabising methods in VQA usually contain a
main model and a biased model. The biased model,
which learns the language bias, is used to measure
the training samples’ bias degree and adjust the
training loss for the main model. We experiment
with SoTAs debiasing methods, i.e., LMH (Clark
et al.,2019), RUBi (Cadene et al.,2019) and LPF
(Liang et al.,2021b), of which LMH is widely
studied for the OOD scenario of VQA (Chen et al.,
2020a;Liang et al.,2020;Si et al.,2021) and NLU
(Jia and Liang,2017;McCoy et al.,2019;Schuster
et al.,2019;Zhang et al.,2019). For comparison,
we also describe the binary cross-entropy here.
Binary Cross-Entropy (BCE) computes the
cross-entropy between the predicted distribution
pm
(from main model) and the soft target score of
each ground-truth t, as:
Lbce =t·log(δ(pm)) + (1 t)·log(1 δ(pm))] (4)
where δdenotes the sigmoid function.
Learned-Mixin +H (LMH) adds a biased model
to learn biases during training, as follows:
ˆpdeb =softmax(log(pm) + g(h)log(pb))
g(h) = softplus(w·h)(5)
where
pb
and
pm
are the predicted distribution of
biased model and main model, respectively.
g(h)
determines how much to trust the learned biases,
based on lxmert’s last hidden representation
h
. Fol-
lowing (Clark et al.,2019), we directly use the
answers’ frequency under each question type as
pb2
. To prevent
pb
from being ignored, LMH also
adds an entropy penalty item Rin the final loss:
Llmh =t·log(δ(ˆpdeb)) + (1 t)·log(1 δ(ˆpdeb))] + R(6)
RUBi adopts a training strategy similar to LMH
to regularize the main model’s probability, and uses
standard cross-entropy as the training loss:
ˆpdeb =softmax(pm·δ(pb))
Lrubi =1
N
N
X
k
log(ˆpdeb) [ak](7)
LPF measures the bias degree as
αk=pb[ak]
to regularize the loss of the main model:
Llpf =1
N
N
X
k
(1 αk)γlog(sof tmax(pm)) [ak](8)
where the γis a tunable hype-parameter.
2.4 Problem Formulation
Given the pre-trained lxmert
f(θpt)
, our goal is
to find a subnetwork
f(mθft)
that satisfies a
target sparsity level
s
and maximizes the OOD
performance:
maxm,θf t (EOOD (f(mθft))) ,s.t. m0
|θpr |= (1 s)(9)
where
EOOD
denotes OOD evaluation,
∥∥0
is the
L0
norm and
|θpr|
is the total number of parameters in
θpr
. This goal is achieved by searching the optimal
mand θft in model training and compression.
Eq. 9only specifies the overall sparsity. In this
work, we also explore a finer-grained control over
sparsity, which allocates different sparsity to differ-
ent modules of lxmert, given that the overall spar-
sity is satisfied. Concretely, we consider three mod-
ules from different modalities, i.e., the language
module, the visual module and the cross-modality
module. The constraint in the optimization prob-
lem is then rewritten as3:
s.t. mLan0
|θLan|= (1 sL),mV is0
|θV is|= (1 sR),mX0
|θXenc |= (1 sX),
sL·|θLan|
|θpr|+sR·|θV is|
|θpr|+sX·|θXenc |
|θpr|=s
(10)
where
θLan =θLEnc ∪ {Wemb}
,
θV is =θREnc
{Wvis-fc}
and
θXEnc
are model parameters of
2
We use the same
pb
in our implementation of LMH, RUBi
and LPF. More details of LMH can be found in App. B.3
3
For simplicity, the pooler layer’s parameters(0.5M) are
not included in eq. 10. We directly set it to the target sparsity
s.
lxmert(bce)+mask train(bce)
lxmert(bce)+mask train(lmh)
lxmert(bce)+OMP+bce ft
lxmert(bce)+OMP+lmh ft
full lxmert(bce
lxmert(lmh)+mask train(bce)
lxmert(lmh)+mask train(lmh)
lxmert(lmh)+OMP+bce ft
lxmert(lmh)+OMP+lmh ft
full lxmert(bce)
full lxmert(lmh)
Figure 2: Results of subnetworks from the BCE fine-tuned lxmert (left) and from the LMH fine-tuned lxmert (right)
on VQA-CP v2. “lxmert(bce/lmh)" denotes full model fine-tuning in Stage1, “mask train(bce/lmh)" and “OMP"
denote pruning in Stage2. “bce/lmh ft" denotes further fine-tuning in Stage3. “Gap" denotes the improvement of
mask train(bce/lmh) over full lxmert(bce/lmh). The shadowed areas denote standard deviations. These abbreviations
are used throughout this paper. Detailed performance on three question types is shown in App. C.1
the language module, visual module and cross-
modality encoder, respectively.
mLan
,
mV is
and
mX
are the binary masks for the three modules,
respectively.
sL
,
sR
and
sX
are the target sparsity
levels for the three modules, respectively.
If not otherwise specified, we set the sparsity of
every weight matrix to target sparsity. For exam-
ple, if
s= 70%
and there is no modality-specific
constraint, then all weight matrices are at 70%
(uniform sparsity). If
sL= 50%
, then all weight
matrices in
θLan
are at 50% sparsity, while
sR
and
sX
could be different (modality-specific sparsity).
2.5 Training and Compression Pipeline
We define two notations:
FL(f(θ))
denotes
training
f(θ)
using loss
L ∈ {Lbce,Llmh}
.
Pp
L(f(θ))
denotes pruning
f(θ)
using method
p∈ {OMP,mask train}
and loss
L
(if applicable),
which outputs a pruning mask
m
. A typical train-
ing and compression pipeline involves three stages:
Stage1: Full Model Fine-tuning. The pre-
trained lxmert
f(θpt)
is fine-tuned using loss
L
,
which produces f(θft) = FL(f(θ)).
Stage2: Model Compression. The fine-tuned
lxmert
f(θft)
is compressed and we get the sub-
network f(mθft), where m=Pp
L(f(θft)).
Stage3: Further Fine-tuning (optional). The
subnetwork
f(mθft)
is further fine-tuned using
loss
L
, and gets
f(mθ
ft) = FL(f(mθft))
.
3 Experiments
In this section, we mainly investigate three ques-
tions: (1) How does compression affect lxmert’s
OOD generalization ability? (2) How to design
the training and pruning pipeline to achieve a good
sparsity-performance trade-off? (3) How to assign
sparsity to different modality-specific modules?
3.1 Datasets, Model and Implementation
We conduct experiments on the OOD benchmarks
VQA-CP v2 (Agrawal et al.,2018) and VQA-VS
(Si et al.,2022b) that evaluate the robustness of
VQA systems, with the accuracy-based evaluation
metric (Antol et al.,2015). A more detailed discus-
sion of the difference between the two datasets is
shown in Sec. 3.5. We thoroughly study the above
three questions on VQA-CP-v2, which is widely
used in the literature on debiasing VQA systems
(refer to Sec. 3.2,3.3 and 3.4 ). Then, based on the
findings, we further explore the more challenging
VQA-VS (Si et al.,2022b) (refer to Sec. 3.5 ). For
VLP, we adopt the lxmert-base-uncased model (Tan
and Bansal,2019) released by huggingface (Wolf
et al.,2020). All the results are averaged over 4
random seeds. More information of the model and
implementation details are shown in App. B.4.
3.2 Effect of Compression on OOD Accuracy
Subnetworks from BCE Fine-tuned lxmert.
We compress the BCE fine-tuned lxmert using
OMP and mask training and introduce either
Lbce
or
Llmh
in the pruning (for mask training) or fur-
ther fine-tuning process (for OMP).
The results are shown in the upper row of Fig.
2. We can derive several observations: 1) When
no debiasing methods are used, the subnetworks
of “mask train(bce)" and “OMP + bce ft" improve
over the full lxmert by 1.35%
2.79%, even at
up to 70% sparsity. This implies that lxmert is
overparameterized and pruning may remove some
parameters related to the bias features. 2) “mask
train(lmh)" and “OMP + lmh ft" achieve further per-
formance boost, exceeding full lxmert by a large
margin (11.05%
14.02%). Since mask train-
ing does not change the value of parameters, the
摘要:

CompressingandDebiasingVision-LanguagePre-TrainedModelsforVisualQuestionAnsweringQingyiSi1,2∗,YuanxinLiu3∗,ZhengLin1,2†PengFu1,YananCao1,2,WeipingWang11InstituteofInformationEngineering,ChineseAcademyofSciences,Beijing,China2SchoolofCyberSecurity,UniversityofChineseAcademyofSciences,Beijing,China3Na...

展开>> 收起<<
Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering Qingyi Si12 Yuanxin Liu3 Zheng Lin12.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:2.27MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注