field (Chen et al.,2020b;Li et al.,2020a,b;Liang
et al.,2021a;Liu et al.,2021,2022;Prasanna et al.,
2020) and VLPs for visual-linguistic tasks (Fang
et al.,2021;Gan et al.,2022). They show that
large-scale PLMs and VLPs can be compressed
into lightweight models without degrading perfor-
mance. Refer to App. Afor more related work.
This paper jointly studies the compression and
debiasing problems of VLP for the VQA task. To
this end, we combine the existing debiasing and
pruning methods to establish a training and com-
pression pipeline, and conduct extensive experi-
ments with the pre-trained lxmert, which is the
most popular VLP in VQA, under different OOD
settings. We show that there exist sparse lxmert sub-
networks that are more robust than the full model,
which suggests that the goal of OOD robustness
and computational efficiency can be achieved si-
multaneously.
We also present a comprehensive study on the
design of the training and compression pipeline,
as well as the assignment of sparsity to different
model modules, to identify subnetworks with bet-
ter OOD generalization. Our findings highlight the
importance of 1) Employing a two-stage training
and compression pipeline and integrating the debi-
asing objective throughout the entire process. 2)
If there are two debiasing methods working well
with the full model, training the full model with the
relatively poor-performing one and compressing it
with the better one. 3) Assigning modality-specific
sparsity to different modules of VLP.
Our main contributions are as follows: (1) We
present the first (to our knowledge) systematic
study on sparsity and OOD robustness for VLPs.
(2) Our empirical studies on the training and com-
pression pipeline and sparsity assignment can serve
as a valuable guideline for the future design of VLP
subnetwork searching methods. (3) We obtain sub-
networks that outperform existing debiasing So-
TAs in terms of the trade-off between accuracy
and model size on OOD datasets VQA-CP v2 and
VQA-VS (see Fig. 1, Tab. 1and Tab. 2).
2 Method
2.1 VLP Architecture and Subnetworks
This section takes lxmert as an example to intro-
duce how we extract subnetworks. Lxmert contains
an embedding layer, a visual fc layer, a pooler layer,
a VQA-specific classifier and a stack of Trans-
former layers, which involve three encoders: lan-
guage encoder (
Lenc
), object relationship encoder
(Renc) and cross-modality encoder (Cenc).
We adopt unstructured pruning to obtain a com-
pressed version (i.e., a subnetwork) of the origi-
nal VLPs. Specifically, given a VLP
f(θ)
with
parameters
θ
, we apply a binary pruning mask
m∈ {0,1}|θ|
to the model parameters, which
gives rise to
f(m⊙θ)
, where
⊙
is the element-wise
product. The parameters to be pruned are:
θpr ={Wemb,Wvis-fc,Wplr} ∪ θLenc ∪θRenc ∪θXenc (1)
where
Wemb
,
Wvis-fc
and
Wplr
are the weights of
embedding layer, vision fc layer and pool layer,
θLenc ∪θRenc ∪θXenc
are the parameters of Trans-
former layers. More details of lxmert can be found
in App. B.1. Another model visualBERT (Li et al.,
2019), which is also used in our experiments, will
be introduced in App. B.2.
2.2 Pruning Methods
We consider two representative pruning methods,
i.e., magnitude-based pruning (Han et al.,2015)
and mask training (Louizos et al.,2018;Ramanujan
et al.,2020;Sanh et al.,2020;Sehwag et al.,2020).
Magnitude-based Pruning approximates the im-
portance of model parameters based on their abso-
lute values and eliminates the less important ones.
We adopt the basic version of magnitude-based
pruning, i.e., one-shot magnitude pruning (OMP).
OMP can optionally be combined with further fine-
tuning of the pruned subnetwork to recover the
performance drop.
Mask Training directly optimizes the binary
pruning mask
m
towards the given objectives.
Specifically, each weight matrix
W∈Rdi×do
is as-
sociated with two mask matrices, namely a binary
mask
m∈ {0,1}di×do
and a real-valued mask
ˆ
m∈Rdi×do
. In the forward propagation,
m
is
computed from ˆ
mthrough binarization:
mi,j =(1if ˆ
mi,j ≥ϕ
0else (2)
where
ϕ
is the threshold. Then, the original weight
matrix
W
is replaced with a pruned one
m⊙W
.
When it comes to backward propagation, we follow
(Liu et al.,2022;Mallya et al.,2018;Radiya-Dixit
and Wang,2020;Zhao et al.,2020) and use the
straight-through estimator (Bengio et al.,2013) to
estimate the gradients of
ˆ
m
using the gradients of