
of adapters through the lens of network prun-
ing (Mozer and Smolensky,1989;Janowsky,1989),
which reduces the model size of neural networks by
pruning redundant parameters and training the rest
ones, therefore, improving the network efficiency.
We call such pruned adapters
SparseAdapter
.
Specifically, we systematically investigate five rep-
resentative pruning methods in §2.2 to check at
what sparse ratio can the adapters maintain the
effectiveness. Note that to maintain the efficient
nature of adapters, we prune all adapters at ini-
tialization such that there are no extra computa-
tional costs. We find that
1
SparseAdapter
can achieve comparable (or even better) perfor-
mance than standard adapters when the sparse ratio
reaches up to 80%. Such encouraging performance
could hold even using the random pruning method
(See Figure 2) on GLUE benchmark (Wang et al.,
2018). Based on these insights, we introduce a
frustratingly easy setting, namely Large-Sparse,
for SparseAdapter. We find that
2
Scaling up
the bottleneck dimension of SparseAdapter with a
correspondingly larger sparse ratio (to ensure the
same parameter budget, for example, 2
×
dimen-
sion scaling with 50% sparse ratio) can effectively
yield significant improvement by augmenting the
model capacity.
We validate the concept of our proposed
SparseAdapter upon five advanced adapters, i.e.,
Houlsby (Houlsby et al.,2019), Pfeiffer (Pfeif-
fer et al.,2020b), LoRA (Hu et al.,2021), MAM
Adapter (He et al.,2022) and AdapterFusion (Pfeif-
fer et al.,2021), spanning both natural language
understanding (GLUE and SQuAD) and generation
(XSum) benchmarks. We show that with proper
sparsity, e.g. 40%, SparseAdapter could consis-
tently outperform their correspondingly counter-
part baselines. And with our Large-Sparse setting,
SparseAdapter could even beat the full fine-tuning
method significantly, e.g. 79.6 vs. 79.0 in Fig-
ure 1.
2 Methodology
Motivation.
Adapters are bottleneck modules
plugged in PLMs, with bottleneck dimension
r
and
model dimension
d
. In standard Adapter Tuning,
only adapter layers are trainable while the param-
eters of original parameters are frozen, where the
number of trainable parameters determines the ca-
pacity of adapters. The common recipe to augment
the capacity is to increase the bottleneck dimension,
which requires more computation cost, violating
the original intention of adapters.
To check whether augmenting adapters by in-
creasing the parameters is an optimal choice, we
decide to revisit the nature of adapters, i.e., parame-
ter efficiency, by pruning the redundant parameters.
As shown in Figure 2, randomly pruned adapters
can achieve comparable or even better performance
than standard adapters, which indicates the exis-
tence of redundant parameters. The comparable
performance could even be held under 80% spar-
sity. Such preliminary study urges us to investigate
the research questions
1
and
2
. We decide to
approach them by systematically investigating the
effects of different pruning methods.
Figure 2: The comparison between randomly pruned
adapters and standard adapters on datasets from GLUE.
Figure 3: Schematic comparison of (a) standard adapter
and (b) our proposed SparseAdapter.
Fine-tune
Adapter Fine-tuned
Adapter
(a) Standard Adapter Tuning.
Prune
Initialization
Fine-tune
Adapter SparseAdapter Fine-tuned
SparseAdapter
(b) SparseAdapter Tuning.
2.1 Pruning Adapters at Initialization
As is shown in Figure 3, we intend to prune
out redundant parameters and then fine-tune the
SparseAdapter, instead of directly tuning all pa-
rameters (standard Adapter Tuning). By pruning
adapters at initialization, we can abandon the re-
dundant parameters at the early stage and avoid the
time-consuming iterative pruning process (Fran-
kle and Carbin,2018). Specifically, considering
an adapter with weights
wl
inserted in the layer