their applicability to more complicated vision tasks is unclear. On the other hand, the single-task
adaptation methods [
7
,
10
,
11
,
12
,
8
] still need to learn and store task-wise parameters, and the
number of trainable parameters increase with respect to the number of tasks.
Therefore, in this paper, we first conduct a thorough study on
how the existing successful parameter-
efficient methods on NLP tasks perform on vision tasks
, particularly on more challenging dense
vision tasks (e.g., semantic segmentation, normals estimation). Second, based on our findings, we
then
design a novel parameter-efficient method for adaptation to multiple dense vision tasks
.
Our method leverage shared modules across tasks and encourage the model to use shared information
in a more parameter-efficient manner.
To start with, we first evaluate the existing parameter-efficient methods in NLP on dense vision
problems. We chose to apply these methods to hierarchical vision transformers (HVTs) considering
their state-of-the-art results on many per-pixel vision tasks [
13
,
14
]. Through our extensive studies, we
find
two
limitations in these works. First, adapter-based methods [
11
,
15
], which have shown strong
performance on NLP parameter-efficient adaptation benchmarks, cannot be efficiently integrated with
HVTs. This is because the parameter usage of adapters in later transformer blocks grows quadratically
with respect to the layer scale (see Fig. 1c). Second, the state-of-the-art multi-task parameter-efficient
method [
16
] applies a hyper-network to produce the weights of adapters and shares information
across different NLP tasks, while we find it inherently requires a large number of trainable parameters
in the hyper-network (see Sec. 4.1 for further discussion).
To address the above limitations, we propose Polyhistor-Lite, which consist of two main components,
Decomposed Lite-HyperNetworks and Layer-wise Scaling Kernels. These two methods reduce the
trainable parameters in two aspects respectively, including parameter reduction for hyper-networks in
multi-tasking architecture and parameter reduction for adapters used in HVTs.
Specifically, to reduce the parameter usage of the multi-task architecture, we decompose a hyper-
network into a pair of separate hyper-networks. Unlike the existing approach, where a relatively large
hyper-network is used to produce long vector that can be reshaped into the weight matrix of adapter,
our decomposed hyper-networks individually produce two low-rank matrices that are multiplied to
construct the adapter weights. As a result, we can rely on this low-rank approximation to reduce the
parameter usage in the hyper-network yet maintain its performance on downstream tasks. In addition,
to enable the hypernetworks shared across layers in HVTs, we factorize an adapter weight matrix
into two kernels, including Template Kernels and Scaling Kernels. These two kernels are multiplied
via Kronecker Product to fit in with different sizes of adapters, and this is achieved by controlling
the sizes of Scaling Kernels based on the scaling of the layer/adapter (and using the same size of
Template Kernels across layers). In this way, the parameters of adapter weights can be effectively
reduced with a minimal sacrifice on the accuracy of downstream tasks.
To benchmark the problem, we construct a unified framework with the same implementation details
and provide a comprehensive and fair comparison between existing parameter-efficient adaptation
works in NLP on our multi-tasking dense vision problems. We also demonstrate that, with the
integration of our proposed Decomposed HyperNetworks and Layer-wise Scaling Kernels, we can
achieve a much better trade-off between trainable parameters and accuracies compared to the existing
methods. Specifically, most of existing methods struggled to match the performance of the simple
baseline, which individually fine-tunes the entire network for each task, while our method achieves
better results than the simple baseline while only training less than
10%
of the parameters in
a model. Compared with the state-of-the-art multi-tasking parameter-efficient adaptation method,
Hyperformer [
16
], our method achieves competitive performance improvement with
∼90%
reduction
in the trainable parameters of their method. Interestingly, we also observed that our proposed method
brings high-performance improvement when applied to the network pre-trained on the larger dataset
(ImageNet-22k). We will publicly release our code to facilitate future research.
To sum up, we list our contributions as follows:
•
To the best of our knowledge, we are the first to address parameter-efficient
multi-task
adaptation for vision tasks. We develop a unified framework to benchmark several parameter-
efficient fine-tuning NLP methods on dense vision tasks.
•
We propose a novel method — Polyhistor-Lite that achieves significant performance im-
provement with very low training parameters compared to existing methods.
2