Polyhistor Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks Yen-Cheng Liu

2025-05-02 0 0 1.41MB 16 页 10玖币
侵权投诉
Polyhistor: Parameter-Efficient Multi-Task
Adaptation for Dense Vision Tasks
Yen-Cheng Liu
Georgia Tech
ycliu@gatech.edu
Chih-Yao Ma
Meta
cyma@meta.com
Junjiao Tian
Georgia Tech
jtian73@gatech.edu
Zijian He
Meta
zijian@meta.com
Zsolt Kira
Georgia Tech
zkira@gatech.edu
Abstract
Adapting large-scale pretrained models to various downstream tasks via fine-tuning
is a standard method in machine learning. Recently, parameter-efficient fine-tuning
methods show promise in adapting a pretrained model to different tasks while
training only a few parameters. Despite their success, most existing methods
are proposed in Natural Language Processing tasks with language Transform-
ers, and adaptation to Computer Vision tasks with Vision Transformers remains
under-explored, especially for dense vision tasks. Further, in multi-task settings,
individually fine-tuning and storing separate models for different tasks is inefficient.
In this work, we provide an extensive multi-task parameter-efficient benchmark and
examine existing parameter-efficient fine-tuning NLP methods for vision tasks. Our
results on four different dense vision tasks showed that existing methods cannot
be efficiently integrated due to the hierarchical nature of the Hierarchical Vision
Transformers. To overcome this issue, we propose Polyhistor and Polyhistor-Lite,
consisting of Decomposed HyperNetworks and Layer-wise Scaling Kernels, to
share information across different tasks with a few trainable parameters. This
leads to favorable performance improvements against existing parameter-efficient
methods while using fewer trainable parameters. Specifically, Polyhistor achieves
competitive accuracy compared to the state-of-the-art while only use
10%
of
their trainable parameters. Furthermore, our methods show larger performance
gains when large networks and more pretraining data are used.
1 Introduction
Foundation models trained with large-scale datasets have shown the success of adapting to a variety
of downstream NLP and vision tasks [
1
]. As the state-of-the-art foundation models grow to billion
or even trillion parameter models [
2
,
3
,
4
,
5
,
6
], individually fine-tuning all parameters of the model
wastes significant computational resources. Further, for multi-task models, both fine-tuning and stor-
ing separate models for multiple tasks become infeasible on devices with low computation resources.
To alleviate this issue, several works [
7
,
8
,
9
] have proposed parameter-efficient fine-tuning methods
to derive a better trade-off between trainable parameters and accuracy on downstream tasks. By only
training a small amount of parameters, these existing methods can substantially narrow the accuracy
gap compared to the baseline that fine-tunes all parameters. However, these existing approaches
mainly focus on NLP tasks [
10
,
11
,
12
] or single-task adaptation on image classification [
9
], and
Polyhistor:someone gifted or learned in multiple disciplines.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03265v1 [cs.CV] 7 Oct 2022
their applicability to more complicated vision tasks is unclear. On the other hand, the single-task
adaptation methods [
7
,
10
,
11
,
12
,
8
] still need to learn and store task-wise parameters, and the
number of trainable parameters increase with respect to the number of tasks.
Therefore, in this paper, we first conduct a thorough study on
how the existing successful parameter-
efficient methods on NLP tasks perform on vision tasks
, particularly on more challenging dense
vision tasks (e.g., semantic segmentation, normals estimation). Second, based on our findings, we
then
design a novel parameter-efficient method for adaptation to multiple dense vision tasks
.
Our method leverage shared modules across tasks and encourage the model to use shared information
in a more parameter-efficient manner.
To start with, we first evaluate the existing parameter-efficient methods in NLP on dense vision
problems. We chose to apply these methods to hierarchical vision transformers (HVTs) considering
their state-of-the-art results on many per-pixel vision tasks [
13
,
14
]. Through our extensive studies, we
find
two
limitations in these works. First, adapter-based methods [
11
,
15
], which have shown strong
performance on NLP parameter-efficient adaptation benchmarks, cannot be efficiently integrated with
HVTs. This is because the parameter usage of adapters in later transformer blocks grows quadratically
with respect to the layer scale (see Fig. 1c). Second, the state-of-the-art multi-task parameter-efficient
method [
16
] applies a hyper-network to produce the weights of adapters and shares information
across different NLP tasks, while we find it inherently requires a large number of trainable parameters
in the hyper-network (see Sec. 4.1 for further discussion).
To address the above limitations, we propose Polyhistor-Lite, which consist of two main components,
Decomposed Lite-HyperNetworks and Layer-wise Scaling Kernels. These two methods reduce the
trainable parameters in two aspects respectively, including parameter reduction for hyper-networks in
multi-tasking architecture and parameter reduction for adapters used in HVTs.
Specifically, to reduce the parameter usage of the multi-task architecture, we decompose a hyper-
network into a pair of separate hyper-networks. Unlike the existing approach, where a relatively large
hyper-network is used to produce long vector that can be reshaped into the weight matrix of adapter,
our decomposed hyper-networks individually produce two low-rank matrices that are multiplied to
construct the adapter weights. As a result, we can rely on this low-rank approximation to reduce the
parameter usage in the hyper-network yet maintain its performance on downstream tasks. In addition,
to enable the hypernetworks shared across layers in HVTs, we factorize an adapter weight matrix
into two kernels, including Template Kernels and Scaling Kernels. These two kernels are multiplied
via Kronecker Product to fit in with different sizes of adapters, and this is achieved by controlling
the sizes of Scaling Kernels based on the scaling of the layer/adapter (and using the same size of
Template Kernels across layers). In this way, the parameters of adapter weights can be effectively
reduced with a minimal sacrifice on the accuracy of downstream tasks.
To benchmark the problem, we construct a unified framework with the same implementation details
and provide a comprehensive and fair comparison between existing parameter-efficient adaptation
works in NLP on our multi-tasking dense vision problems. We also demonstrate that, with the
integration of our proposed Decomposed HyperNetworks and Layer-wise Scaling Kernels, we can
achieve a much better trade-off between trainable parameters and accuracies compared to the existing
methods. Specifically, most of existing methods struggled to match the performance of the simple
baseline, which individually fine-tunes the entire network for each task, while our method achieves
better results than the simple baseline while only training less than
10%
of the parameters in
a model. Compared with the state-of-the-art multi-tasking parameter-efficient adaptation method,
Hyperformer [
16
], our method achieves competitive performance improvement with
90%
reduction
in the trainable parameters of their method. Interestingly, we also observed that our proposed method
brings high-performance improvement when applied to the network pre-trained on the larger dataset
(ImageNet-22k). We will publicly release our code to facilitate future research.
To sum up, we list our contributions as follows:
To the best of our knowledge, we are the first to address parameter-efficient
multi-task
adaptation for vision tasks. We develop a unified framework to benchmark several parameter-
efficient fine-tuning NLP methods on dense vision tasks.
We propose a novel method — Polyhistor-Lite that achieves significant performance im-
provement with very low training parameters compared to existing methods.
2
We observe that our method can bring further performance improvements when applied to
models with larger pre-trained dataset or with larger backbones.
2 Related Works
Parameter-efficient Learning aims to adapt a pre-trained model to a new task by only training a small
number of parameters. The most straightforward method is to freeze the pre-trained encoder and
only fine-tune the last layer, while, in terms of the accuracy of downstream tasks, it is still far from
full fine-tuning. Thus, to achieve a better trade-off between accuracy and the number of tunable
parameters, several works [
7
,
10
,
11
,
12
,
16
,
9
,
8
] have proposed more parameter-efficient methods,
and we summarize these works in the following paragraphs.
Single-Task Parameter-efficient Adaptation.
Several works build upon the Adapter [
11
], which
is a bottleneck-like module that is placed across the architecture and trained while the rest of the
original model is frozen. By changing the dimension of hidden vectors, one can easily control the
trade-off between trainable parameters and accuracy. For example, Houlsby et al. [
11
] proposes
to apply two adapter modules placed after the attention layers and the MLP layers respectively,
while Pfeiffer et al. [
15
] only use adapters after MLP layers and show more parameter-efficiency.
Furthermore, PHM-Layer [
12
] learns two types of matrices, one "slow" matrix shared across layers
and the other "fast" matrix learned individually in different layers, to produce the adapter weight via
Kronecker Product [
17
]. Compacter [
12
] further reduces the parameters by decomposing the slow
matrix into two rank-one vectors. Different from their goal of sharing slow matrix across layers, we
apply Kronecker Product to efficiently scale up adapters to different layer scales.
In addition, there are other parameter-efficient learning works. BitFit [
7
] shows simply tuning biases
in all layers improves against the linear probing. Some other works fine-tune learnable vectors, such
as learnable vectors in input word embeddings [
18
] and learnable vectors integrated with keys/values
in each layer of transformers [
19
]. LoRA [
10
] produces two low-rank matrices, which are multiplied
and served as a residual of attention weight matrices. While the above methods show favorable results
with using fewer trainable parameters, the goal of these works is single-task adaptation.
Multi-Task Parameter-efficient Adaptation.
When multiple tasks are learned jointly, one can share
some homogeneous information across different tasks and save the parameter usage by removing
the duplicated learned features. To this end, Hyperformer [
16
] introduces a hyper-network, which
takes as input task embeddings and produces the weights of adapters in different tasks. Since only the
parameters in the hyper-network need to be trained, the number of trainable parameters in task-wise
components can thus be reduced in the multi-tasking setup. On the other hand, Sung et al. [
20
] shows
that simply adding a single adapter on language transformer and sharing the adapter across tasks can
achieve promising results in vision-and-language cross-modal tasks (e.g., image captioning).
Parameter-efficient Adaptation for Vision.
Despite the promising results, most parameter-efficient
learning methods are evaluated on language transformers and NLP benchmarks, and parameter-
efficient learning on Vision Transformer [
9
] is still an under-explored topic. A recent work, Visual
Prompt Tuning (VPT) [
9
], initiates the study of parameter-efficient learning on Vision Transformers,
and it follows the idea of prompt tuning in language tasks and prepends and fine-tunes some extra
learnable vectors in the input space of pre-trained Vision Transformers. VPT focuses on single-task
adaptation, while our work focuses on multi-task adaptation.
To fairly compare different parameter-efficient learning methods, He et al. [
21
] present an empirical
study and re-evaluate parameter-efficient learning methods (BitFit, Adapters, Prefix Tuning, and
LoRA) under the same experiment configurations for NLP tasks. Inspired by their work, we imple-
ment the aforementioned parameter-efficient NLP methods (and include more latest works [
12
,
16
,
9
])
on our dense vision tasks, conduct comparative experiments, and fairly compare these methods.
3 Background
Hierarchical Vision Transformers.
The Vision Transformer [
22
] is based on transformer architec-
tures [
22
] and operates on a set of patch tokens obtained from an image. As a variant of Vision
Transformer, the Hierarchical Vision Transformer [
13
,
14
,
23
,
24
,
25
,
26
] produces multi-scale feature
representations, and its hierarchical structure extracts fine-grained information and better handles
3
(a)
(b)
(c)
Figure 1: Illustration of (a) Hierarchical Vision Transformer and (b) Adapter. (c) When applying
adapters in a Hierarchical Vision Transformer, the number of parameters grows quadratically with
the respect to the block scale. Note that
C
indicates the dimension of adapter input vectors,
n
is the
bottleneck size of adapters, and drepresents the input size of adapters.
images with scale and size variation. These properties contribute to the promising results in several
per-pixel vision tasks, including semantic segmentation [
13
,
14
,
26
], depth estimation [
27
], and
saliency detection [
28
]. As shown in Fig. 1a, a Hierarchy Vision Transformer (HVT) consists of
several transformer layers, and each transformer layer is mainly composed of an attention layer and
an MLP layer. Different from other transformers (e.g., ViT [
22
]), a distinct characteristic of HVTs is
its pyramid-like feature maps generated from different transformer blocks as shown in Fig. 1a.
Adapters.
Several parameter-efficient adaptation works [
11
,
15
,
12
,
21
] build upon Adapter [
11
],
which is a bottleneck-like module placed in transformer layers as shown in Fig. 1b. These layers are
learnable parameters, while the rest of the model is frozen during fine-tuning. The Adapter
fa(·)
consists of a down-projection layer
Wdown Rd×n
, a non-linearity function
δ(·)
, a up-projection
layer Wup Rn×d, and a skip connection from the input of the adapter hin Rd.
hout =fa(hin;W) = δ(hinWdown)Wup +hin,(1)
where
hout Rd
is the output of the adapter and
W= [Wdown;W|
up]Rd×2n
represents all
learnable parameters in the adapter.
4 Method
Problem Setting.
Given a Hierarchical Vision Transformer pre-trained on large-scale image datasets
(e.g., ImageNet [
29
]), our goal is to train a small number of parameters and adapt the model to the
multi-tasking setting, where training data of
N
tasks are obtained during the training stage. Following
the existing works in NLP, the criteria of parameter-efficient multi-tasking learning includes the
accuracy of downstream tasks and the numbers of training parameters.
Method Overview.
We aim to improve the parameter efficiency in two aspects: (1) to efficiently
share homogeneous information across tasks via lightweight hyper-networks (Section 4.1) and
(2) to efficiently scale up adapter weights in different transformer blocks of Hierarchical Vision
Transformers (Section 4.2). These two components are combined to improve the trade-off between
accuracy and training parameters in multi-tasking per-pixel vision tasks (Section 4.3).
4.1 Polyhistor: Decomposed Lightweight Hyper-networks for Multi-task Adaptation
With the goal of jointly adapting multiple NLP tasks in a parameter-efficient manner, a prior work,
Hyperformer [
16
], builds upon a group of adapters in different tasks and extracts task-sharing
information via a hyper-network shared across different tasks. Specifically, a group of task and layer-
wise adapters with weight parameters
{Wt
l|t= 1, ..., N;l= 1, ..., L}
are individually inserted into
each layer
l
of the model with
L
layers for all
N
tasks. Then, instead of individually learning the
4
摘要:

Polyhistor:Parameter-EfcientMulti-TaskAdaptationforDenseVisionTasksYen-ChengLiuGeorgiaTechycliu@gatech.eduChih-YaoMaMetacyma@meta.comJunjiaoTianGeorgiaTechjtian73@gatech.eduZijianHeMetazijian@meta.comZsoltKiraGeorgiaTechzkira@gatech.eduAbstractAdaptinglarge-scalepretrainedmodelstovariousdownstreamt...

展开>> 收起<<
Polyhistor Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks Yen-Cheng Liu.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.41MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注