Polyhistor Parameter-Efﬁcient Multi-Task Adaptation for Dense Vision Tasks Yen-Cheng Liu

2025-05-02 0 0 1.41MB 16 页 10玖币

侵权投诉

Polyhistor: Parameter-Efﬁcient Multi-Task

Adaptation for Dense Vision Tasks

Yen-Cheng Liu

Georgia Tech

ycliu@gatech.edu

Chih-Yao Ma

Meta

zijian@meta.com

Zsolt Kira

Georgia Tech

zkira@gatech.edu

Abstract

Adapting large-scale pretrained models to various downstream tasks via ﬁne-tuning

is a standard method in machine learning. Recently, parameter-efﬁcient ﬁne-tuning

methods show promise in adapting a pretrained model to different tasks while

training only a few parameters. Despite their success, most existing methods

are proposed in Natural Language Processing tasks with language Transform-

ers, and adaptation to Computer Vision tasks with Vision Transformers remains

under-explored, especially for dense vision tasks. Further, in multi-task settings,

individually ﬁne-tuning and storing separate models for different tasks is inefﬁcient.

In this work, we provide an extensive multi-task parameter-efﬁcient benchmark and

examine existing parameter-efﬁcient ﬁne-tuning NLP methods for vision tasks. Our

results on four different dense vision tasks showed that existing methods cannot

be efﬁciently integrated due to the hierarchical nature of the Hierarchical Vision

Transformers. To overcome this issue, we propose Polyhistor and Polyhistor-Lite,

consisting of Decomposed HyperNetworks and Layer-wise Scaling Kernels, to

share information across different tasks with a few trainable parameters. This

leads to favorable performance improvements against existing parameter-efﬁcient

methods while using fewer trainable parameters. Speciﬁcally, Polyhistor achieves

competitive accuracy compared to the state-of-the-art while only use

∼10%

their trainable parameters. Furthermore, our methods show larger performance

gains when large networks and more pretraining data are used.

1 Introduction

Foundation models trained with large-scale datasets have shown the success of adapting to a variety

of downstream NLP and vision tasks [

]. As the state-of-the-art foundation models grow to billion

or even trillion parameter models [

], individually ﬁne-tuning all parameters of the model

wastes signiﬁcant computational resources. Further, for multi-task models, both ﬁne-tuning and stor-

ing separate models for multiple tasks become infeasible on devices with low computation resources.

To alleviate this issue, several works [

] have proposed parameter-efﬁcient ﬁne-tuning methods

to derive a better trade-off between trainable parameters and accuracy on downstream tasks. By only

training a small amount of parameters, these existing methods can substantially narrow the accuracy

gap compared to the baseline that ﬁne-tunes all parameters. However, these existing approaches

mainly focus on NLP tasks [

] or single-task adaptation on image classiﬁcation [

], and

Polyhistor:someone gifted or learned in multiple disciplines.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.03265v1 [cs.CV] 7 Oct 2022

their applicability to more complicated vision tasks is unclear. On the other hand, the single-task

adaptation methods [

] still need to learn and store task-wise parameters, and the

number of trainable parameters increase with respect to the number of tasks.

Therefore, in this paper, we ﬁrst conduct a thorough study on

how the existing successful parameter-

efﬁcient methods on NLP tasks perform on vision tasks

, particularly on more challenging dense

vision tasks (e.g., semantic segmentation, normals estimation). Second, based on our ﬁndings, we

then

design a novel parameter-efﬁcient method for adaptation to multiple dense vision tasks

Our method leverage shared modules across tasks and encourage the model to use shared information

in a more parameter-efﬁcient manner.

To start with, we ﬁrst evaluate the existing parameter-efﬁcient methods in NLP on dense vision

problems. We chose to apply these methods to hierarchical vision transformers (HVTs) considering

their state-of-the-art results on many per-pixel vision tasks [

]. Through our extensive studies, we

ﬁnd

two

limitations in these works. First, adapter-based methods [

], which have shown strong

performance on NLP parameter-efﬁcient adaptation benchmarks, cannot be efﬁciently integrated with

HVTs. This is because the parameter usage of adapters in later transformer blocks grows quadratically

with respect to the layer scale (see Fig. 1c). Second, the state-of-the-art multi-task parameter-efﬁcient

method [

] applies a hyper-network to produce the weights of adapters and shares information

across different NLP tasks, while we ﬁnd it inherently requires a large number of trainable parameters

in the hyper-network (see Sec. 4.1 for further discussion).

To address the above limitations, we propose Polyhistor-Lite, which consist of two main components,

Decomposed Lite-HyperNetworks and Layer-wise Scaling Kernels. These two methods reduce the

trainable parameters in two aspects respectively, including parameter reduction for hyper-networks in

multi-tasking architecture and parameter reduction for adapters used in HVTs.

Speciﬁcally, to reduce the parameter usage of the multi-task architecture, we decompose a hyper-

network into a pair of separate hyper-networks. Unlike the existing approach, where a relatively large

hyper-network is used to produce long vector that can be reshaped into the weight matrix of adapter,

our decomposed hyper-networks individually produce two low-rank matrices that are multiplied to

construct the adapter weights. As a result, we can rely on this low-rank approximation to reduce the

parameter usage in the hyper-network yet maintain its performance on downstream tasks. In addition,

to enable the hypernetworks shared across layers in HVTs, we factorize an adapter weight matrix

into two kernels, including Template Kernels and Scaling Kernels. These two kernels are multiplied

via Kronecker Product to ﬁt in with different sizes of adapters, and this is achieved by controlling

the sizes of Scaling Kernels based on the scaling of the layer/adapter (and using the same size of

Template Kernels across layers). In this way, the parameters of adapter weights can be effectively

reduced with a minimal sacriﬁce on the accuracy of downstream tasks.

To benchmark the problem, we construct a uniﬁed framework with the same implementation details

and provide a comprehensive and fair comparison between existing parameter-efﬁcient adaptation

works in NLP on our multi-tasking dense vision problems. We also demonstrate that, with the

integration of our proposed Decomposed HyperNetworks and Layer-wise Scaling Kernels, we can

achieve a much better trade-off between trainable parameters and accuracies compared to the existing

methods. Speciﬁcally, most of existing methods struggled to match the performance of the simple

baseline, which individually ﬁne-tunes the entire network for each task, while our method achieves

better results than the simple baseline while only training less than

10%

of the parameters in

a model. Compared with the state-of-the-art multi-tasking parameter-efﬁcient adaptation method,

Hyperformer [

], our method achieves competitive performance improvement with

∼90%

reduction

in the trainable parameters of their method. Interestingly, we also observed that our proposed method

brings high-performance improvement when applied to the network pre-trained on the larger dataset

(ImageNet-22k). We will publicly release our code to facilitate future research.

To sum up, we list our contributions as follows:

•

To the best of our knowledge, we are the ﬁrst to address parameter-efﬁcient

multi-task

adaptation for vision tasks. We develop a uniﬁed framework to benchmark several parameter-

efﬁcient ﬁne-tuning NLP methods on dense vision tasks.

•

We propose a novel method — Polyhistor-Lite that achieves signiﬁcant performance im-

provement with very low training parameters compared to existing methods.

•

We observe that our method can bring further performance improvements when applied to

models with larger pre-trained dataset or with larger backbones.

2 Related Works

Parameter-efﬁcient Learning aims to adapt a pre-trained model to a new task by only training a small

number of parameters. The most straightforward method is to freeze the pre-trained encoder and

only ﬁne-tune the last layer, while, in terms of the accuracy of downstream tasks, it is still far from

full ﬁne-tuning. Thus, to achieve a better trade-off between accuracy and the number of tunable

parameters, several works [

] have proposed more parameter-efﬁcient methods,

and we summarize these works in the following paragraphs.

Single-Task Parameter-efﬁcient Adaptation.

Several works build upon the Adapter [

], which

is a bottleneck-like module that is placed across the architecture and trained while the rest of the

original model is frozen. By changing the dimension of hidden vectors, one can easily control the

trade-off between trainable parameters and accuracy. For example, Houlsby et al. [

] proposes

to apply two adapter modules placed after the attention layers and the MLP layers respectively,

while Pfeiffer et al. [

] only use adapters after MLP layers and show more parameter-efﬁciency.

Furthermore, PHM-Layer [

] learns two types of matrices, one "slow" matrix shared across layers

and the other "fast" matrix learned individually in different layers, to produce the adapter weight via

Kronecker Product [

]. Compacter [

] further reduces the parameters by decomposing the slow

matrix into two rank-one vectors. Different from their goal of sharing slow matrix across layers, we

apply Kronecker Product to efﬁciently scale up adapters to different layer scales.

In addition, there are other parameter-efﬁcient learning works. BitFit [

] shows simply tuning biases

in all layers improves against the linear probing. Some other works ﬁne-tune learnable vectors, such

as learnable vectors in input word embeddings [

] and learnable vectors integrated with keys/values

in each layer of transformers [

]. LoRA [

] produces two low-rank matrices, which are multiplied

and served as a residual of attention weight matrices. While the above methods show favorable results

with using fewer trainable parameters, the goal of these works is single-task adaptation.

Multi-Task Parameter-efﬁcient Adaptation.

When multiple tasks are learned jointly, one can share

some homogeneous information across different tasks and save the parameter usage by removing

the duplicated learned features. To this end, Hyperformer [

] introduces a hyper-network, which

takes as input task embeddings and produces the weights of adapters in different tasks. Since only the

parameters in the hyper-network need to be trained, the number of trainable parameters in task-wise

components can thus be reduced in the multi-tasking setup. On the other hand, Sung et al. [

] shows

that simply adding a single adapter on language transformer and sharing the adapter across tasks can

achieve promising results in vision-and-language cross-modal tasks (e.g., image captioning).

Parameter-efﬁcient Adaptation for Vision.

Despite the promising results, most parameter-efﬁcient

learning methods are evaluated on language transformers and NLP benchmarks, and parameter-

efﬁcient learning on Vision Transformer [

] is still an under-explored topic. A recent work, Visual

Prompt Tuning (VPT) [

], initiates the study of parameter-efﬁcient learning on Vision Transformers,

and it follows the idea of prompt tuning in language tasks and prepends and ﬁne-tunes some extra

learnable vectors in the input space of pre-trained Vision Transformers. VPT focuses on single-task

adaptation, while our work focuses on multi-task adaptation.

To fairly compare different parameter-efﬁcient learning methods, He et al. [

] present an empirical

study and re-evaluate parameter-efﬁcient learning methods (BitFit, Adapters, Preﬁx Tuning, and

LoRA) under the same experiment conﬁgurations for NLP tasks. Inspired by their work, we imple-

ment the aforementioned parameter-efﬁcient NLP methods (and include more latest works [

])

on our dense vision tasks, conduct comparative experiments, and fairly compare these methods.

3 Background

Hierarchical Vision Transformers.

The Vision Transformer [

] is based on transformer architec-

tures [

] and operates on a set of patch tokens obtained from an image. As a variant of Vision

Transformer, the Hierarchical Vision Transformer [

] produces multi-scale feature

representations, and its hierarchical structure extracts ﬁne-grained information and better handles

(a)

(b)

(c)

Figure 1: Illustration of (a) Hierarchical Vision Transformer and (b) Adapter. (c) When applying

adapters in a Hierarchical Vision Transformer, the number of parameters grows quadratically with

the respect to the block scale. Note that

indicates the dimension of adapter input vectors,

is the

bottleneck size of adapters, and drepresents the input size of adapters.

images with scale and size variation. These properties contribute to the promising results in several

per-pixel vision tasks, including semantic segmentation [

], depth estimation [

], and

saliency detection [

]. As shown in Fig. 1a, a Hierarchy Vision Transformer (HVT) consists of

several transformer layers, and each transformer layer is mainly composed of an attention layer and

an MLP layer. Different from other transformers (e.g., ViT [

]), a distinct characteristic of HVTs is

its pyramid-like feature maps generated from different transformer blocks as shown in Fig. 1a.

Adapters.

Several parameter-efﬁcient adaptation works [

] build upon Adapter [

which is a bottleneck-like module placed in transformer layers as shown in Fig. 1b. These layers are

learnable parameters, while the rest of the model is frozen during ﬁne-tuning. The Adapter

fa(·)

consists of a down-projection layer

Wdown ∈Rd×n

, a non-linearity function

δ(·)

, a up-projection

layer Wup ∈Rn×d, and a skip connection from the input of the adapter hin ∈Rd.

hout =fa(hin;W) = δ(hinWdown)Wup +hin,(1)

where

hout ∈Rd

is the output of the adapter and

W= [Wdown;W|

up]∈Rd×2n

represents all

learnable parameters in the adapter.

4 Method

Problem Setting.

Given a Hierarchical Vision Transformer pre-trained on large-scale image datasets

(e.g., ImageNet [

]), our goal is to train a small number of parameters and adapt the model to the

multi-tasking setting, where training data of

tasks are obtained during the training stage. Following

the existing works in NLP, the criteria of parameter-efﬁcient multi-tasking learning includes the

accuracy of downstream tasks and the numbers of training parameters.

Method Overview.

We aim to improve the parameter efﬁciency in two aspects: (1) to efﬁciently

share homogeneous information across tasks via lightweight hyper-networks (Section 4.1) and

(2) to efﬁciently scale up adapter weights in different transformer blocks of Hierarchical Vision

Transformers (Section 4.2). These two components are combined to improve the trade-off between

accuracy and training parameters in multi-tasking per-pixel vision tasks (Section 4.3).

4.1 Polyhistor: Decomposed Lightweight Hyper-networks for Multi-task Adaptation

With the goal of jointly adapting multiple NLP tasks in a parameter-efﬁcient manner, a prior work,

Hyperformer [

], builds upon a group of adapters in different tasks and extracts task-sharing

information via a hyper-network shared across different tasks. Speciﬁcally, a group of task and layer-

wise adapters with weight parameters

{Wt

l|t= 1, ..., N;l= 1, ..., L}

are individually inserted into

each layer

of the model with

layers for all

tasks. Then, instead of individually learning the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Polyhistor:Parameter-EfcientMulti-TaskAdaptationforDenseVisionTasksYen-ChengLiuGeorgiaTechycliu@gatech.eduChih-YaoMaMetacyma@meta.comJunjiaoTianGeorgiaTechjtian73@gatech.eduZijianHeMetazijian@meta.comZsoltKiraGeorgiaTechzkira@gatech.eduAbstractAdaptinglarge-scalepretrainedmodelstovariousdownstreamt...

展开>> 收起<<

Polyhistor Parameter-Efﬁcient Multi-Task Adaptation for Dense Vision Tasks Yen-Cheng Liu.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Polyhistor Parameter-Efﬁcient Multi-Task Adaptation for Dense Vision Tasks Yen-Cheng Liu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: