M3ViT M ixture-of-Experts Vision Transformer for Efﬁcient M ulti-task Learning with M odel-Accelerator Co-design

2025-05-02 1 0 1.04MB 20 页 10玖币

侵权投诉

M3ViT: Mixture-of-Experts Vision Transformer

for Efﬁcient Multi-task Learning

with Model-Accelerator Co-design

Hanxue Liang1∗

, Zhiwen Fan1∗, Rishov Sarkar2, Ziyu Jiang3, Tianlong Chen1,

Kai Zou4, Yu Cheng5, Cong Hao2, Zhangyang Wang1

1University of Texas at Austin, 2Georgia Institute of Technology

3Texas A&M University, 4Protagolabs Inc, 5Microsoft Research

{haliang,zhiwenfan,tianlong.chen,atlaswang}@utexas.edu

{rishov.sarkar,callie.hao}@gatech.edu,jiangziyu@tamu.edu

kz@protagolabs.com,yu.cheng@microsoft.com

Abstract

Multi-task learning (MTL) encapsulates multiple learned tasks in a single model

and often lets those tasks learn better jointly. However, when deploying MTL onto

those real-world systems that are often resource-constrained or latency-sensitive,

two prominent challenges arise: (i) during

training

, simultaneously optimizing all

tasks is often difﬁcult due to gradient conﬂicts across tasks, and the challenge is

ampliﬁed when a growing number of tasks have to be squeezed into one compact

model; (ii) at

inference

, current MTL regimes have to activate nearly the entire

model even to just execute a single task. Yet most real systems demand only one or

two tasks at each moment, and switch between tasks as needed: therefore such “all

tasks activated” inference is also highly inefﬁcient and non-scalable.

In this paper, we present a model-accelerator

co-design

framework to enable ef-

ﬁcient on-device MTL, that tackles

both

training and inference bottlenecks. Our

framework, dubbed

M3ViT

, customizes mixture-of-experts (MoE) layers into a

vision transformer (ViT) backbone for MTL, and sparsely activates task-speciﬁc

experts during training, which effectively disentangles the parameter spaces to

avoid different tasks’ training conﬂicts. Then at inference with any task of interest,

the same design allows for activating only the task-corresponding sparse “expert”

pathway, instead of the full model. Our new model design is further enhanced by

hardware-level innovations, in particular, a novel computation reordering scheme

tailored for memory-constrained MTL that achieves zero-overhead switching be-

tween tasks and can scale to any number of experts. Extensive experiments on

PASCAL-Context [

] and NYUD-v2 [

] datasets at both software and hardware

levels are conducted to demonstrate the effectiveness of the proposed design. When

executing single-task inference, M

ViT achieves higher accuracies than encoder-

focused MTL methods, while signiﬁcantly reducing

88%

inference FLOPs. When

implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design

framework reduces the memory requirement by

2.40×

, while achieving energy

efﬁciency up to 9.23×higher than a comparable FPGA baseline.

Code is available at: https://github.com/VITA-Group/M3ViT.

1 Introduction

Vision Transformers (ViTs) [

], as the latest performant deep models, have achieved impressive

performance on various computer vision tasks [

]. These models are specially trained or tested

∗Equal contribution

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14793v1 [cs.CV] 26 Oct 2022

for only one or a few tasks; however, many real-world applications require one compact system

that can handle many different tasks efﬁciently, and often need to swiftly switch between tasks per

demand. For example, an autonomous driving system [

] needs to perform and switch between many

tasks such as drivable area estimation, lane detection, pedestrian detection, and scene classiﬁcation:

apparently both single task inference and cross-task switching need to happen at ultra-low latency.

As another example, smart-home indoor robots [

] are expected to address semantic segmentation,

navigation, tracking, or other tasks in varying contexts, with very limited on-board resources. Multi-

task learning (MTL) [

] solves multiple tasks simultaneously within a single model and learns

improved feature representations [

] shared by related tasks [

]. Therefore, accomplishing

realistic efﬁcient MTL is becoming a key knob for building real-time sophisticated AI systems.

Despite the promise, challenges persist to build an efﬁcient MTL model suitable for real-world

applications:

during

training

, prior works [

] indicate the competition of different tasks

in training may degrade MTL, since the same weights might receive and be confused by conﬂicting

update directions. Speciﬁcally, [

] reveals that negative cosine similarities between different tasks’

gradients are detrimental. [

] conﬁrm that conﬂicting gradients not only slow down convergence

but also bias the learned representations against some tasks. That is only getting worse on compact

models owing to their limited modeling capacity. To tackle the cross-task conﬂicts, solutions have

been proposed by varying learning rate speeds of different tasks [

], using “cross-stitch” sharing [

or re-balancing task gradients [

]. However, they either require task-speciﬁc design or

signiﬁcantly increase the model complexity which contradicts our efﬁciency goal.

inference

existing MTL regimes typically activate the entire backbone model unconditionally. However,

many

real systems only need to call upon one or a few tasks at each moment

, hence the “all activated”

inference is heavily inefﬁcient and non-scalable. For example, current regimes [

] have

to activate the whole gigantic ResNet [

] encoder even just to execute a single monocular depth

estimation task or so. If the number of tasks scale up [

] and the backbone keeps growing bigger,

the “per task” inference efﬁciency of the resultant MTL model could become catastrophically poor.

To tackle these bottlenecks, we propose a model-accelerator

co-design

framework that enables

efﬁcient on-device MTL. Speciﬁcally,

in the software level

, we propose to adapt mixture of experts

(MoE) layers [

] into the MTL backbone, as MoE can adaptively divide-and-conquer the entire

model capacity into smaller sub-models [

]. Here, we replace the dense feed-forward network

in the ViT with sparsely activated MoE experts (MLPs). A task-dependent gating network will be

trained to select the subset of experts for each input token, conditioning on tasks. During training,

this task-dependent routing principle effectively disentangles the parameter spaces, balancing feature

reuse and automatically avoiding different tasks’ training conﬂicts. Meanwhile, at the inference

stage with any task of interest, this design naturally allows for sparse activation of only the experts

corresponding to the task instead of the full model, thus achieving highly sparse and efﬁcient inference

for the speciﬁc task.

In the hardware level

, we propose a novel computation reordering mechanism

tailored for memory-constrained MTL and MoE, which allows scaling up to any number of experts

and also achieves zero-overhead switching between tasks. Speciﬁcally, based on ViT, we push tokens

to per-expert queues to enable expert-by-expert computation rather than token-by-token. We then

implement a double-buffered computation strategy that hides the memory access latency required to

load each expert’s weights from off-chip memory, regardless of task-speciﬁc expert selection. This

design naturally incurs no overhead for switching between frames or tasks in FPGA.

To validate the effectiveness, we evaluate our performance gain using the ViT-small backbone on

the NYUD-v2 and PASCAL-Context datasets. On the NYUD-v2 dataset with two tasks, our model

achieves comparable results with encoder-focused MTL methods while reducing 71% FLOPs for

single-task execution. When we evaluate on the PASCAL-Context dataset with more tasks, our model

achieves even better performance (2.71 vs. 0.60) and reduces 88% inference FLOPs.

We found the MTL performance gain brought by MoE layers consistently increases as the task count

grows. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design

framework reduces the memory requirement by 2.40

while achieving energy efﬁciency (as the

product of latency and power) up to 9.23

higher than comparable FPGA baselines and up to 10.79

higher than the GPU implementation. Our contributions are outlined below:

•

We target the problem of efﬁcient MTL, and adopt the

more realistic inference setting

(activating one task at a time, while switching between tasks). We introduce MoE as the

uniﬁed tool to attain two goals: both resolving cross-task training conﬂicts (better MTL

performance), and sparsely activating paths for single-task inference (better efﬁciency).

Speciﬁcally for MTL, the MoE layer is accompanied with a task-dependent gating network

to make expert selections conditioning on the current task.

•

We implement the proposed MTL MoE ViT framework on a hardware platform of one Xilinx

ZCU104 FPGA, which enables us to exploit a memory-efﬁcient computation reordering

scheme that consolidates per-expert Multiply-and-ACcumulate (MAC) operations such that

only one expert’s weights are needed on-chip at a time. Our design is scalable to any number

of experts while requiring no frame-switching or task-switching overhead.

•

We conduct extensive experiments to justify its inference effectiveness in both accuracy and

on-edge efﬁciency metrics. Our framework, dubbed M

ViT, achieves higher accuracies than

encoder-focused MTL methods, while signiﬁcantly reducing

88%

inference FLOPs; on

hardware, it reduces the memory requirement by

2.40×

and costs up to

9.23×

and

10.79×

less energy compared to the FPGA and GPU baselines, respectively.

2 Related Works

Multi-task Learning

The generic multi-task learning problem has been studied for a long history.

Some non-deep learning-based methods propose to use distance metric [

], probabilistic

prior [

] to model the common information among tasks. With the emergence of the

deep learning technique, MTL [

] is performed to learn shared representation

among tasks. The emergence of ViT further makes it possible to extend the task range from only

vision tasks to other modalities tasks (e.g., text, audio) [

]. Current MTL models can

be roughly categorized into two types based on where the task interactions take place in the network.

The encoder-focused architectures [

] only share information in the encoder, before

decoding each task with an independent task-speciﬁc head. Cross-stitch networks [

] introduce linear

combination in each layer. NDDR-CNN [

] improves it by dimensional reduction. MTAN [

]

leverages an attention mechanism to learn between tasks. TAPS [

] adapts a base model to a

new task by modifying a small task-speciﬁc subset of layers. The second type, decoder-focused

models [

], make initial task predictions in decoder and then leverage features from these

initial predictions to further improve output. Although they report higher performance, their models

consume a large number of FLOPs, according to [

]. This makes it difﬁcult to deploy them onto

those real-world systems that are often resource-constrained or latency-sensitive. And they need to

execute all the tasks for initial prediction, which is heavily inefﬁcient in the common scenario when

only one or few tasks are needed. Hence, we focus on encoder-focused architecture in this work.

Many methods [25, 20, 52, 27] are also proposed to handle the MTL training conﬂicts problem.

Mixture of Experts (MoE)

MoE contains a series of sub-models (i.e., experts) and performs

conditional computation in an input-dependent fashion [

], based on learned or

deterministic routing policies [

]. The traditional dense MoEs suffer from intensive compu-

tational costs since they select all experts [

]. Recent studies [

] in natural language

processing (NLP) propose sparse MoE that sparsely activates a few experts during both training

and inference, thus substantially reducing the cost and allowing gigantic language models even

with trillions of parameters [

]. Unfortunately, such a sparse-gated manner still has limitations of

unstable training and imbalanced selections among experts. Various solutions are invented from regu-

larization [

] and optimization [

] perspectives. Moreover, MoE has drawn increasing

popularity in computer vision [

], where it mainly focuses on considerably

smaller network backbones compared to the ones in NLP. For instance, [

] and [

] formulate the

channel and kernel of convolutional layers as experts and establish the MoE framework. Several

pioneer investigations also explore MoE for multi-task learning, which are related to this work.

Particularly, [

] introduce task-speciﬁc gating networks to choose different parts of models

for processing information from each task. They present certain possibilities of using MoE to solve

MTL problems in some cases like classiﬁcation for medical signals [

], digital number images

(MNIST) [

], and recommendation systems [

]. We make a further attempt to adapt MoE into a

compact model for dense prediction multi-task learning, along with software-hardware co-design.

Vision Transformer

There are growing interests in exploring the use of transformers [

] for

computer vision tasks since its success in the natural language processing [

], including image

generation [

], generative adversarial networks [

], image classiﬁcation [

], semantic segmentation [

], object detection [

], 3D data processing [

novel view synthesis [90, 91], and many others [92, 93, 94, 95].

Hardware

FPGA acceleration of Transformer-based models has attracted increasing attention.

Pioneering works [

] note that transformers are computation- and memory-intensive

and are too large to ﬁt on the FPGA. Therefore, various model compression methods have been

proposed, such as activation quantization, token pruning, block-circulant matrices (BCM) for weights,

block-balanced weight pruning, and column-balanced block weight pruning. Such compression

methods are lossy and require compression-aware training to regain accuracy. To our best knowledge,

there is no existing FPGA accelerator for MoE in a Transformer-based model. The MoE mechanism

exposes great challenges to FPGA since it requires swift expert switching between tokens and frames,

which may introduce signiﬁcant overhead of memory and parameter loading. In this work, however,

we propose a novel expert-by-expert computation-reordering approach that can reduce the overhead

to negligible despite the number of experts, and does not require model compression or re-training.

3 Method

Overview

We ﬁrst describe the standard Vision Transformer and MoEs, and then show the proposed

MoE ViT design for MTL. To enable dynamically adapting between different tasks with minimum

overhead on FPGA, we detail the hardware implementation. Figure 1 shows the whole framework.

…

Task A Task B Both NotActivated

(a) MoE ViT Design (b) Hardware Design

Decoder A Decoder B

Self-Attention

Load

Parameters

Compute

Expert

Gating Function

Time

Final output

Intermediate

output

Intermediate input Initial input

Layer

Experts

Linear Projection

…

token

embedding

Figure 1:

The overall structure of the proposed M3ViT pipeline

. The input image is split into

ﬁxed-size patches, embedded, and combined with position embeddings. In training, the MTL MoE

ViT adaptively activates the model by sparsely selecting relevant experts using its task-dependent

routers. During inference, only one task will be performed at a time. The hardware collects all

patches allocated for each expert and processes them expert-by-expert with the “load parameters” and

“compute expert” modules.

Layer

Norm Attention Layer

Norm

Router

Task B

Expert1

Expert2

ExpertN-1

ExpertN

…

Router

Task A

Layer

Norm Attention Layer

Norm

Expert1

Expert2

ExpertN-1

ExpertN

…

Router

(a) Multi-gate MoE layer design. (b) Task-conditioned MoE layer design.

Task-conditioned

embedding

Figure 2:

The proposed two variants of MTL MoE layers

. In the left ﬁgure, each task selects

its experts using its own router. In the right one, all tasks share one router, while a task-speciﬁc

embedding is concatenated with the token embedding to formulate the input of the shared router.

3.1 Task-dependent MoE ViT Design

Vision Transformer

The representative Vision Transformer architecture [

] ﬁrst splits the input

image into non-overlapped patches and projects the patches to a higher hidden dimension using one

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

M3ViT:Mixture-of-ExpertsVisionTransformerforEfcientMulti-taskLearningwithModel-AcceleratorCo-designHanxueLiang1,ZhiwenFan1,RishovSarkar2,ZiyuJiang3,TianlongChen1,KaiZou4,YuCheng5,CongHao2,ZhangyangWang11UniversityofTexasatAustin,2GeorgiaInstituteofTechnology3TexasA&MUniversity,4ProtagolabsInc,5Mi...

展开>> 收起<<

M3ViT M ixture-of-Experts Vision Transformer for Efﬁcient M ulti-task Learning with M odel-Accelerator Co-design.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

M3ViT M ixture-of-Experts Vision Transformer for Efﬁcient M ulti-task Learning with M odel-Accelerator Co-design

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: