M3ViT M ixture-of-Experts Vision Transformer for Efficient M ulti-task Learning with M odel-Accelerator Co-design

2025-05-02 0 0 1.04MB 20 页 10玖币
侵权投诉
M3ViT: Mixture-of-Experts Vision Transformer
for Efficient Multi-task Learning
with Model-Accelerator Co-design
Hanxue Liang1
, Zhiwen Fan1, Rishov Sarkar2, Ziyu Jiang3, Tianlong Chen1,
Kai Zou4, Yu Cheng5, Cong Hao2, Zhangyang Wang1
1University of Texas at Austin, 2Georgia Institute of Technology
3Texas A&M University, 4Protagolabs Inc, 5Microsoft Research
{haliang,zhiwenfan,tianlong.chen,atlaswang}@utexas.edu
{rishov.sarkar,callie.hao}@gatech.edu,jiangziyu@tamu.edu
kz@protagolabs.com,yu.cheng@microsoft.com
Abstract
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model
and often lets those tasks learn better jointly. However, when deploying MTL onto
those real-world systems that are often resource-constrained or latency-sensitive,
two prominent challenges arise: (i) during
training
, simultaneously optimizing all
tasks is often difficult due to gradient conflicts across tasks, and the challenge is
amplified when a growing number of tasks have to be squeezed into one compact
model; (ii) at
inference
, current MTL regimes have to activate nearly the entire
model even to just execute a single task. Yet most real systems demand only one or
two tasks at each moment, and switch between tasks as needed: therefore such “all
tasks activated” inference is also highly inefficient and non-scalable.
In this paper, we present a model-accelerator
co-design
framework to enable ef-
ficient on-device MTL, that tackles
both
training and inference bottlenecks. Our
framework, dubbed
M3ViT
, customizes mixture-of-experts (MoE) layers into a
vision transformer (ViT) backbone for MTL, and sparsely activates task-specific
experts during training, which effectively disentangles the parameter spaces to
avoid different tasks’ training conflicts. Then at inference with any task of interest,
the same design allows for activating only the task-corresponding sparse “expert”
pathway, instead of the full model. Our new model design is further enhanced by
hardware-level innovations, in particular, a novel computation reordering scheme
tailored for memory-constrained MTL that achieves zero-overhead switching be-
tween tasks and can scale to any number of experts. Extensive experiments on
PASCAL-Context [
1
] and NYUD-v2 [
2
] datasets at both software and hardware
levels are conducted to demonstrate the effectiveness of the proposed design. When
executing single-task inference, M
3
ViT achieves higher accuracies than encoder-
focused MTL methods, while significantly reducing
88%
inference FLOPs. When
implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design
framework reduces the memory requirement by
2.40×
, while achieving energy
efficiency up to 9.23×higher than a comparable FPGA baseline.
Code is available at: https://github.com/VITA-Group/M3ViT.
1 Introduction
Vision Transformers (ViTs) [
3
,
4
,
5
,
6
], as the latest performant deep models, have achieved impressive
performance on various computer vision tasks [
7
,
8
,
9
]. These models are specially trained or tested
Equal contribution
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14793v1 [cs.CV] 26 Oct 2022
for only one or a few tasks; however, many real-world applications require one compact system
that can handle many different tasks efficiently, and often need to swiftly switch between tasks per
demand. For example, an autonomous driving system [
10
] needs to perform and switch between many
tasks such as drivable area estimation, lane detection, pedestrian detection, and scene classification:
apparently both single task inference and cross-task switching need to happen at ultra-low latency.
As another example, smart-home indoor robots [
11
] are expected to address semantic segmentation,
navigation, tracking, or other tasks in varying contexts, with very limited on-board resources. Multi-
task learning (MTL) [
12
,
13
,
14
] solves multiple tasks simultaneously within a single model and learns
improved feature representations [
15
] shared by related tasks [
16
,
17
]. Therefore, accomplishing
realistic efficient MTL is becoming a key knob for building real-time sophisticated AI systems.
Despite the promise, challenges persist to build an efficient MTL model suitable for real-world
applications:
Ê
during
training
, prior works [
18
,
19
,
20
] indicate the competition of different tasks
in training may degrade MTL, since the same weights might receive and be confused by conflicting
update directions. Specifically, [
19
] reveals that negative cosine similarities between different tasks’
gradients are detrimental. [
21
,
22
] confirm that conflicting gradients not only slow down convergence
but also bias the learned representations against some tasks. That is only getting worse on compact
models owing to their limited modeling capacity. To tackle the cross-task conflicts, solutions have
been proposed by varying learning rate speeds of different tasks [
20
], using “cross-stitch” sharing [
23
],
or re-balancing task gradients [
19
,
24
,
20
,
25
]. However, they either require task-specific design or
significantly increase the model complexity which contradicts our efficiency goal.
Ë
at
inference
,
existing MTL regimes typically activate the entire backbone model unconditionally. However,
many
real systems only need to call upon one or a few tasks at each moment
, hence the “all activated”
inference is heavily inefficient and non-scalable. For example, current regimes [
14
,
23
,
26
,
27
] have
to activate the whole gigantic ResNet [
28
] encoder even just to execute a single monocular depth
estimation task or so. If the number of tasks scale up [
29
] and the backbone keeps growing bigger,
the “per task” inference efficiency of the resultant MTL model could become catastrophically poor.
To tackle these bottlenecks, we propose a model-accelerator
co-design
framework that enables
efficient on-device MTL. Specifically,
in the software level
, we propose to adapt mixture of experts
(MoE) layers [
30
,
31
] into the MTL backbone, as MoE can adaptively divide-and-conquer the entire
model capacity into smaller sub-models [
30
,
32
]. Here, we replace the dense feed-forward network
in the ViT with sparsely activated MoE experts (MLPs). A task-dependent gating network will be
trained to select the subset of experts for each input token, conditioning on tasks. During training,
this task-dependent routing principle effectively disentangles the parameter spaces, balancing feature
reuse and automatically avoiding different tasks’ training conflicts. Meanwhile, at the inference
stage with any task of interest, this design naturally allows for sparse activation of only the experts
corresponding to the task instead of the full model, thus achieving highly sparse and efficient inference
for the specific task.
In the hardware level
, we propose a novel computation reordering mechanism
tailored for memory-constrained MTL and MoE, which allows scaling up to any number of experts
and also achieves zero-overhead switching between tasks. Specifically, based on ViT, we push tokens
to per-expert queues to enable expert-by-expert computation rather than token-by-token. We then
implement a double-buffered computation strategy that hides the memory access latency required to
load each expert’s weights from off-chip memory, regardless of task-specific expert selection. This
design naturally incurs no overhead for switching between frames or tasks in FPGA.
To validate the effectiveness, we evaluate our performance gain using the ViT-small backbone on
the NYUD-v2 and PASCAL-Context datasets. On the NYUD-v2 dataset with two tasks, our model
achieves comparable results with encoder-focused MTL methods while reducing 71% FLOPs for
single-task execution. When we evaluate on the PASCAL-Context dataset with more tasks, our model
achieves even better performance (2.71 vs. 0.60) and reduces 88% inference FLOPs.
We found the MTL performance gain brought by MoE layers consistently increases as the task count
grows. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design
framework reduces the memory requirement by 2.40
×
while achieving energy efficiency (as the
product of latency and power) up to 9.23
×
higher than comparable FPGA baselines and up to 10.79
×
higher than the GPU implementation. Our contributions are outlined below:
We target the problem of efficient MTL, and adopt the
more realistic inference setting
(activating one task at a time, while switching between tasks). We introduce MoE as the
unified tool to attain two goals: both resolving cross-task training conflicts (better MTL
performance), and sparsely activating paths for single-task inference (better efficiency).
2
Specifically for MTL, the MoE layer is accompanied with a task-dependent gating network
to make expert selections conditioning on the current task.
We implement the proposed MTL MoE ViT framework on a hardware platform of one Xilinx
ZCU104 FPGA, which enables us to exploit a memory-efficient computation reordering
scheme that consolidates per-expert Multiply-and-ACcumulate (MAC) operations such that
only one expert’s weights are needed on-chip at a time. Our design is scalable to any number
of experts while requiring no frame-switching or task-switching overhead.
We conduct extensive experiments to justify its inference effectiveness in both accuracy and
on-edge efficiency metrics. Our framework, dubbed M
3
ViT, achieves higher accuracies than
encoder-focused MTL methods, while significantly reducing
88%
inference FLOPs; on
hardware, it reduces the memory requirement by
2.40×
and costs up to
9.23×
and
10.79×
less energy compared to the FPGA and GPU baselines, respectively.
2 Related Works
Multi-task Learning
The generic multi-task learning problem has been studied for a long history.
Some non-deep learning-based methods propose to use distance metric [
33
,
34
,
35
], probabilistic
prior [
36
,
37
,
38
,
39
] to model the common information among tasks. With the emergence of the
deep learning technique, MTL [
14
,
40
,
23
,
41
,
42
,
43
] is performed to learn shared representation
among tasks. The emergence of ViT further makes it possible to extend the task range from only
vision tasks to other modalities tasks (e.g., text, audio) [
44
,
45
,
46
,
47
,
48
]. Current MTL models can
be roughly categorized into two types based on where the task interactions take place in the network.
The encoder-focused architectures [
23
,
40
,
26
,
27
] only share information in the encoder, before
decoding each task with an independent task-specific head. Cross-stitch networks [
23
] introduce linear
combination in each layer. NDDR-CNN [
26
] improves it by dimensional reduction. MTAN [
27
]
leverages an attention mechanism to learn between tasks. TAPS [
49
] adapts a base model to a
new task by modifying a small task-specific subset of layers. The second type, decoder-focused
models [
42
,
43
,
50
,
51
], make initial task predictions in decoder and then leverage features from these
initial predictions to further improve output. Although they report higher performance, their models
consume a large number of FLOPs, according to [
14
]. This makes it difficult to deploy them onto
those real-world systems that are often resource-constrained or latency-sensitive. And they need to
execute all the tasks for initial prediction, which is heavily inefficient in the common scenario when
only one or few tasks are needed. Hence, we focus on encoder-focused architecture in this work.
Many methods [25, 20, 52, 27] are also proposed to handle the MTL training conflicts problem.
Mixture of Experts (MoE)
MoE contains a series of sub-models (i.e., experts) and performs
conditional computation in an input-dependent fashion [
53
,
54
,
55
,
56
,
57
], based on learned or
deterministic routing policies [
58
,
57
]. The traditional dense MoEs suffer from intensive compu-
tational costs since they select all experts [
59
]. Recent studies [
30
,
60
,
61
] in natural language
processing (NLP) propose sparse MoE that sparsely activates a few experts during both training
and inference, thus substantially reducing the cost and allowing gigantic language models even
with trillions of parameters [
61
]. Unfortunately, such a sparse-gated manner still has limitations of
unstable training and imbalanced selections among experts. Various solutions are invented from regu-
larization [
62
,
60
,
61
] and optimization [
63
,
64
] perspectives. Moreover, MoE has drawn increasing
popularity in computer vision [
59
,
65
,
66
,
67
,
68
,
69
,
70
], where it mainly focuses on considerably
smaller network backbones compared to the ones in NLP. For instance, [
67
] and [
68
] formulate the
channel and kernel of convolutional layers as experts and establish the MoE framework. Several
pioneer investigations also explore MoE for multi-task learning, which are related to this work.
Particularly, [
17
,
71
,
72
] introduce task-specific gating networks to choose different parts of models
for processing information from each task. They present certain possibilities of using MoE to solve
MTL problems in some cases like classification for medical signals [
71
], digital number images
(MNIST) [
72
], and recommendation systems [
17
]. We make a further attempt to adapt MoE into a
compact model for dense prediction multi-task learning, along with software-hardware co-design.
Vision Transformer
There are growing interests in exploring the use of transformers [
73
,
3
] for
computer vision tasks since its success in the natural language processing [
73
,
74
,
75
], including image
generation [
76
,
77
], generative adversarial networks [
78
,
79
], image classification [
76
,
3
,
80
,
81
,
82
,
83
,
81
,
84
], semantic segmentation [
8
,
85
], object detection [
6
,
86
], 3D data processing [
87
,
88
,
89
],
novel view synthesis [90, 91], and many others [92, 93, 94, 95].
3
Hardware
FPGA acceleration of Transformer-based models has attracted increasing attention.
Pioneering works [
96
,
97
,
98
,
99
] note that transformers are computation- and memory-intensive
and are too large to fit on the FPGA. Therefore, various model compression methods have been
proposed, such as activation quantization, token pruning, block-circulant matrices (BCM) for weights,
block-balanced weight pruning, and column-balanced block weight pruning. Such compression
methods are lossy and require compression-aware training to regain accuracy. To our best knowledge,
there is no existing FPGA accelerator for MoE in a Transformer-based model. The MoE mechanism
exposes great challenges to FPGA since it requires swift expert switching between tokens and frames,
which may introduce significant overhead of memory and parameter loading. In this work, however,
we propose a novel expert-by-expert computation-reordering approach that can reduce the overhead
to negligible despite the number of experts, and does not require model compression or re-training.
3 Method
Overview
We first describe the standard Vision Transformer and MoEs, and then show the proposed
MoE ViT design for MTL. To enable dynamically adapting between different tasks with minimum
overhead on FPGA, we detail the hardware implementation. Figure 1 shows the whole framework.
Task A Task B Both NotActivated
(a) MoE ViT Design (b) Hardware Design
Decoder A Decoder B
Self-Attention
Load
Parameters
Compute
Expert
Gating Function
Time
Final output
Intermediate
output
Intermediate input Initial input
Layer
Experts
Linear Projection
token
embedding
Figure 1:
The overall structure of the proposed M3ViT pipeline
. The input image is split into
fixed-size patches, embedded, and combined with position embeddings. In training, the MTL MoE
ViT adaptively activates the model by sparsely selecting relevant experts using its task-dependent
routers. During inference, only one task will be performed at a time. The hardware collects all
patches allocated for each expert and processes them expert-by-expert with the “load parameters” and
“compute expert” modules.
Layer
Norm Attention Layer
Norm
Router
Task B
Expert1
Expert2
ExpertN-1
ExpertN
Router
Task A
Layer
Norm Attention Layer
Norm
Expert1
Expert2
ExpertN-1
ExpertN
Router
(a) Multi-gate MoE layer design. (b) Task-conditioned MoE layer design.
Task-conditioned
embedding
Figure 2:
The proposed two variants of MTL MoE layers
. In the left figure, each task selects
its experts using its own router. In the right one, all tasks share one router, while a task-specific
embedding is concatenated with the token embedding to formulate the input of the shared router.
3.1 Task-dependent MoE ViT Design
Vision Transformer
The representative Vision Transformer architecture [
3
] first splits the input
image into non-overlapped patches and projects the patches to a higher hidden dimension using one
4
摘要:

M3ViT:Mixture-of-ExpertsVisionTransformerforEfcientMulti-taskLearningwithModel-AcceleratorCo-designHanxueLiang1,ZhiwenFan1,RishovSarkar2,ZiyuJiang3,TianlongChen1,KaiZou4,YuCheng5,CongHao2,ZhangyangWang11UniversityofTexasatAustin,2GeorgiaInstituteofTechnology3TexasA&MUniversity,4ProtagolabsInc,5Mi...

展开>> 收起<<
M3ViT M ixture-of-Experts Vision Transformer for Efficient M ulti-task Learning with M odel-Accelerator Co-design.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.04MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注