
for only one or a few tasks; however, many real-world applications require one compact system
that can handle many different tasks efficiently, and often need to swiftly switch between tasks per
demand. For example, an autonomous driving system [
10
] needs to perform and switch between many
tasks such as drivable area estimation, lane detection, pedestrian detection, and scene classification:
apparently both single task inference and cross-task switching need to happen at ultra-low latency.
As another example, smart-home indoor robots [
11
] are expected to address semantic segmentation,
navigation, tracking, or other tasks in varying contexts, with very limited on-board resources. Multi-
task learning (MTL) [
12
,
13
,
14
] solves multiple tasks simultaneously within a single model and learns
improved feature representations [
15
] shared by related tasks [
16
,
17
]. Therefore, accomplishing
realistic efficient MTL is becoming a key knob for building real-time sophisticated AI systems.
Despite the promise, challenges persist to build an efficient MTL model suitable for real-world
applications:
Ê
during
training
, prior works [
18
,
19
,
20
] indicate the competition of different tasks
in training may degrade MTL, since the same weights might receive and be confused by conflicting
update directions. Specifically, [
19
] reveals that negative cosine similarities between different tasks’
gradients are detrimental. [
21
,
22
] confirm that conflicting gradients not only slow down convergence
but also bias the learned representations against some tasks. That is only getting worse on compact
models owing to their limited modeling capacity. To tackle the cross-task conflicts, solutions have
been proposed by varying learning rate speeds of different tasks [
20
], using “cross-stitch” sharing [
23
],
or re-balancing task gradients [
19
,
24
,
20
,
25
]. However, they either require task-specific design or
significantly increase the model complexity which contradicts our efficiency goal.
Ë
at
inference
,
existing MTL regimes typically activate the entire backbone model unconditionally. However,
many
real systems only need to call upon one or a few tasks at each moment
, hence the “all activated”
inference is heavily inefficient and non-scalable. For example, current regimes [
14
,
23
,
26
,
27
] have
to activate the whole gigantic ResNet [
28
] encoder even just to execute a single monocular depth
estimation task or so. If the number of tasks scale up [
29
] and the backbone keeps growing bigger,
the “per task” inference efficiency of the resultant MTL model could become catastrophically poor.
To tackle these bottlenecks, we propose a model-accelerator
co-design
framework that enables
efficient on-device MTL. Specifically,
in the software level
, we propose to adapt mixture of experts
(MoE) layers [
30
,
31
] into the MTL backbone, as MoE can adaptively divide-and-conquer the entire
model capacity into smaller sub-models [
30
,
32
]. Here, we replace the dense feed-forward network
in the ViT with sparsely activated MoE experts (MLPs). A task-dependent gating network will be
trained to select the subset of experts for each input token, conditioning on tasks. During training,
this task-dependent routing principle effectively disentangles the parameter spaces, balancing feature
reuse and automatically avoiding different tasks’ training conflicts. Meanwhile, at the inference
stage with any task of interest, this design naturally allows for sparse activation of only the experts
corresponding to the task instead of the full model, thus achieving highly sparse and efficient inference
for the specific task.
In the hardware level
, we propose a novel computation reordering mechanism
tailored for memory-constrained MTL and MoE, which allows scaling up to any number of experts
and also achieves zero-overhead switching between tasks. Specifically, based on ViT, we push tokens
to per-expert queues to enable expert-by-expert computation rather than token-by-token. We then
implement a double-buffered computation strategy that hides the memory access latency required to
load each expert’s weights from off-chip memory, regardless of task-specific expert selection. This
design naturally incurs no overhead for switching between frames or tasks in FPGA.
To validate the effectiveness, we evaluate our performance gain using the ViT-small backbone on
the NYUD-v2 and PASCAL-Context datasets. On the NYUD-v2 dataset with two tasks, our model
achieves comparable results with encoder-focused MTL methods while reducing 71% FLOPs for
single-task execution. When we evaluate on the PASCAL-Context dataset with more tasks, our model
achieves even better performance (2.71 vs. 0.60) and reduces 88% inference FLOPs.
We found the MTL performance gain brought by MoE layers consistently increases as the task count
grows. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design
framework reduces the memory requirement by 2.40
×
while achieving energy efficiency (as the
product of latency and power) up to 9.23
×
higher than comparable FPGA baselines and up to 10.79
×
higher than the GPU implementation. Our contributions are outlined below:
•
We target the problem of efficient MTL, and adopt the
more realistic inference setting
(activating one task at a time, while switching between tasks). We introduce MoE as the
unified tool to attain two goals: both resolving cross-task training conflicts (better MTL
performance), and sparsely activating paths for single-task inference (better efficiency).
2