
tion, in-field robot navigation, etc. As more and more neu-
ral models are being deployed into the real world, there has
been a continuously growing interest in developing edge-
efficient architectures for dense predictions over the years.
However, designing fast and efficient dense prediction
models for edge devices is challenging. First of all, pixel-
level predictions such as semantic segmentation and depth
estimation are fundamentally slower than some other popu-
lar vision tasks, including image classification or object de-
tection. This is because after encoding the input images into
low-spatial resolution features, these networks need to up-
sample them back to produce high-resolution output masks.
In fact, dense estimation can be several times or even an
order of magnitude slower than their counterparts, depend-
ing on the specific model, hardware, and target resolution.
Thus, real-time dense prediction models are not only non-
trivial to design, they can easily become a latency bottle-
neck in systems that utilize their outputs. Such problems
are intensified for edge applications on platforms like the
Coral TPU [13] due to the limited computational resources,
despite the need for low latency, e.g., to inform the users or
process subsequent tasks in real time.
Second, developing models for these edge environments
is costly and hard to scale in practice. On one hand, the
architectural design process requires a significant amount
of time, human labor, and expertise, with the development
process ranging from a few months to a couple of years. On
the other hand, edge applications may require deployment
on various platforms, including cell phones, robots, drones,
and more. Unfortunately, optimal designs discovered for
one hardware may not generalize to another. All of these
together pose challenges to the development of fast and ef-
ficient models for on-edge dense predictions.
To tackle these problems, our first key insight is that
Multi-Task Learning of Dense Predictions (MTL-DP or
MT-DP) and hardware-aware Neural Architecture Search
(h-NAS) can work in synergy to not only mutually ben-
efit but also significantly improve accuracy and computa-
tion. To the best of our knowledge, our framework, named
EDNAS1, is the first to successfully exploit such a syner-
gistic relationship of NAS and MTL for dense predictions.
Indeed, on one hand, state-of-the-art methods for multi-task
dense predictions [4, 22, 36, 40, 53, 58, 66], in which related
tasks are learned jointly together, mostly focus on learning
how to share a fixed set of model components effectively
among tasks but do not consider if such a set itself is op-
timal for MTL to begin with. Moreover, these works typi-
cally study large models targeting powerful graphic accel-
erators such as V100 GPU for inference and are not read-
ily suitable for edge applications. On the other hand, NAS
methods aim to automatically learn an optimal set of neu-
ral components and their connections. However, the current
1short for “Edge-Efficient Dense Predictions via Multi-Task NAS”
literature often focuses on either simpler tasks such as clas-
sification [7, 33, 62] or single-task training setup [19, 34].
In contrast, we jointly learn MTL-DP and NAS and lever-
age their strengths to tackle the aforementioned issues si-
multaneously, resulting in a novel and improved approach
to efficient dense predictions for edge.
Our second key insight is that the standard depth esti-
mation training used in MTL-DP can produce significant
fluctuation in the evaluation accuracy. Indeed, our analysis
reveals a potential for undesirably large variance in both ab-
solute and relative depth. We hypothesize that this is caused
by the standard depth training practice that relies solely on
L1loss function. This can significantly and negatively af-
fect the accuracy of MT-DP evaluation as arbitrary “im-
provement” (or “degradation”) can manifest purely because
of random fluctuation in the relative error. It is important
that we raise awareness of and appropriately address this is-
sue as segmentation and depth information are arguably two
of the most commonly jointly learned and used tasks in edge
applications. To this end, we propose JAReD, an easy-to-
adopt augmented loss that jointly and directly optimizes for
both relative and absolute depth errors. The proposed loss
is highly effective at simultaneously reducing noisy fluctu-
ations and boosting overall prediction accuracy.
We conduct extensive evaluations on CityScapes [14]
and NYUv2 [50] to demonstrate the effectiveness and ro-
bustness of EDNAS and JAReD loss. Experimental results
indicate that our methods can yield significant gains, up to
+8.5% and +10.9% DP accuracy respectively, considerably
higher than the previous state of the art, with only 1/10th of
the parameter and FLOP counts (Fig. 1).
2. Background and Related Works
In general, dense prediction models are often designed
manually, in isolation, or not necessarily constrained by
limited edge computation [10, 27, 34, 35]. Specifically,
works on multi-task learning for dense predictions (MTL-
DP) [4, 5, 20, 22, 53, 58] often take a fixed base archi-
tecture such as DeepLab [9] and focus on learning to ef-
fectively shared components, e.g. by cross-task commu-
nication modules [5, 20], adaptive tree-like branching [4,
22, 58], layer skipping [53], etc. (Fig. 2). On the other
hand, neural architecture search (NAS) studies up until re-
cently have focused mostly on either image classification
problems[1, 7, 29, 33, 39, 62] or learning tasks in isola-
tion [19, 34, 54, 67]. Few have explored architecture search
for joint training of dense prediction tasks. However, as
mentioned earlier, edge efficiency can potentially benefit
both MTL-DP and NAS. To the best of our knowledge, our
study is the first to report successful joint optimization of
these two learning paradigms for dense predictions. Next,
we give an overview of the most relevant efforts in the two
domains of MTL and NAS. For more details, please refer to