
ASAP: Accurate semantic segmentation for real
time performance
Jae Hyun Park
AI tech team
Lotte Data Communication Company
Seoul, South Korea
jaehyun-park@lotte.net
Su Bin Lee
AI tech team
Lotte Data Communication Company
Seoul, South Korea
leesubin@lotte.net
Eon Kim
AI tech team
Lotte Data Communication Company
Seoul, South Korea
eon.kim@lotte.net
Byeong Jun Moon
AI tech team
Lotte Data Communication Company
Seoul, South Korea
bj moon@lotte.net
Da Been Yu
AI tech team
Lotte Data Communication Company
Seoul, South Korea
db.yu@lotte.net
Yeon Seung Yu
AI tech team
Lotte Data Communication Company
Seoul, South Korea
yys4000@lotte.net
Jung Hwan Kim
AI tech team
Lotte Data Communication Company
Seoul, South Korea
jhwan kim@lotte.net
Abstract—Feature fusion modules from encoder and self-
attention module have been adopted in semantic segmentation.
However, the computation of these modules is costly and has
operational limitations in real-time environments. In addition,
segmentation performance is limited in autonomous driving
environments with a lot of contextual information perpendicular
to the road surface, such as people, buildings, and general objects.
In this paper, we propose an efficient feature fusion method,
Feature Fusion with Different Norms (FFDN) that utilizes rich
global context of multi-level scale and vertical pooling module
before self-attention that preserves most contextual information
while reducing the complexity of global context encoding in
the vertical direction. By doing this, we could handle the
properties of representation in global space and reduce additional
computational cost. In addition, we analyze low performance in
challenging cases including small and vertically featured objects.
We achieve the mean Interaction of-union(mIoU) of 73.1 and the
Frame Per Second(FPS) of 191, which are comparable results
with state-of-the-arts on Cityscapes test datasets.
Index Terms—semantic segmentation, deep learning
I. INTRODUCTION
Semantic segmentation is a per-pixel classification which
predicts pixel by pixel. Including biomedical and human-
machine interaction, semantic segmentation has been widely
researched [1], [2].
In particular, segmentation used in autonomous driving,
such as depth estimation and free space, operates in real time
and requires fast inference speed and high performance. To
improve inference speed, aligned feature maps at adjacent
levels used to balance performance and inference speed in
segmentation task [3]. ladder-style lightweight decoder is
designed for upsampling low spatial resolution [4].
To achieve high accuracy, segmentation models require
global contextual information and capabilities with multi-
level semantics. Some studies include a self-attention module,
which helps to concentrate contextual features [5] to satisfy
accuracy. Other studies propose the feature fusion module,
which combine multi-level features [6], [7]. However, these
modules, which contain convolution-based operations to fusion
multi-level features, require huge computational complexity
and memory storage.
In order to reduce the amount of computation while not
dropping the accuracy, we attempt to exploit normalization
technics in feature fusion of semantic segmentation. In U-GAT-
IT [1], spatial and semantic contents are considered adequately
by using adaptive normalizations [12, 13] to reflect image
content such as style and geometry information. Inspired by
these approaches, we propose an efficient Feature Fusion
with Different Norms (FFDN) where layer normalization
and instance normalization are used in aggregating features
of different layers as shown in Fig 2. These normalization
methods allow the segmentation model to obtain exact object
location from spatial information and detailed parts of object
from content information with low computational complexity.
FFDN receives multi-level features obtained from simply
modified FPN (*FPN) as input and combines them to capture
global properties of representations. The *FPN is shown in
Fig 1.
One of the challenging problems with semantic segmen-
tation is considering specific directions, such as vertical or
diagonal (e.g., people, pole). Since a general convolution or
pooling operation uses uniform kernels with the same height
arXiv:2210.01323v1 [cs.CV] 4 Oct 2022