
the image. Recently, deep learning had an enormous impact on
computer vision field, including semantic segmentation task.
Many deep learning models are based on fully convolutional
networks (FCNs) [12], which encoder gradually reduces the
spatial resolution and captures more semantic context of an
image with larger receptive fields. However, CNN still has
small receptive fields, leading to missing context features.
To overcome this limitation, PSPNet [13] proposed Pyramid
Pooling Module to enhance global context, DeepLab [14]
utilized atrous convolution to expand receptive fields.
B. Vision Transformer
Transformer [15] is a deep neural network architecture,
originally proposed to solve machine translation task in natural
language processing. Nowadays, Transformer has a huge influ-
ence on natural language processing and other tasks, including
computer vision. Vision Transformer (ViT) [16] is the first
model successfully apply Transformer in computer vision by
dividing an image into sequences and treating each sequence
as a token, and then putting them into Transformer. Following
this success, PVT [17], Swin [18], and SegFormer [19] are
designed as hierarchical Transformer, generating feature maps
at different scales to enhance the local feature, thus improving
the performance in dense prediction tasks, including detection
and segmentation.
C. Polyp segmentation
Recent advances in deep learning have helped solve med-
ical semantic segmentation tasks effectively, including polyp
segmentation. However, it remains a challenging problem due
to medical image characteristics, and polyps come in different
sizes, shapes, textures, and colors. U-net [20] is a well-known
medical image segmentation model consisting of residual
connections to preserve local features between encoder and
decoder. Inspired by U-net, most models [21], [22], [23], [5],
[24], [25], for medical image segmentation use an architecture
containing a CNN backbone as an encoder, and a decoder tries
to fuse features at different scales to generate segmentation
result. PraNet [26], SFA [27] focus on the distinction between
a polyp boundary and background to improve segmentation
performance. CaraNet [28] designs a attention module for
small medical object segmentation. Since the successful ap-
plication of Transformer in computer vision, current methods
[29], [30], [8] have utilized the capability of Transformer as a
backbone for polyp segmentation task and yielded promising
results. Other methods [31], [7], [32] use both Transformer and
CNN backbone for feature extraction and combine features
from two brands, thus enhancing segmentation result.
III. PROPOSED METHOD
In this section, we describe the proposed LAPFormer in
detail. An overview of our model is presented in Fig 1.
A. Encoder
We choose MiT (Mix Transformer) proposed in [19] as our
segmentation encoder for two main reasons. First, it consists of
a novel hierarchically structured Transformer encoder which
produces multi-level multi-scale feature outputs. Second, MiT
uses a convolutional kernel instead of Positional Encoding
(PE), therefore avoiding decreased performance when the test-
ing resolution differs from training. Convolutional layers are
argued to be more adequate for extracting location information
for Transformer than other methods. Furthermore, MiT uses
small image patches of size 4×4, which are proved to favor
intensive dense prediction tasks like semantic segmentation.
Assume that we have input image Xwith spatial dimensions
of H×W×3(represent the height, width and channel).
MiT generates four different level feature fiwith resolution
H
2i+1 ×W
2i+1 ×Ciwhere i∈ {1,2,3,4}and Ci+1 is larger than
Ci. Basically, MiT has six variants with same architecture but
varies in sizes, from MiT-B0 to MiT-B5.
B. Progressive Feature Fusion
Dense prediction is a famous task in computer vision
which consists of Object Detection, Semantic Segmentation,
... Object of vastly different scales all appear in the same
picture. Therefore, multi-scale features are heavily required
to generate good results. The most popular way to utilize
multi-scale features is to construct a feature pyramid network
(FPN) [33]. However, as pointed out in [34] and [35], there are
big semantic gaps between feature from non-adjacent scales,
features from two scale levels apart are weakly correlated,
while features between adjacent scales are highly correlated.
In FPN, features from upper scale are upscaled then directly
added to features from lower scale. We argue that this is
sub-optimal, when fusing features, they should be further
processed before directly fusing to next level. We propose
Progressive Feature Fusion module, which progressively fuses
features from upper scales to lower scales, therefore, reduce
the information gap between the low resolution, high semantic
feature maps and high resolution, low semantic ones.
Instead of fusing feature maps with addition operation, we
use concatenation operation. Feature maps of all scale are
upsampled to (H/4, W/4). Then feature maps from upper
scale are progressively fused with lower scale as follow:
F(xi, xi−1) = Linear([xi, xi−1])
where xiis features from scale i;xi−1is features from one
scale lower than i;[. . . ]is concatenation operation; Linear is
a fully-connected layer.
C. Aggregation for Prediction
On conventional Encoder-Decoder architecture, after per-
forming multi-scale feature fusion from feature pyramid, other
models [20], [36], [23], [29] often perform prediction on the
last output feature maps, which has the highest resolution (Fig
2a). Others adopt auxiliary predictions during training [37],
[28] but during test time, they only make prediction from the
highest resolution feature map (Fig 2b).
We argue that predicting on only the highest resolution
feature maps is sub-optimal. This make the highest resolution
feature maps to carry a lot of information, but information