the Attention-to-Mask (ATM) module. More specifically, we employ a transformer block that takes
the learnable class tokens as queries and transfers the spatial feature maps as keys and values. A
dot-product operator calculates the similarity maps between queries and keys. We encourage regions
belonging to the same category to generate larger similarity values for the corresponding category
(i.e. a specific class token). Fig. 1visualizes the similarity maps between the features and the ‘Table’
and ‘Chair’ tokens. By simply applying a
Sigmoid
operation, we can transfer the similarity maps to
the masks. Meanwhile, following the design of a typical transformer block, a
Softmax
operation is
also applied to the similarity maps to get the cross attention maps. The ‘Table’ and ‘Chair’ tokens are
then updated as in any regular transformer decoders, by a weighted sum of the values with the cross
attention maps as the weights. Since the mask is a byproduct of the regular attentive calculations,
negligible computation is involved during the operation.
Building upon this efficient ATM module, we propose a new semantic segmentation paradigm with
the plain ViT structure, dubbed SegViT. In the paradigm, several ATM modules are employed on
different layers, and we get the final segmentation mask by adding the outputs from different layers
together. SegViT outperforms its ViT-based counterparts with less computational cost. However,
compared with previous encoder-decoder structures that use hierarchical networks as encoders, ViT
backbones as encoders are generally heavier. To further reduce the computational cost, we employ a
Shrunk structure consisting of query-based down-sampling (QD) and query-based up-sampling (QU).
The QD can be inserted into the ViT backbone to reduce the resolution by half and QU is used parallel
to the backbone to recover the resolution. The Shrunk structure together with the ATM module as the
decoder can reduce up to 40% computations while maintaining competitive performance.
We summarize our main contributions as follows:
•
We propose an Attention-to-Mask (ATM) decoder module that is effective and efficient for
semantic segmentation. For the first time, we utilize the spatial information in attention
maps to generate mask predictions for each category, which can work as a new paradigm for
semantic segmentation.
•
We managed to apply our ATM decoder module to the plain, non-hierarchical ViT backbones
in a cascade manner and designed a structure namely SegViT that achieves mIoU
55.2%
on
the competitive ADE20K dataset which is the best and lightest among methods that use ViT
backbones. We also benchmark our method on the PASCAL-Context dataset (
65.3%
mIoU)
and COCO-Stuff-10K dataset (
50.3%
mIoU) and achieve new state-of-the-art performance.
•
We further explore the architecture of ViT backbones and work out a Shrunk structure
to apply to the backbone to reduce the overall computational cost while still maintaining
competitive performance. This alleviates the disadvantage of ViT backbones that are usually
more computationally intensive compared to their hierarchical counterparts. Our Shrunk
version of SegViT on the ADE20K dataset reaches mIoU
55.1%
with the computational
cost of 373.5 GFLOPs which is about
40%
off compared to the original SegViT (637.9
GFLOPs).
2 Related Work
Semantic segmentation.
Semantic segmentation which requires pixel-level classification on an
input image is a fundamental task in computer vision. Fully Convolutional Networks (FCN) used
to be the dominant approach to this task. Initial per-pixel approaches such as [
9
,
10
] attribute the
class label to each pixel based on the per-pixel probability. To enlarge the receptive field, several
approaches [
11
,
12
] have proposed dilated convolutions or apply spatial pyramid pooling to capture
contextual information of multiple scales. With the introduction of attention mechanisms, [
13
,
14
,
6
]
replace the feature merge conducted by convolutions and pooling with attention to better capture
long-range dependencies.
Recent works [
15
,
8
,
16
] decouple the per-pixel classification process. They reconstruct the structure
by using a fixed number of learnable tokens and use them as weights for the transformation to
apply on feature maps. Binary matching rather than cross-entropy is used to allow overlaps between
feature maps and learnable tokens are used to dynamically generate classification probabilities. This
paradigm enables the classification process to be conducted globally and alleviates the burden for
the decoder to do per-pixel classification, which as a result, is more precise and the performance is
2