SegViT Semantic Segmentation with Plain Vision Transformers Bowen Zhang1 Zhi Tian2 Quan Tang4 Xiangxiang Chu2

2025-04-26 0 0 3.15MB 13 页 10玖币
侵权投诉
SegViT: Semantic Segmentation with Plain Vision
Transformers
Bowen Zhang1, Zhi Tian2
, Quan Tang4, Xiangxiang Chu2,
Xiaolin Wei2,Chunhua Shen3,Yifan Liu1
1The University of Adelaide, Australia 2Meituan Inc.
3Zhejiang University, China 4South China University of Technology, China
Abstract
We explore the capability of plain Vision Transformers (ViTs) for semantic seg-
mentation and propose the SegViT. Previous ViT-based segmentation networks
usually learn a pixel-level representation from the output of the ViT. Differently, we
make use of the fundamental component—attention mechanism, to generate masks
for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM)
module, in which the similarity maps between a set of learnable class tokens and
the spatial feature maps are transferred to the segmentation masks. Experiments
show that our proposed SegViT using the ATM module outperforms its counter-
parts using the plain ViT backbone on the ADE20K dataset and achieves new
state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets.
Furthermore, to reduce the computational cost of the ViT backbone, we propose
query-based down-sampling (QD) and query-based up-sampling (QU) to build
aShrunk structure. With the proposed Shrunk structure, the model can save up
to
40%
computations while maintaining competitive performance. The code is
available through the following link: https://github.com/zbwxp/SegVit.
1 Introduction
Semantic segmentation is a dense prediction task in computer vision that requires pixel-level clas-
sification of an input image. Fully Convolutional Networks (FCN) [
1
] are widely used in recent
state-of-the-art methods. This paradigm includes a deep convolutional neural network as the en-
coder/backbone and a segmentation-oriented decoder to provide dense predictions. A 1×1 convolu-
tional layer is usually applied to a representative feature map to obtain the pixel level predictions.
To achieve higher performance, previous works [
2
4
] focus on enriching the context information or
fusing multi-scale information. However, the correlations among spatial locations are hard to model
explicitly in FCNs due to the limited receptive field.
Recently, Vision Transformers (ViT) [
5
], which make use of the spatial attention mechanism are
introduced to the field of computer vision. Unlike typical convolution-based backbones, the ViT has
a plain and non-hierarchical architecture that keeps the resolution of the feature maps all the way
through. The lack of the down-sampling process (excluding tokenizing the image) brings differences
to the architecture to do the semantic segmentation task using ViT backbone. Various semantic
segmentation methods [
6
8
] based on ViT backbones have achieved promising performance due to
the powerful representation learned from the pre-trained backbones. However, the potential of the
attention mechanism is not fully explored.
Different from previous per-pixel classification paradigm [
6
8
], we consider learning a meaningful
class token and then finding local patches with higher similarity to it. To achieve this goal, we propose
The first two authors contributed equally. CS is the corresponding author, e-mail: chunhua@me.com
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05844v2 [cs.CV] 12 Dec 2022
the Attention-to-Mask (ATM) module. More specifically, we employ a transformer block that takes
the learnable class tokens as queries and transfers the spatial feature maps as keys and values. A
dot-product operator calculates the similarity maps between queries and keys. We encourage regions
belonging to the same category to generate larger similarity values for the corresponding category
(i.e. a specific class token). Fig. 1visualizes the similarity maps between the features and the ‘Table’
and ‘Chair’ tokens. By simply applying a
Sigmoid
operation, we can transfer the similarity maps to
the masks. Meanwhile, following the design of a typical transformer block, a
Softmax
operation is
also applied to the similarity maps to get the cross attention maps. The ‘Table’ and ‘Chair’ tokens are
then updated as in any regular transformer decoders, by a weighted sum of the values with the cross
attention maps as the weights. Since the mask is a byproduct of the regular attentive calculations,
negligible computation is involved during the operation.
Building upon this efficient ATM module, we propose a new semantic segmentation paradigm with
the plain ViT structure, dubbed SegViT. In the paradigm, several ATM modules are employed on
different layers, and we get the final segmentation mask by adding the outputs from different layers
together. SegViT outperforms its ViT-based counterparts with less computational cost. However,
compared with previous encoder-decoder structures that use hierarchical networks as encoders, ViT
backbones as encoders are generally heavier. To further reduce the computational cost, we employ a
Shrunk structure consisting of query-based down-sampling (QD) and query-based up-sampling (QU).
The QD can be inserted into the ViT backbone to reduce the resolution by half and QU is used parallel
to the backbone to recover the resolution. The Shrunk structure together with the ATM module as the
decoder can reduce up to 40% computations while maintaining competitive performance.
We summarize our main contributions as follows:
We propose an Attention-to-Mask (ATM) decoder module that is effective and efficient for
semantic segmentation. For the first time, we utilize the spatial information in attention
maps to generate mask predictions for each category, which can work as a new paradigm for
semantic segmentation.
We managed to apply our ATM decoder module to the plain, non-hierarchical ViT backbones
in a cascade manner and designed a structure namely SegViT that achieves mIoU
55.2%
on
the competitive ADE20K dataset which is the best and lightest among methods that use ViT
backbones. We also benchmark our method on the PASCAL-Context dataset (
65.3%
mIoU)
and COCO-Stuff-10K dataset (
50.3%
mIoU) and achieve new state-of-the-art performance.
We further explore the architecture of ViT backbones and work out a Shrunk structure
to apply to the backbone to reduce the overall computational cost while still maintaining
competitive performance. This alleviates the disadvantage of ViT backbones that are usually
more computationally intensive compared to their hierarchical counterparts. Our Shrunk
version of SegViT on the ADE20K dataset reaches mIoU
55.1%
with the computational
cost of 373.5 GFLOPs which is about
40%
off compared to the original SegViT (637.9
GFLOPs).
2 Related Work
Semantic segmentation.
Semantic segmentation which requires pixel-level classification on an
input image is a fundamental task in computer vision. Fully Convolutional Networks (FCN) used
to be the dominant approach to this task. Initial per-pixel approaches such as [
9
,
10
] attribute the
class label to each pixel based on the per-pixel probability. To enlarge the receptive field, several
approaches [
11
,
12
] have proposed dilated convolutions or apply spatial pyramid pooling to capture
contextual information of multiple scales. With the introduction of attention mechanisms, [
13
,
14
,
6
]
replace the feature merge conducted by convolutions and pooling with attention to better capture
long-range dependencies.
Recent works [
15
,
8
,
16
] decouple the per-pixel classification process. They reconstruct the structure
by using a fixed number of learnable tokens and use them as weights for the transformation to
apply on feature maps. Binary matching rather than cross-entropy is used to allow overlaps between
feature maps and learnable tokens are used to dynamically generate classification probabilities. This
paradigm enables the classification process to be conducted globally and alleviates the burden for
the decoder to do per-pixel classification, which as a result, is more precise and the performance is
2
Similarity Attention Mask GT/Image
Table
Chair
Figure 1: The overall concept of our Attention-to-Mask decoder.
In a typical attentive process,
the dot-product is first calculated between queries and keys to measure the similarity (as illustrated on
the left). If the similarity map is applied with
Softmax
operation on the spatial dimension, the output
is the typical attention map (multiple heads are summed together). However, if the same similarity
map is applied with a per-pixel operation
Sigmoid
, it produces a mask that indicates the area with
certain similarity. Based on the assumption that the tokens within the same category have higher
similarity, we can train a token vector to have high similarity within tokens of the specific category
and low similarity elsewhere. In the meantime, this process does not violate the attention mechanism.
Thus, it can process alongside the original transformer layers.
generally better. However, for those methods, the feature map is still calculated in a static manner,
usually requiring feature merge modules such as FPN [4].
Transformers for vision.
Attention-based transformer backbones have become powerful alterna-
tives to standard convolution based networks for image classification tasks. The original ViT [
5
] is
a plain, non-hierarchical architecture. Various hierarchical transformers such as [
17
21
] have been
presented afterwards. These methods inherit some designs from convolution based networks such as
hierarchical structures, pooling and down-sampling with convolutions. As a result, they can be used
as a straightforward replacement for convolutional based networks and applied with previous decoder
heads for tasks such as semantic segmentation.
Plain-backbone decoders.
High-resolution feature maps generated by backbones are important for
dense prediction tasks such as semantic segmentation. Typical hierarchical transformers use feature
merge techniques such as FPN [
4
] or dilated backbones to generate high-resolution feature maps.
However, for plain, non-hierarchical transformer backbones, the resolution remains the same for all
layers. SETR [
6
] proposed a simple strategy to treat transformer outputs in a sequence-to-sequence
perspective to solve segmentation tasks. Segmenter [
8
] joints random initialized class embeddings
and the transformer patch embeddings together and applies several self-attention layers to the joint
token sequence to obtain updated class embeddings and patch embeddings semantic prediction. In
our study, we consider learning a class token and then finding local patches with higher similarities
with the help of the attention map, making the inference process more direct and efficient.
3 Method
3.1 Encoder
Given an input image
IRH×W×3
, a plain vision transformer backbone reshapes it into a sequence
of tokens
F0RL×C
where
L=HW/P 2
,
P
is the patch size and
C
is the number of channels.
Learnable position embeddings of the same size of
F0
are added to capture the positional information.
Then, the token sequence
F0
is applied with
m
transformer layers to get the output. We define the
output tokens for each layer as
[F1,F2,...,Fm]RL×C
. Typically, a transformer layer consists of
a multi-head self-attention block followed by a point-wise multilayer perceptron block with layer
norm in between and then a residual connection is added afterward. The transformer layers are
stacked repetitively several times. For a plain vision transformer like ViT, there are no other modules
involved and for each layer, the number of the tokens is not changed.
3
摘要:

SegViT:SemanticSegmentationwithPlainVisionTransformersBowenZhang1,ZhiTian2,QuanTang4,XiangxiangChu2,XiaolinWei2,ChunhuaShen3,YifanLiu11TheUniversityofAdelaide,Australia2MeituanInc.3ZhejiangUniversity,China4SouthChinaUniversityofTechnology,ChinaAbstractWeexplorethecapabilityofplainVisionTransformer...

收起<<
SegViT Semantic Segmentation with Plain Vision Transformers Bowen Zhang1 Zhi Tian2 Quan Tang4 Xiangxiang Chu2.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:3.15MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注