SegViT Semantic Segmentation with Plain Vision Transformers Bowen Zhang1 Zhi Tian2 Quan Tang4 Xiangxiang Chu2

2025-04-26 1 0 3.15MB 13 页 10玖币

侵权投诉

SegViT: Semantic Segmentation with Plain Vision

Transformers

Bowen Zhang1∗, Zhi Tian2∗

, Quan Tang4, Xiangxiang Chu2,

Xiaolin Wei2,Chunhua Shen3,Yifan Liu1

1The University of Adelaide, Australia 2Meituan Inc.

3Zhejiang University, China 4South China University of Technology, China

Abstract

We explore the capability of plain Vision Transformers (ViTs) for semantic seg-

mentation and propose the SegViT. Previous ViT-based segmentation networks

usually learn a pixel-level representation from the output of the ViT. Differently, we

make use of the fundamental component—attention mechanism, to generate masks

for semantic segmentation. Speciﬁcally, we propose the Attention-to-Mask (ATM)

module, in which the similarity maps between a set of learnable class tokens and

the spatial feature maps are transferred to the segmentation masks. Experiments

show that our proposed SegViT using the ATM module outperforms its counter-

parts using the plain ViT backbone on the ADE20K dataset and achieves new

state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets.

Furthermore, to reduce the computational cost of the ViT backbone, we propose

query-based down-sampling (QD) and query-based up-sampling (QU) to build

aShrunk structure. With the proposed Shrunk structure, the model can save up

40%

computations while maintaining competitive performance. The code is

available through the following link: https://github.com/zbwxp/SegVit.

1 Introduction

Semantic segmentation is a dense prediction task in computer vision that requires pixel-level clas-

siﬁcation of an input image. Fully Convolutional Networks (FCN) [

] are widely used in recent

state-of-the-art methods. This paradigm includes a deep convolutional neural network as the en-

coder/backbone and a segmentation-oriented decoder to provide dense predictions. A 1×1 convolu-

tional layer is usually applied to a representative feature map to obtain the pixel level predictions.

To achieve higher performance, previous works [

–

] focus on enriching the context information or

fusing multi-scale information. However, the correlations among spatial locations are hard to model

explicitly in FCNs due to the limited receptive ﬁeld.

Recently, Vision Transformers (ViT) [

], which make use of the spatial attention mechanism are

introduced to the ﬁeld of computer vision. Unlike typical convolution-based backbones, the ViT has

a plain and non-hierarchical architecture that keeps the resolution of the feature maps all the way

through. The lack of the down-sampling process (excluding tokenizing the image) brings differences

to the architecture to do the semantic segmentation task using ViT backbone. Various semantic

segmentation methods [

–

] based on ViT backbones have achieved promising performance due to

the powerful representation learned from the pre-trained backbones. However, the potential of the

attention mechanism is not fully explored.

Different from previous per-pixel classiﬁcation paradigm [

–

], we consider learning a meaningful

class token and then ﬁnding local patches with higher similarity to it. To achieve this goal, we propose

∗The ﬁrst two authors contributed equally. CS is the corresponding author, e-mail: chunhua@me.com

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05844v2 [cs.CV] 12 Dec 2022

the Attention-to-Mask (ATM) module. More speciﬁcally, we employ a transformer block that takes

the learnable class tokens as queries and transfers the spatial feature maps as keys and values. A

dot-product operator calculates the similarity maps between queries and keys. We encourage regions

belonging to the same category to generate larger similarity values for the corresponding category

(i.e. a speciﬁc class token). Fig. 1visualizes the similarity maps between the features and the ‘Table’

and ‘Chair’ tokens. By simply applying a

Sigmoid

operation, we can transfer the similarity maps to

the masks. Meanwhile, following the design of a typical transformer block, a

Softmax

operation is

also applied to the similarity maps to get the cross attention maps. The ‘Table’ and ‘Chair’ tokens are

then updated as in any regular transformer decoders, by a weighted sum of the values with the cross

attention maps as the weights. Since the mask is a byproduct of the regular attentive calculations,

negligible computation is involved during the operation.

Building upon this efﬁcient ATM module, we propose a new semantic segmentation paradigm with

the plain ViT structure, dubbed SegViT. In the paradigm, several ATM modules are employed on

different layers, and we get the ﬁnal segmentation mask by adding the outputs from different layers

together. SegViT outperforms its ViT-based counterparts with less computational cost. However,

compared with previous encoder-decoder structures that use hierarchical networks as encoders, ViT

backbones as encoders are generally heavier. To further reduce the computational cost, we employ a

Shrunk structure consisting of query-based down-sampling (QD) and query-based up-sampling (QU).

The QD can be inserted into the ViT backbone to reduce the resolution by half and QU is used parallel

to the backbone to recover the resolution. The Shrunk structure together with the ATM module as the

decoder can reduce up to 40% computations while maintaining competitive performance.

We summarize our main contributions as follows:

•

We propose an Attention-to-Mask (ATM) decoder module that is effective and efﬁcient for

semantic segmentation. For the ﬁrst time, we utilize the spatial information in attention

maps to generate mask predictions for each category, which can work as a new paradigm for

semantic segmentation.

•

We managed to apply our ATM decoder module to the plain, non-hierarchical ViT backbones

in a cascade manner and designed a structure namely SegViT that achieves mIoU

55.2%

the competitive ADE20K dataset which is the best and lightest among methods that use ViT

backbones. We also benchmark our method on the PASCAL-Context dataset (

65.3%

mIoU)

and COCO-Stuff-10K dataset (

50.3%

mIoU) and achieve new state-of-the-art performance.

•

We further explore the architecture of ViT backbones and work out a Shrunk structure

to apply to the backbone to reduce the overall computational cost while still maintaining

competitive performance. This alleviates the disadvantage of ViT backbones that are usually

more computationally intensive compared to their hierarchical counterparts. Our Shrunk

version of SegViT on the ADE20K dataset reaches mIoU

55.1%

with the computational

cost of 373.5 GFLOPs which is about

40%

off compared to the original SegViT (637.9

GFLOPs).

2 Related Work

Semantic segmentation.

Semantic segmentation which requires pixel-level classiﬁcation on an

input image is a fundamental task in computer vision. Fully Convolutional Networks (FCN) used

to be the dominant approach to this task. Initial per-pixel approaches such as [

] attribute the

class label to each pixel based on the per-pixel probability. To enlarge the receptive ﬁeld, several

approaches [

] have proposed dilated convolutions or apply spatial pyramid pooling to capture

contextual information of multiple scales. With the introduction of attention mechanisms, [

]

replace the feature merge conducted by convolutions and pooling with attention to better capture

long-range dependencies.

Recent works [

] decouple the per-pixel classiﬁcation process. They reconstruct the structure

by using a ﬁxed number of learnable tokens and use them as weights for the transformation to

apply on feature maps. Binary matching rather than cross-entropy is used to allow overlaps between

feature maps and learnable tokens are used to dynamically generate classiﬁcation probabilities. This

paradigm enables the classiﬁcation process to be conducted globally and alleviates the burden for

the decoder to do per-pixel classiﬁcation, which as a result, is more precise and the performance is

Similarity Attention Mask GT/Image

Table

Chair

Figure 1: The overall concept of our Attention-to-Mask decoder.

In a typical attentive process,

the dot-product is ﬁrst calculated between queries and keys to measure the similarity (as illustrated on

the left). If the similarity map is applied with

Softmax

operation on the spatial dimension, the output

is the typical attention map (multiple heads are summed together). However, if the same similarity

map is applied with a per-pixel operation

Sigmoid

, it produces a mask that indicates the area with

certain similarity. Based on the assumption that the tokens within the same category have higher

similarity, we can train a token vector to have high similarity within tokens of the speciﬁc category

and low similarity elsewhere. In the meantime, this process does not violate the attention mechanism.

Thus, it can process alongside the original transformer layers.

generally better. However, for those methods, the feature map is still calculated in a static manner,

usually requiring feature merge modules such as FPN [4].

Transformers for vision.

Attention-based transformer backbones have become powerful alterna-

tives to standard convolution based networks for image classiﬁcation tasks. The original ViT [

] is

a plain, non-hierarchical architecture. Various hierarchical transformers such as [

–

] have been

presented afterwards. These methods inherit some designs from convolution based networks such as

hierarchical structures, pooling and down-sampling with convolutions. As a result, they can be used

as a straightforward replacement for convolutional based networks and applied with previous decoder

heads for tasks such as semantic segmentation.

Plain-backbone decoders.

High-resolution feature maps generated by backbones are important for

dense prediction tasks such as semantic segmentation. Typical hierarchical transformers use feature

merge techniques such as FPN [

] or dilated backbones to generate high-resolution feature maps.

However, for plain, non-hierarchical transformer backbones, the resolution remains the same for all

layers. SETR [

] proposed a simple strategy to treat transformer outputs in a sequence-to-sequence

perspective to solve segmentation tasks. Segmenter [

] joints random initialized class embeddings

and the transformer patch embeddings together and applies several self-attention layers to the joint

token sequence to obtain updated class embeddings and patch embeddings semantic prediction. In

our study, we consider learning a class token and then ﬁnding local patches with higher similarities

with the help of the attention map, making the inference process more direct and efﬁcient.

3 Method

3.1 Encoder

Given an input image

I∈RH×W×3

, a plain vision transformer backbone reshapes it into a sequence

of tokens

F0∈RL×C

where

L=HW/P 2

is the patch size and

is the number of channels.

Learnable position embeddings of the same size of

are added to capture the positional information.

Then, the token sequence

is applied with

transformer layers to get the output. We deﬁne the

output tokens for each layer as

[F1,F2,...,Fm]∈RL×C

. Typically, a transformer layer consists of

a multi-head self-attention block followed by a point-wise multilayer perceptron block with layer

norm in between and then a residual connection is added afterward. The transformer layers are

stacked repetitively several times. For a plain vision transformer like ViT, there are no other modules

involved and for each layer, the number of the tokens is not changed.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SegViT:SemanticSegmentationwithPlainVisionTransformersBowenZhang1,ZhiTian2,QuanTang4,XiangxiangChu2,XiaolinWei2,ChunhuaShen3,YifanLiu11TheUniversityofAdelaide,Australia2MeituanInc.3ZhejiangUniversity,China4SouthChinaUniversityofTechnology,ChinaAbstractWeexplorethecapabilityofplainVisionTransformer...

展开>> 收起<<

SegViT Semantic Segmentation with Plain Vision Transformers Bowen Zhang1 Zhi Tian2 Quan Tang4 Xiangxiang Chu2.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SegViT Semantic Segmentation with Plain Vision Transformers Bowen Zhang1 Zhi Tian2 Quan Tang4 Xiangxiang Chu2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: