SimpleClick Interactive Image Segmentation with Simple Vision Transformers Qin Liu Zhenlin Xu Gedas Bertasius Marc Niethammer University of North Carolina at Chapel Hill

2025-05-03 0 0 4.85MB 13 页 10玖币

侵权投诉

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer

University of North Carolina at Chapel Hill

https://github.com/uncbiag/SimpleClick

Abstract

Click-based interactive image segmentation aims at ex-

tracting objects with a limited user clicking. A hierarchical

backbone is the de-facto architecture for current methods.

Recently, the plain, non-hierarchical Vision Transformer

(ViT) has emerged as a competitive backbone for dense pre-

diction tasks. This design allows the original ViT to be a

foundation model that can be ﬁnetuned for downstream tasks

without redesigning a hierarchical backbone for pretrain-

ing. Although this design is simple and has been proven

eﬀective, it has not yet been explored for interactive image

segmentation. To ﬁll this gap, we propose SimpleClick,

the ﬁrst interactive segmentation method that leverages a

plain backbone. Based on the plain backbone, we intro-

duce a symmetric patch embedding layer that encodes clicks

into the backbone with minor modiﬁcations to the back-

bone itself. With the plain backbone pretrained as a masked

autoencoder (MAE), SimpleClick achieves state-of-the-

art performance. Remarkably, our method achieves 4.15

NoC@90 on SBD, improving 21.8% over the previous best

result. Extensive evaluation on medical images demon-

strates the generalizability of our method. We further de-

velop an extremely tiny ViT backbone for SimpleClick

and provide a detailed computational analysis, highlighting

its suitability as a practical annotation tool.

1. Introduction

The goal of interactive image segmentation is to obtain

high-quality pixel-level annotations with limited user inter-

action such as clicking. Interactive image segmentation ap-

proaches have been widely applied to annotate large-scale

image datasets, which drive the success of deep models in

various applications, including video understanding [5,48],

self-driving [7], and medical imaging [31,40]. Much re-

search has been devoted to explore interactive image seg-

mentation with diﬀerent interaction types, such as bound-

ing boxes [46], polygons [1], clicks [42], scribbles [44],

and their combinations [50]. Among them, the click-based

approach is most common due to its simplicity and well-

16 17 18 19 20

NoC@90 on SBD dataset

Swin

Transformer

SegFormer ResNet HRNet

Plain ViT

Figure 1. Interactive segmentation results on SBD [20]. The

metric “NoC@90" denotes the number of clicks required to obtain

90% IoU. The area of each bubble is proportional to the FLOPs

of a model variant (Tab. 5). We show that plain ViTs outperform

all hierarchical backbones for interactive image segmentation at a

moderate computational cost.

established training and evaluation protocols.

Recent advances in click-based approaches mainly lie in

two orthogonal directions: 1) the development of more ef-

fective backbone networks and 2) the exploration of more

elaborate reﬁnement modules built upon the backbone. For

the former direction, diﬀerent hierarchical backbones, in-

cluding both ConvNets [29,42] and ViTs [10,32], have been

developed for interactive segmentation. For the latter di-

rection, various reﬁnement modules, including local reﬁne-

ment [10,29] and click imitation [33], have been proposed

to further boost segmentation performance. In this work,

we delve into the former direction and focus on exploring a

plain backbone for interactive segmentation.

A hierarchical backbone is the predominant architecture

for current interactive segmentation methods. This design is

deeply rooted in ConvNets, represented by ResNet [22], and

has been adopted by ViTs, represented by the Swin Trans-

former [34]. The motivation for a hierarchical backbone

arXiv:2210.11006v3 [cs.CV] 11 Mar 2023

stems from the locality of convolution operations, leading

to insuﬃcient model receptive ﬁeld size without the hier-

archy. To increase the receptive ﬁeld size, ConvNets have

to progressively downsample feature maps to capture more

global contextual information. Therefore, they often require

a feature pyramid network such as FPN [27] to aggregate

multi-scale representations for high-quality segmentation.

However, this reasoning no longer applies for a plain ViT,

in which global information can be captured from the ﬁrst

self-attention block. Because all feature maps in the ViT

are of the same resolution, the motivation for an FPN-like

feature pyramid also no longer remains. The above reason-

ing is supported by a recent ﬁnding that a plain ViT can

serve as a strong backbone for object detection [25]. This

ﬁnding indicates a general-purpose ViT backbone might be

suitable for other tasks, which then can decouple pretraining

from ﬁnetuning and transfer the beneﬁts from readily avail-

able pretrained ViT models (e.g. MAE [21]) to these tasks.

However, although this design is simple and has been proven

eﬀective, it has not yet been explored in interactive segmen-

tation. In this work, we propose SimpleClick, the ﬁrst

plain-backbone method for interactive segmentation. The

core of SimpleClick is a plain ViT backbone that main-

tains single-scale representations throughout. We only use

the last feature map from the plain backbone to build a sim-

ple feature pyramid for segmentation, largely decoupling the

general-purpose backbone from the segmentation-speciﬁc

modules. To make SimpleClick more eﬃcient, we use

a light-weight MLP decoder to transform the simple feature

pyramid into a segmentation (see Sec. 3for details).

We extensively evaluate our method on 10 public bench-

marks, including both natural and medical images. With

the plain backbone pretrained as a MAE [21], our method

achieves 4.15 NoC@90 on SBD, which outperforms the pre-

vious best method by 21.8% without a complex FPN-like

design and local reﬁnement. We demonstrate the generaliz-

ability of our method by out-of-domain evaluation on medi-

cal images. We further analyze the computational eﬃciency

of SimpleClick, highlighting its suitability as a practical

annotation tool.

Our main contributions are:

•We propose SimpleClick, the ﬁrst plain-backbone

method for interactive image segmentation.

•SimpleClick achieves state-of-the-art performance on

natural images and shows strong generalizability on med-

ical images.

•SimpleClick meets the computational eﬃciency re-

quirement for a practical annotation tool, highlighting its

readiness for real-world applications.

2. Related Work

Interactive Image Segmentation Interactive image seg-

mentation is a longstanding problem for which increas-

ingly better solution approaches have been proposed. Early

works [6,16,18,39] tackle this problem using graphs de-

ﬁned over image pixels. However, these methods only focus

on low-level image features, and therefore tend to have dif-

ﬁculty with complex objects.

Thriving on large datasets, ConvNets [10,29,42,46,

47] have evolved as the dominant architecture for high

quality interactive segmentation. ConvNet-based methods

have explored various interaction types, such as bounding

boxes [46], polygons [1], clicks [42], and scribbles [44].

Click-based approaches are the most common due to their

simplicity and well-established training and evaluation pro-

tocols. Xu et al. [47] ﬁrst proposed a click simulation strat-

egy that has been adopted by follow-up work [10,33,42].

DEXTR [35] extracts a target object from specifying its

four extreme points (left-most, right-most, top, bottom pix-

els). FCA-Net [30] demonstrates the critical role of the ﬁrst

click for better segmentation. Recently, ViTs have been ap-

plied to interactive segmentation. FocalClick [10] uses Seg-

Former [45] as the backbone network and achieves state-

of-the-art segmentation results with high computational ef-

ﬁciency. iSegFormer [32] uses a Swin Transformer [34]

as the backbone network for interactive segmentation on

medical images. Besides the contribution on backbones,

some works are exploring elaborate reﬁnement modules

built upon the backbone. FocalClick [10] and FocusCut [29]

propose similar local reﬁnement modules for high-quality

segmentation. PseudoClick [33] proposes a click-imitation

mechanism by estimating the next-click to further reduce

human annotation cost. Our method diﬀers from all pre-

vious click-based methods in its plain, non-hierarchical ViT

backbone, enjoying the beneﬁts from readily available pre-

trained ViT models (e.g. MAE [21]).

Vision Transformers for Non-Interactive Segmentation

Recently, ViT-based approaches [17,24,43,45,49] have

shown competitive performance on segmentation tasks

compared to ConvNets. The original ViT [13] is a non-

hierarchical architecture that only maintains single-scale

feature maps throughout. SETR [51] and Segmenter [43]

use the original ViT as the encoder for semantic segmenta-

tion. To allow for more eﬃcient segmentation, the Swin

Transformer [34] reintroduces a computational hierarchy

into the original ViT architecture using shifted window at-

tention, leading to a highly eﬃcient hierarchical ViT back-

bone. SegFormer [45] designs hierarchical feature rep-

resentations based on the original ViT using overlapped

patch merging, combined with a light-weight MLP de-

coder for eﬃcient segmentation. HRViT [17] integrates a

high-resolution multi-branch architecture with ViTs to learn

multi-scale representations. Recently, the original ViT has

been reintroduced as a competitive backbone for semantic

segmentation [8] and object detection [25], with the aid of

MAE [21] pretraining and window attention. Inspired by

Model↓Module→ViT Backbone Conv. Neck MLP Head

Ours-ViT-B 83.0 (89.3%) 9.0 (9.7%) 0.9 (1.0%)

Ours-ViT-L 290.8 (94.3%) 16.5 (5.3%) 1.1 (0.4%)

Ours-ViT-H 604.0 (95.7%) 25.8 (4.1%) 1.3 (0.2%)

Table 1. Number of parameters of our models. The unit is mil-

lion. The percentage of parameters is shown in bracket. Most pa-

rameters are used by the ViT backbone.

this ﬁnding, we explore using a plain ViT as the backbone

network for interactive segmentation.

3. Method

Our goal is not to propose new modules, but to adapt a

plain-ViT backbone for interactive segmentation with min-

imal modiﬁcations so as to enjoy the readily available pre-

trained ViT weights. Sec. 3.1 introduces the main modules

of SimpleClick. Sec. 3.2 describes the training and inference

details of our method.

3.1. Network Architecture

Adaptation of Plain-ViT Backbone We use a plain

ViT [13] as our backbone network, which only maintains

single-scale feature maps throughout. The patch embedding

layer divides the input image into non-overlapping ﬁxed-size

patches (e.g. 16×16 for ViT-B), each patch is ﬂattened and

linearly projected to a ﬁxed-length vector (e.g. 768 for ViT-

B). The resulting sequence of vectors is fed into a queue of

Transformer blocks (e.g. 12 for ViT-B) for self-attention.

We implement SimpleClick with three backbones: ViT-

B, ViT-L, and ViT-H (Tab. 1shows the number of parame-

ters for the three backbones). The three backbones were pre-

trained on ImageNet-1k as MAEs [21]. We adapt the pre-

trained backbones to higher-resolution inputs during ﬁne-

tuning using non-shifting window attention aided by a few

global self-attention blocks (e.g. 2 for ViT-B), as introduced

in ViTDet [25]. Since the last feature map is subject to all

the attention blocks, it should have the strongest representa-

tion. Therefore, we only use the last feature map to build a

simple multi-scale feature pyramid.

Simple Feature Pyramid For the hierarchical backbone, a

feature pyramid is commonly produced by an FPN [27] to

combine features from diﬀerent stages. For the plain back-

bone, a feature pyramid can be generated in a much simpler

way: by a set of parallel convolutional or deconvolutional

layers using only the last feature map of the backbone. As

shown in Fig. 2, given the input ViT feature map, a multi-

scale feature map can be produced by four convolutions with

diﬀerent strides. Though the eﬀectiveness of this simple fea-

ture pyramid design is ﬁrst demonstrated in ViTDet [25] for

object detection, we show in this work the eﬀectiveness of

this simple feature pyramid design for interactive segmenta-

tion. We also propose several additional variants (Fig. 6) as

part of an ablation study (Sec. 4.4).

All-MLP Segmentation Head We implement a lightweight

segmentation head using only MLP layers. It takes in the

simple feature pyramid and produces a segmentation prob-

ability map1of scale 1∕4, followed by an upsampling oper-

ation to recover the original resolution. Note that this seg-

mentation head avoids computationally demanding compo-

nents and only accounts for up to 1% of the model param-

eters (Tab. 1). The key insight is that with a powerful pre-

trained backbone, a lightweight segmentation head is suﬃ-

cient for interactive segmentation. The proposed all-MLP

segmentation head works in three steps. First, each feature

map from the simple feature pyramid goes through an MLP

layer to transform it to an identical channel dimension (i.e.

𝐶2in Fig. 2). Second, all feature maps are upsampled to

the same resolution (i.e.1∕4 in Fig. 2) for concatenation.

Third, the concatenated features are fused by another MLP

layer to produce a single-channel feature map, followed by a

sigmoid function to obtain a segmentation probability map,

which is then transformed to a binary segmentation given a

predeﬁned threshold (i.e. 0.5).

Symmetric Patch Embedding and Beyond To fuse human

clicks into the plain backbone, we introduce a patch embed-

ding layer that is symmetric to the patch embedding layer

in the backbone, followed by element-wise feature addition.

The user clicks are encoded in a two-channel disk map, one

for positive clicks and the other for negative clicks. The pos-

itive clicks should be placed on the foreground, while the

negative clicks should be placed on the background. The

previous segmentation and the two-channel click map are

concatenated as a three-channel map for patch embedding.

The two symmetric embedding layers operate on the image

and the concatenated three-channel map, respectively. The

inputs are patchiﬁed, ﬂattened, and projected to two vector

sequences of the same dimension, followed by element-wise

addition before inputting into the self-attention blocks.

3.2. Training and Inference Settings

Backbone Pretraining Our backbone models are pretrained

as MAEs [21] on ImageNet-1K [11]. In MAE pretraining,

the ViT models reconstruct the randomly masked pixels of

images while learning a universal representation. This sim-

ple self-supervised approach turns out to be an eﬃcient and

scalable way to pretrain ViT models [21]. In this work, we

do not perform pretraining ourselves. Instead, we simply use

the readily available pretrained MAE weights from [21].

End-to-end Finetuning With the pretrained backbone, we

ﬁnetune our model end-to-end on the interactive segmenta-

tion task. The ﬁnetuning pipeline can be brieﬂy described as

follows. First, we automatically simulate clicks based on the

current segmentation and gold standard segmentation, with-

out a human-in-the-loop providing the clicks. Speciﬁcally,

1This probability map may be miscalibrated and can be improved by

calibration approaches [12].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SimpleClick:InteractiveImageSegmentationwithSimpleVisionTransformersQinLiu,ZhenlinXu,GedasBertasius,MarcNiethammerUniversityofNorthCarolinaatChapelHillhttps://github.com/uncbiag/SimpleClickAbstractClick-basedinteractiveimagesegmentationaimsatex-tractingobjectswithalimiteduserclicking.Ahierarchicalba...

展开>> 收起<<

SimpleClick Interactive Image Segmentation with Simple Vision Transformers Qin Liu Zhenlin Xu Gedas Bertasius Marc Niethammer University of North Carolina at Chapel Hill.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SimpleClick Interactive Image Segmentation with Simple Vision Transformers Qin Liu Zhenlin Xu Gedas Bertasius Marc Niethammer University of North Carolina at Chapel Hill

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: