SimpleClick Interactive Image Segmentation with Simple Vision Transformers Qin Liu Zhenlin Xu Gedas Bertasius Marc Niethammer University of North Carolina at Chapel Hill

2025-05-03 0 0 4.85MB 13 页 10玖币
侵权投诉
SimpleClick: Interactive Image Segmentation with Simple Vision Transformers
Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer
University of North Carolina at Chapel Hill
https://github.com/uncbiag/SimpleClick
Abstract
Click-based interactive image segmentation aims at ex-
tracting objects with a limited user clicking. A hierarchical
backbone is the de-facto architecture for current methods.
Recently, the plain, non-hierarchical Vision Transformer
(ViT) has emerged as a competitive backbone for dense pre-
diction tasks. This design allows the original ViT to be a
foundation model that can be finetuned for downstream tasks
without redesigning a hierarchical backbone for pretrain-
ing. Although this design is simple and has been proven
effective, it has not yet been explored for interactive image
segmentation. To fill this gap, we propose SimpleClick,
the first interactive segmentation method that leverages a
plain backbone. Based on the plain backbone, we intro-
duce a symmetric patch embedding layer that encodes clicks
into the backbone with minor modifications to the back-
bone itself. With the plain backbone pretrained as a masked
autoencoder (MAE), SimpleClick achieves state-of-the-
art performance. Remarkably, our method achieves 4.15
NoC@90 on SBD, improving 21.8% over the previous best
result. Extensive evaluation on medical images demon-
strates the generalizability of our method. We further de-
velop an extremely tiny ViT backbone for SimpleClick
and provide a detailed computational analysis, highlighting
its suitability as a practical annotation tool.
1. Introduction
The goal of interactive image segmentation is to obtain
high-quality pixel-level annotations with limited user inter-
action such as clicking. Interactive image segmentation ap-
proaches have been widely applied to annotate large-scale
image datasets, which drive the success of deep models in
various applications, including video understanding [5,48],
self-driving [7], and medical imaging [31,40]. Much re-
search has been devoted to explore interactive image seg-
mentation with different interaction types, such as bound-
ing boxes [46], polygons [1], clicks [42], scribbles [44],
and their combinations [50]. Among them, the click-based
approach is most common due to its simplicity and well-
3
4
5
6
7
8
16 17 18 19 20
NoC@90 on SBD dataset
Swin
Transformer
SegFormer ResNet HRNet
Plain ViT
Figure 1. Interactive segmentation results on SBD [20]. The
metric “NoC@90" denotes the number of clicks required to obtain
90% IoU. The area of each bubble is proportional to the FLOPs
of a model variant (Tab. 5). We show that plain ViTs outperform
all hierarchical backbones for interactive image segmentation at a
moderate computational cost.
established training and evaluation protocols.
Recent advances in click-based approaches mainly lie in
two orthogonal directions: 1) the development of more ef-
fective backbone networks and 2) the exploration of more
elaborate refinement modules built upon the backbone. For
the former direction, different hierarchical backbones, in-
cluding both ConvNets [29,42] and ViTs [10,32], have been
developed for interactive segmentation. For the latter di-
rection, various refinement modules, including local refine-
ment [10,29] and click imitation [33], have been proposed
to further boost segmentation performance. In this work,
we delve into the former direction and focus on exploring a
plain backbone for interactive segmentation.
A hierarchical backbone is the predominant architecture
for current interactive segmentation methods. This design is
deeply rooted in ConvNets, represented by ResNet [22], and
has been adopted by ViTs, represented by the Swin Trans-
former [34]. The motivation for a hierarchical backbone
arXiv:2210.11006v3 [cs.CV] 11 Mar 2023
stems from the locality of convolution operations, leading
to insufficient model receptive field size without the hier-
archy. To increase the receptive field size, ConvNets have
to progressively downsample feature maps to capture more
global contextual information. Therefore, they often require
a feature pyramid network such as FPN [27] to aggregate
multi-scale representations for high-quality segmentation.
However, this reasoning no longer applies for a plain ViT,
in which global information can be captured from the first
self-attention block. Because all feature maps in the ViT
are of the same resolution, the motivation for an FPN-like
feature pyramid also no longer remains. The above reason-
ing is supported by a recent finding that a plain ViT can
serve as a strong backbone for object detection [25]. This
finding indicates a general-purpose ViT backbone might be
suitable for other tasks, which then can decouple pretraining
from finetuning and transfer the benefits from readily avail-
able pretrained ViT models (e.g. MAE [21]) to these tasks.
However, although this design is simple and has been proven
effective, it has not yet been explored in interactive segmen-
tation. In this work, we propose SimpleClick, the first
plain-backbone method for interactive segmentation. The
core of SimpleClick is a plain ViT backbone that main-
tains single-scale representations throughout. We only use
the last feature map from the plain backbone to build a sim-
ple feature pyramid for segmentation, largely decoupling the
general-purpose backbone from the segmentation-specific
modules. To make SimpleClick more efficient, we use
a light-weight MLP decoder to transform the simple feature
pyramid into a segmentation (see Sec. 3for details).
We extensively evaluate our method on 10 public bench-
marks, including both natural and medical images. With
the plain backbone pretrained as a MAE [21], our method
achieves 4.15 NoC@90 on SBD, which outperforms the pre-
vious best method by 21.8% without a complex FPN-like
design and local refinement. We demonstrate the generaliz-
ability of our method by out-of-domain evaluation on medi-
cal images. We further analyze the computational efficiency
of SimpleClick, highlighting its suitability as a practical
annotation tool.
Our main contributions are:
We propose SimpleClick, the first plain-backbone
method for interactive image segmentation.
SimpleClick achieves state-of-the-art performance on
natural images and shows strong generalizability on med-
ical images.
SimpleClick meets the computational efficiency re-
quirement for a practical annotation tool, highlighting its
readiness for real-world applications.
2. Related Work
Interactive Image Segmentation Interactive image seg-
mentation is a longstanding problem for which increas-
ingly better solution approaches have been proposed. Early
works [6,16,18,39] tackle this problem using graphs de-
fined over image pixels. However, these methods only focus
on low-level image features, and therefore tend to have dif-
ficulty with complex objects.
Thriving on large datasets, ConvNets [10,29,42,46,
47] have evolved as the dominant architecture for high
quality interactive segmentation. ConvNet-based methods
have explored various interaction types, such as bounding
boxes [46], polygons [1], clicks [42], and scribbles [44].
Click-based approaches are the most common due to their
simplicity and well-established training and evaluation pro-
tocols. Xu et al. [47] first proposed a click simulation strat-
egy that has been adopted by follow-up work [10,33,42].
DEXTR [35] extracts a target object from specifying its
four extreme points (left-most, right-most, top, bottom pix-
els). FCA-Net [30] demonstrates the critical role of the first
click for better segmentation. Recently, ViTs have been ap-
plied to interactive segmentation. FocalClick [10] uses Seg-
Former [45] as the backbone network and achieves state-
of-the-art segmentation results with high computational ef-
ficiency. iSegFormer [32] uses a Swin Transformer [34]
as the backbone network for interactive segmentation on
medical images. Besides the contribution on backbones,
some works are exploring elaborate refinement modules
built upon the backbone. FocalClick [10] and FocusCut [29]
propose similar local refinement modules for high-quality
segmentation. PseudoClick [33] proposes a click-imitation
mechanism by estimating the next-click to further reduce
human annotation cost. Our method differs from all pre-
vious click-based methods in its plain, non-hierarchical ViT
backbone, enjoying the benefits from readily available pre-
trained ViT models (e.g. MAE [21]).
Vision Transformers for Non-Interactive Segmentation
Recently, ViT-based approaches [17,24,43,45,49] have
shown competitive performance on segmentation tasks
compared to ConvNets. The original ViT [13] is a non-
hierarchical architecture that only maintains single-scale
feature maps throughout. SETR [51] and Segmenter [43]
use the original ViT as the encoder for semantic segmenta-
tion. To allow for more efficient segmentation, the Swin
Transformer [34] reintroduces a computational hierarchy
into the original ViT architecture using shifted window at-
tention, leading to a highly efficient hierarchical ViT back-
bone. SegFormer [45] designs hierarchical feature rep-
resentations based on the original ViT using overlapped
patch merging, combined with a light-weight MLP de-
coder for efficient segmentation. HRViT [17] integrates a
high-resolution multi-branch architecture with ViTs to learn
multi-scale representations. Recently, the original ViT has
been reintroduced as a competitive backbone for semantic
segmentation [8] and object detection [25], with the aid of
MAE [21] pretraining and window attention. Inspired by
ModelModuleViT Backbone Conv. Neck MLP Head
Ours-ViT-B 83.0 (89.3%) 9.0 (9.7%) 0.9 (1.0%)
Ours-ViT-L 290.8 (94.3%) 16.5 (5.3%) 1.1 (0.4%)
Ours-ViT-H 604.0 (95.7%) 25.8 (4.1%) 1.3 (0.2%)
Table 1. Number of parameters of our models. The unit is mil-
lion. The percentage of parameters is shown in bracket. Most pa-
rameters are used by the ViT backbone.
this finding, we explore using a plain ViT as the backbone
network for interactive segmentation.
3. Method
Our goal is not to propose new modules, but to adapt a
plain-ViT backbone for interactive segmentation with min-
imal modifications so as to enjoy the readily available pre-
trained ViT weights. Sec. 3.1 introduces the main modules
of SimpleClick. Sec. 3.2 describes the training and inference
details of our method.
3.1. Network Architecture
Adaptation of Plain-ViT Backbone We use a plain
ViT [13] as our backbone network, which only maintains
single-scale feature maps throughout. The patch embedding
layer divides the input image into non-overlapping fixed-size
patches (e.g. 16×16 for ViT-B), each patch is flattened and
linearly projected to a fixed-length vector (e.g. 768 for ViT-
B). The resulting sequence of vectors is fed into a queue of
Transformer blocks (e.g. 12 for ViT-B) for self-attention.
We implement SimpleClick with three backbones: ViT-
B, ViT-L, and ViT-H (Tab. 1shows the number of parame-
ters for the three backbones). The three backbones were pre-
trained on ImageNet-1k as MAEs [21]. We adapt the pre-
trained backbones to higher-resolution inputs during fine-
tuning using non-shifting window attention aided by a few
global self-attention blocks (e.g. 2 for ViT-B), as introduced
in ViTDet [25]. Since the last feature map is subject to all
the attention blocks, it should have the strongest representa-
tion. Therefore, we only use the last feature map to build a
simple multi-scale feature pyramid.
Simple Feature Pyramid For the hierarchical backbone, a
feature pyramid is commonly produced by an FPN [27] to
combine features from different stages. For the plain back-
bone, a feature pyramid can be generated in a much simpler
way: by a set of parallel convolutional or deconvolutional
layers using only the last feature map of the backbone. As
shown in Fig. 2, given the input ViT feature map, a multi-
scale feature map can be produced by four convolutions with
different strides. Though the effectiveness of this simple fea-
ture pyramid design is first demonstrated in ViTDet [25] for
object detection, we show in this work the effectiveness of
this simple feature pyramid design for interactive segmenta-
tion. We also propose several additional variants (Fig. 6) as
part of an ablation study (Sec. 4.4).
All-MLP Segmentation Head We implement a lightweight
segmentation head using only MLP layers. It takes in the
simple feature pyramid and produces a segmentation prob-
ability map1of scale 1∕4, followed by an upsampling oper-
ation to recover the original resolution. Note that this seg-
mentation head avoids computationally demanding compo-
nents and only accounts for up to 1% of the model param-
eters (Tab. 1). The key insight is that with a powerful pre-
trained backbone, a lightweight segmentation head is suffi-
cient for interactive segmentation. The proposed all-MLP
segmentation head works in three steps. First, each feature
map from the simple feature pyramid goes through an MLP
layer to transform it to an identical channel dimension (i.e.
𝐶2in Fig. 2). Second, all feature maps are upsampled to
the same resolution (i.e.1∕4 in Fig. 2) for concatenation.
Third, the concatenated features are fused by another MLP
layer to produce a single-channel feature map, followed by a
sigmoid function to obtain a segmentation probability map,
which is then transformed to a binary segmentation given a
predefined threshold (i.e. 0.5).
Symmetric Patch Embedding and Beyond To fuse human
clicks into the plain backbone, we introduce a patch embed-
ding layer that is symmetric to the patch embedding layer
in the backbone, followed by element-wise feature addition.
The user clicks are encoded in a two-channel disk map, one
for positive clicks and the other for negative clicks. The pos-
itive clicks should be placed on the foreground, while the
negative clicks should be placed on the background. The
previous segmentation and the two-channel click map are
concatenated as a three-channel map for patch embedding.
The two symmetric embedding layers operate on the image
and the concatenated three-channel map, respectively. The
inputs are patchified, flattened, and projected to two vector
sequences of the same dimension, followed by element-wise
addition before inputting into the self-attention blocks.
3.2. Training and Inference Settings
Backbone Pretraining Our backbone models are pretrained
as MAEs [21] on ImageNet-1K [11]. In MAE pretraining,
the ViT models reconstruct the randomly masked pixels of
images while learning a universal representation. This sim-
ple self-supervised approach turns out to be an efficient and
scalable way to pretrain ViT models [21]. In this work, we
do not perform pretraining ourselves. Instead, we simply use
the readily available pretrained MAE weights from [21].
End-to-end Finetuning With the pretrained backbone, we
finetune our model end-to-end on the interactive segmenta-
tion task. The finetuning pipeline can be briefly described as
follows. First, we automatically simulate clicks based on the
current segmentation and gold standard segmentation, with-
out a human-in-the-loop providing the clicks. Specifically,
1This probability map may be miscalibrated and can be improved by
calibration approaches [12].
摘要:

SimpleClick:InteractiveImageSegmentationwithSimpleVisionTransformersQinLiu,ZhenlinXu,GedasBertasius,MarcNiethammerUniversityofNorthCarolinaatChapelHillhttps://github.com/uncbiag/SimpleClickAbstractClick-basedinteractiveimagesegmentationaimsatex-tractingobjectswithalimiteduserclicking.Ahierarchicalba...

展开>> 收起<<
SimpleClick Interactive Image Segmentation with Simple Vision Transformers Qin Liu Zhenlin Xu Gedas Bertasius Marc Niethammer University of North Carolina at Chapel Hill.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:4.85MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注