stems from the locality of convolution operations, leading
to insufficient model receptive field size without the hier-
archy. To increase the receptive field size, ConvNets have
to progressively downsample feature maps to capture more
global contextual information. Therefore, they often require
a feature pyramid network such as FPN [27] to aggregate
multi-scale representations for high-quality segmentation.
However, this reasoning no longer applies for a plain ViT,
in which global information can be captured from the first
self-attention block. Because all feature maps in the ViT
are of the same resolution, the motivation for an FPN-like
feature pyramid also no longer remains. The above reason-
ing is supported by a recent finding that a plain ViT can
serve as a strong backbone for object detection [25]. This
finding indicates a general-purpose ViT backbone might be
suitable for other tasks, which then can decouple pretraining
from finetuning and transfer the benefits from readily avail-
able pretrained ViT models (e.g. MAE [21]) to these tasks.
However, although this design is simple and has been proven
effective, it has not yet been explored in interactive segmen-
tation. In this work, we propose SimpleClick, the first
plain-backbone method for interactive segmentation. The
core of SimpleClick is a plain ViT backbone that main-
tains single-scale representations throughout. We only use
the last feature map from the plain backbone to build a sim-
ple feature pyramid for segmentation, largely decoupling the
general-purpose backbone from the segmentation-specific
modules. To make SimpleClick more efficient, we use
a light-weight MLP decoder to transform the simple feature
pyramid into a segmentation (see Sec. 3for details).
We extensively evaluate our method on 10 public bench-
marks, including both natural and medical images. With
the plain backbone pretrained as a MAE [21], our method
achieves 4.15 NoC@90 on SBD, which outperforms the pre-
vious best method by 21.8% without a complex FPN-like
design and local refinement. We demonstrate the generaliz-
ability of our method by out-of-domain evaluation on medi-
cal images. We further analyze the computational efficiency
of SimpleClick, highlighting its suitability as a practical
annotation tool.
Our main contributions are:
•We propose SimpleClick, the first plain-backbone
method for interactive image segmentation.
•SimpleClick achieves state-of-the-art performance on
natural images and shows strong generalizability on med-
ical images.
•SimpleClick meets the computational efficiency re-
quirement for a practical annotation tool, highlighting its
readiness for real-world applications.
2. Related Work
Interactive Image Segmentation Interactive image seg-
mentation is a longstanding problem for which increas-
ingly better solution approaches have been proposed. Early
works [6,16,18,39] tackle this problem using graphs de-
fined over image pixels. However, these methods only focus
on low-level image features, and therefore tend to have dif-
ficulty with complex objects.
Thriving on large datasets, ConvNets [10,29,42,46,
47] have evolved as the dominant architecture for high
quality interactive segmentation. ConvNet-based methods
have explored various interaction types, such as bounding
boxes [46], polygons [1], clicks [42], and scribbles [44].
Click-based approaches are the most common due to their
simplicity and well-established training and evaluation pro-
tocols. Xu et al. [47] first proposed a click simulation strat-
egy that has been adopted by follow-up work [10,33,42].
DEXTR [35] extracts a target object from specifying its
four extreme points (left-most, right-most, top, bottom pix-
els). FCA-Net [30] demonstrates the critical role of the first
click for better segmentation. Recently, ViTs have been ap-
plied to interactive segmentation. FocalClick [10] uses Seg-
Former [45] as the backbone network and achieves state-
of-the-art segmentation results with high computational ef-
ficiency. iSegFormer [32] uses a Swin Transformer [34]
as the backbone network for interactive segmentation on
medical images. Besides the contribution on backbones,
some works are exploring elaborate refinement modules
built upon the backbone. FocalClick [10] and FocusCut [29]
propose similar local refinement modules for high-quality
segmentation. PseudoClick [33] proposes a click-imitation
mechanism by estimating the next-click to further reduce
human annotation cost. Our method differs from all pre-
vious click-based methods in its plain, non-hierarchical ViT
backbone, enjoying the benefits from readily available pre-
trained ViT models (e.g. MAE [21]).
Vision Transformers for Non-Interactive Segmentation
Recently, ViT-based approaches [17,24,43,45,49] have
shown competitive performance on segmentation tasks
compared to ConvNets. The original ViT [13] is a non-
hierarchical architecture that only maintains single-scale
feature maps throughout. SETR [51] and Segmenter [43]
use the original ViT as the encoder for semantic segmenta-
tion. To allow for more efficient segmentation, the Swin
Transformer [34] reintroduces a computational hierarchy
into the original ViT architecture using shifted window at-
tention, leading to a highly efficient hierarchical ViT back-
bone. SegFormer [45] designs hierarchical feature rep-
resentations based on the original ViT using overlapped
patch merging, combined with a light-weight MLP de-
coder for efficient segmentation. HRViT [17] integrates a
high-resolution multi-branch architecture with ViTs to learn
multi-scale representations. Recently, the original ViT has
been reintroduced as a competitive backbone for semantic
segmentation [8] and object detection [25], with the aid of
MAE [21] pretraining and window attention. Inspired by