Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Feng Liang1 Bichen Wu2 Xiaoliang Dai2 Kunpeng Li2 Yinan Zhao2 Hang Zhang 3 Peizhao Zhang2 Peter Vajda2 Diana Marculescu1

2025-05-02 0 0 1.67MB 13 页 10玖币
侵权投诉
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Feng Liang*1, Bichen Wu2, Xiaoliang Dai2, Kunpeng Li2, Yinan Zhao2, Hang Zhang† 3,
Peizhao Zhang2, Peter Vajda2, Diana Marculescu1
1The University of Texas at Austin, 2Meta Reality Labs, 3Cruise
{jeffliang,dianam}@utexas.edu,{wbc,stzpz,vajdap}@meta.com
https://jeff-liangf.github.io/projects/ovseg
Abstract
Open-vocabulary semantic segmentation aims to seg-
ment an image into semantic regions according to text de-
scriptions, which may not have been seen during train-
ing. Recent two-stage methods first generate class-agnostic
mask proposals and then leverage pre-trained vision-
language models, e.g., CLIP, to classify masked regions. We
identify the performance bottleneck of this paradigm to be
the pre-trained CLIP model, since it does not perform well
on masked images. To address this, we propose to finetune
CLIP on a collection of masked image regions and their
corresponding text descriptions. We collect training data by
mining an existing image-caption dataset (e.g., COCO Cap-
tions), using CLIP to match masked image regions to nouns
in the image captions. Compared with the more precise and
manually annotated segmentation labels with fixed classes
(e.g., COCO-Stuff), we find our noisy but diverse dataset
can better retain CLIP’s generalization ability. Along with
finetuning the entire model, we utilize the “blank” areas in
masked images using a method we dub mask prompt tuning.
Experiments demonstrate mask prompt tuning brings signif-
icant improvement without modifying any weights of CLIP,
and it can further improve a fully finetuned model. In par-
ticular, when trained on COCO and evaluated on ADE20K-
150, our best model achieves 29.6% mIoU, which is +8.5%
higher than the previous state-of-the-art. For the first time,
open-vocabulary generalist models match the performance
of supervised specialist models in 2017 without dataset spe-
cific adaptations.
1. Introduction
Semantic segmentation aims to group pixels into mean-
ingful regions with corresponding semantic categories. Al-
though remarkable progress has been made [6,7,9,29,41],
modern semantic segmentation models are mainly trained
with pre-defined categories, failing to generalize to unseen
classes. On the contrary, humans understand scenes in an
*Work done during an internship at Meta Reality Labs.
Work done while at Meta Reality Labs.
open-vocabulary manner, typically with thousands of cate-
gories [2]. To approach human-level perception, this paper
studies open-vocabulary semantic segmentation where the
model segments an image by arbitrary categories described
by texts.
Vision-language models, e.g., CLIP [35], learn rich
multi-modal features from billion-scale image-text pairs.
Witnessing its superior open-vocabulary classification abil-
ity, prior works propose to use pre-trained vision-language
models for open-vocabulary segmentation [11,16,23,40].
Among them, two-stage approaches have shown great po-
tential: they first generate class-agnostic mask propos-
als and then leverage pre-trained CLIP to perform open-
vocabulary classification (see Figure 1(b)). Their success
relies on two assumptions: (1) the model can generate class-
agnostic mask proposals (2) pre-trained CLIP can transfer
its classification performance to masked image proposals.
To examine these two assumptions, we conduct the fol-
lowing analysis. First, we assume an “oracle” mask gener-
ator and an ordinary CLIP classifier. We use ground-truth
masks as region proposals, and feed masked images to a
pre-trained CLIP for classification. This model only reaches
an mIoU of 20.1% on the ADE20K-150 dataset. Next, we
assume an “oracle” classifier but an ordinary mask proposal
generator – a MaskFormer ( [9]) pre-trained on the COCO
dataset. We first extract masked region proposals, then com-
pare each region with ground-truth object masks, find the
object with the highest overlap, and assign the object la-
bel to this extracted region. This model, despite imper-
fect region proposals, reaches a significantly higher mIoU
of 66.5%.
This analysis clearly shows that pre-trained CLIP can
not perform satisfactory classification over masked images,
and it is the performance bottleneck of two-stage open-
vocabulary segmentation models. We hypothesize that this
is caused by the significant domain gap between masked
images and CLIP’s training images. CLIP is pre-trained on
natural images with minimal data augmentation [35]. On
the other hand, mask proposals are cropped and re-sized
from original images, and are further corrupted by noisy
segmentation masks, see examples in Figure 1(b).
arXiv:2210.04150v3 [cs.CV] 1 Apr 2023
bridge
sky
A white cute
cat lying on
the ground.
(a) CLIP is pre-trained
with natural images
CLIP
Mask
proposal
generator
CLIP
classification
A photo of
a {bridge}
(b) Skeleton of two-stage approaches (c) Bottleneck analysis
20
40
60
20.1
66.5
mIoU on ADE20K-150
Oracle mask proposals
Oracle classification
mask class
masked images
Figure 1. (a) CLIP is pre-trained with natural images with little data augmentation. (b) Two-stage open-vocabulary semantic segmentation
approaches first generate class-agnostic mask proposals and then leverage pre-trained CLIP to do open-vocabulary classification. The
input of the CLIP model is cropped masked images, which have huge domain gap from the natural images. (c) Our analysis reveals that
pre-trained CLIP does not work well on masked images.
To address this, we propose to adapt CLIP by finetun-
ing it on masked images and corresponding text labels. One
direct solution is to use segmentation labels, e.g., from the
COCO-stuff dataset. However, this leads to bad general-
ization to unseen classes (Section 4.3.1). Such manually
annotated masks are accurate but classes are limited to a
closed set (e.g., 171 classes for COCO-stuff). We hypothe-
size that the lack of text diversity causes the finetuned CLIP
to lose the generalization ability to open vocabulary con-
cepts. Instead, we collect training data by mining an ex-
isting image-caption dataset (e.g., COCO Captions). Given
an image-caption pair, we first extract nouns in the caption,
and generate class-agnostic masked region proposals using
a pre-trained segmentation model. Then, with a pre-trained
CLIP model, we assign the best-matching proposal to each
extracted noun. By learning from this weakly-supervised
alignments between masked images and novel categories,
the adapted CLIP better retains its generalization ability for
open vocabulary classification.
The next question is how to effectively finetune CLIP?
The most notable difference between a masked image and
a natural image is that background pixels in a masked im-
age are masked out, leading to many blank areas, which
will be converted to “zero tokens” when feeding to CLIP
transformers. Such tokens not only contain no useful in-
formation, but also bring domain distribution shift to the
model (since such tokens don’t exist in natural images) and
cause performance degradation. To mitigate this, we pro-
pose mask prompt tuning, ´
a la visual prompt tuning [20].
When tokenizing a masked image, we replace the “zero to-
kens” with learnable prompt tokens. During finetuning, we
either train prompts only and freeze CLIP’s weights, or train
both of them. We find that mask prompt tuning alone sig-
nificantly improves CLIP’s performance on masked images.
This is a crucial property for multi-task scenarios where we
cannot change CLIP’s weight since it is shared with other
tasks. When combined with full model finetuning, mask
prompt tuning can further improve the performance by a
non-trivial margin (see Section 4.3.2).
In our experiments, we measure the open-vocabulary
segmentation performances on holdout segmentation
datasets in a “zero-shot” manner – we do not adapt the
model for each evaluation dataset. We train our model us-
ing COCO-stuff [5] dataset with captions from [8], and test
on challenging ADE20K (A-150, A-847 for 150/846 cate-
gories) [43], Pascal Context (PC-59, PC-459 for 59/459 cat-
egories) [33] and PASCAL VOC (PAS-20) [15]. Our best
model achieves 29.6% mIoU on A-150, which is +8.5%
than the state-of-the-art OpenSeg [16] under the same set-
ting. On more challenging A-847 and PC-459, our model
sets up a new state-of-the-art of 9.0%, 12.4% mIoU, sur-
passing the previous best solution by +2.7% and 3.4%.
Moreover, for the first time, we show open-vocabulary gen-
eralist models can match the performance of supervised
specialist models [6,29,45] in 2017 without dataset specific
adaptations.
In summary our contributions include: (1) Our anal-
ysis reveals the pre-trained CLIP does not perform well
on mask proposals, making it the performance bottleneck
of two-stage approaches. (2) We collect diverse mask-
category pairs from captions to adapt CLIP for masked im-
ages and retain its generalization ability. (3) We propose
mask prompt tuning specifically for masked image adap-
tation. This method does not change CLIP’s weight, en-
abling multi-task weight sharing. (4) For the first time, we
show open-vocabulary generalist models can match the per-
formance of supervised specialist models in 2017 without
dataset specific adaptations.
2. Related Work
Pre-trained vision-language models [19,25,35,36]
connect the visual concepts with textual description. Pre-
trained CLIP [35] has strong open-vocabulary classifica-
Seg. model
e.g., MaskFormer
Nmask proposals
Nproposal embeddings
𝑣!
Masked
crop
Text
enc.
A photo of
a {orange}
𝑣"
𝑣#
𝑡!𝑡"𝑡$
#𝑣!
#𝑣"
#𝑣#
Image
enc.
orange
apple
orange
apple
N×H×W
N×C
(K+1)×C
N×(K+1)
N×(K+1)
Lossmask
Losscls
CLIP adaptation with collected mask-category pairs
Segmentation model training
Losscls
Figure 2. Two-stage approaches consist of one segmentation model, e.g., MaskFormer, and one CLIP model. Firstly, the modified
MaskFormer is trained with CLIP’s text embeddings so as to perform open-vocabulary segmentation. (Section 3.1). We then use the
pre-trained segmentation model to generate class-agnostic proposals and align proposals with extracted nouns from corresponding captions
(Section 3.2). After collecting diverse mask-category pairs, we finetune CLIP with the proposed mask prompt tuning (Section 3.3).
tion ability, i.e., classifying an image with arbitrary cate-
gories described by language. Pre-trained CLIP has em-
powered many computer vision tasks with the language
ability, such as image manipulation [34], image genera-
tion [10], object detection [17,42] and image segmenta-
tion [11,12,16,21,23,31,39,40]. Our work is similar to
RegionCLIP [42], which adapts CLIP for object detection
by finetuning on region proposals. Our method differs from
RegionCLIP since (1) we adapt CLIP to process masked im-
ages while RegionCLIP processes complete region crops;
(2) We leverage blank areas in masked images and propose
mask prompt tuning, which adapts CLIP without changing
its weights. This enables sharing CLIP’s weight with other
tasks in multi-task scenarios. This is not supported by Re-
gionCLIP.
Open-vocabulary segmentation aims to understand an
image with arbitrary categories described by texts. Pio-
neering work ZS3Net [4] uses generative models to syn-
thesize pixel-level features by word embeddings of un-
seen class. SPNet [37] utilizes the word embeddings, e.g.,
word2vec [32], to align the semantic meaning with visual
features. GroupViT [38] groups segmentation masks di-
rectly from text supervision. More recently, researchers
propose to leverage the pre-trained CLIP [35] for open-
vocabulary semantic segmentation. LSeg [23] aligns pixel
embeddings to the text embedding of the corresponding se-
mantic class, which is generated by CLIP’s text encoder.
Unlike pixel-level LSeg, OpenSeg [16] proposes to align
the segment-level visual features with text embedding via
region-word grounding. Our approach falls into the cate-
gory of two-stage approaches, such as ZSSeg [40] and Zeg-
Former [11]: they first generate class-agnostic mask pro-
posals and then utilize pre-trained CLIP to perform open-
vocabulary classification. Unlike ZSSeg and ZegFormer
which directly use the original CLIP for masked image clas-
sification, we adapt CLIP to improve performance.
Prompt tuning is a strategy to adapt large-scale pre-
trained models to new tasks. The idea originated from nat-
ural language processing [22,24,27], and recent work ex-
tends prompt tuning to computer vision. CoOp [44] pre-
appends the category words with learnable vectors to adapt
CLIP for many recognition tasks. The textual prompt tuning
is also widely used in open-vocabulary object detection [14]
and semantic segmentation [40]. Our mask prompt tuning is
more relevant to prompt tuning in the visual domain [1,20]
where learnable vectors are applied to the image domain.
Unlike visual prompt tuning [20] that inserts additional to-
kens before the actual image tokens, we replace masked to-
kens with learnable prompts. Furthermore, mask prompt
tuning brings additional improvement over a fully finetuned
model (Section 4.3.2). Such additional improvements have
not been reported by prior work.
3. Method
In this section, we first revisit the two-stage open-
vocabulary segmentation methods [11,40]. Then we discuss
how to obtain a dataset of mask-category pairs to finetune
CLIP. Last, we discuss the proposed mask prompt tuning
technique to adapt CLIP for masked images.
3.1. Two-stage models for open-vocabulary seman-
tic segmentation
Our two-stage open-vocabulary semantic segmentation
model is shown in Figure 2. It consists of a segmentation
model that generates mask proposals, and an open vocabu-
lary classification model.
Following [11,40], we choose MaskFormer [9] as the
segmentation model. Unlike per-pixel segmentation mod-
摘要:

Open-VocabularySemanticSegmentationwithMask-adaptedCLIPFengLiang*1,BichenWu2,XiaoliangDai2,KunpengLi2,YinanZhao2,HangZhang†3,PeizhaoZhang2,PeterVajda2,DianaMarculescu11TheUniversityofTexasatAustin,2MetaRealityLabs,3Cruise{jeffliang,dianam}@utexas.edu,{wbc,stzpz,vajdap}@meta.comhttps://jeff-liangf.gi...

展开>> 收起<<
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Feng Liang1 Bichen Wu2 Xiaoliang Dai2 Kunpeng Li2 Yinan Zhao2 Hang Zhang 3 Peizhao Zhang2 Peter Vajda2 Diana Marculescu1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.67MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注