Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Feng Liang1 Bichen Wu2 Xiaoliang Dai2 Kunpeng Li2 Yinan Zhao2 Hang Zhang 3 Peizhao Zhang2 Peter Vajda2 Diana Marculescu1

2025-05-02 0 0 1.67MB 13 页 10玖币

侵权投诉

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Feng Liang*1, Bichen Wu2, Xiaoliang Dai2, Kunpeng Li2, Yinan Zhao2, Hang Zhang† 3,

Peizhao Zhang2, Peter Vajda2, Diana Marculescu1

1The University of Texas at Austin, 2Meta Reality Labs, 3Cruise

{jeffliang,dianam}@utexas.edu,{wbc,stzpz,vajdap}@meta.com

https://jeff-liangf.github.io/projects/ovseg

Abstract

Open-vocabulary semantic segmentation aims to seg-

ment an image into semantic regions according to text de-

scriptions, which may not have been seen during train-

ing. Recent two-stage methods ﬁrst generate class-agnostic

mask proposals and then leverage pre-trained vision-

language models, e.g., CLIP, to classify masked regions. We

identify the performance bottleneck of this paradigm to be

the pre-trained CLIP model, since it does not perform well

on masked images. To address this, we propose to ﬁnetune

CLIP on a collection of masked image regions and their

corresponding text descriptions. We collect training data by

mining an existing image-caption dataset (e.g., COCO Cap-

tions), using CLIP to match masked image regions to nouns

in the image captions. Compared with the more precise and

manually annotated segmentation labels with ﬁxed classes

(e.g., COCO-Stuff), we ﬁnd our noisy but diverse dataset

can better retain CLIP’s generalization ability. Along with

ﬁnetuning the entire model, we utilize the “blank” areas in

masked images using a method we dub mask prompt tuning.

Experiments demonstrate mask prompt tuning brings signif-

icant improvement without modifying any weights of CLIP,

and it can further improve a fully ﬁnetuned model. In par-

ticular, when trained on COCO and evaluated on ADE20K-

150, our best model achieves 29.6% mIoU, which is +8.5%

higher than the previous state-of-the-art. For the ﬁrst time,

open-vocabulary generalist models match the performance

of supervised specialist models in 2017 without dataset spe-

ciﬁc adaptations.

1. Introduction

Semantic segmentation aims to group pixels into mean-

ingful regions with corresponding semantic categories. Al-

though remarkable progress has been made [6,7,9,29,41],

modern semantic segmentation models are mainly trained

with pre-deﬁned categories, failing to generalize to unseen

classes. On the contrary, humans understand scenes in an

*Work done during an internship at Meta Reality Labs.

†Work done while at Meta Reality Labs.

open-vocabulary manner, typically with thousands of cate-

gories [2]. To approach human-level perception, this paper

studies open-vocabulary semantic segmentation where the

model segments an image by arbitrary categories described

by texts.

Vision-language models, e.g., CLIP [35], learn rich

multi-modal features from billion-scale image-text pairs.

Witnessing its superior open-vocabulary classiﬁcation abil-

ity, prior works propose to use pre-trained vision-language

models for open-vocabulary segmentation [11,16,23,40].

Among them, two-stage approaches have shown great po-

tential: they ﬁrst generate class-agnostic mask propos-

als and then leverage pre-trained CLIP to perform open-

vocabulary classiﬁcation (see Figure 1(b)). Their success

relies on two assumptions: (1) the model can generate class-

agnostic mask proposals (2) pre-trained CLIP can transfer

its classiﬁcation performance to masked image proposals.

To examine these two assumptions, we conduct the fol-

lowing analysis. First, we assume an “oracle” mask gener-

ator and an ordinary CLIP classiﬁer. We use ground-truth

masks as region proposals, and feed masked images to a

pre-trained CLIP for classiﬁcation. This model only reaches

an mIoU of 20.1% on the ADE20K-150 dataset. Next, we

assume an “oracle” classiﬁer but an ordinary mask proposal

generator – a MaskFormer ( [9]) pre-trained on the COCO

dataset. We ﬁrst extract masked region proposals, then com-

pare each region with ground-truth object masks, ﬁnd the

object with the highest overlap, and assign the object la-

bel to this extracted region. This model, despite imper-

fect region proposals, reaches a signiﬁcantly higher mIoU

of 66.5%.

This analysis clearly shows that pre-trained CLIP can

not perform satisfactory classiﬁcation over masked images,

and it is the performance bottleneck of two-stage open-

vocabulary segmentation models. We hypothesize that this

is caused by the signiﬁcant domain gap between masked

images and CLIP’s training images. CLIP is pre-trained on

natural images with minimal data augmentation [35]. On

the other hand, mask proposals are cropped and re-sized

from original images, and are further corrupted by noisy

segmentation masks, see examples in Figure 1(b).

arXiv:2210.04150v3 [cs.CV] 1 Apr 2023

bridge

sky

A white cute

cat lying on

the ground.

(a) CLIP is pre-trained

with natural images

CLIP

Mask

proposal

generator

…

CLIP

classification

A photo of

a {bridge}

(b) Skeleton of two-stage approaches (c) Bottleneck analysis

…

20.1

66.5

mIoU on ADE20K-150

Oracle mask proposals

Oracle classification

mask class

masked images

Figure 1. (a) CLIP is pre-trained with natural images with little data augmentation. (b) Two-stage open-vocabulary semantic segmentation

approaches ﬁrst generate class-agnostic mask proposals and then leverage pre-trained CLIP to do open-vocabulary classiﬁcation. The

input of the CLIP model is cropped masked images, which have huge domain gap from the natural images. (c) Our analysis reveals that

pre-trained CLIP does not work well on masked images.

To address this, we propose to adapt CLIP by ﬁnetun-

ing it on masked images and corresponding text labels. One

direct solution is to use segmentation labels, e.g., from the

COCO-stuff dataset. However, this leads to bad general-

ization to unseen classes (Section 4.3.1). Such manually

annotated masks are accurate but classes are limited to a

closed set (e.g., 171 classes for COCO-stuff). We hypothe-

size that the lack of text diversity causes the ﬁnetuned CLIP

to lose the generalization ability to open vocabulary con-

cepts. Instead, we collect training data by mining an ex-

isting image-caption dataset (e.g., COCO Captions). Given

an image-caption pair, we ﬁrst extract nouns in the caption,

and generate class-agnostic masked region proposals using

a pre-trained segmentation model. Then, with a pre-trained

CLIP model, we assign the best-matching proposal to each

extracted noun. By learning from this weakly-supervised

alignments between masked images and novel categories,

the adapted CLIP better retains its generalization ability for

open vocabulary classiﬁcation.

The next question is how to effectively ﬁnetune CLIP?

The most notable difference between a masked image and

a natural image is that background pixels in a masked im-

age are masked out, leading to many blank areas, which

will be converted to “zero tokens” when feeding to CLIP

transformers. Such tokens not only contain no useful in-

formation, but also bring domain distribution shift to the

model (since such tokens don’t exist in natural images) and

cause performance degradation. To mitigate this, we pro-

pose mask prompt tuning, ´

a la visual prompt tuning [20].

When tokenizing a masked image, we replace the “zero to-

kens” with learnable prompt tokens. During ﬁnetuning, we

either train prompts only and freeze CLIP’s weights, or train

both of them. We ﬁnd that mask prompt tuning alone sig-

niﬁcantly improves CLIP’s performance on masked images.

This is a crucial property for multi-task scenarios where we

cannot change CLIP’s weight since it is shared with other

tasks. When combined with full model ﬁnetuning, mask

prompt tuning can further improve the performance by a

non-trivial margin (see Section 4.3.2).

In our experiments, we measure the open-vocabulary

segmentation performances on holdout segmentation

datasets in a “zero-shot” manner – we do not adapt the

model for each evaluation dataset. We train our model us-

ing COCO-stuff [5] dataset with captions from [8], and test

on challenging ADE20K (A-150, A-847 for 150/846 cate-

gories) [43], Pascal Context (PC-59, PC-459 for 59/459 cat-

egories) [33] and PASCAL VOC (PAS-20) [15]. Our best

model achieves 29.6% mIoU on A-150, which is +8.5%

than the state-of-the-art OpenSeg [16] under the same set-

ting. On more challenging A-847 and PC-459, our model

sets up a new state-of-the-art of 9.0%, 12.4% mIoU, sur-

passing the previous best solution by +2.7% and 3.4%.

Moreover, for the ﬁrst time, we show open-vocabulary gen-

eralist models can match the performance of supervised

specialist models [6,29,45] in 2017 without dataset speciﬁc

adaptations.

In summary our contributions include: (1) Our anal-

ysis reveals the pre-trained CLIP does not perform well

on mask proposals, making it the performance bottleneck

of two-stage approaches. (2) We collect diverse mask-

category pairs from captions to adapt CLIP for masked im-

ages and retain its generalization ability. (3) We propose

mask prompt tuning speciﬁcally for masked image adap-

tation. This method does not change CLIP’s weight, en-

abling multi-task weight sharing. (4) For the ﬁrst time, we

show open-vocabulary generalist models can match the per-

formance of supervised specialist models in 2017 without

dataset speciﬁc adaptations.

2. Related Work

Pre-trained vision-language models [19,25,35,36]

connect the visual concepts with textual description. Pre-

trained CLIP [35] has strong open-vocabulary classiﬁca-

Seg. model

e.g., MaskFormer

Nmask proposals

Nproposal embeddings

…

𝑣!

Masked

crop

…

Text

enc.

A photo of

a {orange}

…

𝑣"

𝑣#

𝑡!𝑡"𝑡$

…

#𝑣!

#𝑣"

#𝑣#

Image

enc.

orange

apple

orange

apple

N×H×W

N×C

∅

…

(K+1)×C

…

N×(K+1)

…

Lossmask

Losscls

②CLIP adaptation with collected mask-category pairs

①Segmentation model training

Losscls

Figure 2. Two-stage approaches consist of one segmentation model, e.g., MaskFormer, and one CLIP model. Firstly, the modiﬁed

MaskFormer is trained with CLIP’s text embeddings so as to perform open-vocabulary segmentation. (Section 3.1). We then use the

pre-trained segmentation model to generate class-agnostic proposals and align proposals with extracted nouns from corresponding captions

(Section 3.2). After collecting diverse mask-category pairs, we ﬁnetune CLIP with the proposed mask prompt tuning (Section 3.3).

tion ability, i.e., classifying an image with arbitrary cate-

gories described by language. Pre-trained CLIP has em-

powered many computer vision tasks with the language

ability, such as image manipulation [34], image genera-

tion [10], object detection [17,42] and image segmenta-

tion [11,12,16,21,23,31,39,40]. Our work is similar to

RegionCLIP [42], which adapts CLIP for object detection

by ﬁnetuning on region proposals. Our method differs from

RegionCLIP since (1) we adapt CLIP to process masked im-

ages while RegionCLIP processes complete region crops;

(2) We leverage blank areas in masked images and propose

mask prompt tuning, which adapts CLIP without changing

its weights. This enables sharing CLIP’s weight with other

tasks in multi-task scenarios. This is not supported by Re-

gionCLIP.

Open-vocabulary segmentation aims to understand an

image with arbitrary categories described by texts. Pio-

neering work ZS3Net [4] uses generative models to syn-

thesize pixel-level features by word embeddings of un-

seen class. SPNet [37] utilizes the word embeddings, e.g.,

word2vec [32], to align the semantic meaning with visual

features. GroupViT [38] groups segmentation masks di-

rectly from text supervision. More recently, researchers

propose to leverage the pre-trained CLIP [35] for open-

vocabulary semantic segmentation. LSeg [23] aligns pixel

embeddings to the text embedding of the corresponding se-

mantic class, which is generated by CLIP’s text encoder.

Unlike pixel-level LSeg, OpenSeg [16] proposes to align

the segment-level visual features with text embedding via

region-word grounding. Our approach falls into the cate-

gory of two-stage approaches, such as ZSSeg [40] and Zeg-

Former [11]: they ﬁrst generate class-agnostic mask pro-

posals and then utilize pre-trained CLIP to perform open-

vocabulary classiﬁcation. Unlike ZSSeg and ZegFormer

which directly use the original CLIP for masked image clas-

siﬁcation, we adapt CLIP to improve performance.

Prompt tuning is a strategy to adapt large-scale pre-

trained models to new tasks. The idea originated from nat-

ural language processing [22,24,27], and recent work ex-

tends prompt tuning to computer vision. CoOp [44] pre-

appends the category words with learnable vectors to adapt

CLIP for many recognition tasks. The textual prompt tuning

is also widely used in open-vocabulary object detection [14]

and semantic segmentation [40]. Our mask prompt tuning is

more relevant to prompt tuning in the visual domain [1,20]

where learnable vectors are applied to the image domain.

Unlike visual prompt tuning [20] that inserts additional to-

kens before the actual image tokens, we replace masked to-

kens with learnable prompts. Furthermore, mask prompt

tuning brings additional improvement over a fully ﬁnetuned

model (Section 4.3.2). Such additional improvements have

not been reported by prior work.

3. Method

In this section, we ﬁrst revisit the two-stage open-

vocabulary segmentation methods [11,40]. Then we discuss

how to obtain a dataset of mask-category pairs to ﬁnetune

CLIP. Last, we discuss the proposed mask prompt tuning

technique to adapt CLIP for masked images.

3.1. Two-stage models for open-vocabulary seman-

tic segmentation

Our two-stage open-vocabulary semantic segmentation

model is shown in Figure 2. It consists of a segmentation

model that generates mask proposals, and an open vocabu-

lary classiﬁcation model.

Following [11,40], we choose MaskFormer [9] as the

segmentation model. Unlike per-pixel segmentation mod-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Open-VocabularySemanticSegmentationwithMask-adaptedCLIPFengLiang*1,BichenWu2,XiaoliangDai2,KunpengLi2,YinanZhao2,HangZhang†3,PeizhaoZhang2,PeterVajda2,DianaMarculescu11TheUniversityofTexasatAustin,2MetaRealityLabs,3Cruise{jeffliang,dianam}@utexas.edu,{wbc,stzpz,vajdap}@meta.comhttps://jeff-liangf.gi...

展开>> 收起<<

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Feng Liang1 Bichen Wu2 Xiaoliang Dai2 Kunpeng Li2 Yinan Zhao2 Hang Zhang 3 Peizhao Zhang2 Peter Vajda2 Diana Marculescu1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Feng Liang1 Bichen Wu2 Xiaoliang Dai2 Kunpeng Li2 Yinan Zhao2 Hang Zhang 3 Peizhao Zhang2 Peter Vajda2 Diana Marculescu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: