Perceptual Grouping in Contrastive Vision-Language Models Kanchana Ranasinghe Brandon McKinzie Sachin Ravi Yinfei Yang Alexander Toshev Jonathon Shlens

2025-05-02 0 0 2.43MB 23 页 10玖币

侵权投诉

Perceptual Grouping in Contrastive Vision-Language Models

Kanchana Ranasinghe*, Brandon McKinzie, Sachin Ravi,

Yinfei Yang, Alexander Toshev, Jonathon Shlens†

Apple

kranasinghe@cs.stonybrook.edu

Abstract

Recent advances in zero-shot image recognition suggest

that vision-language models learn generic visual representa-

tions with a high degree of semantic information that may

be arbitrarily probed with natural language phrases. Under-

standing an image, however, is not just about understanding

what content resides within an image, but importantly, where

that content resides. In this work we examine how well vision-

language models are able to understand where objects reside

within an image and group together visually related parts

of the imagery. We demonstrate how contemporary vision

and language representation learning models based on con-

trastive losses and large web-based data capture limited

object localization information. We propose a minimal set of

modiﬁcations that results in models that uniquely learn both

semantic and spatial information. We measure this perfor-

mance in terms of zero-shot image recognition, unsupervised

bottom-up and top-down semantic segmentations, as well

as robustness analyses. We ﬁnd that the resulting model

achieves state-of-the-art results in terms of unsupervised

segmentation, and demonstrate that the learned representa-

tions are uniquely robust to spurious correlations in datasets

designed to probe the causal behavior of vision models.

1. Introduction

Learning a representation for visual imagery requires

resolving not only what resides within an image, but also

where that information resides [

]. In many applications,

knowledge of where information resides is sometimes more

important than a precise description of the content [

Hence, our ability to learn more generic and robust visual

representations requires learning the geometry of visual se-

mantics, and how visual information may be grounded by

speciﬁc regions of the visual ﬁeld.

While recent vision-language models trained under weak

supervision demonstrate a remarkable ability to learn generic

and transferable visual representations [

117

], they

*Work performed as part of Apple internship.

†Work performed at Apple.

Figure 1: Semantic localization in contrastive VLMs. We mea-

sure the ability of vision-language models to predict a label at each

spatial position in a zero shot manner based on the similarity of

location tokens to the corresponding language tokens on selected

examples. CLIP / ALIGN [

] have minimal understanding

of the spatial location of individual objects (row 4). Our proposed

CLIPpy (row 3) predicts the label at locations that correspond

closely to human annotation for semantic segmentation (row 2).

All predictions were performed with no access to any segmentation

data during training or inference. More visualizations in App. B.

showcase a profound inability to associate visual content

with individual objects (Fig. 1, bottom row). In other words,

models trained on large weakly-supervised data have a lim-

ited ability to group together visually related content [

Because the representations have a poor understanding of

where an object resides, they easily conﬂate background

with foreground content. Hence, the learned representations

are unable to learn the spatial layout of a scene [

101

and are susceptible to learning spurious correlations between

a semantic label and extraneous content [91,65].

Recent work [

113

114

] attempts to bridge this gap

through grouping mechanisms under the same weakly su-

pervised training paradigm, but focus more on foreground

arXiv:2210.09996v3 [cs.CV] 22 Aug 2023

objects (neglecting background classes). Another direction is

task speciﬁc unsupervised ﬁne-tuning [

126

] which loses

the generic and transferable nature of these representations.

In this work, we explore vision-language models that

learn from similar weakly labeled data, but a) retain the

generic and transferable nature of features, and b) learns

where all (background and foreground) visual content resides

within an image. Unlike previous attempts using grouping

speciﬁc architectures [

113

114

] or dense human annotations

[

], we explore a minimal set of modiﬁcations to

existing CLIP models [

] that leads to grouping of visual

imagery while retaining their weakly supervised and scalable

training procedure. We ﬁnd that two small adjustments – em-

ploying speciﬁc pretraining strategies and adjusting spatial

feature aggregation – results in models that are equally ef-

fective in zero-shot image recognition, but also retain spatial

information regarding object locations (see Fig. 1, 3rd row).

The resulting model termed CLIPpy exhibits perceptual

grouping – that is, the ability to select and combine re-

lated visual signals into semantically meaningful regions

[

110

]. Endowing models with perceptual grouping –

whether in a bottom up (based solely on visual content) or

top down (guided by external information, language in this

case) manner – in learned representations has been a long

standing goal in computer vision [

]. In this work, our

key contributions are as follows:

•

Identify systematic failure of contrastive vision-language

models [

] to properly identify where objects reside

within an image, and group semantically related content.

•

Design a minimal set of changes to endow these model

with perceptual grouping, resulting in state-of-the-art zero-

shot segmentation without training on any segmentation

data or performing task speciﬁc ﬁne-tuning.

•

Emergence of localization ability in our models uniquely

leads to robustness to counterfactual manipulations. The

degree of robustness matches if not surpasses previous

state-of-the-art supervised learning methods employing

specialized training methodologies.

2. Related Work

Vision-language models for grounding. Contrastive

language image pre-training [

] (CLIP) led to a range

of follow up work performing open-vocabulary detection

[

120

] or segmentation [

122

While these methods leverage dense human annotations for

training, an alternate line of works [

113

114

126

115

]

attempt to learn alignment between regions of images and

language with only image level noisy captions for su-

pervision. Their weak supervision allows better scalabil-

ity (to more data) leading to learning more generic and

transferable representations. In fact, multiple such works

[

113

114

126

] perform zero-shot semantic segmen-

tation. However, unlike [

113

114

] geared to segment a ﬁxed

Component CLIP [85] CLIP†CLIPpy

Image Backbone ViT-B/16 ViT-B/16 ViT-B/16

Text Backbone T-B T-5 T-5

Image Init Random Random DINO

Text Init Random Random Sent T-5

Image Pooling CLS CLS Max

Text Pooling Avg Avg Avg

Dataset 300M∗CC-12M CC-12M

VOC mIoU (%) 16.4 17.5 50.8 (+33.3)

VOC JS (%) 28.6 37.3 47.5 (+10.2)

Table 1: We highlight the minimal differences of CLIPy from CLIP.

CLIP

†

is our implementation following train settings identical to

CLIPpy. ∗indicates OpenAI private data.

count of foreground objects, our proposed CLIPpy can better

segment arbitrary object counts and background classes. In

contrast to [

126

] using generic image level features, CLIPpy

explicitly learns local features during training. Moreover,

CLIPpy requires no dense human annotations or task-speciﬁc

ﬁne-tuning in contrast to [

]. We also highlight how

[

113

114

] perform grouping independent of language

at inference - however CLIPpy can group conditioned on

language, capturing variable object boundaries for different

language prompts.

Multiple contemporary works also explore similar di-

rections as CLIPpy, leveraging pre-trained vision-language

models for various grouping tasks under weak supervision

(no pixel level annotation) [

123

]. Combin-

ing self-supervised methods that emerge grouping [

] with

CLIP models [85] for cross-modal alignment is explored in

[

123

] gaining notable improvements at object boundaries. A

clustering mechanism containing learnable centres similar to

[

113

] is combined with reconstruction and super-pixel align-

ment losses to achieve grouping in [

]. Learning decoder

networks over a frozen CLIP backbone [

] with text to im-

age patch similarity losses are explored in [

] resulting

in similar grouping behaviour. In contrast to these meth-

ods utilizing contrastive vision language training to emerge

grouping, recent works [

] also showcase how text-to-

image generative models (particularly Stable Diffusion [

])

can be leveraged to perform visual grouping.

Zero-shot semantic segmentation. A form of top-down

grouping, this relatively new task [

124

111

] attempts to segment unseen classes, usually after

a supervised training phase often involving dense annotation

based supervision. Following two early representative works

[

111

], most later approaches [

102

]

formulate the task as a pixel-level zero-shot classiﬁcation

problem with a closed set vocabulary. While CLIPpy follows

a similar pixel based formulation, in contrast, our method

requires no dense human annotations for supervision, no task

speciﬁc ﬁne-tuning, and is open-vocabulary. Recent work

[

] also explores region-level classiﬁcation leveraging

pre-trained CLIP models [

], but unlike CLIPpy perform

grouping independent of language during inference.

Visual

Encoder

Text!

Encoder

“bird”

aggregation"

across space

contrastive"

Loss

spatial representation

H x W x D

1 x D1 x D

𝑓

•random"

•supervised pretraining"

•self-supervised pretraining

•random"

•T5 pretraining

•average pooling"

•class token"

•max pooling

Figure 2: Architecture diagram. Images and captions are separately embedded into Euclidean spaces, where image features are spatially

aggregated. A contrastive loss trains the aggregated image embedding to be close to the caption embedding. We demonstrate that two

minimal design decisions (indicated in green) are of paramount importance for CLIP [

] models to perform perceptual grouping under

image-level weak supervision.

Unsupervised segmentation. Analogous to bottom-up

grouping, these works perform class-agnostic segmentation

within the visual modality with no explicit language align-

ment [

]. This topic has a long, rich history

in human visual perception [

110

] and computer vision [

and has been explored as means of generalizing to new visual

domains [

]. It is this goal that most closely inspires our

work. Early efforts group pixels based on known spatially-

local afﬁnities [

], with subsequent methods leading

to region proposal networks for object detection [

103

] and

advances in semantic segmentation [

]. Recent methods em-

ploy self-supervision to learn perceptual grouping [

] or

object-centric groupings [

109

]. Our proposed

CLIPpy demonstrates competitive performance, but addi-

tionally aligns groups to the language modality explicitly.

Learning robust visual representations. For a long time,

ImageNet [

] accuracy was believed to provide a reasonable

proxy for quality of learned visual representations [

However, recent work highlights notable deﬁciencies in such

learned representations [

] including sensitivity to

low level textures, failure for domain shifts, and reliance

on spurious correlations. These failures inspired a large

literature to mitigate learning spurious correlations [

] by focusing on new optimization techniques. Progress

on this issue may address parallel issues in fairness [

Resulting methods have largely focused on synthetic data,

re-balancing data, and shaping learned embeddings [

Nonetheless, theoretical results suggest pessimistic bounds

unless additional structure informs the problem (see refs. in

[

]). Therein, the structured output predictions of proposed

CLIPpy provide another promising solution.

3. Methodology

We ﬁrst set the stage by discussing established core ar-

chitectures and the contrastive learning formulation. Next,

we discuss modiﬁcations that are the focus of the analysis

in this work. In particular, we discuss aggregation options,

pre-training alternatives, and token sub-sampling.

3.1. Architecture and Training

We provide a quick overview of our architecture (Fig. 2).

Consider a batch size

, spatial height

, spatial width

and depth

is a tensor that has a shape of

[N, H, W, D]

and is the output of an image encoder.

is a tensor that is

of shape [N, D]and is the output of a text encoder.

Language Model. We employ a strong language model

baseline derived from the transformer architecture [

104

] and

implemented in T5 [

]. T5 models use an encoder-decoder

architecture that is trained using a generative span corruption

task, and have achieved state-of-the-art on a broad range

of NLP tasks including GLUE [

106

] and Super-Glue [

105

We use the encoder only and discard the decoder part. We

employ the T5-base which consists of 12 transformer layers,

12 attention heads, and 768 token channel dimensions.

Image Model. We explore two architectures for im-

age featurization, CNN-based and Vision-Transformers, al-

though we focus the majority of work on the latter. First,

we employ the EfﬁcientNet architecture [

100

] as a high

performant CNN architecture, which has been used pre-

viously in vision-language models. The speciﬁcs of the

meta-architecture were derived from considerations based

on neural architecture search. Second, we employ the Vision

Transformer (ViT) architecture [

]. We refer the reader to

[

104

] for details. Brieﬂy, ViT is largely inherited from

the NLP literature and consists of a hierarchical associative

memory. Each layer, termed a transformer, is composed

of a Multi-headed Self-Attention (MSA) layer followed by

a 2-layer feed-forward multi-layer perceptron (MLP). The

primary parameter of ViT is the patch size

specifying the

P×P

patch of pixels constituting a token in the architecture.

Contrastive Representation Learning. Let

and

denote the image and text embeddings (post aggregation) of

the

’th example in the batch. A contrastive loss may be spec-

iﬁed as the cross entropy across a batch [

]. The cross

entropy is calculated between a one-hot encoding specifying

the correspondence between the image and text examples,

and a softmax-normalized distribution specifying the dot-

product similarity between image and text embeddings.

L=−1

i=1

log exp(x⊤

iyi/τ)

j=1 exp(x⊤

iyj/τ)

| {z }

image-to-text

+−1

i=1

log exp(y⊤

ixi/τ)

j=1 exp(y⊤

ixj/τ)

| {z }

text-to-image

The normalization for the image-to-text and text-to-

image similarity is computed by summing over the potential

matches (indexed by

) to the text and image examples within

a batch, respectively. Note that

is the temperature of the

softmax for the normalization.

3.2. Aggregation

The goal of the aggregation method is to collapse the im-

age embedding from a

[H, W, D]

tensor to a

dimensional

vector. Average pooling across space is an established tech-

nique for ensuring that the ﬁnal embedding is independent of

the image resolution [

], and has been adopted for CNN-

based architectures in vision-language models [

]. Alterna-

tively, maximum pooling has been explored, in particular

with success for point clouds [

] and image-audio [

Another approach typical for ViT borrowed from language

modeling [

] is the class token (CLS), which is prepended

to the image patch tokens [

]. A class token learns an em-

bedding that aggregates information across all patch tokens

in order to predict the image label. The class token may be

used to summarize the content for an entire image for ViT-

based models [

]. Subsequent work in vision-language

models has explored learning pooling strategies [

115

heuristically selecting a set of similar neighbors [

118

] or

learning attention-based mechanisms [117].

In this work we systematically explore these aggregation

strategies. In early experiments we found that many complex

strategies for aggregation yielded poor results (App. A.2).

We found that the application of max pooling across the

spatial dimensions – while extremely simple – was also by

far the most effective (Sec. 4.5). We hypothesize that the

success of max pooling may be due to the gradient updates

being focused solely on a single spatial location, and not

spread across all spatial dimensions.

Why Max Pooling? In particular, the max pooling oper-

ation allows pre-aggregation features (shaped

[N, H, W, D]

)

to determine the spatial location for gradient updates at each

step, conditioned on input images. Across different images

containing a common object at different spatial locations, the

model has to select a conservative and minimal set of spatial

locations for gradient updates. At the same time, given the

cross-modal contrastive train objective, the aggregated fea-

ture of each such image must be aligned towards a common

language concept (i.e. related to the common object). We

hypothesize that gradient updates at the common object’s

spatial location is the simplest optimization for the train ob-

jective in this case, leading to observed perceptual grouping.

3.3. Pretraining

Language Model. For better sentence level representa-

tion, we utilize pre-training from Sentence-T5 [

] which

adapts a T5 encoder to sentence level embedding using a

contrastive objective. We select Sentence-T5 over auto-

regressive models such as [

] because this contrastive

loss is aligned to our setup. The model is trained on Stan-

ford Natural Language Inference (SNLI) dataset with 275K

examples focused on entailment questions [6,32].

Image Model. We investigate initializing the image

model with several methods. First, we investigate initial-

izing the image model using supervised pre-training and

removing the ﬁnal layer for logistic regression [

]. We

next investigate self-supervised methods derived from self-

distillation (e.g. [

]). We focused on this latter direction

because such models demonstrated impressive performance

in terms of localization [

]. All image pre-training is

performed on ImageNet-1K [23] dataset.

Suitable Visual Pre-training. The visual encoder rep-

resentation space can be viewed as containing per-image

features (post-aggregation) vs per-spatial location features

(pre-aggregation). We hypothesize that semantics tied bound-

aries of this representation space should operate at the latter

granularity to induce perceptual grouping. Furthermore, we

suggest that initializations facilitating the former will detri-

ment grouping behaviour. In particular, visual pre-training

strategies separating image-level representations by seman-

tics (e.g. supervised ImageNet pre-training) will diminish

perceptual grouping. Self-supervised pre-training strategies

focused on more granular within image representations (e.g.

[

]) will tend to enhance perceptual grouping. This hypoth-

esis is empirically validated in ablations (see Table 8).

3.4. Visual Token Sub-Sampling

Motivated by vision transformers’ ability to process se-

quences of length different to train time, we generate higher

resolution segmentations during inference by sampling more

image patches. In order to increase robustness to such vary-

ing resolution, we utilize up to

2×

higher resolution images

during training but randomly drop 80% of visual tokens to

minimize additional compute overhead (similar to [

]).

While improving segmentations, this also provides training

stability possibly due to its regularizing effect (see App. D).

3.5. Inference

CLIPpy performs inference under 3 different settings:

a) classiﬁcation, b) bottom-up grouping, and c) top-down

grouping. On the visual modality, the ﬁrst utilizes a spatially

aggregated single per-image token while the latter two utilize

sets of per-region tokens. Classiﬁcation follows zero-shot

analyses from [

] where the model is prompted at inference

for a selection of labels (App. Ifor prompts). Bottom-up

grouping follows a form of spectral clustering inspired by

[

] (refer to their demo). PCA on image features (from

visual encoder pre-aggregation) gives top n(=8) principal

components, which are used as cluster centers. Each of those

same image features are assigned to one of the n clusters

based on proximity (cosine similarity) to the centers, result-

ing in n clusters (or groups). Top-down grouping employs

zero shot analysis similar to [

], but at each spatial loca-

tion, using the per-region tokens. This is similar to [

] and

generates predictions across space exploiting the transitive

property of our aggregation operations.

4. Experiments

Experimental Setup. We train our models on two

datasets: Conceptual Captions 12M (CC-12M) [

] and

High Quality Image Text Pairs (HQITP-134M) consisting

of 12 million and 134 million image-text pairs, respectively

(App. Cfor details). For both datasets, text is tokenized, and

images resized and center cropped to 224

224 pixels. We

report results on EfﬁcientNet-B5 employed by ALIGN [

and ViT-B/16 employed by CLIP [

] although we focus

more on the latter. We train models on 32 GPUs across 4

machines with PyTorch [

]. See App. Dfor more details.

We evaluate across image classiﬁcation, localization, and

robustness tasks. For image classiﬁcation, we employ the

validation splits of ImageNet [

] and ImageNet-v2 [

and for robustness we employ the test split of Waterbirds

[

]. These datasets contain 1000, 1000, and 3 classes re-

spectively. For segmentation tasks, we employ the validation

splits of PASCAL VOC [

], ADE20K [

125

], COCO

[

], COCO (Obj) [

], and Cityscapes [

]. Each of these

datasets contain 20, 150, 133, 80, and 27 labels, respectively.

Baselines for comparison. Given that most competitive

baselines are trained on private datasets, we ﬁrst attempt to

reproduce results by training models on a corpus of image-

text pairs. In more detail, we train on the public CC-12M

dataset [

] to provide reproducible numbers and observe

competitive performance given our data limitations. We also

train on the larger HQITP-134M dataset to verify scalability.

We ﬁrst measure the performance of CLIP [

] and

ALIGN [

] on zero-shot image classiﬁcation on ImageNet

and ImageNet-v2. Table 2highlights these results. We take

this as a starting point for subsequent work. In the following

experiments we attempt to address the following questions:

•

What are the limitations of current vision-language

models? (Fig. 1)

•

Do we observe perceptual grouping in vision language

models? (Tabs. 3,4and 6).

•

How resilient are vision-language models to counter-

factual manipulations? (Fig. 4).

•

How important are each of the proposed model modiﬁ-

cations? (Tabs. 7to 10).

Dataset IN IN-v2

ALIGN [50] ALIGN-1800M 76.4 70.1

CLIP [85] CLIP-400M 65.5 60.8

CLIP †CC-12M 46.0 40.3

GroupViT [113] CC-12M+YFCC 42.9 -

GroupViT †CC-12M 25.6 23.8

CLIPpy CC-12M 45.3 40.0

ALIGN †HQITP-134M 51.1 45.6

CLIP †HQITP-134M 61.4 56.4

CLIPpy HQITP-134M 60.3 54.8

Table 2: CLIPpy achieves competitive zero-shot image recogni-

tion. IN and IN-v2 denote ImageNet and ImageNet-v2 accuracy,

respectively.

†

indicates our implementation. [

] evaluated at

640

640; others evaluated at 224

224. CLIPpy shows

±0.5

and

±0.9IN acc. (5 runs) on CC-12M and HQITP-134M, respectively.

4.1. Limitations of vision-language models

Visual representations learned in vision-language models

exhibit an impressive ability to generalize across tasks [

]. However they also exhibit a profound shortcoming –

learned visual representations maintain minimal information

about where an object resides, failing to properly recognize

what parts of an image constitute an object.

Fig. 1(bottom row) showcases failure of a CLIP model;

namely, the model improperly conﬂates visual content not

associated with an object with the actual object. This can be

observed by measuring the similarity of each embedding at

each spatial location with a label set using the method in [

]

(Sec. 3.5). One consistently observes that the central object

of interest is incorrectly predicted to reside at every spatial

location. For instance, in the left example, the CLIP model

predicts that a

bird

resides at every spatial location. In a

CNN architecture, where spatial information is inherently

preserved, we observe some improvement, but the larger

issue of poor localization remains (see App. Efor details).

This failure of vision-language models to properly under-

stand the spatial organization of information is consistent

with earlier observations. Ablation experiments in ViT mod-

els demonstrated that removing positional embeddings mini-

mally detriments predictive performance [

121

Without positional information, ViT models effectively learn

representations as a “bag of image patches”, ignoring the

spatial organization.

In contrast, if we perform the same analysis on CLIPpy,

we see that the model retains signiﬁcant information about

spatial information (Fig. 1, 3rd row). We take these visual-

izations as an impetus for further investigation. In particular,

we start by quantifying the ability of the model to arbitrarily

group together semantically related pixels, and compare this

to previous works.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PerceptualGroupinginContrastiveVision-LanguageModelsKanchanaRanasinghe*,BrandonMcKinzie,SachinRavi,YinfeiYang,AlexanderToshev,JonathonShlens†Applekranasinghe@cs.stonybrook.eduAbstractRecentadvancesinzero-shotimagerecognitionsuggestthatvision-languagemodelslearngenericvisualrepresenta-tionswithahighd...

展开>> 收起<<

Perceptual Grouping in Contrastive Vision-Language Models Kanchana Ranasinghe Brandon McKinzie Sachin Ravi Yinfei Yang Alexander Toshev Jonathon Shlens.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Perceptual Grouping in Contrastive Vision-Language Models Kanchana Ranasinghe Brandon McKinzie Sachin Ravi Yinfei Yang Alexander Toshev Jonathon Shlens

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: