Perceptual Grouping in Contrastive Vision-Language Models Kanchana Ranasinghe Brandon McKinzie Sachin Ravi Yinfei Yang Alexander Toshev Jonathon Shlens

2025-05-02 0 0 2.43MB 23 页 10玖币
侵权投诉
Perceptual Grouping in Contrastive Vision-Language Models
Kanchana Ranasinghe*, Brandon McKinzie, Sachin Ravi,
Yinfei Yang, Alexander Toshev, Jonathon Shlens
Apple
kranasinghe@cs.stonybrook.edu
Abstract
Recent advances in zero-shot image recognition suggest
that vision-language models learn generic visual representa-
tions with a high degree of semantic information that may
be arbitrarily probed with natural language phrases. Under-
standing an image, however, is not just about understanding
what content resides within an image, but importantly, where
that content resides. In this work we examine how well vision-
language models are able to understand where objects reside
within an image and group together visually related parts
of the imagery. We demonstrate how contemporary vision
and language representation learning models based on con-
trastive losses and large web-based data capture limited
object localization information. We propose a minimal set of
modifications that results in models that uniquely learn both
semantic and spatial information. We measure this perfor-
mance in terms of zero-shot image recognition, unsupervised
bottom-up and top-down semantic segmentations, as well
as robustness analyses. We find that the resulting model
achieves state-of-the-art results in terms of unsupervised
segmentation, and demonstrate that the learned representa-
tions are uniquely robust to spurious correlations in datasets
designed to probe the causal behavior of vision models.
1. Introduction
Learning a representation for visual imagery requires
resolving not only what resides within an image, but also
where that information resides [
72
]. In many applications,
knowledge of where information resides is sometimes more
important than a precise description of the content [
33
,
98
].
Hence, our ability to learn more generic and robust visual
representations requires learning the geometry of visual se-
mantics, and how visual information may be grounded by
specific regions of the visual field.
While recent vision-language models trained under weak
supervision demonstrate a remarkable ability to learn generic
and transferable visual representations [
50
,
85
,
117
,
24
], they
*Work performed as part of Apple internship.
Work performed at Apple.
Figure 1: Semantic localization in contrastive VLMs. We mea-
sure the ability of vision-language models to predict a label at each
spatial position in a zero shot manner based on the similarity of
location tokens to the corresponding language tokens on selected
examples. CLIP / ALIGN [
50
,
85
] have minimal understanding
of the spatial location of individual objects (row 4). Our proposed
CLIPpy (row 3) predicts the label at locations that correspond
closely to human annotation for semantic segmentation (row 2).
All predictions were performed with no access to any segmentation
data during training or inference. More visualizations in App. B.
showcase a profound inability to associate visual content
with individual objects (Fig. 1, bottom row). In other words,
models trained on large weakly-supervised data have a lim-
ited ability to group together visually related content [
36
].
Because the representations have a poor understanding of
where an object resides, they easily conflate background
with foreground content. Hence, the learned representations
are unable to learn the spatial layout of a scene [
97
,
101
],
and are susceptible to learning spurious correlations between
a semantic label and extraneous content [91,65].
Recent work [
113
,
114
] attempts to bridge this gap
through grouping mechanisms under the same weakly su-
pervised training paradigm, but focus more on foreground
arXiv:2210.09996v3 [cs.CV] 22 Aug 2023
objects (neglecting background classes). Another direction is
task specific unsupervised fine-tuning [
126
,
26
] which loses
the generic and transferable nature of these representations.
In this work, we explore vision-language models that
learn from similar weakly labeled data, but a) retain the
generic and transferable nature of features, and b) learns
where all (background and foreground) visual content resides
within an image. Unlike previous attempts using grouping
specific architectures [
113
,
114
] or dense human annotations
[
36
,
38
,
57
], we explore a minimal set of modifications to
existing CLIP models [
85
] that leads to grouping of visual
imagery while retaining their weakly supervised and scalable
training procedure. We find that two small adjustments – em-
ploying specific pretraining strategies and adjusting spatial
feature aggregation – results in models that are equally ef-
fective in zero-shot image recognition, but also retain spatial
information regarding object locations (see Fig. 1, 3rd row).
The resulting model termed CLIPpy exhibits perceptual
grouping – that is, the ability to select and combine re-
lated visual signals into semantically meaningful regions
[
110
,
72
,
89
]. Endowing models with perceptual grouping –
whether in a bottom up (based solely on visual content) or
top down (guided by external information, language in this
case) manner – in learned representations has been a long
standing goal in computer vision [
70
,
71
]. In this work, our
key contributions are as follows:
Identify systematic failure of contrastive vision-language
models [
85
,
50
] to properly identify where objects reside
within an image, and group semantically related content.
Design a minimal set of changes to endow these model
with perceptual grouping, resulting in state-of-the-art zero-
shot segmentation without training on any segmentation
data or performing task specific fine-tuning.
Emergence of localization ability in our models uniquely
leads to robustness to counterfactual manipulations. The
degree of robustness matches if not surpasses previous
state-of-the-art supervised learning methods employing
specialized training methodologies.
2. Related Work
Vision-language models for grounding. Contrastive
language image pre-training [
85
] (CLIP) led to a range
of follow up work performing open-vocabulary detection
[
38
,
51
,
58
,
59
,
120
,
28
] or segmentation [
36
,
57
,
122
].
While these methods leverage dense human annotations for
training, an alternate line of works [
113
,
114
,
126
,
115
,
22
]
attempt to learn alignment between regions of images and
language with only image level noisy captions for su-
pervision. Their weak supervision allows better scalabil-
ity (to more data) leading to learning more generic and
transferable representations. In fact, multiple such works
[
113
,
114
,
126
,
26
,
57
] perform zero-shot semantic segmen-
tation. However, unlike [
113
,
114
] geared to segment a fixed
Component CLIP [85] CLIPCLIPpy
Image Backbone ViT-B/16 ViT-B/16 ViT-B/16
Text Backbone T-B T-5 T-5
Image Init Random Random DINO
Text Init Random Random Sent T-5
Image Pooling CLS CLS Max
Text Pooling Avg Avg Avg
Dataset 300MCC-12M CC-12M
VOC mIoU (%) 16.4 17.5 50.8 (+33.3)
VOC JS (%) 28.6 37.3 47.5 (+10.2)
Table 1: We highlight the minimal differences of CLIPy from CLIP.
CLIP
is our implementation following train settings identical to
CLIPpy. indicates OpenAI private data.
count of foreground objects, our proposed CLIPpy can better
segment arbitrary object counts and background classes. In
contrast to [
126
] using generic image level features, CLIPpy
explicitly learns local features during training. Moreover,
CLIPpy requires no dense human annotations or task-specific
fine-tuning in contrast to [
26
,
57
]. We also highlight how
[
113
,
114
,
26
] perform grouping independent of language
at inference - however CLIPpy can group conditioned on
language, capturing variable object boundaries for different
language prompts.
Multiple contemporary works also explore similar di-
rections as CLIPpy, leveraging pre-trained vision-language
models for various grouping tasks under weak supervision
(no pixel level annotation) [
123
,
68
,
13
,
75
,
9
,
52
]. Combin-
ing self-supervised methods that emerge grouping [
12
] with
CLIP models [85] for cross-modal alignment is explored in
[
123
] gaining notable improvements at object boundaries. A
clustering mechanism containing learnable centres similar to
[
113
] is combined with reconstruction and super-pixel align-
ment losses to achieve grouping in [
68
]. Learning decoder
networks over a frozen CLIP backbone [
85
] with text to im-
age patch similarity losses are explored in [
13
,
75
] resulting
in similar grouping behaviour. In contrast to these meth-
ods utilizing contrastive vision language training to emerge
grouping, recent works [
9
,
52
] also showcase how text-to-
image generative models (particularly Stable Diffusion [
90
])
can be leveraged to perform visual grouping.
Zero-shot semantic segmentation. A form of top-down
grouping, this relatively new task [
124
,
48
,
111
,
8
,
79
,
45
,
60
,
3
,
95
] attempts to segment unseen classes, usually after
a supervised training phase often involving dense annotation
based supervision. Following two early representative works
[
111
,
8
], most later approaches [
60
,
39
,
40
,
53
,
95
,
102
]
formulate the task as a pixel-level zero-shot classification
problem with a closed set vocabulary. While CLIPpy follows
a similar pixel based formulation, in contrast, our method
requires no dense human annotations for supervision, no task
specific fine-tuning, and is open-vocabulary. Recent work
[
26
,
58
] also explores region-level classification leveraging
pre-trained CLIP models [
85
], but unlike CLIPpy perform
grouping independent of language during inference.
2
Visual
Encoder
Text!
Encoder
“bird”
aggregation"
across space
contrastive"
Loss
spatial representation
H x W x D
1 x D1 x D
𝑓
random"
supervised pretraining"
self-supervised pretraining
random"
T5 pretraining
Figure 2: Architecture diagram. Images and captions are separately embedded into Euclidean spaces, where image features are spatially
aggregated. A contrastive loss trains the aggregated image embedding to be close to the caption embedding. We demonstrate that two
minimal design decisions (indicated in green) are of paramount importance for CLIP [
85
] models to perform perceptual grouping under
image-level weak supervision.
Unsupervised segmentation. Analogous to bottom-up
grouping, these works perform class-agnostic segmentation
within the visual modality with no explicit language align-
ment [
12
,
41
,
31
,
49
,
73
]. This topic has a long, rich history
in human visual perception [
110
] and computer vision [
70
],
and has been explored as means of generalizing to new visual
domains [
84
,
71
]. It is this goal that most closely inspires our
work. Early efforts group pixels based on known spatially-
local affinities [
19
,
96
,
88
], with subsequent methods leading
to region proposal networks for object detection [
103
] and
advances in semantic segmentation [
1
]. Recent methods em-
ploy self-supervision to learn perceptual grouping [
18
,
41
] or
object-centric groupings [
29
,
66
,
109
,
4
,
44
]. Our proposed
CLIPpy demonstrates competitive performance, but addi-
tionally aligns groups to the language modality explicitly.
Learning robust visual representations. For a long time,
ImageNet [
23
] accuracy was believed to provide a reasonable
proxy for quality of learned visual representations [
37
,
55
].
However, recent work highlights notable deficiencies in such
learned representations [
34
,
87
,
54
] including sensitivity to
low level textures, failure for domain shifts, and reliance
on spurious correlations. These failures inspired a large
literature to mitigate learning spurious correlations [
91
,
65
,
2
] by focusing on new optimization techniques. Progress
on this issue may address parallel issues in fairness [
21
].
Resulting methods have largely focused on synthetic data,
re-balancing data, and shaping learned embeddings [
76
,
65
].
Nonetheless, theoretical results suggest pessimistic bounds
unless additional structure informs the problem (see refs. in
[
91
]). Therein, the structured output predictions of proposed
CLIPpy provide another promising solution.
3. Methodology
We first set the stage by discussing established core ar-
chitectures and the contrastive learning formulation. Next,
we discuss modifications that are the focus of the analysis
in this work. In particular, we discuss aggregation options,
pre-training alternatives, and token sub-sampling.
3.1. Architecture and Training
We provide a quick overview of our architecture (Fig. 2).
Consider a batch size
N
, spatial height
H
, spatial width
W
,
and depth
D
.
X
is a tensor that has a shape of
[N, H, W, D]
and is the output of an image encoder.
Y
is a tensor that is
of shape [N, D]and is the output of a text encoder.
Language Model. We employ a strong language model
baseline derived from the transformer architecture [
104
] and
implemented in T5 [
86
]. T5 models use an encoder-decoder
architecture that is trained using a generative span corruption
task, and have achieved state-of-the-art on a broad range
of NLP tasks including GLUE [
106
] and Super-Glue [
105
].
We use the encoder only and discard the decoder part. We
employ the T5-base which consists of 12 transformer layers,
12 attention heads, and 768 token channel dimensions.
Image Model. We explore two architectures for im-
age featurization, CNN-based and Vision-Transformers, al-
though we focus the majority of work on the latter. First,
we employ the EfficientNet architecture [
100
] as a high
performant CNN architecture, which has been used pre-
viously in vision-language models. The specifics of the
meta-architecture were derived from considerations based
on neural architecture search. Second, we employ the Vision
Transformer (ViT) architecture [
27
]. We refer the reader to
[
27
,
104
] for details. Briefly, ViT is largely inherited from
the NLP literature and consists of a hierarchical associative
memory. Each layer, termed a transformer, is composed
of a Multi-headed Self-Attention (MSA) layer followed by
a 2-layer feed-forward multi-layer perceptron (MLP). The
primary parameter of ViT is the patch size
P
specifying the
P×P
patch of pixels constituting a token in the architecture.
Contrastive Representation Learning. Let
xi
and
yi
denote the image and text embeddings (post aggregation) of
the
i
’th example in the batch. A contrastive loss may be spec-
ified as the cross entropy across a batch [
85
,
50
]. The cross
entropy is calculated between a one-hot encoding specifying
the correspondence between the image and text examples,
3
and a softmax-normalized distribution specifying the dot-
product similarity between image and text embeddings.
L=1
N
N
X
i=1
log exp(x
iyi)
PN
j=1 exp(x
iyj)
| {z }
image-to-text
+1
N
N
X
i=1
log exp(y
ixi)
PN
j=1 exp(y
ixj)
| {z }
text-to-image
The normalization for the image-to-text and text-to-
image similarity is computed by summing over the potential
matches (indexed by
j
) to the text and image examples within
a batch, respectively. Note that
τ
is the temperature of the
softmax for the normalization.
3.2. Aggregation
The goal of the aggregation method is to collapse the im-
age embedding from a
[H, W, D]
tensor to a
D
dimensional
vector. Average pooling across space is an established tech-
nique for ensuring that the final embedding is independent of
the image resolution [
99
,
67
], and has been adopted for CNN-
based architectures in vision-language models [
50
]. Alterna-
tively, maximum pooling has been explored, in particular
with success for point clouds [
83
] and image-audio [
42
].
Another approach typical for ViT borrowed from language
modeling [
25
] is the class token (CLS), which is prepended
to the image patch tokens [
27
]. A class token learns an em-
bedding that aggregates information across all patch tokens
in order to predict the image label. The class token may be
used to summarize the content for an entire image for ViT-
based models [
85
,
12
]. Subsequent work in vision-language
models has explored learning pooling strategies [
15
,
115
],
heuristically selecting a set of similar neighbors [
118
] or
learning attention-based mechanisms [117].
In this work we systematically explore these aggregation
strategies. In early experiments we found that many complex
strategies for aggregation yielded poor results (App. A.2).
We found that the application of max pooling across the
spatial dimensions – while extremely simple – was also by
far the most effective (Sec. 4.5). We hypothesize that the
success of max pooling may be due to the gradient updates
being focused solely on a single spatial location, and not
spread across all spatial dimensions.
Why Max Pooling? In particular, the max pooling oper-
ation allows pre-aggregation features (shaped
[N, H, W, D]
)
to determine the spatial location for gradient updates at each
step, conditioned on input images. Across different images
containing a common object at different spatial locations, the
model has to select a conservative and minimal set of spatial
locations for gradient updates. At the same time, given the
cross-modal contrastive train objective, the aggregated fea-
ture of each such image must be aligned towards a common
language concept (i.e. related to the common object). We
hypothesize that gradient updates at the common object’s
spatial location is the simplest optimization for the train ob-
jective in this case, leading to observed perceptual grouping.
3.3. Pretraining
Language Model. For better sentence level representa-
tion, we utilize pre-training from Sentence-T5 [
78
] which
adapts a T5 encoder to sentence level embedding using a
contrastive objective. We select Sentence-T5 over auto-
regressive models such as [
25
,
7
] because this contrastive
loss is aligned to our setup. The model is trained on Stan-
ford Natural Language Inference (SNLI) dataset with 275K
examples focused on entailment questions [6,32].
Image Model. We investigate initializing the image
model with several methods. First, we investigate initial-
izing the image model using supervised pre-training and
removing the final layer for logistic regression [
37
,
55
]. We
next investigate self-supervised methods derived from self-
distillation (e.g. [
12
]). We focused on this latter direction
because such models demonstrated impressive performance
in terms of localization [
12
,
41
]. All image pre-training is
performed on ImageNet-1K [23] dataset.
Suitable Visual Pre-training. The visual encoder rep-
resentation space can be viewed as containing per-image
features (post-aggregation) vs per-spatial location features
(pre-aggregation). We hypothesize that semantics tied bound-
aries of this representation space should operate at the latter
granularity to induce perceptual grouping. Furthermore, we
suggest that initializations facilitating the former will detri-
ment grouping behaviour. In particular, visual pre-training
strategies separating image-level representations by seman-
tics (e.g. supervised ImageNet pre-training) will diminish
perceptual grouping. Self-supervised pre-training strategies
focused on more granular within image representations (e.g.
[
12
]) will tend to enhance perceptual grouping. This hypoth-
esis is empirically validated in ablations (see Table 8).
3.4. Visual Token Sub-Sampling
Motivated by vision transformers’ ability to process se-
quences of length different to train time, we generate higher
resolution segmentations during inference by sampling more
image patches. In order to increase robustness to such vary-
ing resolution, we utilize up to
2×
higher resolution images
during training but randomly drop 80% of visual tokens to
minimize additional compute overhead (similar to [
43
,
62
]).
While improving segmentations, this also provides training
stability possibly due to its regularizing effect (see App. D).
3.5. Inference
CLIPpy performs inference under 3 different settings:
a) classification, b) bottom-up grouping, and c) top-down
grouping. On the visual modality, the first utilizes a spatially
aggregated single per-image token while the latter two utilize
sets of per-region tokens. Classification follows zero-shot
analyses from [
85
] where the model is prompted at inference
for a selection of labels (App. Ifor prompts). Bottom-up
4
grouping follows a form of spectral clustering inspired by
[
12
] (refer to their demo). PCA on image features (from
visual encoder pre-aggregation) gives top n(=8) principal
components, which are used as cluster centers. Each of those
same image features are assigned to one of the n clusters
based on proximity (cosine similarity) to the centers, result-
ing in n clusters (or groups). Top-down grouping employs
zero shot analysis similar to [
85
], but at each spatial loca-
tion, using the per-region tokens. This is similar to [
35
] and
generates predictions across space exploiting the transitive
property of our aggregation operations.
4. Experiments
Experimental Setup. We train our models on two
datasets: Conceptual Captions 12M (CC-12M) [
14
] and
High Quality Image Text Pairs (HQITP-134M) consisting
of 12 million and 134 million image-text pairs, respectively
(App. Cfor details). For both datasets, text is tokenized, and
images resized and center cropped to 224
×
224 pixels. We
report results on EfficientNet-B5 employed by ALIGN [
50
],
and ViT-B/16 employed by CLIP [
85
] although we focus
more on the latter. We train models on 32 GPUs across 4
machines with PyTorch [
80
]. See App. Dfor more details.
We evaluate across image classification, localization, and
robustness tasks. For image classification, we employ the
validation splits of ImageNet [
23
] and ImageNet-v2 [
87
],
and for robustness we employ the test split of Waterbirds
[
91
]. These datasets contain 1000, 1000, and 3 classes re-
spectively. For segmentation tasks, we employ the validation
splits of PASCAL VOC [
30
], ADE20K [
125
,
17
], COCO
[
64
], COCO (Obj) [
64
], and Cityscapes [
20
]. Each of these
datasets contain 20, 150, 133, 80, and 27 labels, respectively.
Baselines for comparison. Given that most competitive
baselines are trained on private datasets, we first attempt to
reproduce results by training models on a corpus of image-
text pairs. In more detail, we train on the public CC-12M
dataset [
14
] to provide reproducible numbers and observe
competitive performance given our data limitations. We also
train on the larger HQITP-134M dataset to verify scalability.
We first measure the performance of CLIP [
85
] and
ALIGN [
50
] on zero-shot image classification on ImageNet
and ImageNet-v2. Table 2highlights these results. We take
this as a starting point for subsequent work. In the following
experiments we attempt to address the following questions:
What are the limitations of current vision-language
models? (Fig. 1)
Do we observe perceptual grouping in vision language
models? (Tabs. 3,4and 6).
How resilient are vision-language models to counter-
factual manipulations? (Fig. 4).
How important are each of the proposed model modifi-
cations? (Tabs. 7to 10).
Dataset IN IN-v2
ALIGN [50] ALIGN-1800M 76.4 70.1
CLIP [85] CLIP-400M 65.5 60.8
CLIP CC-12M 46.0 40.3
GroupViT [113] CC-12M+YFCC 42.9 -
GroupViT CC-12M 25.6 23.8
CLIPpy CC-12M 45.3 40.0
ALIGN HQITP-134M 51.1 45.6
CLIP HQITP-134M 61.4 56.4
CLIPpy HQITP-134M 60.3 54.8
Table 2: CLIPpy achieves competitive zero-shot image recogni-
tion. IN and IN-v2 denote ImageNet and ImageNet-v2 accuracy,
respectively.
indicates our implementation. [
50
] evaluated at
640
×
640; others evaluated at 224
×
224. CLIPpy shows
±0.5
and
±0.9IN acc. (5 runs) on CC-12M and HQITP-134M, respectively.
4.1. Limitations of vision-language models
Visual representations learned in vision-language models
exhibit an impressive ability to generalize across tasks [
85
,
50
]. However they also exhibit a profound shortcoming –
learned visual representations maintain minimal information
about where an object resides, failing to properly recognize
what parts of an image constitute an object.
Fig. 1(bottom row) showcases failure of a CLIP model;
namely, the model improperly conflates visual content not
associated with an object with the actual object. This can be
observed by measuring the similarity of each embedding at
each spatial location with a label set using the method in [
35
]
(Sec. 3.5). One consistently observes that the central object
of interest is incorrectly predicted to reside at every spatial
location. For instance, in the left example, the CLIP model
predicts that a
bird
resides at every spatial location. In a
CNN architecture, where spatial information is inherently
preserved, we observe some improvement, but the larger
issue of poor localization remains (see App. Efor details).
This failure of vision-language models to properly under-
stand the spatial organization of information is consistent
with earlier observations. Ablation experiments in ViT mod-
els demonstrated that removing positional embeddings mini-
mally detriments predictive performance [
27
,
77
,
121
,
97
].
Without positional information, ViT models effectively learn
representations as a “bag of image patches”, ignoring the
spatial organization.
In contrast, if we perform the same analysis on CLIPpy,
we see that the model retains significant information about
spatial information (Fig. 1, 3rd row). We take these visual-
izations as an impetus for further investigation. In particular,
we start by quantifying the ability of the model to arbitrarily
group together semantically related pixels, and compare this
to previous works.
5
摘要:

PerceptualGroupinginContrastiveVision-LanguageModelsKanchanaRanasinghe*,BrandonMcKinzie,SachinRavi,YinfeiYang,AlexanderToshev,JonathonShlens†Applekranasinghe@cs.stonybrook.eduAbstractRecentadvancesinzero-shotimagerecognitionsuggestthatvision-languagemodelslearngenericvisualrepresenta-tionswithahighd...

展开>> 收起<<
Perceptual Grouping in Contrastive Vision-Language Models Kanchana Ranasinghe Brandon McKinzie Sachin Ravi Yinfei Yang Alexander Toshev Jonathon Shlens.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:2.43MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注