
objects (neglecting background classes). Another direction is
task specific unsupervised fine-tuning [
126
,
26
] which loses
the generic and transferable nature of these representations.
In this work, we explore vision-language models that
learn from similar weakly labeled data, but a) retain the
generic and transferable nature of features, and b) learns
where all (background and foreground) visual content resides
within an image. Unlike previous attempts using grouping
specific architectures [
113
,
114
] or dense human annotations
[
36
,
38
,
57
], we explore a minimal set of modifications to
existing CLIP models [
85
] that leads to grouping of visual
imagery while retaining their weakly supervised and scalable
training procedure. We find that two small adjustments – em-
ploying specific pretraining strategies and adjusting spatial
feature aggregation – results in models that are equally ef-
fective in zero-shot image recognition, but also retain spatial
information regarding object locations (see Fig. 1, 3rd row).
The resulting model termed CLIPpy exhibits perceptual
grouping – that is, the ability to select and combine re-
lated visual signals into semantically meaningful regions
[
110
,
72
,
89
]. Endowing models with perceptual grouping –
whether in a bottom up (based solely on visual content) or
top down (guided by external information, language in this
case) manner – in learned representations has been a long
standing goal in computer vision [
70
,
71
]. In this work, our
key contributions are as follows:
•
Identify systematic failure of contrastive vision-language
models [
85
,
50
] to properly identify where objects reside
within an image, and group semantically related content.
•
Design a minimal set of changes to endow these model
with perceptual grouping, resulting in state-of-the-art zero-
shot segmentation without training on any segmentation
data or performing task specific fine-tuning.
•
Emergence of localization ability in our models uniquely
leads to robustness to counterfactual manipulations. The
degree of robustness matches if not surpasses previous
state-of-the-art supervised learning methods employing
specialized training methodologies.
2. Related Work
Vision-language models for grounding. Contrastive
language image pre-training [
85
] (CLIP) led to a range
of follow up work performing open-vocabulary detection
[
38
,
51
,
58
,
59
,
120
,
28
] or segmentation [
36
,
57
,
122
].
While these methods leverage dense human annotations for
training, an alternate line of works [
113
,
114
,
126
,
115
,
22
]
attempt to learn alignment between regions of images and
language with only image level noisy captions for su-
pervision. Their weak supervision allows better scalabil-
ity (to more data) leading to learning more generic and
transferable representations. In fact, multiple such works
[
113
,
114
,
126
,
26
,
57
] perform zero-shot semantic segmen-
tation. However, unlike [
113
,
114
] geared to segment a fixed
Component CLIP [85] CLIP†CLIPpy
Image Backbone ViT-B/16 ViT-B/16 ViT-B/16
Text Backbone T-B T-5 T-5
Image Init Random Random DINO
Text Init Random Random Sent T-5
Image Pooling CLS CLS Max
Text Pooling Avg Avg Avg
Dataset 300M∗CC-12M CC-12M
VOC mIoU (%) 16.4 17.5 50.8 (+33.3)
VOC JS (%) 28.6 37.3 47.5 (+10.2)
Table 1: We highlight the minimal differences of CLIPy from CLIP.
CLIP
†
is our implementation following train settings identical to
CLIPpy. ∗indicates OpenAI private data.
count of foreground objects, our proposed CLIPpy can better
segment arbitrary object counts and background classes. In
contrast to [
126
] using generic image level features, CLIPpy
explicitly learns local features during training. Moreover,
CLIPpy requires no dense human annotations or task-specific
fine-tuning in contrast to [
26
,
57
]. We also highlight how
[
113
,
114
,
26
] perform grouping independent of language
at inference - however CLIPpy can group conditioned on
language, capturing variable object boundaries for different
language prompts.
Multiple contemporary works also explore similar di-
rections as CLIPpy, leveraging pre-trained vision-language
models for various grouping tasks under weak supervision
(no pixel level annotation) [
123
,
68
,
13
,
75
,
9
,
52
]. Combin-
ing self-supervised methods that emerge grouping [
12
] with
CLIP models [85] for cross-modal alignment is explored in
[
123
] gaining notable improvements at object boundaries. A
clustering mechanism containing learnable centres similar to
[
113
] is combined with reconstruction and super-pixel align-
ment losses to achieve grouping in [
68
]. Learning decoder
networks over a frozen CLIP backbone [
85
] with text to im-
age patch similarity losses are explored in [
13
,
75
] resulting
in similar grouping behaviour. In contrast to these meth-
ods utilizing contrastive vision language training to emerge
grouping, recent works [
9
,
52
] also showcase how text-to-
image generative models (particularly Stable Diffusion [
90
])
can be leveraged to perform visual grouping.
Zero-shot semantic segmentation. A form of top-down
grouping, this relatively new task [
124
,
48
,
111
,
8
,
79
,
45
,
60
,
3
,
95
] attempts to segment unseen classes, usually after
a supervised training phase often involving dense annotation
based supervision. Following two early representative works
[
111
,
8
], most later approaches [
60
,
39
,
40
,
53
,
95
,
102
]
formulate the task as a pixel-level zero-shot classification
problem with a closed set vocabulary. While CLIPpy follows
a similar pixel based formulation, in contrast, our method
requires no dense human annotations for supervision, no task
specific fine-tuning, and is open-vocabulary. Recent work
[
26
,
58
] also explores region-level classification leveraging
pre-trained CLIP models [
85
], but unlike CLIPpy perform
grouping independent of language during inference.
2