CLIP also Understands Text: Prompting CLIP for Phrase Understanding
An Yan†, Jiacheng Li†, Wanrong Zhu¶, Yujie Lu¶,
William Yang Wang¶, Julian McAuley†
†UC San Diego, ¶UC Santa Barbara
{ayan, j9li,jmcauley }@ucsd.edu
{wanrongzhu,yujielu,william}@cs.ucsb.edu
Abstract
Contrastive Language-Image Pretraining
(CLIP) efficiently learns visual concepts
by pre-training with natural language su-
pervision. CLIP and its visual encoder
have been explored on various vision and
language tasks and achieve strong zero-shot or
transfer learning performance. However, the
application of its text encoder solely for text
understanding has been less explored. In this
paper, we find that the text encoder of CLIP
actually demonstrates strong ability for phrase
understanding, and can even significantly
outperform popular language models such as
BERT with a properly designed prompt. Ex-
tensive experiments validate the effectiveness
of our method across different datasets and
domains on entity clustering and entity set
expansion tasks.
1 Introduction
Contrastive Language-Image Pretraining (CLIP)
(Radford et al.,2021) is a recent model proposed
to learn visual concepts from natural language su-
pervision. It consists of a visual encoder and a
text encoder, and learns visual representations by
aligning images and text through a contrastive loss.
CLIP has demonstrated strong zero-shot open-set
image classification capability with 400 million pre-
training image-text pairs crawled from the web.
However, despite its success for computer vi-
sion and multimodal tasks (Shen et al.,2021),
few studies explore the application of its text en-
coder on downstream text understanding tasks. Re-
cently, Hsu et al. (2022) has empirically found that
CLIP performs poorly on natural language under-
standing tasks directly. One potential reason is
that CLIP is not trained with language modeling
losses (e.g., masked language modeling, MLM),
which proves to be crucial for language understand-
ing (Devlin et al.,2019;Liu et al.,2019). But since
the visual encoder has benefited from language su-
pervision, one might naturally ask: does the text
encoder also benefit from visual supervision?
In this paper, we show that even though CLIP
is pre-trained without explicit token-, word- or
phrase-level supervision, with a simple and ef-
fective prompting method, the CLIP text encoder
can be directly used for phrase understanding,
and can significantly outperform popular language
models trained with masked language modeling
(e.g. BERT) or even phrase-specific learning objec-
tives such as Phrase-BERT (Wang et al.,2021) and
UCTopic (Li et al.,2022) on several phrase-related
datasets from different domains. Specifically, we
automatically generate instance-level prompts for
each phrase by probing the knowledge of a lan-
guage model. Then, the text encoder of CLIP
encodes phrases with corresponding prompts to
obtain final representations. We evaluate these rep-
resentations directly on two phrase understanding
tasks without further fine-tuning.
Consequently, CLIP text encoder achieves an
average of 6.4% absolute improvement (70.3% vs
76.7% on accuracy) on the entity clustering task
(CoNLL2003, BC5CDR, W-NUT2017) and an im-
provement of 9.8% (56.9% vs 66.7% on mean av-
erage precision) on the entity set expansion task
(WIKI) compared with the best performing lan-
guage models.
Overall, our contributions are as follows:
•
We are the first to show that a text encoder
trained with only image-text contrastive learn-
ing can achieve competitive or even better re-
sults on downstream text understanding tasks
compared to popular language models pre-
trained with MLM.
•
We design an automatic prompting method
with a language model as the knowledge base
to boost performance on phrase understanding
for both language models and CLIP.
•
We conduct comprehensive experiments to
demonstrate the effectiveness of our method,
arXiv:2210.05836v1 [cs.CL] 11 Oct 2022