CLIP also Understands Text Prompting CLIP for Phrase Understanding An Yan Jiacheng Li Wanrong Zhu Yujie Lu William Yang Wang Julian McAuley

2025-04-27 0 0 435.65KB 7 页 10玖币
侵权投诉
CLIP also Understands Text: Prompting CLIP for Phrase Understanding
An Yan, Jiacheng Li, Wanrong Zhu, Yujie Lu,
William Yang Wang, Julian McAuley
UC San Diego, UC Santa Barbara
{ayan, j9li,jmcauley }@ucsd.edu
{wanrongzhu,yujielu,william}@cs.ucsb.edu
Abstract
Contrastive Language-Image Pretraining
(CLIP) efficiently learns visual concepts
by pre-training with natural language su-
pervision. CLIP and its visual encoder
have been explored on various vision and
language tasks and achieve strong zero-shot or
transfer learning performance. However, the
application of its text encoder solely for text
understanding has been less explored. In this
paper, we find that the text encoder of CLIP
actually demonstrates strong ability for phrase
understanding, and can even significantly
outperform popular language models such as
BERT with a properly designed prompt. Ex-
tensive experiments validate the effectiveness
of our method across different datasets and
domains on entity clustering and entity set
expansion tasks.
1 Introduction
Contrastive Language-Image Pretraining (CLIP)
(Radford et al.,2021) is a recent model proposed
to learn visual concepts from natural language su-
pervision. It consists of a visual encoder and a
text encoder, and learns visual representations by
aligning images and text through a contrastive loss.
CLIP has demonstrated strong zero-shot open-set
image classification capability with 400 million pre-
training image-text pairs crawled from the web.
However, despite its success for computer vi-
sion and multimodal tasks (Shen et al.,2021),
few studies explore the application of its text en-
coder on downstream text understanding tasks. Re-
cently, Hsu et al. (2022) has empirically found that
CLIP performs poorly on natural language under-
standing tasks directly. One potential reason is
that CLIP is not trained with language modeling
losses (e.g., masked language modeling, MLM),
which proves to be crucial for language understand-
ing (Devlin et al.,2019;Liu et al.,2019). But since
the visual encoder has benefited from language su-
pervision, one might naturally ask: does the text
encoder also benefit from visual supervision?
In this paper, we show that even though CLIP
is pre-trained without explicit token-, word- or
phrase-level supervision, with a simple and ef-
fective prompting method, the CLIP text encoder
can be directly used for phrase understanding,
and can significantly outperform popular language
models trained with masked language modeling
(e.g. BERT) or even phrase-specific learning objec-
tives such as Phrase-BERT (Wang et al.,2021) and
UCTopic (Li et al.,2022) on several phrase-related
datasets from different domains. Specifically, we
automatically generate instance-level prompts for
each phrase by probing the knowledge of a lan-
guage model. Then, the text encoder of CLIP
encodes phrases with corresponding prompts to
obtain final representations. We evaluate these rep-
resentations directly on two phrase understanding
tasks without further fine-tuning.
Consequently, CLIP text encoder achieves an
average of 6.4% absolute improvement (70.3% vs
76.7% on accuracy) on the entity clustering task
(CoNLL2003, BC5CDR, W-NUT2017) and an im-
provement of 9.8% (56.9% vs 66.7% on mean av-
erage precision) on the entity set expansion task
(WIKI) compared with the best performing lan-
guage models.
Overall, our contributions are as follows:
We are the first to show that a text encoder
trained with only image-text contrastive learn-
ing can achieve competitive or even better re-
sults on downstream text understanding tasks
compared to popular language models pre-
trained with MLM.
We design an automatic prompting method
with a language model as the knowledge base
to boost performance on phrase understanding
for both language models and CLIP.
We conduct comprehensive experiments to
demonstrate the effectiveness of our method,
arXiv:2210.05836v1 [cs.CL] 11 Oct 2022
United States is
a [mask].
A photo of United States, !
a country, republic, nation.
Language Model
Text Encoder
Knowledge Prompting
country?
republic?
nation?
Phrase
Representation
Image Encoder
A cat with!
blue eyes
Text Encoder
Contrastive Learning
(a) CLIP: Contrastive Language-Image Pretraining
A photo of United States, !
a country, republic, nation.
Language Model
Text Encoder
Knowledge Prompting
country?
republic?
nation?
Phrase
Representation
Image Encoder
A cat with!
blue eyes
Text Encoder
Contrastive Learning
(b) Domain-aware Prompting for CLIP
Figure 1: Illustration of our framework for phrase understanding with the text encoder of CLIP.
and analyze why CLIP performs well for these
tasks across different domains.
2 Methodology
2.1 Preliminary: CLIP
CLIP is a powerful vision-language model with
strong performance for zero-shot open-set image
classification. As shown in Figure 1a, it consists
of two encoders, a ResNET (He et al.,2016) or
ViT (Dosovitskiy et al.,2020) image encoder and
a transformer text encoder. Given an image and
a sequence of words, it will transform them into
feature vectors Vand Trespectively.
Then the model is pretrained with contrastive
losses between two modalities:
LCLIP =1
2(Lvt+Ltv)(1)
where given a mini-batch of
N
samples,
sim
as
the cosine similarity,
τ
as the temperature, the con-
trastive loss for
Lvt
( similar definitions for
Ltv
)
is formulated as:
Lvt=log exp(sim(Vi, Ti)/τ)
N
j=0exp(sim(Vi, Tj)/τ).(2)
Note this loss does not inject token or word-level
supervision, but mainly focuses on learning a joint
representation space for images and text, where
those that are paired together in the training data
would ideally be close to each other in the latent
embedding space.
Surprisingly, we find CLIP trained with paired
images and queries leads to a fine-grained under-
standing of phrase representations, with a simple
and effective prompting method which we intro-
duce below.
2.2 Domain-Aware Prompting
After pre-training with large-scale image and text
data, CLIP can be readily leveraged for image clas-
sification via prompting. Based on prompt engi-
neering in the original paper of CLIP (Radford
et al.,2021), “A photo of a [label]” is a good de-
fault template which helps specify that the referring
text is about the content of an image. This template
is able to improve the zero-shot performance of
CLIP for image classification. We first follow the
same template design to prompt CLIP, which we
empirically show can also greatly improve the per-
formance for phrase understanding tasks.
However, simply using this template could lead
to sub-optimal representations, as the semantics
of phrases vary vastly by domain. A recent
work (Zhou et al.,2021) has found that adding a do-
main keyword for a dataset can improve the image
classification performance of CLIP. These domain
keywords are usually hand-crafted; to automate
this process and build robust phrase representations
across different domains, we probe the knowledge
of a language model to identify domains for each
phrase and design an automatic approach to gener-
ating instance-level domain-aware keywords.
Formally, given a phrase
pi
, we use “
pi
is
a [mask]” as the template and ask a language
model (e.g., BERT) to fill in the mask token. We
then use the top-K predictions {
d1
i
,
d2
i
, ...,
dK
i
}
from the language model to construct a prompt “A
photo of
pi
. A
d1
i
, ...,
dK
i
” for CLIP. For example,
as shown in Figure 1b, given the phrase “United
States”, the language model will generate keywords
such as country, republic, and nation. Those key-
words then form a domain-aware prompt “A photo
of United States. A country, republic, nation” that
is used as input for CLIP text encoder.
2.3 Phrase Understanding
After designing a prompting method, we can di-
rectly use the text encoder of CLIP for phrase un-
derstanding, by feeding it with prompted phrases
and using the output encodings as phrase represen-
摘要:

CLIPalsoUnderstandsText:PromptingCLIPforPhraseUnderstandingAnYan†,JiachengLi†,WanrongZhu¶,YujieLu¶,WilliamYangWang¶,JulianMcAuley††UCSanDiego,¶UCSantaBarbara{ayan,j9li,jmcauley}@ucsd.edu{wanrongzhu,yujielu,william}@cs.ucsb.eduAbstractContrastiveLanguage-ImagePretraining(CLIP)efcientlylearnsvisualco...

展开>> 收起<<
CLIP also Understands Text Prompting CLIP for Phrase Understanding An Yan Jiacheng Li Wanrong Zhu Yujie Lu William Yang Wang Julian McAuley.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:435.65KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注