CLIP also Understands Text Prompting CLIP for Phrase Understanding An Yan Jiacheng Li Wanrong Zhu Yujie Lu William Yang Wang Julian McAuley

2025-04-27 1 0 435.65KB 7 页 10玖币

侵权投诉

CLIP also Understands Text: Prompting CLIP for Phrase Understanding

An Yan†, Jiacheng Li†, Wanrong Zhu¶, Yujie Lu¶,

William Yang Wang¶, Julian McAuley†

†UC San Diego, ¶UC Santa Barbara

{ayan, j9li,jmcauley }@ucsd.edu

{wanrongzhu,yujielu,william}@cs.ucsb.edu

Abstract

Contrastive Language-Image Pretraining

(CLIP) efﬁciently learns visual concepts

by pre-training with natural language su-

pervision. CLIP and its visual encoder

have been explored on various vision and

language tasks and achieve strong zero-shot or

transfer learning performance. However, the

application of its text encoder solely for text

understanding has been less explored. In this

paper, we ﬁnd that the text encoder of CLIP

actually demonstrates strong ability for phrase

understanding, and can even signiﬁcantly

outperform popular language models such as

BERT with a properly designed prompt. Ex-

tensive experiments validate the effectiveness

of our method across different datasets and

domains on entity clustering and entity set

expansion tasks.

1 Introduction

Contrastive Language-Image Pretraining (CLIP)

(Radford et al.,2021) is a recent model proposed

to learn visual concepts from natural language su-

pervision. It consists of a visual encoder and a

text encoder, and learns visual representations by

aligning images and text through a contrastive loss.

CLIP has demonstrated strong zero-shot open-set

image classiﬁcation capability with 400 million pre-

training image-text pairs crawled from the web.

However, despite its success for computer vi-

sion and multimodal tasks (Shen et al.,2021),

few studies explore the application of its text en-

coder on downstream text understanding tasks. Re-

cently, Hsu et al. (2022) has empirically found that

CLIP performs poorly on natural language under-

standing tasks directly. One potential reason is

that CLIP is not trained with language modeling

losses (e.g., masked language modeling, MLM),

which proves to be crucial for language understand-

ing (Devlin et al.,2019;Liu et al.,2019). But since

the visual encoder has beneﬁted from language su-

pervision, one might naturally ask: does the text

encoder also beneﬁt from visual supervision?

In this paper, we show that even though CLIP

is pre-trained without explicit token-, word- or

phrase-level supervision, with a simple and ef-

fective prompting method, the CLIP text encoder

can be directly used for phrase understanding,

and can signiﬁcantly outperform popular language

models trained with masked language modeling

(e.g. BERT) or even phrase-speciﬁc learning objec-

tives such as Phrase-BERT (Wang et al.,2021) and

UCTopic (Li et al.,2022) on several phrase-related

datasets from different domains. Speciﬁcally, we

automatically generate instance-level prompts for

each phrase by probing the knowledge of a lan-

guage model. Then, the text encoder of CLIP

encodes phrases with corresponding prompts to

obtain ﬁnal representations. We evaluate these rep-

resentations directly on two phrase understanding

tasks without further ﬁne-tuning.

Consequently, CLIP text encoder achieves an

average of 6.4% absolute improvement (70.3% vs

76.7% on accuracy) on the entity clustering task

(CoNLL2003, BC5CDR, W-NUT2017) and an im-

provement of 9.8% (56.9% vs 66.7% on mean av-

erage precision) on the entity set expansion task

(WIKI) compared with the best performing lan-

guage models.

Overall, our contributions are as follows:

•

We are the ﬁrst to show that a text encoder

trained with only image-text contrastive learn-

ing can achieve competitive or even better re-

sults on downstream text understanding tasks

compared to popular language models pre-

trained with MLM.

•

We design an automatic prompting method

with a language model as the knowledge base

to boost performance on phrase understanding

for both language models and CLIP.

•

We conduct comprehensive experiments to

demonstrate the effectiveness of our method,

arXiv:2210.05836v1 [cs.CL] 11 Oct 2022

United States is

a [mask].

A photo of United States, !

a country, republic, nation.

Language Model

Text Encoder

Knowledge Prompting

country?

republic?

nation?

…

Phrase

Representation

Image Encoder

A cat with!

blue eyes

Text Encoder

Contrastive Learning

(a) CLIP: Contrastive Language-Image Pretraining

United States is

a [mask].

A photo of United States, !

a country, republic, nation.

Language Model

Text Encoder

Knowledge Prompting

country?

republic?

nation?

…

Phrase

Representation

Image Encoder

A cat with!

blue eyes

Text Encoder

Contrastive Learning

(b) Domain-aware Prompting for CLIP

Figure 1: Illustration of our framework for phrase understanding with the text encoder of CLIP.

and analyze why CLIP performs well for these

tasks across different domains.

2 Methodology

2.1 Preliminary: CLIP

CLIP is a powerful vision-language model with

strong performance for zero-shot open-set image

classiﬁcation. As shown in Figure 1a, it consists

of two encoders, a ResNET (He et al.,2016) or

ViT (Dosovitskiy et al.,2020) image encoder and

a transformer text encoder. Given an image and

a sequence of words, it will transform them into

feature vectors Vand Trespectively.

Then the model is pretrained with contrastive

losses between two modalities:

LCLIP =1

2(Lv→t+Lt→v)(1)

where given a mini-batch of

samples,

sim

the cosine similarity,

as the temperature, the con-

trastive loss for

Lv→t

( similar deﬁnitions for

Lt→v

)

is formulated as:

Lv→t=−log exp(sim(Vi, Ti)/τ)

∑N

j=0exp(sim(Vi, Tj)/τ).(2)

Note this loss does not inject token or word-level

supervision, but mainly focuses on learning a joint

representation space for images and text, where

those that are paired together in the training data

would ideally be close to each other in the latent

embedding space.

Surprisingly, we ﬁnd CLIP trained with paired

images and queries leads to a ﬁne-grained under-

standing of phrase representations, with a simple

and effective prompting method which we intro-

duce below.

2.2 Domain-Aware Prompting

After pre-training with large-scale image and text

data, CLIP can be readily leveraged for image clas-

siﬁcation via prompting. Based on prompt engi-

neering in the original paper of CLIP (Radford

et al.,2021), “A photo of a [label]” is a good de-

fault template which helps specify that the referring

text is about the content of an image. This template

is able to improve the zero-shot performance of

CLIP for image classiﬁcation. We ﬁrst follow the

same template design to prompt CLIP, which we

empirically show can also greatly improve the per-

formance for phrase understanding tasks.

However, simply using this template could lead

to sub-optimal representations, as the semantics

of phrases vary vastly by domain. A recent

work (Zhou et al.,2021) has found that adding a do-

main keyword for a dataset can improve the image

classiﬁcation performance of CLIP. These domain

keywords are usually hand-crafted; to automate

this process and build robust phrase representations

across different domains, we probe the knowledge

of a language model to identify domains for each

phrase and design an automatic approach to gener-

ating instance-level domain-aware keywords.

Formally, given a phrase

, we use “

a [mask]” as the template and ask a language

model (e.g., BERT) to ﬁll in the mask token. We

then use the top-K predictions {

, ...,

}

from the language model to construct a prompt “A

photo of

. A

, ...,

” for CLIP. For example,

as shown in Figure 1b, given the phrase “United

States”, the language model will generate keywords

such as country, republic, and nation. Those key-

words then form a domain-aware prompt “A photo

of United States. A country, republic, nation” that

is used as input for CLIP text encoder.

2.3 Phrase Understanding

After designing a prompting method, we can di-

rectly use the text encoder of CLIP for phrase un-

derstanding, by feeding it with prompted phrases

and using the output encodings as phrase represen-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CLIPalsoUnderstandsText:PromptingCLIPforPhraseUnderstandingAnYan,JiachengLi,WanrongZhu¶,YujieLu¶,WilliamYangWang¶,JulianMcAuleyUCSanDiego,¶UCSantaBarbara{ayan,j9li,jmcauley}@ucsd.edu{wanrongzhu,yujielu,william}@cs.ucsb.eduAbstractContrastiveLanguage-ImagePretraining(CLIP)efcientlylearnsvisualco...

展开>> 收起<<

CLIP also Understands Text Prompting CLIP for Phrase Understanding An Yan Jiacheng Li Wanrong Zhu Yujie Lu William Yang Wang Julian McAuley.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CLIP also Understands Text Prompting CLIP for Phrase Understanding An Yan Jiacheng Li Wanrong Zhu Yujie Lu William Yang Wang Julian McAuley

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: