MedCLIP Contrastive Learning from Unpaired Medical Images and Text Zifeng Wang1 Zhenbang Wu1 Dinesh Agarwal13 Jimeng Sun12 1Department of Computer Science University of Illinois Urbana-Champaign

2025-05-02 0 0 1.47MB 12 页 10玖币

侵权投诉

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Zifeng Wang1, Zhenbang Wu1, Dinesh Agarwal1,3, Jimeng Sun1,2

1Department of Computer Science, University of Illinois Urbana-Champaign

2Carle Illinois College of Medicine, University of Illinois Urbana-Champaign

3Adobe

{zifengw2, zw12, jimeng}@illinois.edu, diagarwa@adobe.com

Abstract

Existing vision-text contrastive learning like

CLIP (Radford et al.,2021) aims to match the

paired image and caption embeddings while

pushing others apart, which improves rep-

resentation transferability and supports zero-

shot prediction. However, medical image-text

datasets are orders of magnitude below the

general images and captions from the internet.

Moreover, previous methods encounter many

false negatives, i.e., images and reports from

separate patients probably carry the same se-

mantics but are wrongly treated as negatives.

In this paper, we decouple images and texts

for multimodal contrastive learning thus scal-

ing the usable training data in a combinatorial

magnitude with low cost. We also propose to

replace the InfoNCE loss with semantic match-

ing loss based on medical knowledge to elim-

inate false negatives in contrastive learning.

We prove that MedCLIP is a simple yet effec-

tive framework: it outperforms state-of-the-art

methods on zero-shot prediction, supervised

classiﬁcation, and image-text retrieval. Sur-

prisingly, we observe that with only 20K pre-

training data, MedCLIP wins over the state-of-

the-art method (using ≈200K data) 1.

1 Introduction

Medical images such as X-rays, CTs, and MRIs

are commonly used to diagnose, monitor, or treat

medical conditions in clinical practice (FDA,2022).

With the rapid growth of medical images and the

corresponding reports data, researchers have de-

veloped various deep learning models to support

clinical decision making (Çallı et al.,2021).

Recently, large-scale image-text pre-training,

e.g., CLIP (Radford et al.,2021), has achieved

considerable successes in computer vision and nat-

ural language processing domains. CLIP is trained

to predict the correct matching of a batch of images

Our code is available at

https://github.com/

RyanWangZf/MedCLIP.

Figure 1: Zero-shot performance of MedCLIP, Con-

VIRT (Zhang et al.,2020), GLoRIA (Huang et al.,

2021) when using different amounts of data for pre-

training. ConVIRT and GLoRIA are trained on

MIMIC-CXR (369K) and CheXpert (191K) dataset, re-

spectively. Our method yields superior ACC than GLo-

RIA using near 1/10 of pre-training data.

and text training examples. The joint-training of im-

age and text representations on large-scale image-

text pairs generates transferable representations and

supports ﬂexible downstream tasks. Inspired by

success of CLIP, we believe the knowledge jointly

learned from medical images and reports should be

helpful for downstream clinical tasks.

However, adopting vision-text pre-training on

medical domain is a non-trivial task due to (1)

CLIP’s (Radford et al.,2021) data-hungry nature:

CLIP is trained on a dataset of 400M image-text

pairs collected from the internet, while the total

number of publicly available medical images and

reports is orders of magnitude below; and (2) speci-

ﬁcity of medical images and reports: compared to

general domains (e.g., "cats" v.s. "dog"), the dif-

ferences within medical domains are more subtle

and ﬁne-grained (e.g., "pneumonia" v.s. "consoli-

dation"). In a nutshell, it is necessary to (1) address

the data insufﬁciency issue; and (2) capture the

subtle yet crucial medical meanings.

arXiv:2210.10163v1 [cs.CV] 18 Oct 2022

New left lower

lobe opacity

suggestive of left

lower lobe

pneumonia

Left rib fractures

with adjacent

opacity concerning

for either pleural or

extrapleural mass

Anchor image

True positive False negative

Medical image datasets

Medical image-text datasets

Medical text datasets

•pacification of the right hemi thorax consistent

•increased left bilateral pleural effusion

•right transjugular swan ganz catheter ends in

the right pulmonary artery …

Negative image

Figure 2: Demonstration of challenges in medical image-text contrastive learning. (1) Pre-training data only

includes paired images and texts. However, many more image-only and text-only datasets are ignored. (2) False

negatives appear. For an anchor image, previous methods treat paired texts (i.e., reports from the same patient’s

study) as positives and unpaired texts (i.e., reports from other patients’ studies) as negatives. However, the negative

texts can describe the same symptoms as the anchor texts.

Existing works try to tackle the challenges above

in different ways. ConVIRT (Zhang et al.,2020)

jointly trains the vision and text encoders with the

paired medical images and reports via a bidirec-

tional contrastive objective; GLoRIA (Huang et al.,

2021) further models both the global and local in-

teractions between medical images and reports to

capture the pathology meanings from speciﬁc im-

age regions. However, both works have signiﬁcant

limitations, as illustrated in Fig. 2.

•Limited usable data.

Most medical image

datasets only provide the diagnostic labels in-

stead of the raw reports. However, both works

need paired image and reports, leaving a vast

number of medical image-only and text-only

datasets unused.

•False negatives in contrastive learning.

Both

methods try to push images and texts embed-

dings from different patients apart. However,

even though some reports do not belong to the

target patient’s study, they can still describe the

same symptoms and ﬁndings. Simply treating

the other reports as negative samples brings noise

to the supervision and confuses the model.

To handle the above challenges, we propose a

simple yet effective approach, namely

MedCLIP

. It

has the following contributions:

•Decoupling images and texts for contrastive

learning.

We extend the pre-training to cover

the massive unpaired images and texts datasets,

which scales the number of training data in a

combinatorial manner. It opens a new direction

to expand multi-modal learning based on medi-

cal knowledge instead of expensively scaling up

data.

•Eliminating false negatives via medical

knowledge.

We observe that images and reports

from separate patients’ studies may carry

the same semantics but are falsely treated as

negatives by previous methods. Hence, we

design a soft semantic matching loss that uses

the medical semantic similarity between each

image and report as the supervision signal. This

approach equips the model with the ability to

capture the subtle yet crucial medical meanings.

We make comprehensive evaluation on

MedCLIP

across four public datasets. Results show that

MedCLIP

reaches extremely high data efﬁciency, as

shown in Fig. 1. Our method obtains better perfor-

mances than the state-of-the-art GLoRIA (Huang

et al.,2021) using only 10% pre-training data. Ex-

tensive experiments verify

MedCLIP

’s transferabil-

ity to various downstream tasks. It wins over base-

lines by a large margin: over 10% improvement of

prediction ACC for zero-shot prediction and super-

vised image classiﬁcation tasks on average; over

2% improvement of retrieval precision. Details are

in §4.

2 Related Works

Vision-text representation learning was shown to

learn good visual representations (Joulin et al.,

2016;Li et al.,2017;Sariyildiz et al.,2020;De-

sai and Johnson,2021;Kim et al.,2021;Wang

et al.,2021a). But all of them work on paired im-

age and captions from general domain, e.g., Flickr

(Joulin et al.,2016) and COCO Captions (Desai

and Johnson,2021). Likewise, these methods do

not support cross-modal retrieval hence do not sup-

port zero-shot predictions either.

Many propose to learn visual-semantic embed-

ding for vision-text retrieval (Liu et al.,2019;Wu

et al.,2019;Lu et al.,2019;Huang et al.,2020;

Chen et al.,2021) by attention or objection detec-

tion models; and by vision-text contrastive learning

(Zhang et al.,2020;Jia et al.,2021;Yuan et al.,

2021;Yu et al.,2022) or multiple vision and text su-

pervision (Singh et al.,2021;Li et al.,2022). They

all work on general domain where near inﬁnite web

images and captions are available, which dwarfs

the scale of medical image-text data. This chal-

lenge hurdles the execution of self-supervised CL

for large vision-text transformers. Though reme-

dies like data augmentation (Li et al.,2021) and

knowledge graph (Shen et al.,2022) were proposed,

the magnitude of used data is still far larger than

medical data.

Medical image-text representation learning was

investigated based on contrastive learning as well

(Zhang et al.,2020;Huang et al.,2021;Wang et al.,

2021b). Nonetheless, they all work on paired med-

ical images and texts so still encounter the lack-

ing data challenge. Moreover, they all suffer from

the false negative noises when adopting noise con-

trastive estimation (NCE) (Van den Oord et al.,

2018) to perform instance discrimination (Wu et al.,

2018), which undermines the representation quality

(Arora et al.,2019;Zheng et al.,2021). Our work

bridges the gap by making the full use of all avail-

able medical data to support medical image-text

pre-training. And we harness medical knowledge

tailored to eliminate false negatives in contrastive

learning to improve the pre-training data efﬁciency.

3 Method

In this section, we present the technical details of

MedCLIP

following the ﬂow in Fig. 3.

MedCLIP

consists of components (1) knowledge extraction

that builds the semantic similarity matrix, (2) vision

and text encoders that extracts embeddings, and (3)

semantic matching loss that trains the whole model.

3.1 Vision and Text Encoder

MedCLIP

consists of one visual encoder and one

text encoder.

Vision Encoder.

We encode images into embed-

dings

v∈RD

using a vision encoder

Eimg

. A

projection head then maps raw embeddings to

vp∈RP.

v=Eimg(ximg )(1a)

vp=fv(v)(1b)

where

is the projection head of the vision

encoder.

Text Encoder.

We create clinically meaningful

text embeddings

t∈RM

by a text encoder. We

project them to tp∈RPas

t=Etxt(xtxt)(2a)

tp=ft(t)(2b)

where

is the projection head and

Etxt

denotes

the text encoder. This gives the same embedding

dimension

as the vision encoder, suitable for

contrastive learning.

3.2 Decouple Image-Text Pairs with Medical

Knowledge Extractor

Paired medical image text datasets are orders of

magnitude less than the general paired image text

(e.g., from the internet) due to the signiﬁcant ex-

pense of supplying high-quality annotations by

medical specialists as well as privacy and legal con-

cerns. To enhance medical multi-modal learning,

we want to make the full use of all existing medi-

cal image-text, image-only, and text-only datasets.

The challenge is that for image-only, and text-only

datasets, CLIP-like contrastive learning is infeasi-

ble. Also, we want to dig out all positive pairs to

eliminate false negatives.

Suppose we have

paired image-text samples,

labeled images, and

medical sentences. Previ-

ous methods are only able to use

paired samples.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MedCLIP:ContrastiveLearningfromUnpairedMedicalImagesandTextZifengWang1,ZhenbangWu1,DineshAgarwal1,3,JimengSun1,21DepartmentofComputerScience,UniversityofIllinoisUrbana-Champaign2CarleIllinoisCollegeofMedicine,UniversityofIllinoisUrbana-Champaign3Adobe{zifengw2,zw12,jimeng}@illinois.edu,diagarwa@adob...

展开>> 收起<<

MedCLIP Contrastive Learning from Unpaired Medical Images and Text Zifeng Wang1 Zhenbang Wu1 Dinesh Agarwal13 Jimeng Sun12 1Department of Computer Science University of Illinois Urbana-Champaign.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MedCLIP Contrastive Learning from Unpaired Medical Images and Text Zifeng Wang1 Zhenbang Wu1 Dinesh Agarwal13 Jimeng Sun12 1Department of Computer Science University of Illinois Urbana-Champaign

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: