MedCLIP Contrastive Learning from Unpaired Medical Images and Text Zifeng Wang1 Zhenbang Wu1 Dinesh Agarwal13 Jimeng Sun12 1Department of Computer Science University of Illinois Urbana-Champaign

2025-05-02 0 0 1.47MB 12 页 10玖币
侵权投诉
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
Zifeng Wang1, Zhenbang Wu1, Dinesh Agarwal1,3, Jimeng Sun1,2
1Department of Computer Science, University of Illinois Urbana-Champaign
2Carle Illinois College of Medicine, University of Illinois Urbana-Champaign
3Adobe
{zifengw2, zw12, jimeng}@illinois.edu, diagarwa@adobe.com
Abstract
Existing vision-text contrastive learning like
CLIP (Radford et al.,2021) aims to match the
paired image and caption embeddings while
pushing others apart, which improves rep-
resentation transferability and supports zero-
shot prediction. However, medical image-text
datasets are orders of magnitude below the
general images and captions from the internet.
Moreover, previous methods encounter many
false negatives, i.e., images and reports from
separate patients probably carry the same se-
mantics but are wrongly treated as negatives.
In this paper, we decouple images and texts
for multimodal contrastive learning thus scal-
ing the usable training data in a combinatorial
magnitude with low cost. We also propose to
replace the InfoNCE loss with semantic match-
ing loss based on medical knowledge to elim-
inate false negatives in contrastive learning.
We prove that MedCLIP is a simple yet effec-
tive framework: it outperforms state-of-the-art
methods on zero-shot prediction, supervised
classification, and image-text retrieval. Sur-
prisingly, we observe that with only 20K pre-
training data, MedCLIP wins over the state-of-
the-art method (using 200K data) 1.
1 Introduction
Medical images such as X-rays, CTs, and MRIs
are commonly used to diagnose, monitor, or treat
medical conditions in clinical practice (FDA,2022).
With the rapid growth of medical images and the
corresponding reports data, researchers have de-
veloped various deep learning models to support
clinical decision making (Çallı et al.,2021).
Recently, large-scale image-text pre-training,
e.g., CLIP (Radford et al.,2021), has achieved
considerable successes in computer vision and nat-
ural language processing domains. CLIP is trained
to predict the correct matching of a batch of images
1
Our code is available at
https://github.com/
RyanWangZf/MedCLIP.
Figure 1: Zero-shot performance of MedCLIP, Con-
VIRT (Zhang et al.,2020), GLoRIA (Huang et al.,
2021) when using different amounts of data for pre-
training. ConVIRT and GLoRIA are trained on
MIMIC-CXR (369K) and CheXpert (191K) dataset, re-
spectively. Our method yields superior ACC than GLo-
RIA using near 1/10 of pre-training data.
and text training examples. The joint-training of im-
age and text representations on large-scale image-
text pairs generates transferable representations and
supports flexible downstream tasks. Inspired by
success of CLIP, we believe the knowledge jointly
learned from medical images and reports should be
helpful for downstream clinical tasks.
However, adopting vision-text pre-training on
medical domain is a non-trivial task due to (1)
CLIP’s (Radford et al.,2021) data-hungry nature:
CLIP is trained on a dataset of 400M image-text
pairs collected from the internet, while the total
number of publicly available medical images and
reports is orders of magnitude below; and (2) speci-
ficity of medical images and reports: compared to
general domains (e.g., "cats" v.s. "dog"), the dif-
ferences within medical domains are more subtle
and fine-grained (e.g., "pneumonia" v.s. "consoli-
dation"). In a nutshell, it is necessary to (1) address
the data insufficiency issue; and (2) capture the
subtle yet crucial medical meanings.
arXiv:2210.10163v1 [cs.CV] 18 Oct 2022
New left lower
lobe opacity
suggestive of left
lower lobe
pneumonia
Left rib fractures
with adjacent
opacity concerning
for either pleural or
extrapleural mass
Anchor image
True positive False negative
Medical image datasets
Medical image-text datasets
Medical text datasets
pacification of the right hemi thorax consistent
increased left bilateral pleural effusion
right transjugular swan ganz catheter ends in
the right pulmonary artery …
Negative image
Figure 2: Demonstration of challenges in medical image-text contrastive learning. (1) Pre-training data only
includes paired images and texts. However, many more image-only and text-only datasets are ignored. (2) False
negatives appear. For an anchor image, previous methods treat paired texts (i.e., reports from the same patient’s
study) as positives and unpaired texts (i.e., reports from other patients’ studies) as negatives. However, the negative
texts can describe the same symptoms as the anchor texts.
Existing works try to tackle the challenges above
in different ways. ConVIRT (Zhang et al.,2020)
jointly trains the vision and text encoders with the
paired medical images and reports via a bidirec-
tional contrastive objective; GLoRIA (Huang et al.,
2021) further models both the global and local in-
teractions between medical images and reports to
capture the pathology meanings from specific im-
age regions. However, both works have significant
limitations, as illustrated in Fig. 2.
Limited usable data.
Most medical image
datasets only provide the diagnostic labels in-
stead of the raw reports. However, both works
need paired image and reports, leaving a vast
number of medical image-only and text-only
datasets unused.
False negatives in contrastive learning.
Both
methods try to push images and texts embed-
dings from different patients apart. However,
even though some reports do not belong to the
target patient’s study, they can still describe the
same symptoms and findings. Simply treating
the other reports as negative samples brings noise
to the supervision and confuses the model.
To handle the above challenges, we propose a
simple yet effective approach, namely
MedCLIP
. It
has the following contributions:
Decoupling images and texts for contrastive
learning.
We extend the pre-training to cover
the massive unpaired images and texts datasets,
which scales the number of training data in a
combinatorial manner. It opens a new direction
to expand multi-modal learning based on medi-
cal knowledge instead of expensively scaling up
data.
Eliminating false negatives via medical
knowledge.
We observe that images and reports
from separate patients’ studies may carry
the same semantics but are falsely treated as
negatives by previous methods. Hence, we
design a soft semantic matching loss that uses
the medical semantic similarity between each
image and report as the supervision signal. This
approach equips the model with the ability to
capture the subtle yet crucial medical meanings.
We make comprehensive evaluation on
MedCLIP
across four public datasets. Results show that
MedCLIP
reaches extremely high data efficiency, as
shown in Fig. 1. Our method obtains better perfor-
mances than the state-of-the-art GLoRIA (Huang
et al.,2021) using only 10% pre-training data. Ex-
tensive experiments verify
MedCLIP
s transferabil-
ity to various downstream tasks. It wins over base-
lines by a large margin: over 10% improvement of
prediction ACC for zero-shot prediction and super-
vised image classification tasks on average; over
2% improvement of retrieval precision. Details are
in §4.
2 Related Works
Vision-text representation learning was shown to
learn good visual representations (Joulin et al.,
2016;Li et al.,2017;Sariyildiz et al.,2020;De-
sai and Johnson,2021;Kim et al.,2021;Wang
et al.,2021a). But all of them work on paired im-
age and captions from general domain, e.g., Flickr
(Joulin et al.,2016) and COCO Captions (Desai
and Johnson,2021). Likewise, these methods do
not support cross-modal retrieval hence do not sup-
port zero-shot predictions either.
Many propose to learn visual-semantic embed-
ding for vision-text retrieval (Liu et al.,2019;Wu
et al.,2019;Lu et al.,2019;Huang et al.,2020;
Chen et al.,2021) by attention or objection detec-
tion models; and by vision-text contrastive learning
(Zhang et al.,2020;Jia et al.,2021;Yuan et al.,
2021;Yu et al.,2022) or multiple vision and text su-
pervision (Singh et al.,2021;Li et al.,2022). They
all work on general domain where near infinite web
images and captions are available, which dwarfs
the scale of medical image-text data. This chal-
lenge hurdles the execution of self-supervised CL
for large vision-text transformers. Though reme-
dies like data augmentation (Li et al.,2021) and
knowledge graph (Shen et al.,2022) were proposed,
the magnitude of used data is still far larger than
medical data.
Medical image-text representation learning was
investigated based on contrastive learning as well
(Zhang et al.,2020;Huang et al.,2021;Wang et al.,
2021b). Nonetheless, they all work on paired med-
ical images and texts so still encounter the lack-
ing data challenge. Moreover, they all suffer from
the false negative noises when adopting noise con-
trastive estimation (NCE) (Van den Oord et al.,
2018) to perform instance discrimination (Wu et al.,
2018), which undermines the representation quality
(Arora et al.,2019;Zheng et al.,2021). Our work
bridges the gap by making the full use of all avail-
able medical data to support medical image-text
pre-training. And we harness medical knowledge
tailored to eliminate false negatives in contrastive
learning to improve the pre-training data efficiency.
3 Method
In this section, we present the technical details of
MedCLIP
following the flow in Fig. 3.
MedCLIP
consists of components (1) knowledge extraction
that builds the semantic similarity matrix, (2) vision
and text encoders that extracts embeddings, and (3)
semantic matching loss that trains the whole model.
3.1 Vision and Text Encoder
MedCLIP
consists of one visual encoder and one
text encoder.
Vision Encoder.
We encode images into embed-
dings
vRD
using a vision encoder
Eimg
. A
projection head then maps raw embeddings to
vpRP.
v=Eimg(ximg )(1a)
vp=fv(v)(1b)
where
fv
is the projection head of the vision
encoder.
Text Encoder.
We create clinically meaningful
text embeddings
tRM
by a text encoder. We
project them to tpRPas
t=Etxt(xtxt)(2a)
tp=ft(t)(2b)
where
ft
is the projection head and
Etxt
denotes
the text encoder. This gives the same embedding
dimension
P
as the vision encoder, suitable for
contrastive learning.
3.2 Decouple Image-Text Pairs with Medical
Knowledge Extractor
Paired medical image text datasets are orders of
magnitude less than the general paired image text
(e.g., from the internet) due to the significant ex-
pense of supplying high-quality annotations by
medical specialists as well as privacy and legal con-
cerns. To enhance medical multi-modal learning,
we want to make the full use of all existing medi-
cal image-text, image-only, and text-only datasets.
The challenge is that for image-only, and text-only
datasets, CLIP-like contrastive learning is infeasi-
ble. Also, we want to dig out all positive pairs to
eliminate false negatives.
Suppose we have
n
paired image-text samples,
m
labeled images, and
h
medical sentences. Previ-
ous methods are only able to use
n
paired samples.
摘要:

MedCLIP:ContrastiveLearningfromUnpairedMedicalImagesandTextZifengWang1,ZhenbangWu1,DineshAgarwal1,3,JimengSun1,21DepartmentofComputerScience,UniversityofIllinoisUrbana-Champaign2CarleIllinoisCollegeofMedicine,UniversityofIllinoisUrbana-Champaign3Adobe{zifengw2,zw12,jimeng}@illinois.edu,diagarwa@adob...

展开>> 收起<<
MedCLIP Contrastive Learning from Unpaired Medical Images and Text Zifeng Wang1 Zhenbang Wu1 Dinesh Agarwal13 Jimeng Sun12 1Department of Computer Science University of Illinois Urbana-Champaign.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.47MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注