prediction ACC for zero-shot prediction and super-
vised image classification tasks on average; over
2% improvement of retrieval precision. Details are
in §4.
2 Related Works
Vision-text representation learning was shown to
learn good visual representations (Joulin et al.,
2016;Li et al.,2017;Sariyildiz et al.,2020;De-
sai and Johnson,2021;Kim et al.,2021;Wang
et al.,2021a). But all of them work on paired im-
age and captions from general domain, e.g., Flickr
(Joulin et al.,2016) and COCO Captions (Desai
and Johnson,2021). Likewise, these methods do
not support cross-modal retrieval hence do not sup-
port zero-shot predictions either.
Many propose to learn visual-semantic embed-
ding for vision-text retrieval (Liu et al.,2019;Wu
et al.,2019;Lu et al.,2019;Huang et al.,2020;
Chen et al.,2021) by attention or objection detec-
tion models; and by vision-text contrastive learning
(Zhang et al.,2020;Jia et al.,2021;Yuan et al.,
2021;Yu et al.,2022) or multiple vision and text su-
pervision (Singh et al.,2021;Li et al.,2022). They
all work on general domain where near infinite web
images and captions are available, which dwarfs
the scale of medical image-text data. This chal-
lenge hurdles the execution of self-supervised CL
for large vision-text transformers. Though reme-
dies like data augmentation (Li et al.,2021) and
knowledge graph (Shen et al.,2022) were proposed,
the magnitude of used data is still far larger than
medical data.
Medical image-text representation learning was
investigated based on contrastive learning as well
(Zhang et al.,2020;Huang et al.,2021;Wang et al.,
2021b). Nonetheless, they all work on paired med-
ical images and texts so still encounter the lack-
ing data challenge. Moreover, they all suffer from
the false negative noises when adopting noise con-
trastive estimation (NCE) (Van den Oord et al.,
2018) to perform instance discrimination (Wu et al.,
2018), which undermines the representation quality
(Arora et al.,2019;Zheng et al.,2021). Our work
bridges the gap by making the full use of all avail-
able medical data to support medical image-text
pre-training. And we harness medical knowledge
tailored to eliminate false negatives in contrastive
learning to improve the pre-training data efficiency.
3 Method
In this section, we present the technical details of
MedCLIP
following the flow in Fig. 3.
MedCLIP
consists of components (1) knowledge extraction
that builds the semantic similarity matrix, (2) vision
and text encoders that extracts embeddings, and (3)
semantic matching loss that trains the whole model.
3.1 Vision and Text Encoder
MedCLIP
consists of one visual encoder and one
text encoder.
Vision Encoder.
We encode images into embed-
dings
v∈RD
using a vision encoder
Eimg
. A
projection head then maps raw embeddings to
vp∈RP.
v=Eimg(ximg )(1a)
vp=fv(v)(1b)
where
fv
is the projection head of the vision
encoder.
Text Encoder.
We create clinically meaningful
text embeddings
t∈RM
by a text encoder. We
project them to tp∈RPas
t=Etxt(xtxt)(2a)
tp=ft(t)(2b)
where
ft
is the projection head and
Etxt
denotes
the text encoder. This gives the same embedding
dimension
P
as the vision encoder, suitable for
contrastive learning.
3.2 Decouple Image-Text Pairs with Medical
Knowledge Extractor
Paired medical image text datasets are orders of
magnitude less than the general paired image text
(e.g., from the internet) due to the significant ex-
pense of supplying high-quality annotations by
medical specialists as well as privacy and legal con-
cerns. To enhance medical multi-modal learning,
we want to make the full use of all existing medi-
cal image-text, image-only, and text-only datasets.
The challenge is that for image-only, and text-only
datasets, CLIP-like contrastive learning is infeasi-
ble. Also, we want to dig out all positive pairs to
eliminate false negatives.
Suppose we have
n
paired image-text samples,
m
labeled images, and
h
medical sentences. Previ-
ous methods are only able to use
n
paired samples.