
stance (image). Performance decreasing here (and
not for text perturbations) is consistent with the
hypothesis that the attention map primarily reflects
the image modality, but not the text. This is further
supported by the observation that the model pays
little mind to clear positional cue words such as
“left” and “right” when constructing the attention
map; witness the negligible drop in performance
under the
Swap Left Right
perturbation. Finally,
swapping in other sentences (even from different
reports) yields almost no performance difference.
4 Can We Improve Alignments?
The above results indicate that image attention is
unintuitive and less sensitive to the text modality
than might be expected. Next we propose simple
methods to try to improve image/text alignments.
4.1 Models
All models build on the GLoRIA architecture ex-
cept the baseline
UNITER
, for which we perform
no modifications except to re-train from scratch on
the MIMIC-CXR/Chest ImaGenome dataset.
7
In
the results,
GLoRIA
refers to weights fit using the
CheXpert dataset, released by (Huang et al.,2021).
We do not have access to the reports associated
with this dataset so we do not use it for training
or evaluation, but we do make comparisons to the
original (released) GLoRIA model trained on it.
We also retrain our own
GLoRIA
model on
the MIMIC-CXR/ImaGenome dataset; we call
this
GLoRIA Retrained
. While the two datasets
are similar in size and content, CheXpert has
many more positive cases of conditions than
MIMIC-CXR/ImaGenome (8.86% of CheXpert
images are labeled as having “No Findings”; in
the ImaGenome dataset, reports associated with
21.80% of train images do not contain a sen-
tence labeled “abnormal”). Given this difference
in the number of positive cases, we train a
Re-
trained+Abnormal
model variant on the subset
of MIMIC-CXR/ImaGenome sentence/image pairs
featuring an “abnormal” sentence.
We also train models in which we adopt masking
strategies intended to improve localization, hypoth-
esizing that this might prevent over-reliance on text
artifacts that might allow the model to ignore text
7
We re-train from scratch because: (1) Unlike in the orig-
inal model, we are not feeding in features from Fast-RCNN,
but instead using flattened pixels from a bounding box, and;
(2) We would like a fair comparison to the GLoRIA variants
which are also re-trained from scratch.
that localizes. Our
Retrained+Word Masking
model randomly replaces words in the input with
[MASK]
tokens during training with 30% probabil-
ity.
8
For our
Retrained+Clinical Masking
model,
we randomly swap clinical entity spans found using
a
SciSpaCy
entity linker (Neumann et al.,2019)
for [MASK] tokens with 50% probability.
Many sentences in a report will not refer to
any particular region in the image. We therefore
propose the
Retrained+“No Attn” Token
model,
which concatenates a special “No Attn” token pa-
rameter vector to the set of local image embeddings
just before attention is induced. This allows the
model to attend to this special vector, rather than
any of the local image embeddings, effectively in-
dicating that there is no good match.
We also consider a setting in which we assume a
small amount of supervision (annotations linking
image regions to texts). We finetune a model to
produce high attention on the annotated regions of
interest, i.e., we supervise attention. We employ an
alignment loss
Lalignment(s, `) = Ppsp`p
using
the pixel-wise scores
s
derived from the attention
9
and the segmentation labels
`
(Section 2.3). We
train on a batch of 30 examples for up to 500 steps
with early stopping on an additional 30-example
validation set using a patience of 25 steps. This
might be viewed as “few-shot alignment”, where
we use a small number of annotated examples to try
to make the model more interpretable by improving
image and text alignments.
Finally, as a point of reference we train
Re-
trained+Rand Sents
in the same style as the
Re-
trained
model except that all sentences are re-
placed with random sentences. This deprives the
model of any meaningful training signal, which
otherwise comes entirely through the pairing of
images and texts. This variant provides a baseline
to help contextualize results. For all models, we
use early stopping with a patience of 10 epochs.10
4.2 Results and Discussion
4.2.1 Localization Metrics
Table 3might seem to imply that
UNITER
per-
forms best. However, we emphasize that this is not
8
We choose the high value of 30% here because without
allowing hyperparameter tuning of this probability, we would
like to see a significant impact when comparing to the baseline.
9
In this case, we also renormalize again after upsampling
so the pixel scores to sum to 1.
10
For all models we report results on the last epoch before
the early stopping condition is reached.