Thats the Wrong Lung Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data Denis Jered McInerney

2025-04-26 0 0 2.74MB 23 页 10玖币
侵权投诉
That’s the Wrong Lung! Evaluating and Improving the Interpretability
of Unsupervised Multimodal Encoders for Medical Data
Denis Jered McInerney
Northeastern University
mcinerney.de@northeastern.edu
Geoffrey Young
Brigham and Women’s Hospital
gsyoung@bwh.harvard.edu
Jan-Willem van de Meent
University of Amsterdam
j.w.vandemeent@uva.nl
Byron C. Wallace
Northeastern University
b.wallace@northeastern.edu
Abstract
Pretraining multimodal models on Electronic
Health Records (EHRs) provides a means of
learning representations that can transfer to
downstream tasks with minimal supervision.
Recent multimodal models induce soft local
alignments between image regions and sen-
tences. This is of particular interest in the
medical domain, where alignments might high-
light regions in an image relevant to specific
phenomena described in free-text. While past
work has suggested that attention “heatmaps”
can be interpreted in this manner, there has
been little evaluation of such alignments. We
compare alignments from a state-of-the-art
multimodal (image and text) model for EHR
with human annotations that link image re-
gions to sentences. Our main finding is that the
text has an often weak or unintuitive influence
on attention; alignments do not consistently re-
flect basic anatomical information. Moreover,
synthetic modifications — such as substituting
“left” for “right” — do not substantially influ-
ence highlights. Simple techniques such as al-
lowing the model to opt out of attending to the
image and few-shot finetuning show promise
in terms of their ability to improve alignments
with very little or no supervision. We make our
code and checkpoints open-source.1
1 Introduction
There has been a flurry of recent work on model ar-
chitectures and self-supervised training objectives
for multimodal representation learning, both gener-
ally (Li et al.,2019;Tan and Bansal,2019;Huang
et al.,2020;Su et al.,2020;Chen et al.,2020) and
for medical data specifically (Wang et al.,2018;
Chauhan et al.,2020;Li et al.,2020). These meth-
ods yield representations that permit efficient learn-
ing on various multimodal downstream tasks (e.g.,
classification, captioning).
Given the inherently multimodal nature of much
medical data — e.g., in radiology images and text
1https://github.com/dmcinerney/gloria
are naturally paired — there has been particular
interest in designing multimodal models for Elec-
tronic Health Records (EHRs) data. However, one
of the factors that currently stands in the way of
broader adoption is interpretability. Neural models
that map image-text pairs to shared representations
are opaque. Consequently, doctors have no way of
knowing whether such models rely on meaningful
clinical signals or data artifacts (Zech et al.,2018).
Recent work has proposed models that soft-align
text snippets to image regions. This may afford
a type of interpretability by allowing practition-
ers to inspect what the model has “learned” or al-
low more efficient identification of relevant regions.
Past work has presented illustrative multimodal
“saliency” maps in which such models highlight
plausible regions. But such highlights also risk pro-
viding a false sense that the model “understands”
more than it actually does, and irrelevant highlights
would be antithetical to the goal of a efficiency in
clinical decision support.
Multimodal models may fail in a few obvious
ways; they may focus on the wrong part of an im-
age, fail to localize by producing a high-entropy at-
tention distribution, or localize too much and miss
a larger region of interest. However, even when
image attention appears reasonable, it may not in
actuality reflect both modalities. Figure 1shows
an example. Here the model ostensibly succeeds at
identifying the image region relevant to the given
text (left). One may be tempted to conclude the
model has “understood” the text and indicated the
corresponding region. But this may be misleading:
We can see that the same model yields a similar
attention pattern when provided text with radically
different semantics (e.g., when swapping “right”
with “left”), or when providing sentences referenc-
ing an abnormality in another region.
Our contributions are as follows. (i) We appraise
the interpretability of soft-alignments induced be-
tween images and texts by existing neural multi-
arXiv:2210.06565v2 [cs.LG] 22 Oct 2022
Image ReportSent1 ReportSent2 ReportSent3SwapLeftRight
Right sided pleural
effusion.
There is blunting of
the right posterior
costophrenic angle...
Left greater than right
bibasilar opacities...
Right greater than left
bibasilar opacities...
Figure 1: Alignment failures often occur when the model (overly) focuses on one aspect of the image, largely
ignoring the text. (Note: images are “mirrored”, so right and left are flipped.)
modal models for radiology, both retrospectively
and via manual radiologist assessments. To the best
of our knowledge, this is the first such evaluation.
(ii) We propose methods that improve the ability
of multimodal models for EHR to intuitively align
image regions with texts.
2 Preliminaries
We aim to evaluate the localization abilities of mul-
timodal models for EHR. For this we focus on the
recently proposed GLoRIA model (Huang et al.,
2021), which is representative of state-of-the-art,
transformer-based multimodal architectures and
accompanying pre-training methods. For com-
pleteness we also analyze (a modified version of)
UNITER (Chen et al.,2020). We next review de-
tails of these models, and then discuss the datasets
we use to evaluate the alignments they induce.
2.1 GLoRIA
GLoRIA uses Clinical BERT (Alsentzer et al.,
2019) as a text encoder and ResNet (He et al.,
2016) as an image encoder. Unlike prior work,
GLoRIA does not assume an image can be parti-
tioned into different objects, which is important
because pre-trained object detectors are not readily
available for X-ray images. GLoRIA passes a CNN
over the image to yield local region representations.
This is useful because a finding within an X-ray
described in a report will usually appear in only a
small region of the corresponding image (Huang
et al.,2021). GLoRIA exploits this intuition via a
local contrastive loss term in the objective.
We assume a dataset of instances comprising an
image
xv
and a sentence from the corresponding re-
port xt, and the model consumes this to produce a
set of local embeddings and a global embedding per
modality:
vlRM×D
,
vgRD
,
tlRN×D
, and
tgRD
. To construct the local contrastive loss,
an attention mechanism (Bahdanau et al.,2014) is
applied to local image embeddings, queried by the
local text embeddings. This induces a soft align-
ment between the local vectors of each mode:
aij =exp (tT
li vlj )
PM
k=1 exp (tT
li vlk )(1)
where
ti
is the
i
th text embedding,
vj
the
j
th image
embedding, and
τ
is a temperature hyperparameter.
2.2 UNITER
Despite the challenges inherent to adopting
"general-domain" multimodal models for this do-
main (discussed in Appendix A.1), we modify
UNITER to serve as an additional model for anal-
ysis. We provide details regarding how we have
implemented UNITER in Appendix A.2, but note
here that this requires ground-truth bounding boxes
as inputs, which means that (a) results with respect
to most metrics (which measure overlap with target
bounding boxes) for UNITER will be artificially
high, and, (b) we could not use this method in prac-
tice, because it requires a set of reference bounding
boxes as input (including at inference time). We
include this for completeness.
2.3 Data and Metrics
Data
Our retrospective evaluation of localization
abilities is made possible by the MIMIC-CXR
(Johnson et al.,2019a,b) and Chest ImaGenome
(Wu et al.,2021) datasets. MIMIC-CXR comprises
chest X-rays and corresponding radiology reports.
ImaGenome includes 1000 manually annotated im-
age/report pairs,
2
with bounding boxes for anatomi-
cal locations, links between referring sentences and
image bounding boxes, and a set of conditions and
2Annotations were automatically derived then cleaned.
AUROC Avg. P IOU@5/10/30%
69.07 51.68 3.79/6.69/20.10
Table 1: Localization performance of GLoRIA.
positive/negative context annotations
3
associated
with each sentence/bounding box pair.
Metrics
We quantify the degree to which attention
highlights the region to which a text snippet refers
by comparing average attention over an input sen-
tence
xj=1
NPN
i=1 aij
with reference annotated
bounding boxes associated with the sentence.
We use several metrics to measure the alignment
between soft attention weights and bounding boxes.
We create scores
sRP
for each of the
P
pix-
els based on the attention weight assigned to the
image region the pixel belongs to. Specifically,
for GLoRIA we use upsampling with bilinear in-
terpolation to distribute attention over pixels. For
UNITER, we score pixels by taking a max over
attention scores for the bounding boxes that con-
tain the pixel (scores for pixels not in any bounding
boxes are 0). We use bounding boxes to create a
segmentation label
`RP
where
`p= 1
if pixel
i
is in any of the bounding boxes, and
`p= 0
oth-
erwise. Given pixel-level scores
s
and pixel-level
segmentation labels
`p
, we can compute the
AU-
ROC
,
Average Precision
, and
Intersection Over
Union
(IOU) at varying pixel percentile thresholds
for the ranking ordered by s(See section A.4).
We also adopt a simple, interpretable metric to
capture the accuracy of similarity scores assigned
to pairs of images and texts. Specifically, we use
a simpler version of the text retrieval task from
(Huang et al.,2021): We report the percentage of
time the similarity between an image and a sen-
tence from the corresponding report is greater than
the similarity between the image and a random sen-
tence taken from a different report in the dataset.
This allows us to interpret 50% as the mean value
of a totally random similarity measure.
3 Are Alignment Weights Accurate?
We first use the metrics defined above to evalu-
ate the pretrained, publicly available weights for
GLoRIA (Huang et al.,2021). Table 1reports the
metrics used to evaluate localization on the gold
split of the ImaGenome dataset.
AUROC scores are well over 50%, suggesting
reasonable localization performance. IOU scores
3
Here, context refers to whether the condition is negated
in the text (negative) or not (positive).
*Equivilant to shuffling
report sentences
Shuffle in Report Random BBoxes
Swap Left Right:
Small left pleural effusion is stable.
Random Sentence:
The lungs are hyperinflated but clear
of consolidation.
Small right pleural
effusion is stable.
Original Text
Original BBox BBox Perturbations
Text Perturbations
Figure 2: Examples of each perturbation for a given
instance. (Synth w/ Swapped Conditions example in
Appendix 10.)
are small, which is expected as target bounding
boxes tend to be much larger than the actual re-
gions of interest and serve more to detect errors
when highlighted regions are far from where they
should be; this is further supported by the relatively
high average precision scores.
4
However, while
seemingly promising, our results below suggest
that the attention patterns here may be less multi-
modal than one might expect.
We next focus on evaluating the degree to which
these patterns actually reflect the associated text.
To this end we perturb instances in ways that ought
to shift the attention pattern (Section 3.1), e.g.,
by replacing “right” with “left” in the text. We
then identify data subsets in Section 3.2 comprising
“complex” instances, where we expect the image
and text to be closely correlated at a local level.
3.1 Perturbations
Figure 2shows examples of the perturbations that
include: Swapping “left” with “right” (
Swap Left
Right
); Shuffling the target bounding boxes for
sentences within the same report at random (
Shuf-
fle in Report
); Replacing sentences in a report
with other sentences, randomly drawn from the
rest of the dataset (
Random Sentences
); Replac-
ing target bounding boxes with other bounding
boxes randomly sampled from the dataset (
Ran-
dom BBoxes
)
5
, and; Swapping the correct condi-
tions in a synthetically created prompt with random
conditions
Synth w/ Swapped Conditions
. We in-
4
In Section B.1, we address this with a modified evaluation
that drops some large bounding boxes in the labels.
5Shuffle in Report
bboxes will still correspond to valid
and noteworthy anatomical regions, but
Random BBoxes
bboxes will not correspond to valid anatomical regions at all.
clude additional details about synthetic sentences
and perturbations in Appendices A.3 and A.5.
Under these perturbations, we would expect a
well-behaved model to shift its attention distribu-
tion over the image accordingly, resulting in a
decrease in localization scores (overlap with the
original reference bounding boxes). The
Random
BBoxes
perturbation in particular targets the de-
gree to which the attention relies specifically on the
image modality, because here the “target” bound-
ing boxes have been replaced with bounding boxes
associated with random other images. By contrast,
all other perturbations should measure the degree
to which the model is sensitive to changes to the
text (even
Shuffle in Report
, which is equivalent
to shuffling the sentences in a report).
If attention maps reflect alignments with input
texts, then under these perturbations one should
expect large negative differences in performance
(
metric) relative to observed performance using
the unperturbed data. For all but
Random BBoxes
,
if the performance does not much change (
metric
0), this suggests the attention maps are somewhat
invariant to the text modality.
3.2 Subsets
We perform granular evaluations using specific
data subsets, including: (1)
Abnormal
instances
with an abnormality, (2)
One Lung
instances with
only one side of the Chest X-ray (left or right)
referenced, and (3)
Most Diverse Report BBoxes
(MDRB)
instances with a lot of diversity in the
labels for sentences in the same report. Details are
in Appendix A.6.
Intuitively, some of the perturbations in Sec-
tion 3.1 should mainly effect certain subsets:
Swap Left Right
should most impact the
One
Lung
subset,
Shuffle in Report
should mainly
effect
MDRB
, and
Random Sentences
,
Ran-
dom BBoxes
, and
Synth w/ Swapped Conditions
should primarily effect Abnormal examples.
3.3 Annotations for Post-hoc Evaluation
We enlist a domain expert (radiologist) to conduct
annotations to complement our retrospective quan-
titative evaluations. We elicit judgements on a five-
point Likert scale regarding the recall, precision,
and “intuitiveness” of image highlights induced for
text snippets.
6
More details are in the Appendix,
6
For recall and precision, points on the Likert scale are
intended to correspond to buckets of 0-20, 20-40, 40-60, 60-80,
and 80-100 percent respectively.
Swap
Left Right
Shuffle
in Report
Random
Sentences
Random
BBoxes
8
4
0
4
8
AUROC
on
One Lung
on
MDRB
on Abnormal
The only significant
effect is from evaluating
on random labels
Figure 3: For each perturbation, we plot the change in
localization performance (AUROC) of GLoRIA.
Subset AUROC Avg. P IOU@5/10/30%
Abnormal 69.51 48.29 4.10/7.25/19.05
One Lung 65.48 38.68 4.43/8.05/20.54
MDRB 65.01 36.96 3.56/6.37/16.92
Table 2: GLoRIA Localization performance on sub-
sets.
including annotation instructions (Section A.7) and
a screenshot of the interface (Figure 11).
3.4 Results
We first evaluate performance on the subsets de-
scribed in Section 3.2. This establishes a baseline
with respect to which we can take differences ob-
served under perturbations. We report results in
Table 2. We observe that the model performs sig-
nificantly worse on both the
One Lung
and
MDRB
subsets (which we view as “harder”) in terms of
AUROC and Average Precision, supporting this
disaggregated evaluation.
Manual evaluation results of 3.1, 1.8, and 1.7 for
recall, precision, and intuitiveness respectively in-
dicate that GLoRIA produces unintuitive heatmaps
that have poor precision and middling recall. Be-
cause GLoria was trained on the CheXpert dataset
and we perform these evaluations on ImaGenome,
the change in dataset may be one cause of poor
performance; in Section 4we report how retraining
on the ImaGenome dataset affects these scores.
To measure the sensitivity of model attention to
changes in the text, we report
differences in local-
ization performance
in Figure 3. Specifically, this
is the difference in model performance (
AUROC)
achieved using (a) the original (unperturbed) sen-
tences, and, (b) sentences perturbed as described in
Section 3.1. We show results for each perturbation
on the subsets they should most effect (Section 3.2),
leaving the full results for the appendix (Figure 14).
The only real decrease in performance observed
is under the
Random BBoxes
perturbation, which
entails swapping out the target bounding box for
an instance with one associated with some other in-
stance (image). Performance decreasing here (and
not for text perturbations) is consistent with the
hypothesis that the attention map primarily reflects
the image modality, but not the text. This is further
supported by the observation that the model pays
little mind to clear positional cue words such as
“left” and “right” when constructing the attention
map; witness the negligible drop in performance
under the
Swap Left Right
perturbation. Finally,
swapping in other sentences (even from different
reports) yields almost no performance difference.
4 Can We Improve Alignments?
The above results indicate that image attention is
unintuitive and less sensitive to the text modality
than might be expected. Next we propose simple
methods to try to improve image/text alignments.
4.1 Models
All models build on the GLoRIA architecture ex-
cept the baseline
UNITER
, for which we perform
no modifications except to re-train from scratch on
the MIMIC-CXR/Chest ImaGenome dataset.
7
In
the results,
GLoRIA
refers to weights fit using the
CheXpert dataset, released by (Huang et al.,2021).
We do not have access to the reports associated
with this dataset so we do not use it for training
or evaluation, but we do make comparisons to the
original (released) GLoRIA model trained on it.
We also retrain our own
GLoRIA
model on
the MIMIC-CXR/ImaGenome dataset; we call
this
GLoRIA Retrained
. While the two datasets
are similar in size and content, CheXpert has
many more positive cases of conditions than
MIMIC-CXR/ImaGenome (8.86% of CheXpert
images are labeled as having “No Findings”; in
the ImaGenome dataset, reports associated with
21.80% of train images do not contain a sen-
tence labeled “abnormal”). Given this difference
in the number of positive cases, we train a
Re-
trained+Abnormal
model variant on the subset
of MIMIC-CXR/ImaGenome sentence/image pairs
featuring an “abnormal” sentence.
We also train models in which we adopt masking
strategies intended to improve localization, hypoth-
esizing that this might prevent over-reliance on text
artifacts that might allow the model to ignore text
7
We re-train from scratch because: (1) Unlike in the orig-
inal model, we are not feeding in features from Fast-RCNN,
but instead using flattened pixels from a bounding box, and;
(2) We would like a fair comparison to the GLoRIA variants
which are also re-trained from scratch.
that localizes. Our
Retrained+Word Masking
model randomly replaces words in the input with
[MASK]
tokens during training with 30% probabil-
ity.
8
For our
Retrained+Clinical Masking
model,
we randomly swap clinical entity spans found using
a
SciSpaCy
entity linker (Neumann et al.,2019)
for [MASK] tokens with 50% probability.
Many sentences in a report will not refer to
any particular region in the image. We therefore
propose the
Retrained+“No Attn” Token
model,
which concatenates a special “No Attn” token pa-
rameter vector to the set of local image embeddings
just before attention is induced. This allows the
model to attend to this special vector, rather than
any of the local image embeddings, effectively in-
dicating that there is no good match.
We also consider a setting in which we assume a
small amount of supervision (annotations linking
image regions to texts). We finetune a model to
produce high attention on the annotated regions of
interest, i.e., we supervise attention. We employ an
alignment loss
Lalignment(s, `) = Ppsp`p
using
the pixel-wise scores
s
derived from the attention
9
and the segmentation labels
`
(Section 2.3). We
train on a batch of 30 examples for up to 500 steps
with early stopping on an additional 30-example
validation set using a patience of 25 steps. This
might be viewed as “few-shot alignment”, where
we use a small number of annotated examples to try
to make the model more interpretable by improving
image and text alignments.
Finally, as a point of reference we train
Re-
trained+Rand Sents
in the same style as the
Re-
trained
model except that all sentences are re-
placed with random sentences. This deprives the
model of any meaningful training signal, which
otherwise comes entirely through the pairing of
images and texts. This variant provides a baseline
to help contextualize results. For all models, we
use early stopping with a patience of 10 epochs.10
4.2 Results and Discussion
4.2.1 Localization Metrics
Table 3might seem to imply that
UNITER
per-
forms best. However, we emphasize that this is not
8
We choose the high value of 30% here because without
allowing hyperparameter tuning of this probability, we would
like to see a significant impact when comparing to the baseline.
9
In this case, we also renormalize again after upsampling
so the pixel scores to sum to 1.
10
For all models we report results on the last epoch before
the early stopping condition is reached.
摘要:

That'stheWrongLung!EvaluatingandImprovingtheInterpretabilityofUnsupervisedMultimodalEncodersforMedicalDataDenisJeredMcInerneyNortheasternUniversitymcinerney.de@northeastern.eduGeoffreyYoungBrighamandWomen'sHospitalgsyoung@bwh.harvard.eduJan-WillemvandeMeentUniversityofAmsterdamj.w.vandemeent@uva.nlB...

展开>> 收起<<
Thats the Wrong Lung Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data Denis Jered McInerney.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:2.74MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注