Thats the Wrong Lung Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data Denis Jered McInerney

2025-04-26 0 0 2.74MB 23 页 10玖币

侵权投诉

That’s the Wrong Lung! Evaluating and Improving the Interpretability

of Unsupervised Multimodal Encoders for Medical Data

Denis Jered McInerney

Northeastern University

mcinerney.de@northeastern.edu

Geoffrey Young

Brigham and Women’s Hospital

gsyoung@bwh.harvard.edu

Jan-Willem van de Meent

University of Amsterdam

j.w.vandemeent@uva.nl

Byron C. Wallace

Northeastern University

b.wallace@northeastern.edu

Abstract

Pretraining multimodal models on Electronic

Health Records (EHRs) provides a means of

learning representations that can transfer to

downstream tasks with minimal supervision.

Recent multimodal models induce soft local

alignments between image regions and sen-

tences. This is of particular interest in the

medical domain, where alignments might high-

light regions in an image relevant to speciﬁc

phenomena described in free-text. While past

work has suggested that attention “heatmaps”

can be interpreted in this manner, there has

been little evaluation of such alignments. We

compare alignments from a state-of-the-art

multimodal (image and text) model for EHR

with human annotations that link image re-

gions to sentences. Our main ﬁnding is that the

text has an often weak or unintuitive inﬂuence

on attention; alignments do not consistently re-

ﬂect basic anatomical information. Moreover,

synthetic modiﬁcations — such as substituting

“left” for “right” — do not substantially inﬂu-

ence highlights. Simple techniques such as al-

lowing the model to opt out of attending to the

image and few-shot ﬁnetuning show promise

in terms of their ability to improve alignments

with very little or no supervision. We make our

code and checkpoints open-source.1

1 Introduction

There has been a ﬂurry of recent work on model ar-

chitectures and self-supervised training objectives

for multimodal representation learning, both gener-

ally (Li et al.,2019;Tan and Bansal,2019;Huang

et al.,2020;Su et al.,2020;Chen et al.,2020) and

for medical data speciﬁcally (Wang et al.,2018;

Chauhan et al.,2020;Li et al.,2020). These meth-

ods yield representations that permit efﬁcient learn-

ing on various multimodal downstream tasks (e.g.,

classiﬁcation, captioning).

Given the inherently multimodal nature of much

medical data — e.g., in radiology images and text

1https://github.com/dmcinerney/gloria

are naturally paired — there has been particular

interest in designing multimodal models for Elec-

tronic Health Records (EHRs) data. However, one

of the factors that currently stands in the way of

broader adoption is interpretability. Neural models

that map image-text pairs to shared representations

are opaque. Consequently, doctors have no way of

knowing whether such models rely on meaningful

clinical signals or data artifacts (Zech et al.,2018).

Recent work has proposed models that soft-align

text snippets to image regions. This may afford

a type of interpretability by allowing practition-

ers to inspect what the model has “learned” or al-

low more efﬁcient identiﬁcation of relevant regions.

Past work has presented illustrative multimodal

“saliency” maps in which such models highlight

plausible regions. But such highlights also risk pro-

viding a false sense that the model “understands”

more than it actually does, and irrelevant highlights

would be antithetical to the goal of a efﬁciency in

clinical decision support.

Multimodal models may fail in a few obvious

ways; they may focus on the wrong part of an im-

age, fail to localize by producing a high-entropy at-

tention distribution, or localize too much and miss

a larger region of interest. However, even when

image attention appears reasonable, it may not in

actuality reﬂect both modalities. Figure 1shows

an example. Here the model ostensibly succeeds at

identifying the image region relevant to the given

text (left). One may be tempted to conclude the

model has “understood” the text and indicated the

corresponding region. But this may be misleading:

We can see that the same model yields a similar

attention pattern when provided text with radically

different semantics (e.g., when swapping “right”

with “left”), or when providing sentences referenc-

ing an abnormality in another region.

Our contributions are as follows. (i) We appraise

the interpretability of soft-alignments induced be-

tween images and texts by existing neural multi-

arXiv:2210.06565v2 [cs.LG] 22 Oct 2022

Image ReportSent1 ReportSent2 ReportSent3SwapLeftRight

Right sided pleural

effusion.

There is blunting of

the right posterior

costophrenic angle...

Left greater than right

bibasilar opacities...

Right greater than left

bibasilar opacities...

Figure 1: Alignment failures often occur when the model (overly) focuses on one aspect of the image, largely

ignoring the text. (Note: images are “mirrored”, so right and left are ﬂipped.)

modal models for radiology, both retrospectively

and via manual radiologist assessments. To the best

of our knowledge, this is the ﬁrst such evaluation.

(ii) We propose methods that improve the ability

of multimodal models for EHR to intuitively align

image regions with texts.

2 Preliminaries

We aim to evaluate the localization abilities of mul-

timodal models for EHR. For this we focus on the

recently proposed GLoRIA model (Huang et al.,

2021), which is representative of state-of-the-art,

transformer-based multimodal architectures and

accompanying pre-training methods. For com-

pleteness we also analyze (a modiﬁed version of)

UNITER (Chen et al.,2020). We next review de-

tails of these models, and then discuss the datasets

we use to evaluate the alignments they induce.

2.1 GLoRIA

GLoRIA uses Clinical BERT (Alsentzer et al.,

2019) as a text encoder and ResNet (He et al.,

2016) as an image encoder. Unlike prior work,

GLoRIA does not assume an image can be parti-

tioned into different objects, which is important

because pre-trained object detectors are not readily

available for X-ray images. GLoRIA passes a CNN

over the image to yield local region representations.

This is useful because a ﬁnding within an X-ray

described in a report will usually appear in only a

small region of the corresponding image (Huang

et al.,2021). GLoRIA exploits this intuition via a

local contrastive loss term in the objective.

We assume a dataset of instances comprising an

image

and a sentence from the corresponding re-

port xt, and the model consumes this to produce a

set of local embeddings and a global embedding per

modality:

vl∈RM×D

vg∈RD

tl∈RN×D

, and

tg∈RD

. To construct the local contrastive loss,

an attention mechanism (Bahdanau et al.,2014) is

applied to local image embeddings, queried by the

local text embeddings. This induces a soft align-

ment between the local vectors of each mode:

aij =exp (tT

li vlj /τ )

k=1 exp (tT

li vlk /τ)(1)

where

is the

th text embedding,

the

th image

embedding, and

is a temperature hyperparameter.

2.2 UNITER

Despite the challenges inherent to adopting

"general-domain" multimodal models for this do-

main (discussed in Appendix A.1), we modify

UNITER to serve as an additional model for anal-

ysis. We provide details regarding how we have

implemented UNITER in Appendix A.2, but note

here that this requires ground-truth bounding boxes

as inputs, which means that (a) results with respect

to most metrics (which measure overlap with target

bounding boxes) for UNITER will be artiﬁcially

high, and, (b) we could not use this method in prac-

tice, because it requires a set of reference bounding

boxes as input (including at inference time). We

include this for completeness.

2.3 Data and Metrics

Data

Our retrospective evaluation of localization

abilities is made possible by the MIMIC-CXR

(Johnson et al.,2019a,b) and Chest ImaGenome

(Wu et al.,2021) datasets. MIMIC-CXR comprises

chest X-rays and corresponding radiology reports.

ImaGenome includes 1000 manually annotated im-

age/report pairs,

with bounding boxes for anatomi-

cal locations, links between referring sentences and

image bounding boxes, and a set of conditions and

2Annotations were automatically derived then cleaned.

AUROC Avg. P IOU@5/10/30%

69.07 51.68 3.79/6.69/20.10

Table 1: Localization performance of GLoRIA.

positive/negative context annotations

associated

with each sentence/bounding box pair.

Metrics

We quantify the degree to which attention

highlights the region to which a text snippet refers

by comparing average attention over an input sen-

tence

xj=1

NPN

i=1 aij

with reference annotated

bounding boxes associated with the sentence.

We use several metrics to measure the alignment

between soft attention weights and bounding boxes.

We create scores

s∈RP

for each of the

pix-

els based on the attention weight assigned to the

image region the pixel belongs to. Speciﬁcally,

for GLoRIA we use upsampling with bilinear in-

terpolation to distribute attention over pixels. For

UNITER, we score pixels by taking a max over

attention scores for the bounding boxes that con-

tain the pixel (scores for pixels not in any bounding

boxes are 0). We use bounding boxes to create a

segmentation label

`∈RP

where

`p= 1

if pixel

is in any of the bounding boxes, and

`p= 0

oth-

erwise. Given pixel-level scores

and pixel-level

segmentation labels

, we can compute the

AU-

ROC

Average Precision

, and

Intersection Over

Union

(IOU) at varying pixel percentile thresholds

for the ranking ordered by s(See section A.4).

We also adopt a simple, interpretable metric to

capture the accuracy of similarity scores assigned

to pairs of images and texts. Speciﬁcally, we use

a simpler version of the text retrieval task from

(Huang et al.,2021): We report the percentage of

time the similarity between an image and a sen-

tence from the corresponding report is greater than

the similarity between the image and a random sen-

tence taken from a different report in the dataset.

This allows us to interpret 50% as the mean value

of a totally random similarity measure.

3 Are Alignment Weights Accurate?

We ﬁrst use the metrics deﬁned above to evalu-

ate the pretrained, publicly available weights for

GLoRIA (Huang et al.,2021). Table 1reports the

metrics used to evaluate localization on the gold

split of the ImaGenome dataset.

AUROC scores are well over 50%, suggesting

reasonable localization performance. IOU scores

Here, context refers to whether the condition is negated

in the text (negative) or not (positive).

*Equivilant to shuffling

report sentences

Shuffle in Report Random BBoxes

Swap Left Right:

Small left pleural effusion is stable.

Random Sentence:

The lungs are hyperinflated but clear

of consolidation.

Small right pleural

effusion is stable.

Original Text

Original BBox BBox Perturbations

Text Perturbations

Figure 2: Examples of each perturbation for a given

instance. (Synth w/ Swapped Conditions example in

Appendix 10.)

are small, which is expected as target bounding

boxes tend to be much larger than the actual re-

gions of interest and serve more to detect errors

when highlighted regions are far from where they

should be; this is further supported by the relatively

high average precision scores.

However, while

seemingly promising, our results below suggest

that the attention patterns here may be less multi-

modal than one might expect.

We next focus on evaluating the degree to which

these patterns actually reﬂect the associated text.

To this end we perturb instances in ways that ought

to shift the attention pattern (Section 3.1), e.g.,

by replacing “right” with “left” in the text. We

then identify data subsets in Section 3.2 comprising

“complex” instances, where we expect the image

and text to be closely correlated at a local level.

3.1 Perturbations

Figure 2shows examples of the perturbations that

include: Swapping “left” with “right” (

Swap Left

Right

); Shufﬂing the target bounding boxes for

sentences within the same report at random (

Shuf-

ﬂe in Report

); Replacing sentences in a report

with other sentences, randomly drawn from the

rest of the dataset (

Random Sentences

); Replac-

ing target bounding boxes with other bounding

boxes randomly sampled from the dataset (

Ran-

dom BBoxes

)

, and; Swapping the correct condi-

tions in a synthetically created prompt with random

conditions

Synth w/ Swapped Conditions

. We in-

In Section B.1, we address this with a modiﬁed evaluation

that drops some large bounding boxes in the labels.

5Shufﬂe in Report

bboxes will still correspond to valid

and noteworthy anatomical regions, but

Random BBoxes

bboxes will not correspond to valid anatomical regions at all.

clude additional details about synthetic sentences

and perturbations in Appendices A.3 and A.5.

Under these perturbations, we would expect a

well-behaved model to shift its attention distribu-

tion over the image accordingly, resulting in a

decrease in localization scores (overlap with the

original reference bounding boxes). The

Random

BBoxes

perturbation in particular targets the de-

gree to which the attention relies speciﬁcally on the

image modality, because here the “target” bound-

ing boxes have been replaced with bounding boxes

associated with random other images. By contrast,

all other perturbations should measure the degree

to which the model is sensitive to changes to the

text (even

Shufﬂe in Report

, which is equivalent

to shufﬂing the sentences in a report).

If attention maps reﬂect alignments with input

texts, then under these perturbations one should

expect large negative differences in performance

(

∆

metric) relative to observed performance using

the unperturbed data. For all but

Random BBoxes

if the performance does not much change (

∆

metric

≈

0), this suggests the attention maps are somewhat

invariant to the text modality.

3.2 Subsets

We perform granular evaluations using speciﬁc

data subsets, including: (1)

Abnormal

instances

with an abnormality, (2)

One Lung

instances with

only one side of the Chest X-ray (left or right)

referenced, and (3)

Most Diverse Report BBoxes

(MDRB)

instances with a lot of diversity in the

labels for sentences in the same report. Details are

in Appendix A.6.

Intuitively, some of the perturbations in Sec-

tion 3.1 should mainly effect certain subsets:

Swap Left Right

should most impact the

One

Lung

subset,

Shufﬂe in Report

should mainly

effect

MDRB

, and

Random Sentences

Ran-

dom BBoxes

, and

Synth w/ Swapped Conditions

should primarily effect Abnormal examples.

3.3 Annotations for Post-hoc Evaluation

We enlist a domain expert (radiologist) to conduct

annotations to complement our retrospective quan-

titative evaluations. We elicit judgements on a ﬁve-

point Likert scale regarding the recall, precision,

and “intuitiveness” of image highlights induced for

text snippets.

More details are in the Appendix,

For recall and precision, points on the Likert scale are

intended to correspond to buckets of 0-20, 20-40, 40-60, 60-80,

and 80-100 percent respectively.

Swap

Left Right

Shuffle

in Report

Random

Sentences

Random

BBoxes

AUROC

One Lung

MDRB

on Abnormal

The only significant

effect is from evaluating

on random labels

Figure 3: For each perturbation, we plot the change in

localization performance (AUROC) of GLoRIA.

Subset AUROC Avg. P IOU@5/10/30%

Abnormal 69.51 48.29 4.10/7.25/19.05

One Lung 65.48 38.68 4.43/8.05/20.54

MDRB 65.01 36.96 3.56/6.37/16.92

Table 2: GLoRIA Localization performance on sub-

sets.

including annotation instructions (Section A.7) and

a screenshot of the interface (Figure 11).

3.4 Results

We ﬁrst evaluate performance on the subsets de-

scribed in Section 3.2. This establishes a baseline

with respect to which we can take differences ob-

served under perturbations. We report results in

Table 2. We observe that the model performs sig-

niﬁcantly worse on both the

One Lung

and

MDRB

subsets (which we view as “harder”) in terms of

AUROC and Average Precision, supporting this

disaggregated evaluation.

Manual evaluation results of 3.1, 1.8, and 1.7 for

recall, precision, and intuitiveness respectively in-

dicate that GLoRIA produces unintuitive heatmaps

that have poor precision and middling recall. Be-

cause GLoria was trained on the CheXpert dataset

and we perform these evaluations on ImaGenome,

the change in dataset may be one cause of poor

performance; in Section 4we report how retraining

on the ImaGenome dataset affects these scores.

To measure the sensitivity of model attention to

changes in the text, we report

differences in local-

ization performance

in Figure 3. Speciﬁcally, this

is the difference in model performance (

∆

AUROC)

achieved using (a) the original (unperturbed) sen-

tences, and, (b) sentences perturbed as described in

Section 3.1. We show results for each perturbation

on the subsets they should most effect (Section 3.2),

leaving the full results for the appendix (Figure 14).

The only real decrease in performance observed

is under the

Random BBoxes

perturbation, which

entails swapping out the target bounding box for

an instance with one associated with some other in-

stance (image). Performance decreasing here (and

not for text perturbations) is consistent with the

hypothesis that the attention map primarily reﬂects

the image modality, but not the text. This is further

supported by the observation that the model pays

little mind to clear positional cue words such as

“left” and “right” when constructing the attention

map; witness the negligible drop in performance

under the

Swap Left Right

perturbation. Finally,

swapping in other sentences (even from different

reports) yields almost no performance difference.

4 Can We Improve Alignments?

The above results indicate that image attention is

unintuitive and less sensitive to the text modality

than might be expected. Next we propose simple

methods to try to improve image/text alignments.

4.1 Models

All models build on the GLoRIA architecture ex-

cept the baseline

UNITER

, for which we perform

no modiﬁcations except to re-train from scratch on

the MIMIC-CXR/Chest ImaGenome dataset.

the results,

GLoRIA

refers to weights ﬁt using the

CheXpert dataset, released by (Huang et al.,2021).

We do not have access to the reports associated

with this dataset so we do not use it for training

or evaluation, but we do make comparisons to the

original (released) GLoRIA model trained on it.

We also retrain our own

GLoRIA

model on

the MIMIC-CXR/ImaGenome dataset; we call

this

GLoRIA Retrained

. While the two datasets

are similar in size and content, CheXpert has

many more positive cases of conditions than

MIMIC-CXR/ImaGenome (8.86% of CheXpert

images are labeled as having “No Findings”; in

the ImaGenome dataset, reports associated with

21.80% of train images do not contain a sen-

tence labeled “abnormal”). Given this difference

in the number of positive cases, we train a

Re-

trained+Abnormal

model variant on the subset

of MIMIC-CXR/ImaGenome sentence/image pairs

featuring an “abnormal” sentence.

We also train models in which we adopt masking

strategies intended to improve localization, hypoth-

esizing that this might prevent over-reliance on text

artifacts that might allow the model to ignore text

We re-train from scratch because: (1) Unlike in the orig-

inal model, we are not feeding in features from Fast-RCNN,

but instead using ﬂattened pixels from a bounding box, and;

(2) We would like a fair comparison to the GLoRIA variants

which are also re-trained from scratch.

that localizes. Our

Retrained+Word Masking

model randomly replaces words in the input with

[MASK]

tokens during training with 30% probabil-

ity.

For our

Retrained+Clinical Masking

model,

we randomly swap clinical entity spans found using

SciSpaCy

entity linker (Neumann et al.,2019)

for [MASK] tokens with 50% probability.

Many sentences in a report will not refer to

any particular region in the image. We therefore

propose the

Retrained+“No Attn” Token

model,

which concatenates a special “No Attn” token pa-

rameter vector to the set of local image embeddings

just before attention is induced. This allows the

model to attend to this special vector, rather than

any of the local image embeddings, effectively in-

dicating that there is no good match.

We also consider a setting in which we assume a

small amount of supervision (annotations linking

image regions to texts). We ﬁnetune a model to

produce high attention on the annotated regions of

interest, i.e., we supervise attention. We employ an

alignment loss

Lalignment(s, `) = Ppsp`p

using

the pixel-wise scores

derived from the attention

and the segmentation labels

(Section 2.3). We

train on a batch of 30 examples for up to 500 steps

with early stopping on an additional 30-example

validation set using a patience of 25 steps. This

might be viewed as “few-shot alignment”, where

we use a small number of annotated examples to try

to make the model more interpretable by improving

image and text alignments.

Finally, as a point of reference we train

Re-

trained+Rand Sents

in the same style as the

Re-

trained

model except that all sentences are re-

placed with random sentences. This deprives the

model of any meaningful training signal, which

otherwise comes entirely through the pairing of

images and texts. This variant provides a baseline

to help contextualize results. For all models, we

use early stopping with a patience of 10 epochs.10

4.2 Results and Discussion

4.2.1 Localization Metrics

Table 3might seem to imply that

UNITER

per-

forms best. However, we emphasize that this is not

We choose the high value of 30% here because without

allowing hyperparameter tuning of this probability, we would

like to see a signiﬁcant impact when comparing to the baseline.

In this case, we also renormalize again after upsampling

so the pixel scores to sum to 1.

For all models we report results on the last epoch before

the early stopping condition is reached.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

That'stheWrongLung!EvaluatingandImprovingtheInterpretabilityofUnsupervisedMultimodalEncodersforMedicalDataDenisJeredMcInerneyNortheasternUniversitymcinerney.de@northeastern.eduGeoffreyYoungBrighamandWomen'sHospitalgsyoung@bwh.harvard.eduJan-WillemvandeMeentUniversityofAmsterdamj.w.vandemeent@uva.nlB...

展开>> 收起<<

Thats the Wrong Lung Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data Denis Jered McInerney.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Thats the Wrong Lung Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data Denis Jered McInerney

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: