Meta-learning Pathologies from Radiology Reports using Variance Aware Prototypical Networks Arijit Sehanobish Kawshik KannanNabila Abraham Anasuya Das Benjamin Odry

2025-05-02 0 0 2.29MB 16 页 10玖币

侵权投诉

Meta-learning Pathologies from Radiology Reports using Variance Aware

Prototypical Networks

Arijit Sehanobish Kawshik Kannan∗Nabila Abraham Anasuya Das Benjamin Odry

Covera Health

New York City, NY

{arijit.sehanobish, kawshik.kannan, nabila.abraham,

anasuya.das, benjamin.odry}@coverahealth.com

Abstract

Large pretrained Transformer-based language

models like BERT and GPT have changed

the landscape of Natural Language Process-

ing (NLP). However, ﬁne tuning such models

still requires a large number of training exam-

ples for each target task, thus annotating mul-

tiple datasets and training these models on var-

ious downstream tasks becomes time consum-

ing and expensive. In this work, we propose a

simple extension of the Prototypical Networks

for few-shot text classiﬁcation. Our main idea

is to replace the class prototypes by Gaussians

and introduce a regularization term that en-

courages the examples to be clustered near the

appropriate class centroids. Experimental re-

sults show that our method outperforms vari-

ous strong baselines on 13 public and 4inter-

nal datasets. Furthermore, we use the class

distributions as a tool for detecting potential

out-of-distribution (OOD) data points during

deployment.

1 Introduction

Pretrained Transformer-based language models

(PLMs) have achieved great success on many NLP

tasks (Devlin et al.,2019;Brown et al.,2020), but

still need a large number of in-domain labeled

examples for ﬁnetuning (Yogatama et al.,2019).

Learning to learn (Lake et al.,2015a;Schmidhu-

ber,1987;Bengio et al.,1997) from limited super-

vision is an important problem with widespread

application in areas where obtaining labeled data

can be difﬁcult or expensive. To that end, meta-

learning methods have been proposed as effective

solutions for few-shot learning (Hospedales et al.,

2020). Current applications of such meta-learning

methods have shown improved performance in few-

shot learning for vision tasks such as learning to

classify new image classes within a similar dataset.

Namely, on classical few-shot image classiﬁcation

∗Equal Contribution

benchmarks, the training tasks are sampled from

a “single” larger dataset (for ex: Omniglot (Lake

et al.,2015b) and miniImageNet (Vinyals et al.,

2016)), and the label space contains the same task

structure for all tasks. There has been a simi-

lar trend of such classical methods in NLP as

well (Geng et al.,2019). In contrast, in text clas-

siﬁcation tasks, the set of source tasks available

during training and target tasks during evaluation

can range from sentiment analysis to grammatical

acceptability judgment (Bansal et al.,2020a,b). In

recent works (Wang et al.,2021), the authors use a

range of different source tasks (different not only

in terms of input domain, but also their task struc-

ture i.e. label semantics, and number of labels) for

meta-training and show successful performance on

a wide range of downstream tasks. In spite of this

success, meta-training on various source tasks is

quite challenging as it requires resistance to over-

ﬁtting to certain source tasks due to its few-shot

nature and more task-speciﬁc adaptation due to the

distinct nature among tasks (Roelofs et al.,2019).

However, in medical NLP, collecting large num-

ber of diverse labeled datasets is difﬁcult. In our in-

stitution, we collect high quality labeled radiology

reports (which are always used as held out test data)

and use it to train our internal annotators who then

annotate our unlabeled data. This training process

is expensive and time consuming. Our annotation

process is described in section A. Thus a natural

question is: if we have a large labeled dataset con-

sisting of a lot of classes, can we use it to meta-train

a model that can be used on a large number of down-

stream datasets where we have little to no training

examples? This is a challenging problem as the

reports can be structured differently based on the

report type and there can be a substantial variation

in writing style across radiologists from different

institutions. Our main goal is to build out a set

of extensible pipelines that can generalize to new

pathologies typically in new sub-specialties while

arXiv:2210.13979v2 [cs.LG] 10 Nov 2022

also generalizing across different health systems.

In addition, the exact deﬁnition of the pathologies

and their severity change can change depending on

the clinical use case. This makes fully supervised

approaches that rely on large labeled datasets ex-

pensive. Having few-shot capabilities allows us to

annotate a handful of cases and rapidly expand the

list of pathologies we can detect and classify. In ad-

dition, we can use our approach to generate pseudo

labels for rare pathologies and enrich our validation

and test sets for annotation by an in-house clinical

team. Lastly our approach can be extended to sup-

port patient search and deﬁne custom cohorts of

patients.

Our contributions in this work are the following:

(1)

We develop a novel loss function that extends

the vanilla prototypical networks and introduce a

regularization term that encourages tight cluster-

ing of examples near the class prototypes.

(2)

meta-train our models on a large labeled dataset

on shoulder MRI reports (single domain) and show

good performance on

diverse downstream classi-

ﬁcation tasks on radiology reports on knee, cervical

spine and chest. In addition to our internal datasets,

we show superior performance of our method on

public benchmarks over well-known methods

like Leopard. Our model is very simple to train,

easy to deploy unlike gradient based methods and

just requires a few additional lines of codes to a

vanilla prototypical network trainer.

(3)

We deploy

our system and use the dataset statistics to inform

out-of-distribution (OOD) cases.

2 Related Work

There are three common approaches to meta-

learning: metric-based, model-based, and

optimization-based. Model agnostic meta-learning

(MAML) (Finn et al.,2017) is an optimization-

based approach to meta-learning which is agnostic

to the model architecture and task speciﬁcation.

Over the years, several variants of the method have

shown that it is an ideal candidate for learning

to learn from diverse tasks (Nichol et al.,2018;

Raghu et al.,2019;Bansal et al.,2020b). However,

to solve a new task, MAML type methods would

require training a new classiﬁcation layer for the

task. In contrast, metric-based approaches, such

as prototypical networks (Vinyals et al.,2016;

Snell et al.,2017), being non-parametric in nature

can handle varied number of classes and thus

can be easily deployed. Given the simple nature

of prototypical networks, a lot of work has been

done to improve them (Allen et al.,2019;Zhang

et al.,2019;Ding et al.,2022;Wang et al.,2021).

Prototypical networks usually construct a class

prototype (mean) using the support vectors to

describe the class and, given a query example,

assigns the class whose class prototype is closest to

the query vector. In (Allen et al.,2019), the authors

use a mixture of Gaussians to describe the class

conditional distribution and in (Zhang et al.,2019);

the authors try to model an unknown general class

distribution. In (Ding et al.,2022), the authors

use spherical Gaussians and a KL-divergence type

function between the Gaussians to compute the

function

in equation 2. However, the function

used by the above authors is not a true metric, i.e.

does not satisfy the triangle inequality. Triangle

inequality is implicitly important since we use this

metric as a form of distance which we optimize, so

it makes sense to use a true metric. In this work

we replace it by the Wasserstein distance which is

a true metric and add in a regularization term that

encourages the

norm of the covariance matrices

to be small, encouraging the class examples to be

clustered close to the centroid. One of our main

reasons to work with Gaussians is due to the closed

form formula of the Wasserstein distance.

Few shot learning (FSL) in the medical domain

has been mostly focused in computer vision (Singh

et al.,2021). There are only a few works that have

applied FSL in medical NLP (Ge et al.,2022) but

most of those works have only focused on different

tasks on MIMIC-III (Johnson et al.,2016) which

is a single domain dataset (patients from ICU and

one hospital system). To the best of our knowledge,

ours is the ﬁrst study to successfully apply FSL on

a diverse set of medical datasets (diverse in terms

of tasks and patient populations).

3 Datasets

All our internal datasets are MRI radiology re-

ports detailing various pathologies in different body

parts. Our models are meta-trained on a dataset of

shoulder pathologies which is collected from 74

unique and de-identiﬁed institutions in the United

States. 60 labels are chosen for training and 20

novel labels are chosen for validation. The number

of training labels is similar to some well-known

image datasets (Lake et al.,2015b;Vinyals et al.,

2016;Wah et al.,2011). This diverse dataset has

a rich label space detailing multiple structures in

shoulder, granular pathologies and their severity

levels in each structure. The relationship between

the granularity/severity of these pathologies at dif-

ferent structures can be leveraged for other patholo-

gies in different body parts and may lead to suc-

cessful transfer to various downstream tasks. The

labels are split such that all pathologies in a given

structure appear at either training or validation but

not both. More details about the label space can

be found in section B. The ﬁgure 1and the table 1

shows the distribution of labels and an example

of this dataset can be found in ﬁgure 4. Our met-

Figure 1: Histogram showing the label distribution in

(left) train and (right) validation dataset.

alearner is applied to

downstream binary clas-

siﬁcation tasks spanning different sub-specialities

(cancer screening, musculoskeletal radiology, and

neuro-radiology) that are both common as well as

clinically important. The statistics for each task are

given in table 2:

(1)

High risk cancer screening

for lung nodules (according to Fleischner guide-

lines (Nair et al.,2018) which bucket patients at

high-risk of lung cancer and requiring follow up

imaging immediately or within 3 months as belong-

ing to Category High Risk ; we consider patients

not at high-risk as Low Risk),

(2)

Complete An-

terior Cruciate Ligament (ACL) tear (Grade

) vs

not Complete ACL tear,

(3)

Acute ACL tears (MRI

examination was performed within

weeks of in-

jury) and typiﬁed by the presence of diffuse or

focal increased signal within the ligament vs not

Acute ACL tear (Dimond et al.,1998),

(4)

Severe

vs not severe neural foraminal stenosis in the cervi-

cal spine as severe foraminal stenosis may indicate

nerve impingement, which is clinically signiﬁcant.

Acute tear in ACL refers to the age of the tear/injury

whereas the complete tear refers to the integrity

of the ligament. Our testing datasets are diverse

and sampled from different institutions: the knee

Split Number of examples Min Max Average

Train 34595 79 6379 567

Validation 5754 44 1138 303

Table 1: Statistics of our meta-training and meta-

validation dataset, where the min/max/average refer to

min/max/average examples per label.

data, lung dataset and cervical dataset is sampled

from

50,4

and

institutions respectively and our

annotation process is described in Appendix A. Ex-

amples of these datasets can be found in ﬁgure 10

(knee), ﬁgure 6(lung), and ﬁgure 8(cervical).

Task Validation Distribution Test Distribution

Lung Nodule Low Risk : 233

High Risk : 30

Low Risk : 347

High Risk : 46

Knee ACL

Acute Tear

Normal: 258

Acute Tear: 48

Normal : 439

Acute Tear: 93

Knee ACL

Complete Tear

Normal : 263

Complete Tear : 44

Normal : 429

Complete Tear :103

Neural Foraminal

Stenosis

Normal : 215

Abnormal : 43

Normal : 789

Abnormal : 91

Table 2: Statistics of our downstream testing datasets

4 Workﬂow

Our workﬂow consists of the following parts: A

Figure 2: Overview of our workﬂow. A report is passed

through a report segmenter which splits it into sen-

tences and extracts the relevant portion of the text for

downstream classiﬁcation. The relevant text is passed

through our model and we use the pre-computed proto-

types and class variances to assign a label to the query

point.

report is ﬁrst de-identiﬁed according to HIPAA reg-

ulations and passed through a sentence parser (ex.

Spacy (Honnibal et al.,2020)) that splits the re-

port into sentences. In the shoulder dataset, each

of these sentences is labeled with the appropri-

ate structure and severity label and we ﬁlter out

sentences that do not have such a label. We ﬁrst

train a meta-learner in an episodic fashion on this

dataset and choose the best model based on meta-

validation accuracy.

For our downstream tasks, we use a body-part

speciﬁc custom data processor to collect sentences

related to a given structure (ACL in knee, different

vertebrae in the cervical spine, the entire impres-

sion section for lung reports) and concatenate them

together to create a paragraph describing all the

pathologies in the structure of interest. Detailed de-

scription of preprocessing for different body parts,

is presented in Appendix C. The concatenated text

from the validation sets of each task is passed to our

trained meta-learner to generate the relevant class

statistics (mean and variance). We then perform

pathology classiﬁcation on the test set by using our

trained meta-learner and the saved class statistics.

The downstream tasks are similar to the shoulder

task in the sense that the pathology classiﬁcation

is performed on a sequence of sentences that all

pertain to the same anatomical structure. Thus our

approach needs to learn the language that describes

the severity of the pathology for a speciﬁc anatomi-

cal structure.

We would like to shed some light on the com-

plexity of the language we encounter. Since our

dataset is sourced from multiple health systems,

and not all reports follow a standard structure, there

is a large amount of variation in the language de-

scribing the same diagnosis. For example: a severe

tear can be referred to as a rupture, or only the

size of the nodule is mentioned without specifying

that it is low risk (see Appendix Cfor more exam-

ples). Furthermore, most of our pipelines attempt

to classify the different severities for a given pathol-

ogy and the language describing severity can vary.

While it might be possible to construct a rule based

system to extract the diagnoses and severities we

are interested in, it will be difﬁcult to generalize

as we expand to more diagnoses as well as to new

health systems.

5 Prototypical Networks

Prototypical Networks or ProtoNets (Snell et al.,

2017) use an embedding function

fθ

to encode

each input into a

-dimensional feature vector. A

prototype is deﬁned for every class

c∈ L

, as the

mean of the set of embedded support data samples

(Sc) for the given class, i.e.

vc=1

|Sc|X

(xi,yi)∈Sc

fθ(xi).(1)

The distribution over classes for a given test input

is a softmax over the inverse of distances between

the test data embedding and prototype vectors.

P(y=c|x) = softmax(−d(fθ(x), vc))

=exp(−d(fθ(x), vc))

Pc0∈Lexp(−d(fθ(x), vc0))

(2)

where

can be any (differentiable) distance func-

tion. The loss function is negative log-likelihood:

L(θ) = −logPθ(y=c|x).

ProtoNets are simple and easy to train and deploy.

The mean is used to capture the entire conditional

distribution

P(y=c|x)

, thus losing a lot of infor-

mation about the underlying distribution. A lot of

work (Ding et al.,2022;Allen et al.,2019;Zhang

et al.,2019) has focused on improving ProtoNets by

taking into account the above observation. We ex-

tend ProtoNets by incorporating the variance (

moment) of the distribution and use distributional

distance, i.e.

-Wasserstein metric, directly gener-

alizing the vanilla ProtoNets.

5.1 Variance Aware ProtoNets

In this work, we model each conditional distribu-

tion as a Gaussian. Now the main question is: how

do we match a query example with a distribution?

The simplest thing here is to treat the query exam-

ple as a Dirac distribution. With that formulation

in mind, recall: the Wasserstein-Bures metric be-

tween Gaussians (mi,Σi)is given by:

d2=||m1−m2||2+Tr(Σ1+Σ2−2(Σ

1Σ2Σ

1)1

Given

(xi, yi)∈Sc

, where

is the support set

of examples belonging to class

, we compute the

mean

and covariance matrix

Σc

; the computa-

tion of Wasserstein distance between a Gaussian

and a query vector q(i.e. a Dirac) boils down to

d2=||mc−q||2+Tr(Σc)(3)

The above formula shows that we can simplify our

conditional distribution to be a Gaussian with a

diagonal covariance matrix. This brings down our

space complexity to store this covariance matrix

from O(

) to O(

). Note, this is a direct general-

ization of the vanilla prototypical networks as the

vanilla prototypical networks can be interpreted as

computing the Wasserstein distance (aka simple

distance) between two Dirac distributions (mean

of the conditional distribution and the query sam-

ple). We also propose another variant of the above

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Meta-learningPathologiesfromRadiologyReportsusingVarianceAwarePrototypicalNetworksArijitSehanobishKawshikKannanNabilaAbrahamAnasuyaDasBenjaminOdryCoveraHealthNewYorkCity,NY{arijit.sehanobish,kawshik.kannan,nabila.abraham,anasuya.das,benjamin.odry}@coverahealth.comAbstractLargepretrainedTransformer-...

展开>> 收起<<

Meta-learning Pathologies from Radiology Reports using Variance Aware Prototypical Networks Arijit Sehanobish Kawshik KannanNabila Abraham Anasuya Das Benjamin Odry.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Meta-learning Pathologies from Radiology Reports using Variance Aware Prototypical Networks Arijit Sehanobish Kawshik KannanNabila Abraham Anasuya Das Benjamin Odry

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: