Meta-learning Pathologies from Radiology Reports using Variance Aware Prototypical Networks Arijit Sehanobish Kawshik KannanNabila Abraham Anasuya Das Benjamin Odry

2025-05-02 0 0 2.29MB 16 页 10玖币
侵权投诉
Meta-learning Pathologies from Radiology Reports using Variance Aware
Prototypical Networks
Arijit Sehanobish Kawshik KannanNabila Abraham Anasuya Das Benjamin Odry
Covera Health
New York City, NY
{arijit.sehanobish, kawshik.kannan, nabila.abraham,
anasuya.das, benjamin.odry}@coverahealth.com
Abstract
Large pretrained Transformer-based language
models like BERT and GPT have changed
the landscape of Natural Language Process-
ing (NLP). However, fine tuning such models
still requires a large number of training exam-
ples for each target task, thus annotating mul-
tiple datasets and training these models on var-
ious downstream tasks becomes time consum-
ing and expensive. In this work, we propose a
simple extension of the Prototypical Networks
for few-shot text classification. Our main idea
is to replace the class prototypes by Gaussians
and introduce a regularization term that en-
courages the examples to be clustered near the
appropriate class centroids. Experimental re-
sults show that our method outperforms vari-
ous strong baselines on 13 public and 4inter-
nal datasets. Furthermore, we use the class
distributions as a tool for detecting potential
out-of-distribution (OOD) data points during
deployment.
1 Introduction
Pretrained Transformer-based language models
(PLMs) have achieved great success on many NLP
tasks (Devlin et al.,2019;Brown et al.,2020), but
still need a large number of in-domain labeled
examples for finetuning (Yogatama et al.,2019).
Learning to learn (Lake et al.,2015a;Schmidhu-
ber,1987;Bengio et al.,1997) from limited super-
vision is an important problem with widespread
application in areas where obtaining labeled data
can be difficult or expensive. To that end, meta-
learning methods have been proposed as effective
solutions for few-shot learning (Hospedales et al.,
2020). Current applications of such meta-learning
methods have shown improved performance in few-
shot learning for vision tasks such as learning to
classify new image classes within a similar dataset.
Namely, on classical few-shot image classification
Equal Contribution
benchmarks, the training tasks are sampled from
a “single” larger dataset (for ex: Omniglot (Lake
et al.,2015b) and miniImageNet (Vinyals et al.,
2016)), and the label space contains the same task
structure for all tasks. There has been a simi-
lar trend of such classical methods in NLP as
well (Geng et al.,2019). In contrast, in text clas-
sification tasks, the set of source tasks available
during training and target tasks during evaluation
can range from sentiment analysis to grammatical
acceptability judgment (Bansal et al.,2020a,b). In
recent works (Wang et al.,2021), the authors use a
range of different source tasks (different not only
in terms of input domain, but also their task struc-
ture i.e. label semantics, and number of labels) for
meta-training and show successful performance on
a wide range of downstream tasks. In spite of this
success, meta-training on various source tasks is
quite challenging as it requires resistance to over-
fitting to certain source tasks due to its few-shot
nature and more task-specific adaptation due to the
distinct nature among tasks (Roelofs et al.,2019).
However, in medical NLP, collecting large num-
ber of diverse labeled datasets is difficult. In our in-
stitution, we collect high quality labeled radiology
reports (which are always used as held out test data)
and use it to train our internal annotators who then
annotate our unlabeled data. This training process
is expensive and time consuming. Our annotation
process is described in section A. Thus a natural
question is: if we have a large labeled dataset con-
sisting of a lot of classes, can we use it to meta-train
a model that can be used on a large number of down-
stream datasets where we have little to no training
examples? This is a challenging problem as the
reports can be structured differently based on the
report type and there can be a substantial variation
in writing style across radiologists from different
institutions. Our main goal is to build out a set
of extensible pipelines that can generalize to new
pathologies typically in new sub-specialties while
arXiv:2210.13979v2 [cs.LG] 10 Nov 2022
also generalizing across different health systems.
In addition, the exact definition of the pathologies
and their severity change can change depending on
the clinical use case. This makes fully supervised
approaches that rely on large labeled datasets ex-
pensive. Having few-shot capabilities allows us to
annotate a handful of cases and rapidly expand the
list of pathologies we can detect and classify. In ad-
dition, we can use our approach to generate pseudo
labels for rare pathologies and enrich our validation
and test sets for annotation by an in-house clinical
team. Lastly our approach can be extended to sup-
port patient search and define custom cohorts of
patients.
Our contributions in this work are the following:
(1)
We develop a novel loss function that extends
the vanilla prototypical networks and introduce a
regularization term that encourages tight cluster-
ing of examples near the class prototypes.
(2)
We
meta-train our models on a large labeled dataset
on shoulder MRI reports (single domain) and show
good performance on
4
diverse downstream classi-
fication tasks on radiology reports on knee, cervical
spine and chest. In addition to our internal datasets,
we show superior performance of our method on
13
public benchmarks over well-known methods
like Leopard. Our model is very simple to train,
easy to deploy unlike gradient based methods and
just requires a few additional lines of codes to a
vanilla prototypical network trainer.
(3)
We deploy
our system and use the dataset statistics to inform
out-of-distribution (OOD) cases.
2 Related Work
There are three common approaches to meta-
learning: metric-based, model-based, and
optimization-based. Model agnostic meta-learning
(MAML) (Finn et al.,2017) is an optimization-
based approach to meta-learning which is agnostic
to the model architecture and task specification.
Over the years, several variants of the method have
shown that it is an ideal candidate for learning
to learn from diverse tasks (Nichol et al.,2018;
Raghu et al.,2019;Bansal et al.,2020b). However,
to solve a new task, MAML type methods would
require training a new classification layer for the
task. In contrast, metric-based approaches, such
as prototypical networks (Vinyals et al.,2016;
Snell et al.,2017), being non-parametric in nature
can handle varied number of classes and thus
can be easily deployed. Given the simple nature
of prototypical networks, a lot of work has been
done to improve them (Allen et al.,2019;Zhang
et al.,2019;Ding et al.,2022;Wang et al.,2021).
Prototypical networks usually construct a class
prototype (mean) using the support vectors to
describe the class and, given a query example,
assigns the class whose class prototype is closest to
the query vector. In (Allen et al.,2019), the authors
use a mixture of Gaussians to describe the class
conditional distribution and in (Zhang et al.,2019);
the authors try to model an unknown general class
distribution. In (Ding et al.,2022), the authors
use spherical Gaussians and a KL-divergence type
function between the Gaussians to compute the
function
d
in equation 2. However, the function
used by the above authors is not a true metric, i.e.
does not satisfy the triangle inequality. Triangle
inequality is implicitly important since we use this
metric as a form of distance which we optimize, so
it makes sense to use a true metric. In this work
we replace it by the Wasserstein distance which is
a true metric and add in a regularization term that
encourages the
L2
norm of the covariance matrices
to be small, encouraging the class examples to be
clustered close to the centroid. One of our main
reasons to work with Gaussians is due to the closed
form formula of the Wasserstein distance.
Few shot learning (FSL) in the medical domain
has been mostly focused in computer vision (Singh
et al.,2021). There are only a few works that have
applied FSL in medical NLP (Ge et al.,2022) but
most of those works have only focused on different
tasks on MIMIC-III (Johnson et al.,2016) which
is a single domain dataset (patients from ICU and
one hospital system). To the best of our knowledge,
ours is the first study to successfully apply FSL on
a diverse set of medical datasets (diverse in terms
of tasks and patient populations).
3 Datasets
All our internal datasets are MRI radiology re-
ports detailing various pathologies in different body
parts. Our models are meta-trained on a dataset of
shoulder pathologies which is collected from 74
unique and de-identified institutions in the United
States. 60 labels are chosen for training and 20
novel labels are chosen for validation. The number
of training labels is similar to some well-known
image datasets (Lake et al.,2015b;Vinyals et al.,
2016;Wah et al.,2011). This diverse dataset has
a rich label space detailing multiple structures in
shoulder, granular pathologies and their severity
levels in each structure. The relationship between
the granularity/severity of these pathologies at dif-
ferent structures can be leveraged for other patholo-
gies in different body parts and may lead to suc-
cessful transfer to various downstream tasks. The
labels are split such that all pathologies in a given
structure appear at either training or validation but
not both. More details about the label space can
be found in section B. The figure 1and the table 1
shows the distribution of labels and an example
of this dataset can be found in figure 4. Our met-
Figure 1: Histogram showing the label distribution in
(left) train and (right) validation dataset.
alearner is applied to
4
downstream binary clas-
sification tasks spanning different sub-specialities
(cancer screening, musculoskeletal radiology, and
neuro-radiology) that are both common as well as
clinically important. The statistics for each task are
given in table 2:
(1)
High risk cancer screening
for lung nodules (according to Fleischner guide-
lines (Nair et al.,2018) which bucket patients at
high-risk of lung cancer and requiring follow up
imaging immediately or within 3 months as belong-
ing to Category High Risk ; we consider patients
not at high-risk as Low Risk),
(2)
Complete An-
terior Cruciate Ligament (ACL) tear (Grade
3
) vs
not Complete ACL tear,
(3)
Acute ACL tears (MRI
examination was performed within
6
weeks of in-
jury) and typified by the presence of diffuse or
focal increased signal within the ligament vs not
Acute ACL tear (Dimond et al.,1998),
(4)
Severe
vs not severe neural foraminal stenosis in the cervi-
cal spine as severe foraminal stenosis may indicate
nerve impingement, which is clinically significant.
Acute tear in ACL refers to the age of the tear/injury
whereas the complete tear refers to the integrity
of the ligament. Our testing datasets are diverse
and sampled from different institutions: the knee
Split Number of examples Min Max Average
Train 34595 79 6379 567
Validation 5754 44 1138 303
Table 1: Statistics of our meta-training and meta-
validation dataset, where the min/max/average refer to
min/max/average examples per label.
data, lung dataset and cervical dataset is sampled
from
50,4
and
65
institutions respectively and our
annotation process is described in Appendix A. Ex-
amples of these datasets can be found in figure 10
(knee), figure 6(lung), and figure 8(cervical).
Task Validation Distribution Test Distribution
Lung Nodule Low Risk : 233
High Risk : 30
Low Risk : 347
High Risk : 46
Knee ACL
Acute Tear
Normal: 258
Acute Tear: 48
Normal : 439
Acute Tear: 93
Knee ACL
Complete Tear
Normal : 263
Complete Tear : 44
Normal : 429
Complete Tear :103
Neural Foraminal
Stenosis
Normal : 215
Abnormal : 43
Normal : 789
Abnormal : 91
Table 2: Statistics of our downstream testing datasets
4 Workflow
Our workflow consists of the following parts: A
Figure 2: Overview of our workflow. A report is passed
through a report segmenter which splits it into sen-
tences and extracts the relevant portion of the text for
downstream classification. The relevant text is passed
through our model and we use the pre-computed proto-
types and class variances to assign a label to the query
point.
report is first de-identified according to HIPAA reg-
ulations and passed through a sentence parser (ex.
Spacy (Honnibal et al.,2020)) that splits the re-
port into sentences. In the shoulder dataset, each
of these sentences is labeled with the appropri-
ate structure and severity label and we filter out
sentences that do not have such a label. We first
train a meta-learner in an episodic fashion on this
dataset and choose the best model based on meta-
validation accuracy.
For our downstream tasks, we use a body-part
specific custom data processor to collect sentences
related to a given structure (ACL in knee, different
vertebrae in the cervical spine, the entire impres-
sion section for lung reports) and concatenate them
together to create a paragraph describing all the
pathologies in the structure of interest. Detailed de-
scription of preprocessing for different body parts,
is presented in Appendix C. The concatenated text
from the validation sets of each task is passed to our
trained meta-learner to generate the relevant class
statistics (mean and variance). We then perform
pathology classification on the test set by using our
trained meta-learner and the saved class statistics.
The downstream tasks are similar to the shoulder
task in the sense that the pathology classification
is performed on a sequence of sentences that all
pertain to the same anatomical structure. Thus our
approach needs to learn the language that describes
the severity of the pathology for a specific anatomi-
cal structure.
We would like to shed some light on the com-
plexity of the language we encounter. Since our
dataset is sourced from multiple health systems,
and not all reports follow a standard structure, there
is a large amount of variation in the language de-
scribing the same diagnosis. For example: a severe
tear can be referred to as a rupture, or only the
size of the nodule is mentioned without specifying
that it is low risk (see Appendix Cfor more exam-
ples). Furthermore, most of our pipelines attempt
to classify the different severities for a given pathol-
ogy and the language describing severity can vary.
While it might be possible to construct a rule based
system to extract the diagnoses and severities we
are interested in, it will be difficult to generalize
as we expand to more diagnoses as well as to new
health systems.
5 Prototypical Networks
Prototypical Networks or ProtoNets (Snell et al.,
2017) use an embedding function
fθ
to encode
each input into a
M
-dimensional feature vector. A
prototype is defined for every class
c∈ L
, as the
mean of the set of embedded support data samples
(Sc) for the given class, i.e.
vc=1
|Sc|X
(xi,yi)Sc
fθ(xi).(1)
The distribution over classes for a given test input
x
is a softmax over the inverse of distances between
the test data embedding and prototype vectors.
P(y=c|x) = softmax(d(fθ(x), vc))
=exp(d(fθ(x), vc))
Pc0Lexp(d(fθ(x), vc0))
(2)
where
d
can be any (differentiable) distance func-
tion. The loss function is negative log-likelihood:
L(θ) = logPθ(y=c|x).
ProtoNets are simple and easy to train and deploy.
The mean is used to capture the entire conditional
distribution
P(y=c|x)
, thus losing a lot of infor-
mation about the underlying distribution. A lot of
work (Ding et al.,2022;Allen et al.,2019;Zhang
et al.,2019) has focused on improving ProtoNets by
taking into account the above observation. We ex-
tend ProtoNets by incorporating the variance (
2
nd
moment) of the distribution and use distributional
distance, i.e.
2
-Wasserstein metric, directly gener-
alizing the vanilla ProtoNets.
5.1 Variance Aware ProtoNets
In this work, we model each conditional distribu-
tion as a Gaussian. Now the main question is: how
do we match a query example with a distribution?
The simplest thing here is to treat the query exam-
ple as a Dirac distribution. With that formulation
in mind, recall: the Wasserstein-Bures metric be-
tween Gaussians (mi,Σi)is given by:
d2=||m1m2||2+Tr1+Σ22(Σ
1
2
1Σ2Σ
1
2
1)1
2)
Given
(xi, yi)Sc
, where
Sc
is the support set
of examples belonging to class
c
, we compute the
mean
mc
and covariance matrix
Σc
; the computa-
tion of Wasserstein distance between a Gaussian
and a query vector q(i.e. a Dirac) boils down to
d2=||mcq||2+Trc)(3)
The above formula shows that we can simplify our
conditional distribution to be a Gaussian with a
diagonal covariance matrix. This brings down our
space complexity to store this covariance matrix
from O(
n2
) to O(
n
). Note, this is a direct general-
ization of the vanilla prototypical networks as the
vanilla prototypical networks can be interpreted as
computing the Wasserstein distance (aka simple
L2
distance) between two Dirac distributions (mean
of the conditional distribution and the query sam-
ple). We also propose another variant of the above
摘要:

Meta-learningPathologiesfromRadiologyReportsusingVarianceAwarePrototypicalNetworksArijitSehanobishKawshikKannanNabilaAbrahamAnasuyaDasBenjaminOdryCoveraHealthNewYorkCity,NY{arijit.sehanobish,kawshik.kannan,nabila.abraham,anasuya.das,benjamin.odry}@coverahealth.comAbstractLargepretrainedTransformer-...

展开>> 收起<<
Meta-learning Pathologies from Radiology Reports using Variance Aware Prototypical Networks Arijit Sehanobish Kawshik KannanNabila Abraham Anasuya Das Benjamin Odry.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.29MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注