Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints Omid Rohanian16 Hannah Jauncey3 Mohammadmahdi Nouriborji56 Vinod Kumar Chauhan1

2025-05-06 0 0 452.37KB 17 页 10玖币
侵权投诉
Using Bottleneck Adapters to Identify Cancer in Clinical Notes under
Low-Resource Constraints
Omid Rohanian1,6, Hannah Jauncey3, Mohammadmahdi Nouriborji5,6, Vinod Kumar Chauhan1,
Bronner P. Gonçalves2, Christiana Kartsonaki2, ISARIC Clinical Characterisation Group2, Laura Merson2,
David Clifton1,4
1Department of Engineering Science, University of Oxford, Oxford, UK
2ISARIC, Pandemic Sciences Institute, University of Oxford, Oxford, UK
3Infectious Diseases Data Observatory (IDDO), University of Oxford, UK
4Oxford-Suzhou Centre for Advanced Research, Suzhou, China
5Sharif University of Technology, Tehran, Iran
6NLPie Research, Oxford, UK
{omid.rohanian,david.clifton,vinod.kumar}@eng.ox.ac.uk
{hannah.jauncey,laura.merson,bronner.goncalves}@ndm.ox.ac.uk
m.nouriborji@nlpie.com
christiana.kartsonaki@dph.ox.ac.uk
Abstract
Processing information locked within clinical
health records is a challenging task that remains
an active area of research in biomedical NLP. In
this work, we evaluate a broad set of machine
learning techniques ranging from simple RNNs
to specialised transformers such as BioBERT
on a dataset containing clinical notes along with
a set of annotations indicating whether a sample
is cancer-related or not.
Furthermore, we specifically employ efficient
fine-tuning methods from NLP, namely, bottle-
neck adapters and prompt tuning, to adapt the
models to our specialised task. Our evaluations
suggest that fine-tuning a frozen BERT model
pre-trained on natural language and with bottle-
neck adapters outperforms all other strategies,
including full fine-tuning of the specialised
BioBERT model. Based on our findings, we
suggest that using bottleneck adapters in low-
resource situations with limited access to la-
belled data or processing capacity could be a
viable strategy in biomedical text mining. The
code used in the experiments are going to be
made available at [LINK ANONYMIZED].
1 Introduction
Clinical notes involve important information about
patients and their current state and medical history.
Automatic processing of these notes and the terms
that appear in them would help researchers clas-
sify them into standard conditions that can also be
Please refer to Appendix A.4 for the full list of collabora-
tors.
looked up in medical knowledge-bases. In com-
bination with other medical signals, this informa-
tion has been shown to be useful in predicting in-
hospital mortality rate (Deznabi et al.,2021), pro-
longed mechanical ventilation (Huang et al.,2020),
or clinical outcome (van Aken et al.,2021), among
others.
In this work, we looked at a real clinical notes
database and designed a pilot experiment in which
a set of different ML models were used to predict
whether a clinical note is cancer-related or not. The
incentive behind this experiment is to help clini-
cians and data curators to automatically search for
and identify notes that signal a particular medical
condition, instead of solely relying on laborious
human annotation and keyword-based search.
The promise of ML is in automating this task rea-
sonably close to human-level performance and ulti-
mately expanding this work to include other con-
ditions in a multi-class scenario. Ideally a model
would be able to identify cancer types that are not
seen during training and would be able to have
some understanding of context and grammar to be
sensitive to negation.
Contributions
In this work, we targeted the task of disease identi-
fication within a clinical notes dataset. We tested
a range of different models including RNN-based
and transformer-based architectures to tackle this
problem. We particularly focused on efficient fine-
tuning approaches to adapt our pre-trained models
to the biomedical task. The novelty of this work is
arXiv:2210.09440v2 [cs.CL] 7 Jun 2023
in the successful application of bottleneck adapters
to the cancer identification task which to the best
of our knowledge has not been explored before.We
compare this method with multiple other strong
baselines and conduct experiments and analyses to
evaluate these different approaches. The systems
developed in this study and those that will follow in
related future work will be added to the data cura-
tion system of a biomedical database with the aim
to enable automatic processing of clinical notes in
real EHR data.
2 Pre-Trained Transformers and
Fine-Tuning
In recent years, the Transformers architecture
(Vaswani et al.,2017) and large language models
(LMs) have become the staple baseline for many
NLP tasks. The conventional paradigm is to first
pre-train an LM on a large corpus of general text
(e.g. Wikipedia) with a pre-training objective such
as masked or causal language modeling and then
fine-tune the LM on downstream tasks.
In our task, we focus on transformers pre-trained
with the Masked Language Modeling (MLM) ob-
jective. In MLM, a portion of the text is masked
out and the objective of the model is to learn to
reconstruct the masked portion based on the avail-
able context. The most commonly used model pre-
trained with MLM is named BERT (Devlin et al.,
2019).
Despite BERT’s promising results on many
downstream NLP tasks, it has been shown that large
LMs pre-trained on generic text do not always per-
form well on specialised domains like biomedical
tasks (Lee et al.,2020;Gururangan et al.,2020).
The standard approach, therefore, is to pre-train
models on corpora that are related to the target do-
main. BioBERT (Lee et al.,2020) is an example of
an LM trained on specialized data. It is trained on
a large corpus of general and biomedical texts mak-
ing it a strong model for biomedical text mining.
2.1 Efficient Fine-Tuning Methods
The benefits of fine-tuning large LMs for down-
stream applications are offset by a significant com-
putational cost. Some LMs, for example, include
more than
100
billion parameters, making their
fine-tuning costly. Furthermore, complete fine-
tuning may be ineffective when the amount of train-
ing data is small or different from the initial domain
that the model was trained on, which might result
in catastrophic forgetting.
As a response to these limitations, more effi-
cient fine-tuning approaches have been developed,
among which prompt tuning (2.3) and bottleneck
adapters (2.2) are two of the most effective and
well-known.
2.2 Bottleneck Adapters
Bottleneck Adapters (BAs) (Houlsby et al.,2019;
Pfeiffer et al.,2021;Rücklé et al.,2020;Pfeif-
fer et al.,2020) are Multi-layer Perceptron (MLP)
blocks that are made up of a down-projection dense
layer, an activation function, and an up-projection
dense layer with a residual connection. These
blocks are inserted between the frozen attention
and feed-forward blocks of a pre-trained LM, and
only these modules will be updated during fine-
tuning. This method has proven to be effective in
terms of both computational and parameter effi-
ciency.
Houlsby et al. (2019) showed that by training
only around
3
% of the parameters, BERT trained
with adapters can get competitive results compared
to complete fine-tuning. Adapter tuning can be
expressed in the below equation where
Xi
is the
output of the frozen attention or MLP component
of the ith layer of the pre-trained LM.
Oi=fup(Activation(fdown(Xi))) + Xi(1)
2.3 Prompt Tuning
Another efficient method of fine-tuning is called
Prompt Tuning (PT) (Li and Liang,2021;Lester
et al.,2021). PT is mostly used for autoregressive
LMs such as GPT (Brown et al.,2020). In this
approach, a set of learnable vectors (prompt) are
concatenated with the original input and passed to
the LM. During fine-tuning, the objective is to learn
a prompt which is intended to encode task-specific
knowledge for the downstream task while the orig-
inal model parameters are kept frozen. In some
variations of PT, instead of concatenating a set of
learnable vectors with the input before passing it
to the model once, a set of prompts are learned for
each individual attention layer of the pre-trained
LM (Li and Liang,2021). The PT approach used in
this study can be expressed in the below equation
where
Attentioni
is the attention block of the
ith
layer of the pre-trained transformer and
Pk
i
and
Pv
i
denote the learnable prompts for keys and values
respectively.
Oi=Attentioni(Qi,[Pk
i, Ki],[Pv
i, Vi]) (2)
2.4 Bottleneck Adapters in Biomedical
Domain
BAs are increasingly used for efficient knowledge
extraction and domain adaptation due to their pa-
rameter efficiency and low computational cost. Fol-
lowing this trend, there are some works in the
biomedical domain that have used adapters to insert
task-specific knowledge via pre-training into the
LMs (Grover,2021;Lu et al.,2021), or employed
them in layer adaptation for developing compact
biomedical models (Nouriborji et al.,2022).
3 Challenges of Identifying
Cancer-Related Records
Clinical notes usually involve abbreviated and non-
standard language. A single concept like cancer is
mentioned in different ways depending on cancer
subtype. The same subtype might have a scientific
and a commonly known variant and both can ap-
pear in the text. Grammar is sometimes broken and
language can appear cryptic. Another issue is the
prevalence of misspellings which further compli-
cates this task.
There are also words that co-occur with a con-
dition and can easily confound the model. For
instance, words like ‘breast’ and ‘lung’ which are
not specific to cancer appear a lot in cancer-related
samples and the model can mistake them for a can-
cer signal. Another important issue is negation. If
a condition is ruled out, ideally a model should
not return positive. However since most rows that
are classed as positive in the dataset include the
token ‘cancer’, an example like ‘not cancer’ could
be mistaken as positive. Encoding awareness of
negation into the model is a challenge since it is
known that pre-trained LMs lack an innate ability
to handle negation (Hosseini et al.,2021).
4 Dataset and Annotation
The dataset in this pilot experiment was provided
by ISARIC, a global initiative that, among other
things, provides tools and resources to facilitate
clinical research
1
. The larger dataset contains
1
The ISARIC COVID-19 Data Platform is a global part-
nership of more than
1,700
institutions across more than
60
countries. Accreditation of the individuals, institutions and
funders that contributed to this effort can be found in the sup-
plementary material. These partners have combined data and
expertise to accelerate the pandemic response and improve pa-
tient outcomes. For more information on ISARIC, see
https:
//isaric.org
. Data are available for access via application
to the Data Access Committee at www.iddo.org/covid-19.
+
Mutli-Head Attention
Bottleneck Adapter
Norm
Feed-Forward
Pre-Trained
Language Model
Bottleneck Adapter
Norm
+
Activation
Feed-Forward
Down Project
Feed-Forward
Up Project
+
Bottleneck
Adapter
Figure 1: The overall architecture of Adapter Tuning,
Note that the original parameters of the pre-trained
model are kept frozen for the fine-tuning of the model
and only the Bottleneck Adapters in between attention
and feed-forward layers will be updated.
125381
rows corresponding to clinical notes re-
lated to different conditions and patients. For the
purposes of this experiment a portion of this data
was annotated for presence of cancer. The anno-
tated subset contains
2563
rows that include cancer
labels, out of which
343
are repeated notes where
the doctors have written the same cancer-related
note for a different patient. The human experts who
tagged the data for cancer, had access to a set of
cancer-related terms to guide them in the annota-
tion. The negative cohort of 3K rows was gener-
ated by filtering out the larger data by any row that
contained keywords that could potentially signal
cancer definitively or with a very high possibility.
The details of the lists and more information on
the annotation scheme are included in the appendix
(A.1).
5 Experiments
The experiments in this work are divided into
two categories, namely, attention-based and RNN-
based methods. We conducted all our experiments
on an internal cancer detection dataset with ~
6k
labeled samples with roughly equal instances in
each classes and evaluated them on a gold standard
consisting of 1k samples,
31
of which were posi-
tive and the rest negative. Note the distributional
shift between training and test sets which reflect
the real clinical setting under which the models are
expected to perform.
5.1 Baselines
We used three baselines in this work all of which
are RNN-based. The initial weights in embedding
layer of all the baselines comes from Chen et al.
(2019) which is a
word2vec
model pre-trained on
medical data. The first model is a simple Bi-LSTM,
the second uses a 1D-convolution before the Bi-
LSTM (CNN-Bi-LSTM), and the final model adds
a multi-head self-attention layer after the CNN-Bi-
LSTM model (CNN-Bi-LSTM-Att). All models
are trained for 24 epochs with a batch size of 64.
5.2 Approach
Our aim was to improve upon the strong RNN
baselines by the use of efficient fine-tuning of pre-
trained transformers, namely, BERT (Devlin et al.,
2019) and BioBERT (Lee et al.,2020). Three fine-
tuning approaches were tried: full fine-tuning, tun-
ing with BAs (Sec. 2.2), and PT (Sec. 2.3).
5.2.1 Tuning with Adapters
The BA used in this work is from Houlsby et al.
(2019) and implemented using Adapter Hub (Pfeif-
fer et al.,2020). The reduction factor of the adapter
is set to
16
and its activation function is ReLU. The
adapters are used after attention layers and feed-
forward layers of each transformer block while the
parameters of the model are kept frozen. The over-
all architecture of the model used in this work is
depicted in Figure 1.
5.2.2 Tuning with Prompts
For the PT, the approach from (Li and Liang,2021)
with a prompt size of
30
is used and implemented
with the Adapter Hub library (Pfeiffer et al.,2020).
In this approach, a set of prompts are learned for
each attention layer of the frozen language model.
5.2.3 Encoding knowledge of Negation and
Uncertainty
Negation is not by default understood by any of
the models we have explored in this work. For in-
stance, the phrases ‘Evidence of lung cancer’ and
‘No Evidence of lung cancer’ are both predicted as
cancer-related by a model as the negative samples
in the training set do not include negation patterns
for cancer. To encode some understanding of nega-
tion into the model, we analysed the larger dataset
and identified a number of different ways a con-
dition can be ruled out with varying degrees of
certainty (full list available at A.2). We used these
examples to generate synthetic negative samples
that include a cancer-related term (e.g. ‘not lym-
phoma’ or ‘Melanoma not formally diagnosed’).
6 Results
Reported results in Table 1are best out of three
subsequent runs. For each approach, the hyperpa-
rameters that seemed to work best during training
were kept fixed for all the runs. Full fine-tuning was
done with
5
epochs and a learning rate of
2e5
.
Tuning with BAs was done with
10
epochs and a
learning rate of
1e3
. PT was used with
10
epochs
and a learning rate of
1e4
. All approaches used a
batch size of
64
, AdamW Optimizer, Weight Decay
of
0.01
, and a cosine scheduler. As can be seen, the
best performing model is the BERT trained with
Adapters (including variants which are equipped
with some notion of negation as explained in 5.2.3).
Analysing the outputs of individual models, we
found that the majority of positive labels in the
test set are correctly identified by most models.
The bottleneck, however, is the false positives that
happen due to the presence of certain words (e.g.
‘diagnosed with’, ‘lung’, ‘breast’ etc) that co-occur
with cancer and can cause models to incorrectly
label an instance as positive. The best model had
only
4
false positives and no false negatives. The
values for the confusion matrices of all the models
are provided in A.3.
To alleviate the false positive issue, using the
method explained in 5.2.3, we trained our best
model (BERT with adapter-tuning) with additional
250
and
500
generated negative samples. The
model was subsequently able to predict cases such
as ‘neither cancer nor covid’, ‘lung infection but no
cancer’, and ‘diagnosed with covid but not cancer’
correctly with only minor performance drops.
A point of strength in all the models was their
ability to correctly identify cancer, given rare can-
cer types that had not occurred in the training set.
This generalisation to unseen cancer types indicates
that the models can effectively use information
from the pre-trained resources they rely upon.
7 Conclusion
In this work, we trained and tested a number
of classification approaches as part of a prelim-
inary experiment on a dataset of clinical notes
annotated for presence of cancer. We compared
a number of RNN models utilising pre-trained
biomedical embeddings with two different pre-
trained transformer-based models that were fine-
tuned in separate ways. We also addressed the is-
sue of negation by integrating negation patterns
into the negative training samples. Our find-
摘要:

UsingBottleneckAdapterstoIdentifyCancerinClinicalNotesunderLow-ResourceConstraintsOmidRohanian1,6,HannahJauncey3,MohammadmahdiNouriborji5,6,VinodKumarChauhan1,BronnerP.Gonçalves2,ChristianaKartsonaki2,ISARICClinicalCharacterisationGroup2†,LauraMerson2,DavidClifton1,41DepartmentofEngineeringScience,U...

展开>> 收起<<
Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints Omid Rohanian16 Hannah Jauncey3 Mohammadmahdi Nouriborji56 Vinod Kumar Chauhan1.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:452.37KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注