in the successful application of bottleneck adapters
to the cancer identification task which to the best
of our knowledge has not been explored before.We
compare this method with multiple other strong
baselines and conduct experiments and analyses to
evaluate these different approaches. The systems
developed in this study and those that will follow in
related future work will be added to the data cura-
tion system of a biomedical database with the aim
to enable automatic processing of clinical notes in
real EHR data.
2 Pre-Trained Transformers and
Fine-Tuning
In recent years, the Transformers architecture
(Vaswani et al.,2017) and large language models
(LMs) have become the staple baseline for many
NLP tasks. The conventional paradigm is to first
pre-train an LM on a large corpus of general text
(e.g. Wikipedia) with a pre-training objective such
as masked or causal language modeling and then
fine-tune the LM on downstream tasks.
In our task, we focus on transformers pre-trained
with the Masked Language Modeling (MLM) ob-
jective. In MLM, a portion of the text is masked
out and the objective of the model is to learn to
reconstruct the masked portion based on the avail-
able context. The most commonly used model pre-
trained with MLM is named BERT (Devlin et al.,
2019).
Despite BERT’s promising results on many
downstream NLP tasks, it has been shown that large
LMs pre-trained on generic text do not always per-
form well on specialised domains like biomedical
tasks (Lee et al.,2020;Gururangan et al.,2020).
The standard approach, therefore, is to pre-train
models on corpora that are related to the target do-
main. BioBERT (Lee et al.,2020) is an example of
an LM trained on specialized data. It is trained on
a large corpus of general and biomedical texts mak-
ing it a strong model for biomedical text mining.
2.1 Efficient Fine-Tuning Methods
The benefits of fine-tuning large LMs for down-
stream applications are offset by a significant com-
putational cost. Some LMs, for example, include
more than
100
billion parameters, making their
fine-tuning costly. Furthermore, complete fine-
tuning may be ineffective when the amount of train-
ing data is small or different from the initial domain
that the model was trained on, which might result
in catastrophic forgetting.
As a response to these limitations, more effi-
cient fine-tuning approaches have been developed,
among which prompt tuning (2.3) and bottleneck
adapters (2.2) are two of the most effective and
well-known.
2.2 Bottleneck Adapters
Bottleneck Adapters (BAs) (Houlsby et al.,2019;
Pfeiffer et al.,2021;Rücklé et al.,2020;Pfeif-
fer et al.,2020) are Multi-layer Perceptron (MLP)
blocks that are made up of a down-projection dense
layer, an activation function, and an up-projection
dense layer with a residual connection. These
blocks are inserted between the frozen attention
and feed-forward blocks of a pre-trained LM, and
only these modules will be updated during fine-
tuning. This method has proven to be effective in
terms of both computational and parameter effi-
ciency.
Houlsby et al. (2019) showed that by training
only around
3
% of the parameters, BERT trained
with adapters can get competitive results compared
to complete fine-tuning. Adapter tuning can be
expressed in the below equation where
Xi
is the
output of the frozen attention or MLP component
of the ith layer of the pre-trained LM.
Oi=fup(Activation(fdown(Xi))) + Xi(1)
2.3 Prompt Tuning
Another efficient method of fine-tuning is called
Prompt Tuning (PT) (Li and Liang,2021;Lester
et al.,2021). PT is mostly used for autoregressive
LMs such as GPT (Brown et al.,2020). In this
approach, a set of learnable vectors (prompt) are
concatenated with the original input and passed to
the LM. During fine-tuning, the objective is to learn
a prompt which is intended to encode task-specific
knowledge for the downstream task while the orig-
inal model parameters are kept frozen. In some
variations of PT, instead of concatenating a set of
learnable vectors with the input before passing it
to the model once, a set of prompts are learned for
each individual attention layer of the pre-trained
LM (Li and Liang,2021). The PT approach used in
this study can be expressed in the below equation
where
Attentioni
is the attention block of the
ith
layer of the pre-trained transformer and
Pk
i
and
Pv
i
denote the learnable prompts for keys and values
respectively.
Oi=Attentioni(Qi,[Pk
i, Ki],[Pv
i, Vi]) (2)