allows us to observe the effect of transfer learning
on target task data efficiency.
2.1 Transfer Method
Task transfer can be accomplished in various ways,
but we decided to use sequential task transfer (Mc-
Closkey and Cohen,1989;Ruder et al.,2019). The
sequential method involves training the model in
two distinct steps: (1) training the model on a
source task, then (2) training the model on the
target task. What differentiates the sequential
method from other methods, such as multitask and
multitask/fine-tuning transfer learning (Caruana,
1994), is that the model never trains on the source
and target task at the same time. This is ideal be-
cause it would allow large institutions that are less
limited by computing power and resources to an-
notate data, train large models, and then publish
those models. Then, smaller groups can use these
published source task models as the base for their
target task models. Since the training of the model
on the source and target tasks is in separate steps,
this is sequential learning.
2.2 Dataset
We use the Friends dialog dataset (see Appendix
Afor details) for our experiments. 70% of the
dialogues in the dataset are used for training, 15%
for validation, and 15% for testing. From there
the training and validation dialogues are divided
into 20%, 40%, 60%, 80%, and 100% data splits
such that every dialog in a smaller split appears in
each larger split. For example, a dialog in the 40%
training split must also appear in the 60%, 80%
and 100% splits but may or may not appear in the
20% split. In this way, larger data divisions are
guaranteed to have at least as much information
as smaller data divisions, even if certain dialogues
inherently hold more information. Each sample
isn’t annotated on each task so some tasks have
more samples than others. Additionally, the data
split samples are the same across all tasks meaning
that if a dialog is in the testing data split it’s also
in the testing data split for every other task it is
annotated on. See Appendix A.1 for exact data
split counts.
2.3 Models
To see if results differed across pretrained mod-
els, we ran our experiments on both BERT-base
from Devlin et al. (2019) (110 million parameters)
and T5-base from Raffel et al. (2020) (220 million
parameters). On top of the base BERT model a
small classification layer, unique to each task, was
trained to convert the output of the base model to
a valid output of the task. T5 converts all tasks
into a text-to-text format and requires no additional
classification layer.
3 Experiment Methodology
3.1 Evaluation and Stopping Mechanism
Each task has one or more metrics it can be eval-
uated on (see Appendix A.1 for metric details).
When a model is evaluated on a task with multiple
metrics, its performance is based on that model’s
average performance over that task’s metrics.
For each experiment, training stopped when the
validation metric hadn’t improved in 4 consecutive
epochs. The validation metric was calculated using
the same percentage of the validation data as the
was used for the training data. For instance, when
training on the 40% training split the model will
be validation metric will be calculated on the 40%
validation split. We saved the model checkpoint
that had the highest validation metric. Then, after
the training was complete, we evaluated the best
model checkpoint on 100% of the testing data (re-
gardless of the percentage of the data used during
training) and used that score in our analysis.
3.2 Hyperparameter Search
For each task on each pretrained model type, we
ran a hyperparameter search across the learning
rates
10−4
and
10−5
and the batch sizes 10, 30, 60,
and 120. We used 100% of the data for training
and validation and did not use task transfer. Ev-
ery two epochs, the hyperparameter combinations
were evaluated using 100% of the validation data
and the worst 45% were removed so that the search
didn’t waste time training hyperparameters already
known to be subpar. The search stopped after ev-
ery hyperparameter combination had either been
removed or stopped by the stopping mechanism.
Once finished, the hyperparameter search saved the
hyperparameter combination that had the highest
validation metric on 100% of the validation data.
These saved hyperparameters were then used in
our experiments regardless of whether or not task
transfer was used or which percentage of the data
was being used for training. See Appendix Bfor
the results of the hyperparameter search.