An Exploration of Data Efficiency in Intra-Dataset Task Transfer for Dialog Understanding Josiah Ross Luke Yoffe Alon Albalak William Yang Wang

2025-04-30 0 0 902.79KB 17 页 10玖币
侵权投诉
An Exploration of Data Efficiency
in Intra-Dataset Task Transfer for Dialog Understanding
Josiah Ross, Luke Yoffe, Alon Albalak, William Yang Wang
University of California, Santa Barbara
Abstract
Transfer learning is an exciting area of Natural
Language Processing that has the potential to
both improve model performance and increase
data efficiency. This study explores the effects
of varying quantities of target task training
data on sequential transfer learning in the dia-
log domain. We hypothesize that a model can
utilize the information learned from a source
task to better learn a target task, thereby reduc-
ing the number of target task training samples
required. Unintuitively, our data shows that of-
ten target task training data size has minimal
effect on how sequential transfer learning per-
forms compared to the same model without
transfer learning1. Our results lead us to be-
lieve that this unexpected result could be due
to the effects of catastrophic forgetting, moti-
vating further work into methods that prevent
such forgetting.
1 Introduction
Large annotated datasets are needed to train state-
of-the-art NLP models. Models pretrained on self-
supervised language modeling tasks such as T5
(Raffel et al.,2020) and BERT (Devlin et al.,2019)
have improved accuracy and data efficiency on
downstream tasks, and it has been demonstrated
that this concept can be pushed further through su-
pervised task transfer (Pruksachatkun et al.,2020).
Transfer learning is a technique where a ma-
chine learning model can use the knowledge it
has learned from one task or domain in order to
better perform in a different task or domain (Pan
and Yang,2010). This study explores task transfer
(transfer learning between tasks), specifically intra-
dataset task transfer, where the source and target
tasks are annotated on the same dataset (Albalak
et al.,2022). We decided to focus on intra-dataset
task transfer in order to specifically study the effect
1
We used github.com/josiahnross/TLiDB to run our exper-
iments
of varying the amount of target task data used on
task transfer without allowing our results to get
affected by switching the domain.
We hope that our work and future work on task
transfer can make NLP more accessible and effi-
cient. Task transfer allows larger institutions to
use their vast resources to train models such as
BERT on a supervised task and publish the result-
ing model for others to start from. For example,
if someone wanted to train a model to perform
reading comprehension, it might be better to start
from a BERT model already trained on emotion
recognition than the base BERT model.
We hypothesize that a model needs less target
task data in order to have similar performance to a
model trained without task transfer since, through
task transfer, a model can use what it learns from
the source task to learn the target task more ef-
ficiently. Additionally, we theorize that when a
model has access to large quantities of target task
data, transfer learning would be less effective, since
the target task data would contain enough knowl-
edge on its own. Our goal for this paper is to
explore intra-dataset task transfer’s effect on data
efficiency and determine whether or not our theo-
ries are correct. Contrary to our hypothesis, our
results show that sequential intra-domain task trans-
fer doesn’t necessarily improve data efficiency.
2 Dataset, Models, and Framework
In this section, we describe the dataset, models,
and framework of our experiments. Many of our
decisions regarding our choice of dataset, mod-
els, and our focus on intra-dataset task transfer are
based on the FETA paper (Albalak et al.,2022).
Additionally, we based our code off of Albalak
(2022). However, unlike Albalak et al. (2022), who
ran their experiments using only 10% of the tar-
get task training and validation data, we ran our
experiments on 20%, 40%, 60%, 80% and 100%
of the target task training and validation data. This
arXiv:2210.11729v1 [cs.CL] 21 Oct 2022
allows us to observe the effect of transfer learning
on target task data efficiency.
2.1 Transfer Method
Task transfer can be accomplished in various ways,
but we decided to use sequential task transfer (Mc-
Closkey and Cohen,1989;Ruder et al.,2019). The
sequential method involves training the model in
two distinct steps: (1) training the model on a
source task, then (2) training the model on the
target task. What differentiates the sequential
method from other methods, such as multitask and
multitask/fine-tuning transfer learning (Caruana,
1994), is that the model never trains on the source
and target task at the same time. This is ideal be-
cause it would allow large institutions that are less
limited by computing power and resources to an-
notate data, train large models, and then publish
those models. Then, smaller groups can use these
published source task models as the base for their
target task models. Since the training of the model
on the source and target tasks is in separate steps,
this is sequential learning.
2.2 Dataset
We use the Friends dialog dataset (see Appendix
Afor details) for our experiments. 70% of the
dialogues in the dataset are used for training, 15%
for validation, and 15% for testing. From there
the training and validation dialogues are divided
into 20%, 40%, 60%, 80%, and 100% data splits
such that every dialog in a smaller split appears in
each larger split. For example, a dialog in the 40%
training split must also appear in the 60%, 80%
and 100% splits but may or may not appear in the
20% split. In this way, larger data divisions are
guaranteed to have at least as much information
as smaller data divisions, even if certain dialogues
inherently hold more information. Each sample
isn’t annotated on each task so some tasks have
more samples than others. Additionally, the data
split samples are the same across all tasks meaning
that if a dialog is in the testing data split it’s also
in the testing data split for every other task it is
annotated on. See Appendix A.1 for exact data
split counts.
2.3 Models
To see if results differed across pretrained mod-
els, we ran our experiments on both BERT-base
from Devlin et al. (2019) (110 million parameters)
and T5-base from Raffel et al. (2020) (220 million
parameters). On top of the base BERT model a
small classification layer, unique to each task, was
trained to convert the output of the base model to
a valid output of the task. T5 converts all tasks
into a text-to-text format and requires no additional
classification layer.
3 Experiment Methodology
3.1 Evaluation and Stopping Mechanism
Each task has one or more metrics it can be eval-
uated on (see Appendix A.1 for metric details).
When a model is evaluated on a task with multiple
metrics, its performance is based on that model’s
average performance over that task’s metrics.
For each experiment, training stopped when the
validation metric hadn’t improved in 4 consecutive
epochs. The validation metric was calculated using
the same percentage of the validation data as the
was used for the training data. For instance, when
training on the 40% training split the model will
be validation metric will be calculated on the 40%
validation split. We saved the model checkpoint
that had the highest validation metric. Then, after
the training was complete, we evaluated the best
model checkpoint on 100% of the testing data (re-
gardless of the percentage of the data used during
training) and used that score in our analysis.
3.2 Hyperparameter Search
For each task on each pretrained model type, we
ran a hyperparameter search across the learning
rates
104
and
105
and the batch sizes 10, 30, 60,
and 120. We used 100% of the data for training
and validation and did not use task transfer. Ev-
ery two epochs, the hyperparameter combinations
were evaluated using 100% of the validation data
and the worst 45% were removed so that the search
didn’t waste time training hyperparameters already
known to be subpar. The search stopped after ev-
ery hyperparameter combination had either been
removed or stopped by the stopping mechanism.
Once finished, the hyperparameter search saved the
hyperparameter combination that had the highest
validation metric on 100% of the validation data.
These saved hyperparameters were then used in
our experiments regardless of whether or not task
transfer was used or which percentage of the data
was being used for training. See Appendix Bfor
the results of the hyperparameter search.
3.3 Source and Target Task Models
Once the hyperparameters were determined, the
source models were trained on each task using both
BERT and T5. The source models were all trained
using 100% of the training data. We decided not to
alter the number of samples each source task model
used to train because we wanted to focus on the
data efficiency of the target tasks rather than the
data efficiency of the source tasks. Since we were
exploring sequential transfer learning, the source
model only needs to be trained once but could be
used many times on different target tasks. There-
fore, the data efficiency on the source task is much
less important than the data efficiency on the target
task.
Additionally, we chose not to equalize the num-
ber of samples used in each task. Some tasks are
cheaper to annotate than others (per sample), and
we made the assumption that the dataset was cre-
ated by spending an equal amount of resources to
annotate each task. Under this assumption, any
differences in the number of samples between two
tasks can be attributed to the difference in cost per
sample of annotating those tasks. This assump-
tion only matters when we are comparing different
source tasks, it doesn’t affect the trends we see
within a given source task.
After creating models trained on each source
task, we then trained a copy of each of these source
task models on every other task at every percentage
(20/40/60/80/100) of the target task data. These
models are our target task models.
3.4 No Transfer Models
As well as training the source models on 100% of
the training data, we also trained models on each
task with both BERT and T5 without task transfer
using 20, 40, 60, and 80% of the data. We ran
this experiment so that we could measure the effect
of task transfer by comparing the performance of
a model trained using task transfer to the perfor-
mance of a model trained using the same data but
with no task transfer.
3.5 Masked Language Modeling
Since the pretrained models we are using (BERT
and T5) were pretrained on a more general dataset
than what we are using, we wanted to look into
the effect of continuing unsupervised learning on
our dataset, which has previously shown positive
results in specific domains (Gu et al.,2021). BERT
was pretrained in part using masked language mod-
eling, where 15% of the input tokens are masked
and the model is tasked with predicting what the
masked tokens were.
To test if unsupervised learning on our dataset
would help, we added masked language modeling
to our list of BERT source tasks. Masked language
modeling is unsupervised, so we were able to sim-
ply use every dialog in each data split as our train-
ing, validation, and testing data (without having
to worry about annotating the data). Notably, the
15% of the input tokens that were masked were re-
chosen randomly at the beginning of each epoch.
3.6 Random Seeds
Both the initialization of the models and the train-
ing of the models relied on some randomness. We
ran each of our experiments 3 times using a differ-
ent training/initialization seed each time and we av-
eraged all results across the 3 seeds. Crucially, the
data splits were generated randomly but remained
the same across all experiments.
4 Results
We originally hypothesized that we would see a
downward trend in our results showing that when
less target task data is used, transfer learning is
more effective. However, this is not what Figure 1
shows. Our BERT results show a fairly flat trend
while our T5 results show an upward trend. We hy-
pothesize that these trends are due to catastrophic
forgetting (McCloskey and Cohen,1989;Serrà
et al.,2018). When the model learns the source
task, it forgets some of what it learned during its
pretraining. To compensate for this forgetting, the
model needs a large amount of target task training
data to relearn any of the forgotten knowledge im-
portant to modeling the target task. This isn’t a
problem if the model isn’t performing task transfer
since the model will tend to only forget things that
aren’t relevant to the current task. If this theory is
true, we would see the difference in scores between
transfer and no transfer increase as the percent of
the target task training data increases.
While that is what we see in our T5 results, we
don’t see this in our BERT results. We hypothesize
that there are two forces at play in task transfer:
1.
When training on a source task, some new
knowledge is learned, leading to better results
in low-data settings.
Figure 1: How each source task performed on average on the BERT (Left) and T5 (Right) models. The x-axis
represents which percentage of the data was used for training and validation. The y-axis represents the average
difference between the target task performance using task transfer and target task performance without task transfer.
The performance of each source task is averaged over each valid target task (every other task besides masked
language modeling) and over the 3 training/initialization random seeds.
2.
When training on a source task, some knowl-
edge learned during pretraining is forgotten,
leading to worse results in low-data settings.
In the case of T5, (2) may be having a stronger
effect leading to the general upward trend, while in
the case of BERT, these effects are more equal, so
they cancel each other out resulting in a mostly flat
trend.
Our results show that, for many source tasks,
task transfer produces a negative effect. While
unfortunate, this result is somewhat expected since
sequential task transfer is particularly susceptible to
the effects of catastrophic forgetting (McCloskey
and Cohen,1989). In the case of BERT, there
seem to be two distinct clusters of source tasks in
terms of their performance, but T5 does not have
this pattern. This suggests that different model
architectures are affected by learning and forgetting
in task transfer differently.
Interestingly, while masked language modeling
appears to be one of the best BERT source tasks, it
still had very little effect in comparison to not using
task transfer. This contradicts our expectation that
masked language modeling would always help by
adapting the BERT model from the domain it was
pretrained on to the Friends domain. We hypoth-
esize that the reason why we don’t see this result
is simply because BERT’s pretraining is general
enough to understand the dialogues, so performing
masked language modeling on our domain has little
effect.
5 Conclusion
In this paper, we explored intra-dataset sequential
task transfer’s effect on target efficiency. We found
that with BERT, adjusting the amount of target task
data used did not significantly affect task transfer’s
effectiveness. With T5, we found that the more
target task data we used, the more effective task
transfer was. Additionally, we found that out of
our 7 tasks, Personality Detection, Emory Emotion
Recognition, MELD Emotion Recognition, and
Masked Language Modeling on BERT had a neu-
tral effect in comparison to not using transfer learn-
ing, while the remaining tasks on BERT and T5 had
mostly negative results. We also showed that just
because a source task performs better than other
tasks on BERT that doesn’t mean it will on T5 or
摘要:

AnExplorationofDataEfciencyinIntra-DatasetTaskTransferforDialogUnderstandingJosiahRoss,LukeYoffe,AlonAlbalak,WilliamYangWangUniversityofCalifornia,SantaBarbaraAbstractTransferlearningisanexcitingareaofNaturalLanguageProcessingthathasthepotentialtobothimprovemodelperformanceandincreasedataefciency....

展开>> 收起<<
An Exploration of Data Efficiency in Intra-Dataset Task Transfer for Dialog Understanding Josiah Ross Luke Yoffe Alon Albalak William Yang Wang.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:902.79KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注