An Exploration of Data Efﬁciency in Intra-Dataset Task Transfer for Dialog Understanding Josiah Ross Luke Yoffe Alon Albalak William Yang Wang

2025-04-30 0 0 902.79KB 17 页 10玖币

侵权投诉

An Exploration of Data Efﬁciency

in Intra-Dataset Task Transfer for Dialog Understanding

Josiah Ross, Luke Yoffe, Alon Albalak, William Yang Wang

University of California, Santa Barbara

Abstract

Transfer learning is an exciting area of Natural

Language Processing that has the potential to

both improve model performance and increase

data efﬁciency. This study explores the effects

of varying quantities of target task training

data on sequential transfer learning in the dia-

log domain. We hypothesize that a model can

utilize the information learned from a source

task to better learn a target task, thereby reduc-

ing the number of target task training samples

required. Unintuitively, our data shows that of-

ten target task training data size has minimal

effect on how sequential transfer learning per-

forms compared to the same model without

transfer learning1. Our results lead us to be-

lieve that this unexpected result could be due

to the effects of catastrophic forgetting, moti-

vating further work into methods that prevent

such forgetting.

1 Introduction

Large annotated datasets are needed to train state-

of-the-art NLP models. Models pretrained on self-

supervised language modeling tasks such as T5

(Raffel et al.,2020) and BERT (Devlin et al.,2019)

have improved accuracy and data efﬁciency on

downstream tasks, and it has been demonstrated

that this concept can be pushed further through su-

pervised task transfer (Pruksachatkun et al.,2020).

Transfer learning is a technique where a ma-

chine learning model can use the knowledge it

has learned from one task or domain in order to

better perform in a different task or domain (Pan

and Yang,2010). This study explores task transfer

(transfer learning between tasks), speciﬁcally intra-

dataset task transfer, where the source and target

tasks are annotated on the same dataset (Albalak

et al.,2022). We decided to focus on intra-dataset

task transfer in order to speciﬁcally study the effect

We used github.com/josiahnross/TLiDB to run our exper-

iments

of varying the amount of target task data used on

task transfer without allowing our results to get

affected by switching the domain.

We hope that our work and future work on task

transfer can make NLP more accessible and efﬁ-

cient. Task transfer allows larger institutions to

use their vast resources to train models such as

BERT on a supervised task and publish the result-

ing model for others to start from. For example,

if someone wanted to train a model to perform

reading comprehension, it might be better to start

from a BERT model already trained on emotion

recognition than the base BERT model.

We hypothesize that a model needs less target

task data in order to have similar performance to a

model trained without task transfer since, through

task transfer, a model can use what it learns from

the source task to learn the target task more ef-

ﬁciently. Additionally, we theorize that when a

model has access to large quantities of target task

data, transfer learning would be less effective, since

the target task data would contain enough knowl-

edge on its own. Our goal for this paper is to

explore intra-dataset task transfer’s effect on data

efﬁciency and determine whether or not our theo-

ries are correct. Contrary to our hypothesis, our

results show that sequential intra-domain task trans-

fer doesn’t necessarily improve data efﬁciency.

2 Dataset, Models, and Framework

In this section, we describe the dataset, models,

and framework of our experiments. Many of our

decisions regarding our choice of dataset, mod-

els, and our focus on intra-dataset task transfer are

based on the FETA paper (Albalak et al.,2022).

Additionally, we based our code off of Albalak

(2022). However, unlike Albalak et al. (2022), who

ran their experiments using only 10% of the tar-

get task training and validation data, we ran our

experiments on 20%, 40%, 60%, 80% and 100%

of the target task training and validation data. This

arXiv:2210.11729v1 [cs.CL] 21 Oct 2022

allows us to observe the effect of transfer learning

on target task data efﬁciency.

2.1 Transfer Method

Task transfer can be accomplished in various ways,

but we decided to use sequential task transfer (Mc-

Closkey and Cohen,1989;Ruder et al.,2019). The

sequential method involves training the model in

two distinct steps: (1) training the model on a

source task, then (2) training the model on the

target task. What differentiates the sequential

method from other methods, such as multitask and

multitask/ﬁne-tuning transfer learning (Caruana,

1994), is that the model never trains on the source

and target task at the same time. This is ideal be-

cause it would allow large institutions that are less

limited by computing power and resources to an-

notate data, train large models, and then publish

those models. Then, smaller groups can use these

published source task models as the base for their

target task models. Since the training of the model

on the source and target tasks is in separate steps,

this is sequential learning.

2.2 Dataset

We use the Friends dialog dataset (see Appendix

Afor details) for our experiments. 70% of the

dialogues in the dataset are used for training, 15%

for validation, and 15% for testing. From there

the training and validation dialogues are divided

into 20%, 40%, 60%, 80%, and 100% data splits

such that every dialog in a smaller split appears in

each larger split. For example, a dialog in the 40%

training split must also appear in the 60%, 80%

and 100% splits but may or may not appear in the

20% split. In this way, larger data divisions are

guaranteed to have at least as much information

as smaller data divisions, even if certain dialogues

inherently hold more information. Each sample

isn’t annotated on each task so some tasks have

more samples than others. Additionally, the data

split samples are the same across all tasks meaning

that if a dialog is in the testing data split it’s also

in the testing data split for every other task it is

annotated on. See Appendix A.1 for exact data

split counts.

2.3 Models

To see if results differed across pretrained mod-

els, we ran our experiments on both BERT-base

from Devlin et al. (2019) (110 million parameters)

and T5-base from Raffel et al. (2020) (220 million

parameters). On top of the base BERT model a

small classiﬁcation layer, unique to each task, was

trained to convert the output of the base model to

a valid output of the task. T5 converts all tasks

into a text-to-text format and requires no additional

classiﬁcation layer.

3 Experiment Methodology

3.1 Evaluation and Stopping Mechanism

Each task has one or more metrics it can be eval-

uated on (see Appendix A.1 for metric details).

When a model is evaluated on a task with multiple

metrics, its performance is based on that model’s

average performance over that task’s metrics.

For each experiment, training stopped when the

validation metric hadn’t improved in 4 consecutive

epochs. The validation metric was calculated using

the same percentage of the validation data as the

was used for the training data. For instance, when

training on the 40% training split the model will

be validation metric will be calculated on the 40%

validation split. We saved the model checkpoint

that had the highest validation metric. Then, after

the training was complete, we evaluated the best

model checkpoint on 100% of the testing data (re-

gardless of the percentage of the data used during

training) and used that score in our analysis.

3.2 Hyperparameter Search

For each task on each pretrained model type, we

ran a hyperparameter search across the learning

rates

10−4

and

10−5

and the batch sizes 10, 30, 60,

and 120. We used 100% of the data for training

and validation and did not use task transfer. Ev-

ery two epochs, the hyperparameter combinations

were evaluated using 100% of the validation data

and the worst 45% were removed so that the search

didn’t waste time training hyperparameters already

known to be subpar. The search stopped after ev-

ery hyperparameter combination had either been

removed or stopped by the stopping mechanism.

Once ﬁnished, the hyperparameter search saved the

hyperparameter combination that had the highest

validation metric on 100% of the validation data.

These saved hyperparameters were then used in

our experiments regardless of whether or not task

transfer was used or which percentage of the data

was being used for training. See Appendix Bfor

the results of the hyperparameter search.

3.3 Source and Target Task Models

Once the hyperparameters were determined, the

source models were trained on each task using both

BERT and T5. The source models were all trained

using 100% of the training data. We decided not to

alter the number of samples each source task model

used to train because we wanted to focus on the

data efﬁciency of the target tasks rather than the

data efﬁciency of the source tasks. Since we were

exploring sequential transfer learning, the source

model only needs to be trained once but could be

used many times on different target tasks. There-

fore, the data efﬁciency on the source task is much

less important than the data efﬁciency on the target

task.

Additionally, we chose not to equalize the num-

ber of samples used in each task. Some tasks are

cheaper to annotate than others (per sample), and

we made the assumption that the dataset was cre-

ated by spending an equal amount of resources to

annotate each task. Under this assumption, any

differences in the number of samples between two

tasks can be attributed to the difference in cost per

sample of annotating those tasks. This assump-

tion only matters when we are comparing different

source tasks, it doesn’t affect the trends we see

within a given source task.

After creating models trained on each source

task, we then trained a copy of each of these source

task models on every other task at every percentage

(20/40/60/80/100) of the target task data. These

models are our target task models.

3.4 No Transfer Models

As well as training the source models on 100% of

the training data, we also trained models on each

task with both BERT and T5 without task transfer

using 20, 40, 60, and 80% of the data. We ran

this experiment so that we could measure the effect

of task transfer by comparing the performance of

a model trained using task transfer to the perfor-

mance of a model trained using the same data but

with no task transfer.

3.5 Masked Language Modeling

Since the pretrained models we are using (BERT

and T5) were pretrained on a more general dataset

than what we are using, we wanted to look into

the effect of continuing unsupervised learning on

our dataset, which has previously shown positive

results in speciﬁc domains (Gu et al.,2021). BERT

was pretrained in part using masked language mod-

eling, where 15% of the input tokens are masked

and the model is tasked with predicting what the

masked tokens were.

To test if unsupervised learning on our dataset

would help, we added masked language modeling

to our list of BERT source tasks. Masked language

modeling is unsupervised, so we were able to sim-

ply use every dialog in each data split as our train-

ing, validation, and testing data (without having

to worry about annotating the data). Notably, the

15% of the input tokens that were masked were re-

chosen randomly at the beginning of each epoch.

3.6 Random Seeds

Both the initialization of the models and the train-

ing of the models relied on some randomness. We

ran each of our experiments 3 times using a differ-

ent training/initialization seed each time and we av-

eraged all results across the 3 seeds. Crucially, the

data splits were generated randomly but remained

the same across all experiments.

4 Results

We originally hypothesized that we would see a

downward trend in our results showing that when

less target task data is used, transfer learning is

more effective. However, this is not what Figure 1

shows. Our BERT results show a fairly ﬂat trend

while our T5 results show an upward trend. We hy-

pothesize that these trends are due to catastrophic

forgetting (McCloskey and Cohen,1989;Serrà

et al.,2018). When the model learns the source

task, it forgets some of what it learned during its

pretraining. To compensate for this forgetting, the

model needs a large amount of target task training

data to relearn any of the forgotten knowledge im-

portant to modeling the target task. This isn’t a

problem if the model isn’t performing task transfer

since the model will tend to only forget things that

aren’t relevant to the current task. If this theory is

true, we would see the difference in scores between

transfer and no transfer increase as the percent of

the target task training data increases.

While that is what we see in our T5 results, we

don’t see this in our BERT results. We hypothesize

that there are two forces at play in task transfer:

When training on a source task, some new

knowledge is learned, leading to better results

in low-data settings.

Figure 1: How each source task performed on average on the BERT (Left) and T5 (Right) models. The x-axis

represents which percentage of the data was used for training and validation. The y-axis represents the average

difference between the target task performance using task transfer and target task performance without task transfer.

The performance of each source task is averaged over each valid target task (every other task besides masked

language modeling) and over the 3 training/initialization random seeds.

When training on a source task, some knowl-

edge learned during pretraining is forgotten,

leading to worse results in low-data settings.

In the case of T5, (2) may be having a stronger

effect leading to the general upward trend, while in

the case of BERT, these effects are more equal, so

they cancel each other out resulting in a mostly ﬂat

trend.

Our results show that, for many source tasks,

task transfer produces a negative effect. While

unfortunate, this result is somewhat expected since

sequential task transfer is particularly susceptible to

the effects of catastrophic forgetting (McCloskey

and Cohen,1989). In the case of BERT, there

seem to be two distinct clusters of source tasks in

terms of their performance, but T5 does not have

this pattern. This suggests that different model

architectures are affected by learning and forgetting

in task transfer differently.

Interestingly, while masked language modeling

appears to be one of the best BERT source tasks, it

still had very little effect in comparison to not using

task transfer. This contradicts our expectation that

masked language modeling would always help by

adapting the BERT model from the domain it was

pretrained on to the Friends domain. We hypoth-

esize that the reason why we don’t see this result

is simply because BERT’s pretraining is general

enough to understand the dialogues, so performing

masked language modeling on our domain has little

effect.

5 Conclusion

In this paper, we explored intra-dataset sequential

task transfer’s effect on target efﬁciency. We found

that with BERT, adjusting the amount of target task

data used did not signiﬁcantly affect task transfer’s

effectiveness. With T5, we found that the more

target task data we used, the more effective task

transfer was. Additionally, we found that out of

our 7 tasks, Personality Detection, Emory Emotion

Recognition, MELD Emotion Recognition, and

Masked Language Modeling on BERT had a neu-

tral effect in comparison to not using transfer learn-

ing, while the remaining tasks on BERT and T5 had

mostly negative results. We also showed that just

because a source task performs better than other

tasks on BERT that doesn’t mean it will on T5 or

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnExplorationofDataEfciencyinIntra-DatasetTaskTransferforDialogUnderstandingJosiahRoss,LukeYoffe,AlonAlbalak,WilliamYangWangUniversityofCalifornia,SantaBarbaraAbstractTransferlearningisanexcitingareaofNaturalLanguageProcessingthathasthepotentialtobothimprovemodelperformanceandincreasedataefciency....

展开>> 收起<<

An Exploration of Data Efﬁciency in Intra-Dataset Task Transfer for Dialog Understanding Josiah Ross Luke Yoffe Alon Albalak William Yang Wang.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An Exploration of Data Efﬁciency in Intra-Dataset Task Transfer for Dialog Understanding Josiah Ross Luke Yoffe Alon Albalak William Yang Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: