Training Dynamics for Curriculum Learning A Study on Monolingual and Cross-lingual NLU Fenia Christopoulou Gerasimos Lampouras Ignacio Iacobacci

2025-05-06 0 0 2.78MB 18 页 10玖币

侵权投诉

Training Dynamics for Curriculum Learning:

A Study on Monolingual and Cross-lingual NLU

Fenia Christopoulou, Gerasimos Lampouras, Ignacio Iacobacci

Huawei Noah’s Ark Lab, London, UK

{efstathia.christopoulou, gerasimos.lampouras, ignacio.iacobacci}@huawei.com

Abstract

Curriculum Learning (CL) is a technique of

training models via ranking examples in a typi-

cally increasing difﬁculty trend with the aim of

accelerating convergence and improving gen-

eralisability. Current approaches for Natural

Language Understanding (NLU) tasks use CL

to improve in-distribution data performance of-

ten via heuristic-oriented or task-agnostic dif-

ﬁculties. In this work, instead, we employ

CL for NLU by taking advantage of train-

ing dynamics as difﬁculty metrics, i.e. statis-

tics that measure the behavior of the model

at hand on speciﬁc task-data instances dur-

ing training and propose modiﬁcations of ex-

isting CL schedulers based on these statis-

tics. Differently from existing works, we focus

on evaluating models on in-distribution (ID),

out-of-distribution (OOD) as well as zero-shot

(ZS) cross-lingual transfer datasets. We show

across several NLU tasks that CL with train-

ing dynamics can result in better performance

mostly on zero-shot cross-lingual transfer and

OOD settings with improvements up by 8.5%

in certain cases. Overall, experiments indicate

that training dynamics can lead to better per-

forming models with smoother training com-

pared to other difﬁculty metrics while being

20% faster on average. In addition, through

analysis we shed light on the correlations of

task-speciﬁc versus task-agnostic metrics1.

1 Introduction

Transformer-based language models (Vaswani

et al.,2017;Devlin et al.,2019, LMs) have re-

cently achieved great success in a variety of NLP

tasks (Wang et al.,2018,2019a). However, gen-

eralisation to out-of-distribution (OOD) data and

zero-shot cross-lingual transfer still remain a chal-

lenge (Linzen,2020;Hu et al.,2020). Among exist-

ing techniques, improving OOD performance has

Code is available at

https://github.com/

huawei-noah/noah-research/tree/master/

NLP/TD4CL

been addressed by training with adversarial data (Yi

et al.,2021), while better transfer across languages

has been achieved by selecting appropriate lan-

guages to transfer from (Lin et al.,2019;Turc et al.,

2021), employing meta-learning (Nooralahzadeh

et al.,2020) or data alignment (Fang et al.,2020).

Contrastive to such approaches that take advan-

tage of additional training data is Curriculum Learn-

ing (Bengio et al.,2009, CL), a technique that aims

to train models using a speciﬁc ordering of the

original training examples. This ordering typically

follows an increasing difﬁculty trend where easy

examples are fed to the model ﬁrst, moving to-

wards harder instances. The intuition behind CL

stems from human learning, as humans focus on

simpler concepts before learning more complex

ones, a procedure that is called shaping (Krueger

and Dayan,2009). Although curricula have been

primarily used for Computer Vision (Hacohen and

Weinshall,2019;Wu et al.,2021) and Machine

Translation (Zhang et al.,2019a;Platanios et al.,

2019), there are only a handful of approaches that

incorporate CL into Natural Language Understand-

ing tasks (Sachan and Xing,2016;Tay et al.,2019;

Lalor and Yu,2020;Xu et al.,2020a).

Typically, CL requires a measure of difﬁculty

for each example in the training set. Existing

methods using CL in NLU tasks rely on heuris-

tics such as sentence length, word rarity, depth of

the dependency tree (Platanios et al.,2019;Tay

et al.,2019), metrics based on item-response the-

ory (Lalor and Yu,2020) or task-agnostic model

metrics such as perplexity (Zhou et al.,2020). Such

metrics have been employed to either improve

in-distribution performance on NLU or Machine

Translation. However, their effect is still under-

explored on other settings.

In this study instead, we propose to adopt train-

ing dynamics (Swayamdipta et al.,2020, TD) as

difﬁculty measures for CL and ﬁne-tune models

with curricula on downstream tasks. TD were re-

arXiv:2210.12499v2 [cs.CL] 24 Nov 2022

cently proposed as a set of statistics collected dur-

ing the course of a model’s training to automatically

evaluate dataset quality, by identifying annotation

artifacts. These statistics, offer a 3-dimensional

view of a model’s uncertainty towards each training

example classifying them into distinct areas–easy,

ambiguous and hard examples for a model to learn.

We test a series of easy-to-hard curricula using

TD, namely TD-CL, with existing schedulers as

well as novel modiﬁcations of those and experiment

with other task-speciﬁc and task-agnostic metrics.

We show performances and training times on three

settings: in-distribution (ID), out-of-distribution

(OOD) and zero-shot (ZS) transfer to languages

different than English. To the best of our knowl-

edge, no prior work on NLU considers the impact

of CL on all these settings. To consolidate our

ﬁndings, we evaluate models on different classiﬁca-

tion tasks, including Natural Language Inference,

Paraphrase Identiﬁcation, Commonsense Causal

Reasoning and Document Classiﬁcation.

Our ﬁndings suggest that TD-CL provides better

zero-shot cross-lingual transfer up to 1.2% over

prior work and can gain an average speedup of

20%, up to 51% in certain cases. In ID settings CL

has minimal to no impact, while in OOD settings

models trained with TD-CL can boost performance

up to 8.5% on a different domain. Finally, TD pro-

vide more stable training compared to another task-

speciﬁc metric (Cross-Review). On the other hand,

heuristics can also offer improvements especially

when testing on a completely different domain.

2 Related Work

Curriculum Learning was initially mentioned in the

work of Elman (1993) who demonstrated the impor-

tance of feeding neural networks with small/easy

inputs at the early stages of training. The con-

cept was later formalised by Bengio et al. (2009)

where training in an easy-to-hard ordering was

shown to result in faster convergence and improved

performance. In general, Curriculum Learning re-

quires a difﬁculty metric (also known as the scoring

function) used to rank training instances, and a

scheduler (known as the pacing function) that de-

cides when and how new examples–of different

difﬁculty–should be introduced to the model.

Example Difﬁculty

was initially expressed via

model loss, in self-paced learning (Kumar et al.,

2010;Jiang et al.,2015), increasing the contribu-

tion of harder training instances over time. This

setting posed a challenge due to the fast-changing

pace of the loss during training, thus later ap-

proaches used human-intuitive difﬁculty metrics,

such as sentence length or the existence of rare

words (Platanios et al.,2019) to pre-compute difﬁ-

culties of training instances. However, as such met-

rics do not express difﬁculty of the model, model-

based metrics have been proposed over the years,

such as measuring the loss difference between two

checkpoints (Xu et al.,2020b) or model translation

variability (Wang et al.,2019b;Wan et al.,2020).

In our curricula we use training dynamics to mea-

sure example difﬁculty, i.e. metrics that consider

difﬁculty from the perspective of a model towards

a certain task. Example difﬁculty can be also esti-

mated either in a static (ofﬂine) or dynamic (online)

manner, where in the latter training instances are

evaluated and re-ordered at certain times during

training, while in the former the difﬁculty of each

example remains the same throughout. In our ex-

periments we adopt the ﬁrst setting and consider

static example difﬁculties.

Transfer Teacher CL

is a particular family of such

approaches that use an external model (namely the

teacher) to measure the difﬁculty of training exam-

ples. Notable works incorporate a simpler model

as the teacher (Zhang et al.,2018) or a larger-sized

model (Hacohen and Weinshall,2019), as well as

using similar-sized learners trained on different

subsets of the training data. These methods have

considered as example difﬁculty, either the teacher

model perplexity (Zhou et al.,2020), the norm of a

teacher model word embeddings (Liu et al.,2020),

the teacher’s performance on a certain task (Xu

et al.,2020a) or simply regard difﬁculty as a la-

tent variable in a teacher model (Lalor and Yu,

2020). In the same vein, we also incorporate Trans-

fer Teacher CL via teacher and student models of

the same size and type. However, differently, we

take into account the behavior of the teacher during

the course of its training to measure example difﬁ-

culty instead of considering its performance at the

end of training or analysing internal embeddings.

Moving on to

Schedulers

, these can be divided

into discrete and continuous. Discrete schedulers,

often referred to as bucketing, group training in-

stances that share similar difﬁculties into distinct

sets. Different conﬁgurations include accumulat-

ing buckets over time (Cirik et al.,2016), sam-

pling a subset of data from each bucket (Xu et al.,

2020a;Kocmi and Bojar,2017) or more sophisti-

cated sampling strategies (Zhang et al.,2018). In

cases where the number of buckets is not obtained

in a straightforward manner, methods either heuris-

tically split examples (Zhang et al.,2018), adopt

uniform splits (Xu et al.,2020a) or employ sched-

ulers that are based on a continuous function. A

characteristic approach is that of Platanios et al.

(2019) where at each training step a monotonically

increasing function chooses the amount of training

data the model has access to, sorted by increasing

difﬁculty. As we will describe later on, we experi-

ment with two established schedulers and propose

modiﬁcations of those based on training dynamics.

Other tasks where CL has been employed in-

clude Question Answering (Sachan and Xing,

2016), Reading comprehension (Tay et al.,2019)

and other general NLU classiﬁcation tasks (Lalor

and Yu,2020;Xu et al.,2020a). Others have devel-

oped modiﬁed curricula in order to train models for

code-switching (Choudhury et al.,2017), anaphora

resolution (Stojanovski and Fraser,2019), relation

extraction (Huang and Du,2019), dialogue (Saito,

2018;Shen and Feng,2020) and self-supervised

Neural Machine Translation (Ruiter et al.,2020),

while more advanced approaches combine it with

Reinforcement Learning in a collaborative teacher-

student transfer curriculum (Kumar et al.,2019).

3 Methodology

Let

D={(xi, yi)}N

i=1

be a set of training data in-

stances. A curriculum is comprised of two main

elements: the difﬁculty metric, responsible for asso-

ciating a training example to a score that represents

a notion of difﬁculty and the scheduler that deter-

mines the type and number of available instances

at each training step

. We experiment with three

difﬁculty metrics derived from training dynamics

and four schedulers: two are new contributions and

the remaining are referenced from previous work.

3.1 Difﬁculty Metrics

As aforementioned, we use training dynam-

ics (Swayamdipta et al.,2020), i.e. statistics origi-

nally introduced to analyse dataset quality, as difﬁ-

culty metrics. The suitability of such statistics to

serve as difﬁculty measures for CL is encapsulated

in three core aspects. Firstly, training dynamics

are straightforward. They can be easily obtained

by training a single model on the target dataset

and keeping statistics about its predictions on the

training set. Secondly, training dynamics correlate

well with model uncertainty and follow a similar

trend to human (dis)agreement in terms of data an-

notation, essentially combining the view of both

worlds. Finally, training dynamics manifest a clear

pattern of separating instances into distinct areas–

easy,ambiguous and hard examples for a model

to learn–something that aligns well with the ideas

behind Curriculum Learning.

The difﬁculty of an example (

xi, yi

) can be

determined by a function

, where an example

is considered more difﬁcult than example

f(xi, yi)> f(xj, yj)

. We list three difﬁculty met-

rics that use statistics during the course of a model’s

training, as follows:

CONFIDENCE

(CONF) of an example

is the av-

erage probability assigned to the gold label

by a

model with parameters

across a number of epochs

. This is a continuous metric with higher values

corresponding to easier examples.

fCONF (xi, yi) = µi=1

e=1

pθ(e)(yi|xi)(1)

CORRECTNESS

(CORR) is the number of times

a model classiﬁes example

correctly across its

training. It takes values between

and

. Higher

correctness indicates easier examples for a model

to learn.

fCORR (xi, yi) =

e=1

o(e)

i=(1if arg max pθ(e)(xi) = yi

0,otherwise (2)

VARIABILITY (VAR) of an example xiis the stan-

dard deviation of the probabilities assigned to the

gold label

across

epochs. It is a continuous

metric with higher values indicating greater uncer-

tainty for a training example.

fVAR (xi, yi) = sPE

e=1 (pθ(e)(yi|xi)−µi)2

(3)

Conﬁdence and correctness are the primary met-

rics that we use in our curricula since low and high

values correspond to hard and easy examples re-

spectively. On the other hand, variability is used as

an auxiliary metric since only high scores clearly

represent uncertain examples while low scores of-

fer no important information on their own.

3.2 Schedulers

We consider both discrete and continuous sched-

ulers. Each scheduler is paired with the metric that

is most suited, i.e. the discrete correctness met-

ric combined with annealing and the continuous

conﬁdence metric is combined with competence.

The

ANNEALING

(

CORRANNEAL

) scheduler pro-

posed by Xu et al. (2020a), assumes that training

data are split into buckets

{d1⊂D, . . . , dK⊂D}

with possibly different sizes

|di|

. In particular, we

group examples into the same bucket if they have

the same correctness score (see Equation (2)). In

total, this results in

E+1

buckets, which are sorted

in order of increasing difﬁculty. Training starts

with the easiest bucket. We then move on to the

next bucket by also randomly selecting

1/(E+ 1)

examples from each previous bucket. Following

prior work, we train on each bucket for one epoch.

The

COMPETENCE

(

CONFCOMP

) scheduler was

originally proposed by Platanios et al. (2019). Here,

we sort examples based on the conﬁdence metric

(see Equation (1)), and use a monotonically increas-

ing function to obtain the percentage of available

training data at each step. The model can use only

the top

most conﬁdent examples as instructed

by this function. A mini-batch is then sampled

uniformly from the available examples.

In addition to those schedulers, we introduce

the following modiﬁcations that take advantage

of the variability metric.

CORRECTNESS +

VARIABILITY ANNEALING

(

CORR+VARANNEAL

)

is a modiﬁcation of the Annealing scheduler and

CONFIDENCE + VARIABILITY COMPETENCE

(

CONF+VARCOMP

) is a modiﬁcation of the Com-

petence scheduler. In both variations, instead of

sampling uniformly across available examples, we

give higher probability to instances with high vari-

ability scores (Equation (3)), essentially using two

metrics instead of one. We assume that since the

model is more uncertain about such examples fur-

ther training on them can be beneﬁcial. For all

curricula, after the model has ﬁnished the curricu-

lum stage, we resume training as normal, i.e. by

random sampling of training instances.

3.3 Transfer Teacher Curriculum Learning

In order to train a model (student) with training

dynamics provided by another model (teacher), the

latter should be ﬁrst ﬁne-tuned on a target dataset.

In other words, the proposed metrics are used in a

transfer teacher CL setting (Matiisen et al.,2019).

Training

Data

Student

Model

Teacher

Model

Stage 1: Collecting Training Dynamics

Training

Dynamics

Stage 2: Transfer Teacher Curriculum fine-tuning

confidence

correctness

variability

Scheduler Difficulty

Metrics

Figure 1: Transfer Teacher Curriculum Learning used

in our study. A teacher model determines the difﬁculty

of training examples by collecting training dynamics

during ﬁne-tuning (Stage 1). The collected dynamics

are converted into difﬁculty metrics and are given to a

student model via a scheduler (Stage 2).

The two-step procedure that we follow in this

study is depicted in Figure 1. Initially a model

(the teacher) is ﬁne-tuned on a target dataset and

training dynamics are collected during the course

of training. The collected dynamics are then con-

verted into difﬁculty metrics, following Equations

(1)-(3). In the second stage, the difﬁculty metrics

and the original training data are fed into a sched-

uler that re-orders the examples according to their

difﬁculty (in our case from easy-to-hard) and feeds

them into another model (the student) that is the

same in size and type as the teacher.

4 Experimental Setup

4.1 Datasets

In this work we focus on four NLU classiﬁcations

tasks: Natural Language Inference, where given

a premise and a hypothesis the task is to identify

if the hypothesis entails/contradicts/or is neutral

based on the premise; Paraphrase Identiﬁcation,

where the task is to ﬁnd if two sentences are para-

phrases of one another; Commonsense Causal Rea-

soning, where given a premise, a question and a

set of choices the task is to ﬁnd the correct answer

to the question based on the premise, and Docu-

ment Classiﬁcation where each document should

be assigned the correct category.

We aim for a comparison across 3 settings: in-

distribution (ID), out-of-distribution (OOD) and

zero-shot (ZS), hence, we select datasets that con-

tain all those settings, if possible. We use a small

subset from the GLUE benchmark (Wang et al.,

2018) covering the NLI task (RTE, QNLI and

MNLI) and four cross-lingual datasets: XNLI (Con-

neau et al.,2018), PAWS-X (Yang et al.,2019)

for paraphrase detection, XCOPA (Ponti et al.,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TrainingDynamicsforCurriculumLearning:AStudyonMonolingualandCross-lingualNLUFeniaChristopoulou,GerasimosLampouras,IgnacioIacobacciHuaweiNoah'sArkLab,London,UK{efstathia.christopoulou,gerasimos.lampouras,ignacio.iacobacci}@huawei.comAbstractCurriculumLearning(CL)isatechniqueoftrainingmodelsviaranking...

展开>> 收起<<

Training Dynamics for Curriculum Learning A Study on Monolingual and Cross-lingual NLU Fenia Christopoulou Gerasimos Lampouras Ignacio Iacobacci.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Training Dynamics for Curriculum Learning A Study on Monolingual and Cross-lingual NLU Fenia Christopoulou Gerasimos Lampouras Ignacio Iacobacci

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: