Training Dynamics for Curriculum Learning A Study on Monolingual and Cross-lingual NLU Fenia Christopoulou Gerasimos Lampouras Ignacio Iacobacci

2025-05-06 0 0 2.78MB 18 页 10玖币
侵权投诉
Training Dynamics for Curriculum Learning:
A Study on Monolingual and Cross-lingual NLU
Fenia Christopoulou, Gerasimos Lampouras, Ignacio Iacobacci
Huawei Noah’s Ark Lab, London, UK
{efstathia.christopoulou, gerasimos.lampouras, ignacio.iacobacci}@huawei.com
Abstract
Curriculum Learning (CL) is a technique of
training models via ranking examples in a typi-
cally increasing difficulty trend with the aim of
accelerating convergence and improving gen-
eralisability. Current approaches for Natural
Language Understanding (NLU) tasks use CL
to improve in-distribution data performance of-
ten via heuristic-oriented or task-agnostic dif-
ficulties. In this work, instead, we employ
CL for NLU by taking advantage of train-
ing dynamics as difficulty metrics, i.e. statis-
tics that measure the behavior of the model
at hand on specific task-data instances dur-
ing training and propose modifications of ex-
isting CL schedulers based on these statis-
tics. Differently from existing works, we focus
on evaluating models on in-distribution (ID),
out-of-distribution (OOD) as well as zero-shot
(ZS) cross-lingual transfer datasets. We show
across several NLU tasks that CL with train-
ing dynamics can result in better performance
mostly on zero-shot cross-lingual transfer and
OOD settings with improvements up by 8.5%
in certain cases. Overall, experiments indicate
that training dynamics can lead to better per-
forming models with smoother training com-
pared to other difficulty metrics while being
20% faster on average. In addition, through
analysis we shed light on the correlations of
task-specific versus task-agnostic metrics1.
1 Introduction
Transformer-based language models (Vaswani
et al.,2017;Devlin et al.,2019, LMs) have re-
cently achieved great success in a variety of NLP
tasks (Wang et al.,2018,2019a). However, gen-
eralisation to out-of-distribution (OOD) data and
zero-shot cross-lingual transfer still remain a chal-
lenge (Linzen,2020;Hu et al.,2020). Among exist-
ing techniques, improving OOD performance has
1
Code is available at
https://github.com/
huawei-noah/noah-research/tree/master/
NLP/TD4CL
been addressed by training with adversarial data (Yi
et al.,2021), while better transfer across languages
has been achieved by selecting appropriate lan-
guages to transfer from (Lin et al.,2019;Turc et al.,
2021), employing meta-learning (Nooralahzadeh
et al.,2020) or data alignment (Fang et al.,2020).
Contrastive to such approaches that take advan-
tage of additional training data is Curriculum Learn-
ing (Bengio et al.,2009, CL), a technique that aims
to train models using a specific ordering of the
original training examples. This ordering typically
follows an increasing difficulty trend where easy
examples are fed to the model first, moving to-
wards harder instances. The intuition behind CL
stems from human learning, as humans focus on
simpler concepts before learning more complex
ones, a procedure that is called shaping (Krueger
and Dayan,2009). Although curricula have been
primarily used for Computer Vision (Hacohen and
Weinshall,2019;Wu et al.,2021) and Machine
Translation (Zhang et al.,2019a;Platanios et al.,
2019), there are only a handful of approaches that
incorporate CL into Natural Language Understand-
ing tasks (Sachan and Xing,2016;Tay et al.,2019;
Lalor and Yu,2020;Xu et al.,2020a).
Typically, CL requires a measure of difficulty
for each example in the training set. Existing
methods using CL in NLU tasks rely on heuris-
tics such as sentence length, word rarity, depth of
the dependency tree (Platanios et al.,2019;Tay
et al.,2019), metrics based on item-response the-
ory (Lalor and Yu,2020) or task-agnostic model
metrics such as perplexity (Zhou et al.,2020). Such
metrics have been employed to either improve
in-distribution performance on NLU or Machine
Translation. However, their effect is still under-
explored on other settings.
In this study instead, we propose to adopt train-
ing dynamics (Swayamdipta et al.,2020, TD) as
difficulty measures for CL and fine-tune models
with curricula on downstream tasks. TD were re-
arXiv:2210.12499v2 [cs.CL] 24 Nov 2022
cently proposed as a set of statistics collected dur-
ing the course of a model’s training to automatically
evaluate dataset quality, by identifying annotation
artifacts. These statistics, offer a 3-dimensional
view of a model’s uncertainty towards each training
example classifying them into distinct areas–easy,
ambiguous and hard examples for a model to learn.
We test a series of easy-to-hard curricula using
TD, namely TD-CL, with existing schedulers as
well as novel modifications of those and experiment
with other task-specific and task-agnostic metrics.
We show performances and training times on three
settings: in-distribution (ID), out-of-distribution
(OOD) and zero-shot (ZS) transfer to languages
different than English. To the best of our knowl-
edge, no prior work on NLU considers the impact
of CL on all these settings. To consolidate our
findings, we evaluate models on different classifica-
tion tasks, including Natural Language Inference,
Paraphrase Identification, Commonsense Causal
Reasoning and Document Classification.
Our findings suggest that TD-CL provides better
zero-shot cross-lingual transfer up to 1.2% over
prior work and can gain an average speedup of
20%, up to 51% in certain cases. In ID settings CL
has minimal to no impact, while in OOD settings
models trained with TD-CL can boost performance
up to 8.5% on a different domain. Finally, TD pro-
vide more stable training compared to another task-
specific metric (Cross-Review). On the other hand,
heuristics can also offer improvements especially
when testing on a completely different domain.
2 Related Work
Curriculum Learning was initially mentioned in the
work of Elman (1993) who demonstrated the impor-
tance of feeding neural networks with small/easy
inputs at the early stages of training. The con-
cept was later formalised by Bengio et al. (2009)
where training in an easy-to-hard ordering was
shown to result in faster convergence and improved
performance. In general, Curriculum Learning re-
quires a difficulty metric (also known as the scoring
function) used to rank training instances, and a
scheduler (known as the pacing function) that de-
cides when and how new examples–of different
difficulty–should be introduced to the model.
Example Difficulty
was initially expressed via
model loss, in self-paced learning (Kumar et al.,
2010;Jiang et al.,2015), increasing the contribu-
tion of harder training instances over time. This
setting posed a challenge due to the fast-changing
pace of the loss during training, thus later ap-
proaches used human-intuitive difficulty metrics,
such as sentence length or the existence of rare
words (Platanios et al.,2019) to pre-compute diffi-
culties of training instances. However, as such met-
rics do not express difficulty of the model, model-
based metrics have been proposed over the years,
such as measuring the loss difference between two
checkpoints (Xu et al.,2020b) or model translation
variability (Wang et al.,2019b;Wan et al.,2020).
In our curricula we use training dynamics to mea-
sure example difficulty, i.e. metrics that consider
difficulty from the perspective of a model towards
a certain task. Example difficulty can be also esti-
mated either in a static (offline) or dynamic (online)
manner, where in the latter training instances are
evaluated and re-ordered at certain times during
training, while in the former the difficulty of each
example remains the same throughout. In our ex-
periments we adopt the first setting and consider
static example difficulties.
Transfer Teacher CL
is a particular family of such
approaches that use an external model (namely the
teacher) to measure the difficulty of training exam-
ples. Notable works incorporate a simpler model
as the teacher (Zhang et al.,2018) or a larger-sized
model (Hacohen and Weinshall,2019), as well as
using similar-sized learners trained on different
subsets of the training data. These methods have
considered as example difficulty, either the teacher
model perplexity (Zhou et al.,2020), the norm of a
teacher model word embeddings (Liu et al.,2020),
the teacher’s performance on a certain task (Xu
et al.,2020a) or simply regard difficulty as a la-
tent variable in a teacher model (Lalor and Yu,
2020). In the same vein, we also incorporate Trans-
fer Teacher CL via teacher and student models of
the same size and type. However, differently, we
take into account the behavior of the teacher during
the course of its training to measure example diffi-
culty instead of considering its performance at the
end of training or analysing internal embeddings.
Moving on to
Schedulers
, these can be divided
into discrete and continuous. Discrete schedulers,
often referred to as bucketing, group training in-
stances that share similar difficulties into distinct
sets. Different configurations include accumulat-
ing buckets over time (Cirik et al.,2016), sam-
pling a subset of data from each bucket (Xu et al.,
2020a;Kocmi and Bojar,2017) or more sophisti-
cated sampling strategies (Zhang et al.,2018). In
cases where the number of buckets is not obtained
in a straightforward manner, methods either heuris-
tically split examples (Zhang et al.,2018), adopt
uniform splits (Xu et al.,2020a) or employ sched-
ulers that are based on a continuous function. A
characteristic approach is that of Platanios et al.
(2019) where at each training step a monotonically
increasing function chooses the amount of training
data the model has access to, sorted by increasing
difficulty. As we will describe later on, we experi-
ment with two established schedulers and propose
modifications of those based on training dynamics.
Other tasks where CL has been employed in-
clude Question Answering (Sachan and Xing,
2016), Reading comprehension (Tay et al.,2019)
and other general NLU classification tasks (Lalor
and Yu,2020;Xu et al.,2020a). Others have devel-
oped modified curricula in order to train models for
code-switching (Choudhury et al.,2017), anaphora
resolution (Stojanovski and Fraser,2019), relation
extraction (Huang and Du,2019), dialogue (Saito,
2018;Shen and Feng,2020) and self-supervised
Neural Machine Translation (Ruiter et al.,2020),
while more advanced approaches combine it with
Reinforcement Learning in a collaborative teacher-
student transfer curriculum (Kumar et al.,2019).
3 Methodology
Let
D={(xi, yi)}N
i=1
be a set of training data in-
stances. A curriculum is comprised of two main
elements: the difficulty metric, responsible for asso-
ciating a training example to a score that represents
a notion of difficulty and the scheduler that deter-
mines the type and number of available instances
at each training step
t
. We experiment with three
difficulty metrics derived from training dynamics
and four schedulers: two are new contributions and
the remaining are referenced from previous work.
3.1 Difficulty Metrics
As aforementioned, we use training dynam-
ics (Swayamdipta et al.,2020), i.e. statistics origi-
nally introduced to analyse dataset quality, as diffi-
culty metrics. The suitability of such statistics to
serve as difficulty measures for CL is encapsulated
in three core aspects. Firstly, training dynamics
are straightforward. They can be easily obtained
by training a single model on the target dataset
and keeping statistics about its predictions on the
training set. Secondly, training dynamics correlate
well with model uncertainty and follow a similar
trend to human (dis)agreement in terms of data an-
notation, essentially combining the view of both
worlds. Finally, training dynamics manifest a clear
pattern of separating instances into distinct areas–
easy,ambiguous and hard examples for a model
to learn–something that aligns well with the ideas
behind Curriculum Learning.
The difficulty of an example (
xi, yi
) can be
determined by a function
f
, where an example
i
is considered more difficult than example
j
if
f(xi, yi)> f(xj, yj)
. We list three difficulty met-
rics that use statistics during the course of a model’s
training, as follows:
CONFIDENCE
(CONF) of an example
xi
is the av-
erage probability assigned to the gold label
yi
by a
model with parameters
θ
across a number of epochs
E
. This is a continuous metric with higher values
corresponding to easier examples.
fCONF (xi, yi) = µi=1
E
E
X
e=1
pθ(e)(yi|xi)(1)
CORRECTNESS
(CORR) is the number of times
a model classifies example
xi
correctly across its
training. It takes values between
0
and
E
. Higher
correctness indicates easier examples for a model
to learn.
fCORR (xi, yi) =
E
X
e=1
o(e)
i,
o(e)
i=(1if arg max pθ(e)(xi) = yi
0,otherwise (2)
VARIABILITY (VAR) of an example xiis the stan-
dard deviation of the probabilities assigned to the
gold label
yi
across
E
epochs. It is a continuous
metric with higher values indicating greater uncer-
tainty for a training example.
fVAR (xi, yi) = sPE
e=1 (pθ(e)(yi|xi)µi)2
E
(3)
Confidence and correctness are the primary met-
rics that we use in our curricula since low and high
values correspond to hard and easy examples re-
spectively. On the other hand, variability is used as
an auxiliary metric since only high scores clearly
represent uncertain examples while low scores of-
fer no important information on their own.
3.2 Schedulers
We consider both discrete and continuous sched-
ulers. Each scheduler is paired with the metric that
is most suited, i.e. the discrete correctness met-
ric combined with annealing and the continuous
confidence metric is combined with competence.
The
ANNEALING
(
CORRANNEAL
) scheduler pro-
posed by Xu et al. (2020a), assumes that training
data are split into buckets
{d1D, . . . , dKD}
with possibly different sizes
|di|
. In particular, we
group examples into the same bucket if they have
the same correctness score (see Equation (2)). In
total, this results in
E+1
buckets, which are sorted
in order of increasing difficulty. Training starts
with the easiest bucket. We then move on to the
next bucket by also randomly selecting
1/(E+ 1)
examples from each previous bucket. Following
prior work, we train on each bucket for one epoch.
The
COMPETENCE
(
CONFCOMP
) scheduler was
originally proposed by Platanios et al. (2019). Here,
we sort examples based on the confidence metric
(see Equation (1)), and use a monotonically increas-
ing function to obtain the percentage of available
training data at each step. The model can use only
the top
K
most confident examples as instructed
by this function. A mini-batch is then sampled
uniformly from the available examples.
In addition to those schedulers, we introduce
the following modifications that take advantage
of the variability metric.
CORRECTNESS +
VARIABILITY ANNEALING
(
CORR+VARANNEAL
)
is a modification of the Annealing scheduler and
CONFIDENCE + VARIABILITY COMPETENCE
(
CONF+VARCOMP
) is a modification of the Com-
petence scheduler. In both variations, instead of
sampling uniformly across available examples, we
give higher probability to instances with high vari-
ability scores (Equation (3)), essentially using two
metrics instead of one. We assume that since the
model is more uncertain about such examples fur-
ther training on them can be beneficial. For all
curricula, after the model has finished the curricu-
lum stage, we resume training as normal, i.e. by
random sampling of training instances.
3.3 Transfer Teacher Curriculum Learning
In order to train a model (student) with training
dynamics provided by another model (teacher), the
latter should be first fine-tuned on a target dataset.
In other words, the proposed metrics are used in a
transfer teacher CL setting (Matiisen et al.,2019).
Training
Data
Student
Model
Teacher
Model
Stage 1: Collecting Training Dynamics
Training
Dynamics
Stage 2: Transfer Teacher Curriculum fine-tuning
confidence
correctness
variability
Scheduler Difficulty
Metrics
ft
Figure 1: Transfer Teacher Curriculum Learning used
in our study. A teacher model determines the difficulty
of training examples by collecting training dynamics
during fine-tuning (Stage 1). The collected dynamics
are converted into difficulty metrics and are given to a
student model via a scheduler (Stage 2).
The two-step procedure that we follow in this
study is depicted in Figure 1. Initially a model
(the teacher) is fine-tuned on a target dataset and
training dynamics are collected during the course
of training. The collected dynamics are then con-
verted into difficulty metrics, following Equations
(1)-(3). In the second stage, the difficulty metrics
and the original training data are fed into a sched-
uler that re-orders the examples according to their
difficulty (in our case from easy-to-hard) and feeds
them into another model (the student) that is the
same in size and type as the teacher.
4 Experimental Setup
4.1 Datasets
In this work we focus on four NLU classifications
tasks: Natural Language Inference, where given
a premise and a hypothesis the task is to identify
if the hypothesis entails/contradicts/or is neutral
based on the premise; Paraphrase Identification,
where the task is to find if two sentences are para-
phrases of one another; Commonsense Causal Rea-
soning, where given a premise, a question and a
set of choices the task is to find the correct answer
to the question based on the premise, and Docu-
ment Classification where each document should
be assigned the correct category.
We aim for a comparison across 3 settings: in-
distribution (ID), out-of-distribution (OOD) and
zero-shot (ZS), hence, we select datasets that con-
tain all those settings, if possible. We use a small
subset from the GLUE benchmark (Wang et al.,
2018) covering the NLI task (RTE, QNLI and
MNLI) and four cross-lingual datasets: XNLI (Con-
neau et al.,2018), PAWS-X (Yang et al.,2019)
for paraphrase detection, XCOPA (Ponti et al.,
摘要:

TrainingDynamicsforCurriculumLearning:AStudyonMonolingualandCross-lingualNLUFeniaChristopoulou,GerasimosLampouras,IgnacioIacobacciHuaweiNoah'sArkLab,London,UK{efstathia.christopoulou,gerasimos.lampouras,ignacio.iacobacci}@huawei.comAbstractCurriculumLearning(CL)isatechniqueoftrainingmodelsviaranking...

展开>> 收起<<
Training Dynamics for Curriculum Learning A Study on Monolingual and Cross-lingual NLU Fenia Christopoulou Gerasimos Lampouras Ignacio Iacobacci.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2.78MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注