cently proposed as a set of statistics collected dur-
ing the course of a model’s training to automatically
evaluate dataset quality, by identifying annotation
artifacts. These statistics, offer a 3-dimensional
view of a model’s uncertainty towards each training
example classifying them into distinct areas–easy,
ambiguous and hard examples for a model to learn.
We test a series of easy-to-hard curricula using
TD, namely TD-CL, with existing schedulers as
well as novel modifications of those and experiment
with other task-specific and task-agnostic metrics.
We show performances and training times on three
settings: in-distribution (ID), out-of-distribution
(OOD) and zero-shot (ZS) transfer to languages
different than English. To the best of our knowl-
edge, no prior work on NLU considers the impact
of CL on all these settings. To consolidate our
findings, we evaluate models on different classifica-
tion tasks, including Natural Language Inference,
Paraphrase Identification, Commonsense Causal
Reasoning and Document Classification.
Our findings suggest that TD-CL provides better
zero-shot cross-lingual transfer up to 1.2% over
prior work and can gain an average speedup of
20%, up to 51% in certain cases. In ID settings CL
has minimal to no impact, while in OOD settings
models trained with TD-CL can boost performance
up to 8.5% on a different domain. Finally, TD pro-
vide more stable training compared to another task-
specific metric (Cross-Review). On the other hand,
heuristics can also offer improvements especially
when testing on a completely different domain.
2 Related Work
Curriculum Learning was initially mentioned in the
work of Elman (1993) who demonstrated the impor-
tance of feeding neural networks with small/easy
inputs at the early stages of training. The con-
cept was later formalised by Bengio et al. (2009)
where training in an easy-to-hard ordering was
shown to result in faster convergence and improved
performance. In general, Curriculum Learning re-
quires a difficulty metric (also known as the scoring
function) used to rank training instances, and a
scheduler (known as the pacing function) that de-
cides when and how new examples–of different
difficulty–should be introduced to the model.
Example Difficulty
was initially expressed via
model loss, in self-paced learning (Kumar et al.,
2010;Jiang et al.,2015), increasing the contribu-
tion of harder training instances over time. This
setting posed a challenge due to the fast-changing
pace of the loss during training, thus later ap-
proaches used human-intuitive difficulty metrics,
such as sentence length or the existence of rare
words (Platanios et al.,2019) to pre-compute diffi-
culties of training instances. However, as such met-
rics do not express difficulty of the model, model-
based metrics have been proposed over the years,
such as measuring the loss difference between two
checkpoints (Xu et al.,2020b) or model translation
variability (Wang et al.,2019b;Wan et al.,2020).
In our curricula we use training dynamics to mea-
sure example difficulty, i.e. metrics that consider
difficulty from the perspective of a model towards
a certain task. Example difficulty can be also esti-
mated either in a static (offline) or dynamic (online)
manner, where in the latter training instances are
evaluated and re-ordered at certain times during
training, while in the former the difficulty of each
example remains the same throughout. In our ex-
periments we adopt the first setting and consider
static example difficulties.
Transfer Teacher CL
is a particular family of such
approaches that use an external model (namely the
teacher) to measure the difficulty of training exam-
ples. Notable works incorporate a simpler model
as the teacher (Zhang et al.,2018) or a larger-sized
model (Hacohen and Weinshall,2019), as well as
using similar-sized learners trained on different
subsets of the training data. These methods have
considered as example difficulty, either the teacher
model perplexity (Zhou et al.,2020), the norm of a
teacher model word embeddings (Liu et al.,2020),
the teacher’s performance on a certain task (Xu
et al.,2020a) or simply regard difficulty as a la-
tent variable in a teacher model (Lalor and Yu,
2020). In the same vein, we also incorporate Trans-
fer Teacher CL via teacher and student models of
the same size and type. However, differently, we
take into account the behavior of the teacher during
the course of its training to measure example diffi-
culty instead of considering its performance at the
end of training or analysing internal embeddings.
Moving on to
Schedulers
, these can be divided
into discrete and continuous. Discrete schedulers,
often referred to as bucketing, group training in-
stances that share similar difficulties into distinct
sets. Different configurations include accumulat-
ing buckets over time (Cirik et al.,2016), sam-
pling a subset of data from each bucket (Xu et al.,
2020a;Kocmi and Bojar,2017) or more sophisti-