2 Related Work
Example Hardness.
Several recent works quantify example hardness with various training-time
metrics. Many of these metrics are based on first-split learning dynamics [
8
,
25
,
27
,
35
,
43
]. Other
works have resorted to properties of deep networks such as compression ability [
21
] and prediction
depth [
5
]. Carlini et al.
[7]
study metrics centered around model training such as confidence, ensemble
agreement, adversarial robustness, holdout retraining, and accuracy under privacy-preserving training.
Closest in spirit to the SSFT studied in our paper are efforts in [
7
,
47
]. Crucially, Carlini et al.
[7]
study the KL divergence of the prediction vector after fine-tuning on a held-out set at a low learning
rate, and do not draw any direct inference of the separation offered by their metric. Focusing on (first-
split) forgetting dynamics, Toneva et al.
[47]
defined a metric based on the number of forgetting events
during training and identified sets of unforgettable examples that are never misclassified once learned.
In our work, we find complementary benefits of analysis based on first- and second-split dynamics.
Memorization of Data Points.
In order to capture the memorization ability of deep networks, their
ability to memorize noise (or randomly labeled samples) has been studied in recent work [
3
,
48
]. As
opposed to the memorization of rare examples, the memorization of noisy samples hurts generalization
and makes the classifier boundary more complex [
15
]. On the contrary, a recent line of works has
argued how memorization of (atypical) data points is important for achieving optimal generalization
performance when data is sampled from long-tailed distributions [6,11,15].
Simplicity Bias.
Another line of work argues that neural networks have a bias toward learning
simple features [
43
], and often do not learn complex features even when the complex feature is more
predictive of the true label than the simple features. This suggests that models end up memorizing
(through noise) the few samples in the dataset that contain the complex feature alone, and utilize the
simple feature for correctly predicting the other training examples [1,32].
Label Noise.
Large-scale machine learning datasets are typically labeled with the help of human
labelers [
12
] to facilitate supervised learning. It has been shown that a significant fraction of these
labels are erroneous in common machine learning datasets [
39
]. Learning under noisy labels is a
long-studied problem [
2
,
26
,
31
,
37
]. Various recent methods have also attempted to identify label
noise [
10
,
23
,
38
,
40
]. While the focus of our work is not to propose a new method in this long line
of work, we show that the view of forgetting time naturally distills out examples with noisy labels.
Future work may benefit by augmenting our metric with SOTA methods for label noise identification.
3 Method
The primary goal of our work is to characterize the hardness of different datapoints in a given
dataset. Suppose we have a dataset
SA={xi,yi}n
such that
(xi,yi)∼ D
. For the purpose of
characterization, we augment each datapoint
(xi,yi)∈ SA
with parameters
(fslti,ssfti)
where
fslti
quantifies the first-split learning time (FSLT), and
ssfti
quantifies the second-split forgetting
time (SSFT) of the sample. To obtain these parameters, we next describe our proposed procedure.
Procedure
We train a model
f
on
S
to minimize the empirical risk:
L(S;f) = Pi`(f(xi),yi)
.
We use
fA
to denote a model
f
(initialized with random weights) trained on
SA
until convergence
(100% accuracy on
SA
). We then train a model initialized with
fA
on a held-out split
SB∼ Dn
until
convergence. We denote this model with
fA→B
. To obtain parameters
(fslti,ssfti)
, we track per-
example predictions (
ˆ
yt
i
) at the end of every epoch (
tth
) of training. Unless specified otherwise, we
train the model with cross-entropy loss using Stochastic Gradient Descent (SGD).
Definition 1
(First-Split Learning Time)
.
For
{xi,yi} ∈ SA
, learning time is defined as the earliest
epoch during the training of a classifier fon SAafter which it is always classified correctly, i.e.,
fslti= argmin
t∗(ˆ
yt
i,(A)=yi∀t≥t∗)∀{xi,yi}∈SA.(1)
Definition 2
(Second-Split Forgetting Time)
.
Let
ˆ
yt
i,(A→B)
to denote the prediction of sample
{xi,yi} ∈ SA
after training
f(A→B)
for
t
epochs on
SB
. Then, for
{xi,yi} ∈ SA
forgetting time is
defined as the earliest epoch after which it is never classified correctly, i.e.,
ssfti= argmin
t∗(ˆ
yt
i,(A→B)6=yi∀t≥t∗)∀{xi,yi}∈SA.(2)
3