Characterizing Datapoints via Second-Split Forgetting Pratyush Maini1Saurabh Garg1Zachary C. Lipton1J. Zico Kolter12 Carnegie Mellon University1Bosch Center for AI2

2025-04-30 0 0 2.67MB 29 页 10玖币
侵权投诉
Characterizing Datapoints via Second-Split Forgetting
Pratyush Maini1Saurabh Garg1Zachary C. Lipton1J. Zico Kolter1,2
Carnegie Mellon University1Bosch Center for AI2
{pratyushmaini,zlipton}@cmu.edu;{sgarg2, zkolter}@cs.cmu.edu
Abstract
Researchers investigating example hardness have increasingly focused on the dy-
namics by which neural networks learn and forget examples throughout training.
Popular metrics derived from these dynamics include (i) the epoch at which exam-
ples are first correctly classified; (ii) the number of times their predictions flip dur-
ing training; and (iii) whether their prediction flips if they are held out. However,
these metrics do not distinguish among examples that are hard for distinct reasons,
such as membership in a rare subpopulation, being mislabeled, or belonging to
a complex subpopulation. In this paper, we propose second-split forgetting time
(SSFT), a complementary metric that tracks the epoch (if any) after which an origi-
nal training example is forgotten as the network is fine-tuned on a randomly held
out partition of the data. Across multiple benchmark datasets and modalities, we
demonstrate that mislabeled examples are forgotten quickly, and seemingly rare
examples are forgotten comparatively slowly. By contrast, metrics only consider-
ing the first split learning dynamics struggle to differentiate the two. At large learn-
ing rates, SSFT tends to be robust across architectures, optimizers, and random
seeds. From a practical standpoint, the SSFT can (i) help to identify mislabeled
samples, the removal of which improves generalization; and (ii) provide insights
about failure modes. Through theoretical analysis addressing overparameterized
linear models, we provide insights into how the observed phenomena may arise.1
1 Introduction
A growing literature has investigated metrics for characterizing the difficulty of training examples,
driven by such diverse motivations as (i) deriving insights for how to reconcile the ability of deep
neural networks to generalize [
30
] with their ability to memorize noise [
15
,
48
]; (ii) identifying
potentially mislabeled examples; and (iii) identifying notably challenging or rare sub-populations of
examples. Some of these efforts have turned towards learning dynamics, with researchers noting that
neural networks tend to learn cleanly labeled examples before mislabeled examples [
17
,
18
,
33
], and
more generally tend to learn simpler patterns sooner—for several intuitive notions of simplicity [
19
,
35
,
43
]. Broadly, works in this area tend to characterize examples as belonging either to prototypical
groups or memorized exceptions [
7
,
16
,
25
]. Adapting these intuitions to real datasets, Feldman
[15]
propose rating the degree to which an example is memorized based on whether its predicted
class flips when it is excluded from the training set. These, and other works [
8
,
21
,
35
,
43
,
47
] have
proposed many metrics for characterizing example difficulty with Carlini et al.
[7]
comparing five
such metrics. However, while many of these works distinguish some notion of easy versus hard
samples, they seldom (i) offer tools for distinguishing among different types of hard examples; (ii)
explain theoretically why these metrics might be useful for distinguishing easy versus hard samples.
Moreover, existing metrics tend to give similar scores to examples that are difficult for distinct reasons,
e.g, membership in rare, complex, or mislabeled sub-populations.
1Code for reproducing our experiments can be found at https://github.com/pratyushmaini/ssft.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15031v1 [cs.LG] 26 Oct 2022
Learning Time
Forgetting Time
Complex
Examples
Rare
Examples
Mislabeled
Examples
Typical
Examples Never
Forgotten
Figure 1: Overview of example separation offered by the unified view of learning and forgetting time.
In this paper, we propose to additionally consider a new metric, Second-Split Forgetting Time (SSFT),
calculated based on the forgetting dynamics that arise as training examples are forgotten when a
neural network continues to train on a second, randomly held out data partition. SSFT is defined as the
fine-tuning epoch after which a first-split training example is no longer classified correctly. We find
that SSFT identifies mislabeled examples remarkably well but does little to separate out under- versus
over-represented subpopulations. Conversely, metrics based on the (first-split) training dynamics
are more discriminative for separating these populations but less useful for detecting mislabeled
examples. We leverage the complementarity of first- and second-split metrics, showing that by jointly
visualizing the two, we can produce a richer characterization of the training examples.
In our experiments, we operationalize several notions of hard examples, namely: (i)
mislabeled
examples, for which the original label has been flipped to a randomly chosen incorrect label; (ii)
rare
examples, which belong to underrepresented subpopulations; and (iii)
complex
examples, which
belong to subpopulations for which the classification task is more difficult (details in Section 3.2).
We perform specific ablation studies with datasets complicated by just one type of hard example
(Section 4.3), and show how SSFT can help to distinguish among these categories of examples.
We observe that during second-split training, neural networks (i) first forget mislabeled examples
from the first split; (ii) only slowly begin to forget rare examples (e.g., from underrepresented sub-
populations) unique to the first training set; and (iii) do not forget complex examples.
This separation of hard example types has multiple practical applications.
First
, we can use the
method to identify noisy labels: On CIFAR-10 with 10% added class noise, SSFT achieves 0.94 AUC
for identifying mislabeled samples, while the first-split metrics range in AUC between 0.58 to 0.90.
Second
, the method can also help improve generalization in noisy data settings: while the removal of
hard examples according to first-split learning time degrades the performance of the classifier, the
removal of hard examples according to SSFT can actually improve generalization. This is especially
beneficial when e.g., training on synthetic data (produced by a generative model) or mislabeled data.
Third
, we show how SSFT can identify failure modes of machine learning models. For example,
in a simplified task classifying between horses and airplanes in the CIFAR-10 dataset, we find that
training examples containing horses with sky backgrounds and airplanes with green backgrounds
are among the earliest forgotten—indicating that the model relies on the background as a spurious
feature.
Last
, we also add that our metric is robust across multiple seeds, and the earliest forgotten
examples are robust across architectures. Across multiple optimizers, SSFT distinguishes mislabeled
samples, whereas first-split metrics appear more sensitive to the choice of optimizer.
Finally, we investigate second-split dynamics theoretically, analyzing overparametrized linear models
[
46
]. We introduce notions of mislabeled, rare, and complex examples appropriate to this toy model.
Our analysis shows that mislabeled examples from the first split are forgotten quickly during second-
split training whereas rare examples are not. However, as we train for a long time, rare examples
from the first split are eventually forgotten as the model converges to the minimum norm solution on
the second split while predictions on complex examples remain accurate with high probability.
2
2 Related Work
Example Hardness.
Several recent works quantify example hardness with various training-time
metrics. Many of these metrics are based on first-split learning dynamics [
8
,
25
,
27
,
35
,
43
]. Other
works have resorted to properties of deep networks such as compression ability [
21
] and prediction
depth [
5
]. Carlini et al.
[7]
study metrics centered around model training such as confidence, ensemble
agreement, adversarial robustness, holdout retraining, and accuracy under privacy-preserving training.
Closest in spirit to the SSFT studied in our paper are efforts in [
7
,
47
]. Crucially, Carlini et al.
[7]
study the KL divergence of the prediction vector after fine-tuning on a held-out set at a low learning
rate, and do not draw any direct inference of the separation offered by their metric. Focusing on (first-
split) forgetting dynamics, Toneva et al.
[47]
defined a metric based on the number of forgetting events
during training and identified sets of unforgettable examples that are never misclassified once learned.
In our work, we find complementary benefits of analysis based on first- and second-split dynamics.
Memorization of Data Points.
In order to capture the memorization ability of deep networks, their
ability to memorize noise (or randomly labeled samples) has been studied in recent work [
3
,
48
]. As
opposed to the memorization of rare examples, the memorization of noisy samples hurts generalization
and makes the classifier boundary more complex [
15
]. On the contrary, a recent line of works has
argued how memorization of (atypical) data points is important for achieving optimal generalization
performance when data is sampled from long-tailed distributions [6,11,15].
Simplicity Bias.
Another line of work argues that neural networks have a bias toward learning
simple features [
43
], and often do not learn complex features even when the complex feature is more
predictive of the true label than the simple features. This suggests that models end up memorizing
(through noise) the few samples in the dataset that contain the complex feature alone, and utilize the
simple feature for correctly predicting the other training examples [1,32].
Label Noise.
Large-scale machine learning datasets are typically labeled with the help of human
labelers [
12
] to facilitate supervised learning. It has been shown that a significant fraction of these
labels are erroneous in common machine learning datasets [
39
]. Learning under noisy labels is a
long-studied problem [
2
,
26
,
31
,
37
]. Various recent methods have also attempted to identify label
noise [
10
,
23
,
38
,
40
]. While the focus of our work is not to propose a new method in this long line
of work, we show that the view of forgetting time naturally distills out examples with noisy labels.
Future work may benefit by augmenting our metric with SOTA methods for label noise identification.
3 Method
The primary goal of our work is to characterize the hardness of different datapoints in a given
dataset. Suppose we have a dataset
SA={xi,yi}n
such that
(xi,yi)∼ D
. For the purpose of
characterization, we augment each datapoint
(xi,yi)∈ SA
with parameters
(fslti,ssfti)
where
fslti
quantifies the first-split learning time (FSLT), and
ssfti
quantifies the second-split forgetting
time (SSFT) of the sample. To obtain these parameters, we next describe our proposed procedure.
Procedure
We train a model
f
on
S
to minimize the empirical risk:
L(S;f) = Pi`(f(xi),yi)
.
We use
fA
to denote a model
f
(initialized with random weights) trained on
SA
until convergence
(100% accuracy on
SA
). We then train a model initialized with
fA
on a held-out split
SB∼ Dn
until
convergence. We denote this model with
fAB
. To obtain parameters
(fslti,ssfti)
, we track per-
example predictions (
ˆ
yt
i
) at the end of every epoch (
tth
) of training. Unless specified otherwise, we
train the model with cross-entropy loss using Stochastic Gradient Descent (SGD).
Definition 1
(First-Split Learning Time)
.
For
{xi,yi} ∈ SA
, learning time is defined as the earliest
epoch during the training of a classifier fon SAafter which it is always classified correctly, i.e.,
fslti= argmin
t(ˆ
yt
i,(A)=yitt)∀{xi,yi}∈SA.(1)
Definition 2
(Second-Split Forgetting Time)
.
Let
ˆ
yt
i,(AB)
to denote the prediction of sample
{xi,yi} ∈ SA
after training
f(AB)
for
t
epochs on
SB
. Then, for
{xi,yi} ∈ SA
forgetting time is
defined as the earliest epoch after which it is never classified correctly, i.e.,
ssfti= argmin
t(ˆ
yt
i,(AB)6=yitt)∀{xi,yi}∈SA.(2)
3
100101
0.2
0.4
0.6
0.8
1.0
Accuracy
First-split Training
Mislabeled
Rare
Complex
Typical
101103105
0.0
0.2
0.4
0.6
0.8
1.0
Second-split Training
Number of Training Epochs
(a) (b)
Figure 2: Rate of Learning and Forgetting of examples for different groups in the synthetic dataset.
While first-split training is not able to distinguish between rare and complex examples, second-split
training succeeds in distinguishing them. Additionally, second-split training separates mislabeled
examples from the rest relatively better than first-split training. (b) Visualization of first-split learning
and second-split forgetting times when training LeNet model on the MNIST dataset.
3.1 Baseline Methods
We provide a brief description of metrics for example hardness considered in recent comparisons [
25
].
Number of Forgetting Events:
(
nf
). An example
(xi,yi)∈ S
undergoes a forgetting event when
the accuracy on the example decreases between two consecutive updates. Toneva et al.
[47]
analyzed
the total number of such events
nf
during the training of a neural network to identify hard examples.
Cumulative Learning Accuracy:
(
accl
). Jiang et al.
[25]
suggest that rather than using the learning
time (Definition 1), using the number of epochs during training when a machine learning model
correctly classifies a given sample is a more stable metric for predicting example hardness.
Cumulative Learning Confidence:
(
confl
). Similar to
accl
,
confl
measures the cumulative softmax
confidence of the model towards the correct class over the course of training.
3.2 Example Characterization
We characterize example hardness via three sources of learning difficulty:
(i) Mislabeled Examples:
We refer to mislabeled examples as those datapoints whose label has been flipped to an incorrect
label uniformly at random.
(ii) Rare Examples:
We assume that rare examples belong to sub-
populations of the original distribution that have a low probability of occurrence. In particular, there
exist
O(1)
examples from such sub-populations in a given dataset. In Section 4.3 we describe how
we operationalize this notion in the case of the CIFAR-100 dataset.
(iii) Complex Examples:
These
constitute samples that are drawn from sub-groups in the dataset that require either (1) a hypothesis
class of high complexity; or (2) higher sample complexity to be learnt relative to examples from rest of
the dataset. We leave the definition of complex samples mathematically imprecise, but with the same
intuitive sense as in prior work [
3
,
43
]. For instance, in a dataset composed of the union of MNIST and
CIFAR-10 images, we would consider the subpopulation of CIFAR-10 images to be more complex.
4 Empirical Investigation of First- and Second-Split Training Dynamics
4.1 Experimental Setup
Datasets
We show results on a variety of image classification datasets—MNIST [
13
], CIFAR-
10 [
29
], and Imagenette [
22
]. For experiments in the language domain, we use the SST-2 dataset [
45
].
For each of the datasets, we split the training set into two equal partitions (
SA,SB)
. For experiments
4
Sentences in SST-2 dataset with smallest forgetting time Label
The director explores all three sides of his story with a sensitivity and an inquisitiveness reminiscent of Truffaut Neg
Beneath the film’s obvious determination to shock at any cost lies considerable skill and determination , backed by sheer nerve Neg
This is a fragmented film, once a good idea that was followed by the bad idea to turn it into a movie Pos
The holiday message of the 37-minute Santa vs. the Snowman leaves a lot to be desired. Pos
Epps has neither the charisma nor the natural affability that has made Tucker a star Pos
The bottom line is the piece works brilliantly Neg
Alternative medicine obviously has its merits ... but Ayurveda does the field no favors Pos
What could have easily become a cold, calculated exercise in postmodern pastiche winds up a powerful and deeply moving
example of melodramatic moviemaking
Neg
Lacks depth Pos
Certain to be distasteful to children and adults alike , Eight Crazy Nights is a total misfire Pos
Table 1: First-split sentences that were forgotten by the 3rd epoch of second-split training of a BERT-
base model on the SST-2 dataset. Notice that all of these forgotten examples are mislabeled.
with mislabeled examples, we simulate mislabeled examples by randomly selecting a subset of 10%
examples from both the partitions and changing their label to an incorrect class.
Training Details
Unless otherwise specified, we train a ResNet-9 model [
4
] using SGD optimizer
with weight decay 5e-4 and momentum 0.9. We use the cyclic learning rate schedule [
44
] with a
peak learning rate of 0.1 at the 10th epoch. We train for a maximum of 100 epochs or until we have 5
epochs of 100% training accuracy. We first train on
SA
, and then using the pre-initialized weights
from stage 1, train on
SB
with the same learning parameters. All experiments can be performed on a
single RTX2080 Ti. Complete hyperparameter details are available in Appendix B.1.
4.2 Learning-Forgetting Spectrum for various datasets
Synthetic Dataset
We consider data (
x,y
) sampled from a mixture of multiple distributions
Dg
, s.t.
xRd
.
Dg
denotes the
gth
group and has a sampling frequency of
πg
. Each group
Dg(Xg,{yg})
,
i.e., the true label for all the samples drawn from a given group is the same, and the examples in each
group are non-overlapping. Each group is parametrized by a set of
kd
unique indices
Ig[d]
such that
Ii∩ Ij=φ
for
i6=j
. The discriminative characteristic of each group is the vector
ug
,
such that, [ug]i= 1 if i∈ Igelse 0i[d]. Then for any sample (x,y)∈ S:
P(x∈ Xg) = πg;x|Xg∼ N(0, σ2Id) + µg.
For our simulation, we consider a 10 class-classification problem, with
µg= 5
for typical groups, and
µg= 4
for complex groups (higher signal to noise ratio). For any sample drawn from a rare group,
we have
O(1)
samples from that group in the entire dataset (
SA∪ SB)
. Mislabeled samples are only
generated from the majority typical groups. In Figure 2a, we show the rate of learning and forgetting
of examples from each of these categories. We note that in the second-split training, the mislabeled
examples are quickly forgotten, and the complex examples are never forgotten. The rare examples are
forgotten slowly. In Section 5we will theoretically justify the observations in the synthetic dataset
and show that the rare examples are expected to be forgotten as we train for an infinite time.
Image Domain
In Figure 2b, we show representative examples in the four quadrants of the learning-
forgetting spectrum. More specifically, we find that the examples forgotten fastest and learned last
are mislabeled. And the ones learned early and never forgotten once learned are characteristic simple
examples of the MNIST dataset. Examples in the first and third quadrant are seemingly atypical and
ambiguous respectively. Similar visualizations for other image datasets can be found in Appendix B.2.
Other Modalities
The forgetting and learning dynamics occur broadly across modalities apart
from images. We repeat the same problem setup on the SST-2 [
45
] dataset for sentiment classification.
We fine-tune a pre-trained BERT-base model [
14
] successively on two disjoint splits of the dataset.
In Table 1, we provide a list of the earliest forgotten samples when we train a BERT model on the
second split of SST-2 dataset. The results suggest that SSFT is able to identify mislabeled samples.
4.3 Ablation Experiments
We design specific experimental setups to capture the three notions of hardness as defined in Section 3.
5
摘要:

CharacterizingDatapointsviaSecond-SplitForgettingPratyushMaini1SaurabhGarg1ZacharyC.Lipton1J.ZicoKolter1;2CarnegieMellonUniversity1BoschCenterforAI2{pratyushmaini,zlipton}@cmu.edu;{sgarg2,zkolter}@cs.cmu.eduAbstractResearchersinvestigatingexamplehardnesshaveincreasinglyfocusedonthedy-namicsbywhichne...

展开>> 收起<<
Characterizing Datapoints via Second-Split Forgetting Pratyush Maini1Saurabh Garg1Zachary C. Lipton1J. Zico Kolter12 Carnegie Mellon University1Bosch Center for AI2.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:2.67MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注