Characterizing Datapoints via Second-Split Forgetting Pratyush Maini1Saurabh Garg1Zachary C. Lipton1J. Zico Kolter12 Carnegie Mellon University1Bosch Center for AI2

2025-04-30 1 0 2.67MB 29 页 10玖币

侵权投诉

Characterizing Datapoints via Second-Split Forgetting

Pratyush Maini1Saurabh Garg1Zachary C. Lipton1J. Zico Kolter1,2

Carnegie Mellon University1Bosch Center for AI2

{pratyushmaini,zlipton}@cmu.edu;{sgarg2, zkolter}@cs.cmu.edu

Abstract

Researchers investigating example hardness have increasingly focused on the dy-

namics by which neural networks learn and forget examples throughout training.

Popular metrics derived from these dynamics include (i) the epoch at which exam-

ples are ﬁrst correctly classiﬁed; (ii) the number of times their predictions ﬂip dur-

ing training; and (iii) whether their prediction ﬂips if they are held out. However,

these metrics do not distinguish among examples that are hard for distinct reasons,

such as membership in a rare subpopulation, being mislabeled, or belonging to

a complex subpopulation. In this paper, we propose second-split forgetting time

(SSFT), a complementary metric that tracks the epoch (if any) after which an origi-

nal training example is forgotten as the network is ﬁne-tuned on a randomly held

out partition of the data. Across multiple benchmark datasets and modalities, we

demonstrate that mislabeled examples are forgotten quickly, and seemingly rare

examples are forgotten comparatively slowly. By contrast, metrics only consider-

ing the ﬁrst split learning dynamics struggle to differentiate the two. At large learn-

ing rates, SSFT tends to be robust across architectures, optimizers, and random

seeds. From a practical standpoint, the SSFT can (i) help to identify mislabeled

samples, the removal of which improves generalization; and (ii) provide insights

about failure modes. Through theoretical analysis addressing overparameterized

linear models, we provide insights into how the observed phenomena may arise.1

1 Introduction

A growing literature has investigated metrics for characterizing the difﬁculty of training examples,

driven by such diverse motivations as (i) deriving insights for how to reconcile the ability of deep

neural networks to generalize [

] with their ability to memorize noise [

]; (ii) identifying

potentially mislabeled examples; and (iii) identifying notably challenging or rare sub-populations of

examples. Some of these efforts have turned towards learning dynamics, with researchers noting that

neural networks tend to learn cleanly labeled examples before mislabeled examples [

], and

more generally tend to learn simpler patterns sooner—for several intuitive notions of simplicity [

]. Broadly, works in this area tend to characterize examples as belonging either to prototypical

groups or memorized exceptions [

]. Adapting these intuitions to real datasets, Feldman

[15]

propose rating the degree to which an example is memorized based on whether its predicted

class ﬂips when it is excluded from the training set. These, and other works [

] have

proposed many metrics for characterizing example difﬁculty with Carlini et al.

[7]

comparing ﬁve

such metrics. However, while many of these works distinguish some notion of easy versus hard

samples, they seldom (i) offer tools for distinguishing among different types of hard examples; (ii)

explain theoretically why these metrics might be useful for distinguishing easy versus hard samples.

Moreover, existing metrics tend to give similar scores to examples that are difﬁcult for distinct reasons,

e.g, membership in rare, complex, or mislabeled sub-populations.

1Code for reproducing our experiments can be found at https://github.com/pratyushmaini/ssft.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15031v1 [cs.LG] 26 Oct 2022

Learning Time

Forgetting Time

Complex

Examples

Rare

Examples

Mislabeled

Examples

Typical

Examples Never

Forgotten

Figure 1: Overview of example separation offered by the uniﬁed view of learning and forgetting time.

In this paper, we propose to additionally consider a new metric, Second-Split Forgetting Time (SSFT),

calculated based on the forgetting dynamics that arise as training examples are forgotten when a

neural network continues to train on a second, randomly held out data partition. SSFT is deﬁned as the

ﬁne-tuning epoch after which a ﬁrst-split training example is no longer classiﬁed correctly. We ﬁnd

that SSFT identiﬁes mislabeled examples remarkably well but does little to separate out under- versus

over-represented subpopulations. Conversely, metrics based on the (ﬁrst-split) training dynamics

are more discriminative for separating these populations but less useful for detecting mislabeled

examples. We leverage the complementarity of ﬁrst- and second-split metrics, showing that by jointly

visualizing the two, we can produce a richer characterization of the training examples.

In our experiments, we operationalize several notions of hard examples, namely: (i)

mislabeled

examples, for which the original label has been ﬂipped to a randomly chosen incorrect label; (ii)

rare

examples, which belong to underrepresented subpopulations; and (iii)

complex

examples, which

belong to subpopulations for which the classiﬁcation task is more difﬁcult (details in Section 3.2).

We perform speciﬁc ablation studies with datasets complicated by just one type of hard example

(Section 4.3), and show how SSFT can help to distinguish among these categories of examples.

We observe that during second-split training, neural networks (i) ﬁrst forget mislabeled examples

from the ﬁrst split; (ii) only slowly begin to forget rare examples (e.g., from underrepresented sub-

populations) unique to the ﬁrst training set; and (iii) do not forget complex examples.

This separation of hard example types has multiple practical applications.

First

, we can use the

method to identify noisy labels: On CIFAR-10 with 10% added class noise, SSFT achieves 0.94 AUC

for identifying mislabeled samples, while the ﬁrst-split metrics range in AUC between 0.58 to 0.90.

Second

, the method can also help improve generalization in noisy data settings: while the removal of

hard examples according to ﬁrst-split learning time degrades the performance of the classiﬁer, the

removal of hard examples according to SSFT can actually improve generalization. This is especially

beneﬁcial when e.g., training on synthetic data (produced by a generative model) or mislabeled data.

Third

, we show how SSFT can identify failure modes of machine learning models. For example,

in a simpliﬁed task classifying between horses and airplanes in the CIFAR-10 dataset, we ﬁnd that

training examples containing horses with sky backgrounds and airplanes with green backgrounds

are among the earliest forgotten—indicating that the model relies on the background as a spurious

feature.

Last

, we also add that our metric is robust across multiple seeds, and the earliest forgotten

examples are robust across architectures. Across multiple optimizers, SSFT distinguishes mislabeled

samples, whereas ﬁrst-split metrics appear more sensitive to the choice of optimizer.

Finally, we investigate second-split dynamics theoretically, analyzing overparametrized linear models

[

]. We introduce notions of mislabeled, rare, and complex examples appropriate to this toy model.

Our analysis shows that mislabeled examples from the ﬁrst split are forgotten quickly during second-

split training whereas rare examples are not. However, as we train for a long time, rare examples

from the ﬁrst split are eventually forgotten as the model converges to the minimum norm solution on

the second split while predictions on complex examples remain accurate with high probability.

2 Related Work

Example Hardness.

Several recent works quantify example hardness with various training-time

metrics. Many of these metrics are based on ﬁrst-split learning dynamics [

]. Other

works have resorted to properties of deep networks such as compression ability [

] and prediction

depth [

]. Carlini et al.

[7]

study metrics centered around model training such as conﬁdence, ensemble

agreement, adversarial robustness, holdout retraining, and accuracy under privacy-preserving training.

Closest in spirit to the SSFT studied in our paper are efforts in [

]. Crucially, Carlini et al.

[7]

study the KL divergence of the prediction vector after ﬁne-tuning on a held-out set at a low learning

rate, and do not draw any direct inference of the separation offered by their metric. Focusing on (ﬁrst-

split) forgetting dynamics, Toneva et al.

[47]

deﬁned a metric based on the number of forgetting events

during training and identiﬁed sets of unforgettable examples that are never misclassiﬁed once learned.

In our work, we ﬁnd complementary beneﬁts of analysis based on ﬁrst- and second-split dynamics.

Memorization of Data Points.

In order to capture the memorization ability of deep networks, their

ability to memorize noise (or randomly labeled samples) has been studied in recent work [

]. As

opposed to the memorization of rare examples, the memorization of noisy samples hurts generalization

and makes the classiﬁer boundary more complex [

]. On the contrary, a recent line of works has

argued how memorization of (atypical) data points is important for achieving optimal generalization

performance when data is sampled from long-tailed distributions [6,11,15].

Simplicity Bias.

Another line of work argues that neural networks have a bias toward learning

simple features [

], and often do not learn complex features even when the complex feature is more

predictive of the true label than the simple features. This suggests that models end up memorizing

(through noise) the few samples in the dataset that contain the complex feature alone, and utilize the

simple feature for correctly predicting the other training examples [1,32].

Label Noise.

Large-scale machine learning datasets are typically labeled with the help of human

labelers [

] to facilitate supervised learning. It has been shown that a signiﬁcant fraction of these

labels are erroneous in common machine learning datasets [

]. Learning under noisy labels is a

long-studied problem [

]. Various recent methods have also attempted to identify label

noise [

]. While the focus of our work is not to propose a new method in this long line

of work, we show that the view of forgetting time naturally distills out examples with noisy labels.

Future work may beneﬁt by augmenting our metric with SOTA methods for label noise identiﬁcation.

3 Method

The primary goal of our work is to characterize the hardness of different datapoints in a given

dataset. Suppose we have a dataset

SA={xi,yi}n

such that

(xi,yi)∼ D

. For the purpose of

characterization, we augment each datapoint

(xi,yi)∈ SA

with parameters

(fslti,ssfti)

where

fslti

quantiﬁes the ﬁrst-split learning time (FSLT), and

ssfti

quantiﬁes the second-split forgetting

time (SSFT) of the sample. To obtain these parameters, we next describe our proposed procedure.

Procedure

We train a model

to minimize the empirical risk:

L(S;f) = Pi`(f(xi),yi)

We use

to denote a model

(initialized with random weights) trained on

until convergence

(100% accuracy on

). We then train a model initialized with

on a held-out split

SB∼ Dn

until

convergence. We denote this model with

fA→B

. To obtain parameters

(fslti,ssfti)

, we track per-

example predictions (

) at the end of every epoch (

tth

) of training. Unless speciﬁed otherwise, we

train the model with cross-entropy loss using Stochastic Gradient Descent (SGD).

Deﬁnition 1

(First-Split Learning Time)

For

{xi,yi} ∈ SA

, learning time is deﬁned as the earliest

epoch during the training of a classiﬁer fon SAafter which it is always classiﬁed correctly, i.e.,

fslti= argmin

t∗(ˆ

i,(A)=yi∀t≥t∗)∀{xi,yi}∈SA.(1)

Deﬁnition 2

(Second-Split Forgetting Time)

Let

i,(A→B)

to denote the prediction of sample

{xi,yi} ∈ SA

after training

f(A→B)

for

epochs on

. Then, for

{xi,yi} ∈ SA

forgetting time is

deﬁned as the earliest epoch after which it is never classiﬁed correctly, i.e.,

ssfti= argmin

t∗(ˆ

i,(A→B)6=yi∀t≥t∗)∀{xi,yi}∈SA.(2)

100101

0.2

0.4

0.6

0.8

1.0

Accuracy

First-split Training

Mislabeled

Rare

Complex

Typical

101103105

0.0

0.2

0.4

0.6

0.8

1.0

Second-split Training

Number of Training Epochs

(a) (b)

Figure 2: Rate of Learning and Forgetting of examples for different groups in the synthetic dataset.

While ﬁrst-split training is not able to distinguish between rare and complex examples, second-split

training succeeds in distinguishing them. Additionally, second-split training separates mislabeled

examples from the rest relatively better than ﬁrst-split training. (b) Visualization of ﬁrst-split learning

and second-split forgetting times when training LeNet model on the MNIST dataset.

3.1 Baseline Methods

We provide a brief description of metrics for example hardness considered in recent comparisons [

Number of Forgetting Events:

(

). An example

(xi,yi)∈ S

undergoes a forgetting event when

the accuracy on the example decreases between two consecutive updates. Toneva et al.

[47]

analyzed

the total number of such events

during the training of a neural network to identify hard examples.

Cumulative Learning Accuracy:

(

accl

). Jiang et al.

[25]

suggest that rather than using the learning

time (Deﬁnition 1), using the number of epochs during training when a machine learning model

correctly classiﬁes a given sample is a more stable metric for predicting example hardness.

Cumulative Learning Conﬁdence:

(

confl

). Similar to

accl

confl

measures the cumulative softmax

conﬁdence of the model towards the correct class over the course of training.

3.2 Example Characterization

We characterize example hardness via three sources of learning difﬁculty:

(i) Mislabeled Examples:

We refer to mislabeled examples as those datapoints whose label has been ﬂipped to an incorrect

label uniformly at random.

(ii) Rare Examples:

We assume that rare examples belong to sub-

populations of the original distribution that have a low probability of occurrence. In particular, there

exist

O(1)

examples from such sub-populations in a given dataset. In Section 4.3 we describe how

we operationalize this notion in the case of the CIFAR-100 dataset.

(iii) Complex Examples:

These

constitute samples that are drawn from sub-groups in the dataset that require either (1) a hypothesis

class of high complexity; or (2) higher sample complexity to be learnt relative to examples from rest of

the dataset. We leave the deﬁnition of complex samples mathematically imprecise, but with the same

intuitive sense as in prior work [

]. For instance, in a dataset composed of the union of MNIST and

CIFAR-10 images, we would consider the subpopulation of CIFAR-10 images to be more complex.

4 Empirical Investigation of First- and Second-Split Training Dynamics

4.1 Experimental Setup

Datasets

We show results on a variety of image classiﬁcation datasets—MNIST [

], CIFAR-

10 [

], and Imagenette [

]. For experiments in the language domain, we use the SST-2 dataset [

For each of the datasets, we split the training set into two equal partitions (

SA,SB)

. For experiments

Sentences in SST-2 dataset with smallest forgetting time Label

The director explores all three sides of his story with a sensitivity and an inquisitiveness reminiscent of Truffaut Neg

Beneath the ﬁlm’s obvious determination to shock at any cost lies considerable skill and determination , backed by sheer nerve Neg

This is a fragmented ﬁlm, once a good idea that was followed by the bad idea to turn it into a movie Pos

The holiday message of the 37-minute Santa vs. the Snowman leaves a lot to be desired. Pos

Epps has neither the charisma nor the natural affability that has made Tucker a star Pos

The bottom line is the piece works brilliantly Neg

Alternative medicine obviously has its merits ... but Ayurveda does the ﬁeld no favors Pos

What could have easily become a cold, calculated exercise in postmodern pastiche winds up a powerful and deeply moving

example of melodramatic moviemaking

Neg

Lacks depth Pos

Certain to be distasteful to children and adults alike , Eight Crazy Nights is a total misﬁre Pos

Table 1: First-split sentences that were forgotten by the 3rd epoch of second-split training of a BERT-

base model on the SST-2 dataset. Notice that all of these forgotten examples are mislabeled.

with mislabeled examples, we simulate mislabeled examples by randomly selecting a subset of 10%

examples from both the partitions and changing their label to an incorrect class.

Training Details

Unless otherwise speciﬁed, we train a ResNet-9 model [

] using SGD optimizer

with weight decay 5e-4 and momentum 0.9. We use the cyclic learning rate schedule [

] with a

peak learning rate of 0.1 at the 10th epoch. We train for a maximum of 100 epochs or until we have 5

epochs of 100% training accuracy. We ﬁrst train on

, and then using the pre-initialized weights

from stage 1, train on

with the same learning parameters. All experiments can be performed on a

single RTX2080 Ti. Complete hyperparameter details are available in Appendix B.1.

4.2 Learning-Forgetting Spectrum for various datasets

Synthetic Dataset

We consider data (

x,y

) sampled from a mixture of multiple distributions

, s.t.

x∈Rd

denotes the

gth

group and has a sampling frequency of

πg

. Each group

Dg≡(Xg,{yg})

i.e., the true label for all the samples drawn from a given group is the same, and the examples in each

group are non-overlapping. Each group is parametrized by a set of

kd

unique indices

Ig⊂[d]

such that

Ii∩ Ij=φ

for

i6=j

. The discriminative characteristic of each group is the vector

such that, [ug]i= 1 if i∈ Igelse 0∀i∈[d]. Then for any sample (x,y)∈ S:

P(x∈ Xg) = πg;x|Xg∼ N(0, σ2Id) + µg.

For our simulation, we consider a 10 class-classiﬁcation problem, with

µg= 5

for typical groups, and

µg= 4

for complex groups (higher signal to noise ratio). For any sample drawn from a rare group,

we have

O(1)

samples from that group in the entire dataset (

SA∪ SB)

. Mislabeled samples are only

generated from the majority typical groups. In Figure 2a, we show the rate of learning and forgetting

of examples from each of these categories. We note that in the second-split training, the mislabeled

examples are quickly forgotten, and the complex examples are never forgotten. The rare examples are

forgotten slowly. In Section 5we will theoretically justify the observations in the synthetic dataset

and show that the rare examples are expected to be forgotten as we train for an inﬁnite time.

Image Domain

In Figure 2b, we show representative examples in the four quadrants of the learning-

forgetting spectrum. More speciﬁcally, we ﬁnd that the examples forgotten fastest and learned last

are mislabeled. And the ones learned early and never forgotten once learned are characteristic simple

examples of the MNIST dataset. Examples in the ﬁrst and third quadrant are seemingly atypical and

ambiguous respectively. Similar visualizations for other image datasets can be found in Appendix B.2.

Other Modalities

The forgetting and learning dynamics occur broadly across modalities apart

from images. We repeat the same problem setup on the SST-2 [

] dataset for sentiment classiﬁcation.

We ﬁne-tune a pre-trained BERT-base model [

] successively on two disjoint splits of the dataset.

In Table 1, we provide a list of the earliest forgotten samples when we train a BERT model on the

second split of SST-2 dataset. The results suggest that SSFT is able to identify mislabeled samples.

4.3 Ablation Experiments

We design speciﬁc experimental setups to capture the three notions of hardness as deﬁned in Section 3.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CharacterizingDatapointsviaSecond-SplitForgettingPratyushMaini1SaurabhGarg1ZacharyC.Lipton1J.ZicoKolter1;2CarnegieMellonUniversity1BoschCenterforAI2{pratyushmaini,zlipton}@cmu.edu;{sgarg2,zkolter}@cs.cmu.eduAbstractResearchersinvestigatingexamplehardnesshaveincreasinglyfocusedonthedy-namicsbywhichne...

展开>> 收起<<

Characterizing Datapoints via Second-Split Forgetting Pratyush Maini1Saurabh Garg1Zachary C. Lipton1J. Zico Kolter12 Carnegie Mellon University1Bosch Center for AI2.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Characterizing Datapoints via Second-Split Forgetting Pratyush Maini1Saurabh Garg1Zachary C. Lipton1J. Zico Kolter12 Carnegie Mellon University1Bosch Center for AI2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: