THEDYNAMIC OF CONSENSUS IN DEEPNETWORKS AND THE IDENTIFICATION OF NOISY LABELS Daniel Shwartz Uri Stern Daphna Weinshall

2025-05-06 0 0 5.25MB 22 页 10玖币

侵权投诉

THE DYNAMIC OF CONSENSUS IN DEEP NETWORKS

AND THE IDENTIFICATION OF NOISY LABELS

Daniel Shwartz, Uri Stern & Daphna Weinshall

School of Computer Science and Engineering

Hebrew University of Jerusalem

Jerusalem, Israel

{Daniel.Shwartz1,Uri.Stern,daphna}@mail.huji.ac.il

ABSTRACT

Deep neural networks have incredible capacity and expressibility, and can seem-

ingly memorize any training set. This introduces a problem when training in the

presence of noisy labels, as the noisy examples cannot be distinguished from clean

examples by the end of training. Recent research has dealt with this challenge

by utilizing the fact that deep networks seem to memorize clean examples much

earlier than noisy examples. Here we report a new empirical result: for each

example, when looking at the time it has been memorized by each model in an

ensemble of networks, the diversity seen in noisy examples is much larger than

the clean examples. We use this observation to develop a new method for noisy

labels ﬁltration. The method is based on a statistics of the data, which captures

the differences in ensemble learning dynamics between clean and noisy data. We

test our method on three tasks: (i) noise amount estimation; (ii) noise ﬁltration;

(iii) supervised classiﬁcation. We show that our method improves over existing

baselines in all three tasks using a variety of datasets, noise models, and noise

levels. Aside from its improved performance, our method has two other advantages.

(i) Simplicity, which implies that no additional hyperparameters are introduced.

(ii) Our method is modular: it does not work in an end-to-end fashion, and can

therefore be used to clean a dataset for any other future usage.

1 INTRODUCTION

Deep neural networks dominate the state of the art in an ever increasing list of application domains,

but for the most part, this incredible success relies on very large datasets of annotated examples

available for training. Unfortunately, large amounts of high-quality annotated data are hard and

expensive to acquire, whereas cheap alternatives (obtained by way of crowd-sourcing or automatic

labeling, for example) often introduce noisy labels into the training set. By now there is much

empirical evidence that neural networks can memorize almost every training set, including ones with

noisy and even random labels Zhang et al. (2017), which in turn increases the generalization error of

the model. As a result, the problems of identifying the existence of label noise and the separation of

noisy labels from clean ones, are becoming more urgent and therefore attract increasing attention.

Henceforth, we will call the set of examples in the training data whose labels are correct "clean data",

and the set of examples whose labels are incorrect "noisy data". While all labels can be eventually

learned by deep models, it has been empirically shown that most noisy datapoints are learned by

deep models late, after most of the clean data has already been learned (Arpit et al.,2017). Therefore

many methods focus on the learning time of an example in order to classify it as noisy or clean, by

looking at its loss (Pleiss et al.,2020;Arazo et al.,2019) or loss per epoch (Li et al.,2020) in a single

model. However, these methods struggle to classify correctly clean and noisy datapoints that are

learned at the same time, or worse - noisy datapoints that are learned early. Additionally, many of

these methods work in an end-to-end manner, and thus neither provide noise level estimation nor do

they deliver separate sets of clean and noisy data for novel future usages.

Our ﬁrst contribution is a new empirical results regarding the learning dynamics of an ensemble of

deep networks, showing that the dynamics is different when training with clean data vs. noisy data.

arXiv:2210.00583v1 [cs.LG] 2 Oct 2022

The dynamics of clean data has been studied in (Hacohen et al.,2020;Pliushch et al.,2021), where it

is reported that different deep models learn examples in the same order and pace. This means that

when training a few models and comparing their predictions, a binary occurrence (approximately)

is seen at each epoch

: either all the networks correctly predict the example’s label, or none of

them does. This further implies that for the most part, the distribution of predictions across points

is bimodal. Additionally, a variety of studies showed that the bias and variance of deep networks

decrease as the networks complexity grow (Nakkiran et al.,2021;Neal et al.,2018), providing

additional evidence that different deep networks learn data at the same time simultaneously.

Figure 1: With noisy labels models show higher disagreement. The

noisy examples are not only learned at a later stage, but each model

learns the example at its own different time.

In Section 3we describe a new em-

pirical result: when training an en-

semble of deep models with noisy

data, and in contrast to what happens

when using clean data, different mod-

els learn different datapoints at differ-

ent times (see Fig. 1). This empirical

ﬁnding tells us that in an ensemble

of networks, the learning dynamics

of clean data and noisy data can be

distinguished. When training such an

ensemble with a mixture of clean and noisy data, the emerging dynamics reﬂects this observation, as

well as the tendency of clean data to be learned faster as previously observed.

In our second contribution, we use this result to develop a new algorithm for noise level estimation

and noise ﬁltration, which we call DisagreeNet (see Section 4). Importantly, unlike most alternative

methods, our algorithm is simple (it does not introduce any new hyperparameters), parallelizable,

easy to integrate with any supervised or semi-supervised learning method and any loss function,

and does not rely on prior knowledge of the noise amount. When used for noise ﬁltration, our

empirical study (see Section 5) shows the superiority of DisagreeNet as compared to the state of

the art, using different datasets, different noise models and different noise levels. When used for

supervised classiﬁcation by way of pre-processing the training set prior to training a deep model, it

provides a signiﬁcant boost in performance, more so than alternative methods.

Relation to prior art

Work on the dynamics of learning in deep models has received increased attention in recent years (e.g.,

Nguyen et al.,2020;Hacohen et al.,2020;Baldock et al.,2021). Our work adds a new observation

to this body of knowledge, which is seemingly unique to an ensemble of deep models (as against

an ensemble of other commonly used classiﬁers). Thus, while there exist other methods that use

ensembles to handle label noise (e.g., Sabzevari et al.,2018;Feng et al.,2020;Chai et al.,2021;

de Moura et al.,2018), for the most part they cannot take advantage of this characteristic of deep

models, and as a result are forced to use additional knowledge, typically the availability of a clean

validation set and/or prior knowledge of the noise amount.

Work on deep learning with noisy labels (see Song et al. (2022) for a recent survey) can be coarsely

divided to two categories: general methods that use a modiﬁed loss or network’s architecture, and

methods that focus on noise identiﬁcation. The ﬁrst group includes methods that aim to estimate the

underlying noise transition matrix (Goldberger and Ben-Reuven,2016;Patrini et al.,2017), employ

a noise-robust loss (Ghosh et al.,2017;Zhang and Sabuncu,2018;Wang et al.,2019;Xu et al.,

2019), or achieve robustness to noise by way of regularization (Tanno et al.,2019;Jenni and Favaro,

2018). Methods in the second group, which is more inline with our approach, focus more directly

on noise identiﬁcation. Some methods assume that clean examples are usually learned faster than

noisy examples (e.g. Liu et al.,2020). Others (Arazo et al.,2019;Li et al.,2020) generate soft labels

by interpolating the given labels and the model’s predictions during training. Yet other methods

(Jiang et al.,2018;Han et al.,2018;Malach and Shalev-Shwartz,2017;Yu et al.,2019;Lee and

Chung,2019), like our own, inspect an ensemble of networks, usually in order to transfer information

between networks and thus avoid agreement bias.

Notably, we also analyze the behavior of ensembles in order to identify the noisy examples, resembling

(Pleiss et al.,2020;Nguyen et al.,2019;Lee and Chung,2019). But unlike these methods, which track

the loss of the networks, we track the dynamics of the agreement between multiple networks over

epochs. We then show that this statistics is more effective, and achieves superior results. Additionally

(and not less importantly), unlike these works, we do not assume prior knowledge of the noise amount

or the presence of a clean validation set, and do not introduce new hyper-parameters in our algorithm.

Recently, the emphasis has somewhat shifted to the use of semi-supervised learning and contrastive

learning (Li et al.,2020;Liu et al.,2020;Ortego et al.,2020;Wei et al.,2020;Yao et al.,2021;

Zheltonozhskii et al.,2022;Li et al.,2022;Karim et al.,2022). Semi-supervised learning is an

effective paradigm for the prediction of missing labels. This paradigm is especially useful when the

identiﬁcation of noisy points cannot be done reliably, in which case it is advantageous to remove

labels whose likelihood to be true is not negligible. The effectiveness of semi-supervised learning in

providing reliable pseudo-labels for unlabeled points will compensate for the loss of clean labels.

However, semi-supervised learning is not universally practical as it often relies on the extraction of

effective representations based on unsupervised learning tasks, which typically introduces implicit

priors (e.g., that contrastive loss is appropriate). In contrast, our goal is to reliably identify noisy

points, to be subsequently removed. Thus, our method can be easily incorporated into any SOTA

method which uses supervised or semi-supervised learning (with or without contrastive learning),

and may provide beneﬁt even when semi-supervised learning is not viable.

2 INTER-NETWORK AGREEMENT: DEFINITION AND SCORES

Measuring the similarity between deep models is not a trivial challenge, as modern deep neural

networks are complex functions deﬁned by a huge number of parameters, which are invariant to

transformations hidden in the model’s architecture. Here we measure the similarity between deep

models in an ensemble by measuring inter-model prediction agreement at each datapoint. Accordingly,

in Section 2.2 we describe scores that are based on the state of the networks at each epoch

, while in

Section 2.3 we describe cumulative scores that integrate these states through many epochs. Practically

(see Section 4), our proposed method relies on the cumulative scores, which are shown empirically to

provide more accurate results in the noise ﬁltration task. These scores promise added robustness, as it

is no longer necessary to identify the epoch at which the score is to be evaluated.

2.1 PRELIMINARIES

Notations

Let

fe:Rd→[0,1]|C|

denote a deep model, trained with Stochastic Gradient Descent

(SGD) for

epochs on training set

X={(xi, yi)}M

i=1

, where

xi∈Rd

denotes a single example and

yi∈[C]

its corresponding label. Let

Fe(X) = {fe

1, ..., fe

denote an ensemble of

such models,

where each model fe

i∈[N]is initialized and trained independently on X.

Noise model

We analyze the training dynamics of an ensemble of models in the presence of label

noise. Label noise is different from data noise (like image distortion or additive Gaussian noise). Here

it is assumed that after the training set

X={(xi, li)}M

i=1

is sampled, the labels

{li}

are corrupted by

some noise function

g: [C]→[C]

, and the training set becomes

X={(xi, yi)}M

i=1 , yi=g(li)

. The

two most common models of label noise are termed symmetric noise and asymmetric noise (Patrini

et al.,2017). In both cases it is assumed that some ﬁxed percentage of the labels are corrupted by

g(l)

. With symmetric noise,

g(l)

assigns any new label from the set

[C]\ {l}

with equal probability.

With asymmetric noise,

g(l)

is the deterministic permutation function (see App. Ffor details). Note

that the asymmetric noise model is considered much harder than the symmetric noise model.

2.2 PER-EPOCH AGREEMENT SCORE

Following Hacohen et al. (2020), we deﬁne the True Positive Agreement (TPA) score of ensemble

Fe(X)

at each datapoint

(x, y)

, where

TPA(x, y;Fe(X)) = 1

NPN

i=1 [fe

i(x)=y]

. The TPA score

measures the average accuracy of the models in the ensemble, when seeing

, after each model has

been trained for exactly

epochs on

. Note that

TPA

measures the average accuracy of multiple

models on one example, as opposed to the generalization error that measures the average error of one

model on multiple examples.

2.3 CUMULATIVE SCORES

When inspecting the dynamics of the TPA score on clean data, we see that at the beginning the

distribution of

{TPA(xi, yi)}

is concentrated around 0, and then quickly shifts to 1 as training

proceeds (see side panels in Fig. 2a). This implies that empirically, data is learned in a speciﬁc order

by all models in the ensemble. To measure this phenomenon we use the Ensemble Learning Pace

(ELP) score deﬁned below, which essentially integrates the TPA score over a set of epochs E:

ELP (x, y) = 1

|E| X

e∈E

TPA(x, y;Fe(X)) (1)

ELP (x, y)

captures both the time of learning by a single model, and its consistency across models.

For example, if all the models learned the example early, the score would be high. It would be

signiﬁcantly lower if some of them learned it later than others (see pseudo-code in App. C).

In our study we evaluated two additional cumulative scores of inter-model agreement:

1. Cumulative loss: CumLoss(x, y) = 1

N|E| X

i,e∈E

CE(fe

i(x), y)

Above

denotes the cross entropy function. This score is very similar to ELP, engaging the

average of the cross-entropy loss instead of the accuracy indicator [fe

i(x)=y].

Area under the margin: following (Pleiss et al.,2020), the MeanMargin score is deﬁned as

follows

MeanM argin(x, y) = 1

N|E| X

i,e∈E

[fe

i(x)]yi−argmax

j6=yi

[fe

i(x)]j

The MeanMargin score is the mean of the ’margin’, the difference between the value of the

ground-truth logit (before softmax) and the value of the otherwise maximal logit.

3 THE DYNAMICS OF AGREEMENT: NEW EMPIRICAL OBSERVATION

In this section we analyze, both theoretically and empirically, how measures of inter-network agree-

ment may indicate the detrimental phenomenon of ’overﬁt’. Overﬁt is a condition that can occur

during the training of deep neural networks. It is characterized by the co-occurring decrease of train

error or loss and the increase of test error or loss. Recall that train loss is the quantity that is being

continuously minimized during the training of deep models, while the test error is the quantity linked

to generalization error. When these quantities change in opposite directions, training harms the ﬁnal

performance and thus early stopping is recommended.

We begin by showing in Section 3.1 that in an ensemble of linear regression models, overﬁt and the

agreement between models are negatively correlated. When this is the case, an epoch in which the

agreement between networks reaches its maximal value is likely to indicate the beginning of overﬁt.

Our next goal is to examine the relevance of this result to deep learning in practice. Yet inexplicably,

at least as far as image datasets are concerned, overﬁt rarely occurs in practice when deep learning

is used for image recognition. However, when label noise is introduced, signiﬁcant overﬁt occurs.

Capitalizing on this observation, we report in Section 3.3 that when overﬁt occurs in the independent

training of an ensemble of deep networks, the agreement between the networks starts to decrease.

The approach we describe in Section 4is motivated by these results: Since it has been observed

that noisy data are memorized later than clean data, we hypothesize that overﬁt occurs when the

memorization of noisy labels becomes dominant. This suggests that measuring the dynamics of

agreement between networks, which is correlated with overﬁt as shown below, can be effectively

used for the identiﬁcation of label noise.

3.1 OVERFIT AND AGREEMENT: THEORETICAL RESULT

Since deep learning models are not amenable to a rigorous theoretical analysis, and in order to gain

computational insight into such general phenomena as overﬁt, simpler models are sometimes analyzed

(e.g. Weinshall and Amir,2020). Accordingly, in App. Awe formally analyze the relation between

overﬁt and inter-model agreement in an ensemble of linear regression models. In this framework, it

can be shown that the two phenomena are negatively correlated, namely, increase in overﬁt implies

decrease in inter-model agreement. Thus, we prove (under some assumptions) the following result:

Theorem.

Assume an ensemble of models obtained by solving linear regression with gradient descent

and random initialization. If overﬁt increases at time

in all the models in the ensemble, then the

agreement between the models in the ensemble at time tdecreases.

3.2 MEASURING THE AGREEMENT BETWEEN MODELS

In order to obtain a score that captures the level of disagreement between networks, we inspect more

closely the distribution of

TPA(x, y;Fe(X))

, deﬁned in Section 2.2, over a sample of datapoints,

and analyze its dynamics as training proceeds. First, note that if all of the models in ensemble

Fe(X)

give identical predictions at each point, the TPA score would be either 0 (when all the networks predict

a false label) or 1 (when all the networks predict the correct label). In this case, the TPA distribution

is perfectly bimodal, with only two peaks at 0 and 1. If the predictions of the models at each point

are independent with mean accuracy

, then it can be readily shown that TPA is approximately the

binomial random variable with a unimodal distribution around p.

Empirically, (Hacohen et al.,2020) showed that in ensembles of deep models trained on ‘real’ datasets

as we use here, the TPA distribution is highly bimodal. Since commonly used measures of bimodality,

such as the Pearson bimodality score, are ill-ﬁtted for the discrete TPA distribution, we measure

bimodality with the following Bimodal Index score:

BI(e) = v

i=1

[TPA(xi,yi;Fe(X))=N]+v

i=1

[TPA(xi,yi;Fe(X))=0] (2)

BI(e)

measures how many examples are either correctly or incorrectly classiﬁed by all the models

in the ensemble, rewarding distributions where points are (roughly) equally divided between 0 and 1.

Here we use this score to measure the agreement between networks at epoch e.

2 max BI

(a) BI (Y-axis) vs. epochs (X-axis)

(b)

Figure 2: (a) Main panel: bimodality in an ensemble of 10 DenseNet networks, trained to classify Cifar10 with

20% symmetric noise. Side panels: TPA distribution in 6 epochs (blue - clean examples, orange - noisy ones).

(b) Scatter plots of test accuracy vs train bimodality, measured by

BI(e)

as deﬁned in (2), where changes in

color from blue to yellow correspond with advancing epochs.

If we were to draw the Bimodality Index (BI) of the TPA score as a function of the epochs (Fig. 2a),

we often see two distinct phases. Initially (phase 1), BI is monotonically increasing, namely, both test

accuracy and agreement are on the rise. We call it the ‘learning’ phase. Empirically, in this phase

most of the clean examples are being learned (or memorized), as can also be seen in the left side

panels of Fig. 2a (cf. Li et al.,2015). At some point BI may start to decrease, followed by another

possible ascent. This is phase 2, in which empirically the memorization of noisy examples dominates

the learning (see the right side panels of Fig. 2a). This fall and rise is explained by another set of

empirical observations, that noisy labels are

not

being learned in the same order by an ensemble of

networks (see App. B), which therefore predicts a decline in BI when noisy labels are being learned.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

THEDYNAMICOFCONSENSUSINDEEPNETWORKSANDTHEIDENTIFICATIONOFNOISYLABELSDanielShwartz,UriStern&DaphnaWeinshallSchoolofComputerScienceandEngineeringHebrewUniversityofJerusalemJerusalem,Israel{Daniel.Shwartz1,Uri.Stern,daphna}@mail.huji.ac.ilABSTRACTDeepneuralnetworkshaveincrediblecapacityandexpressibilit...

展开>> 收起<<

THEDYNAMIC OF CONSENSUS IN DEEPNETWORKS AND THE IDENTIFICATION OF NOISY LABELS Daniel Shwartz Uri Stern Daphna Weinshall.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

THEDYNAMIC OF CONSENSUS IN DEEPNETWORKS AND THE IDENTIFICATION OF NOISY LABELS Daniel Shwartz Uri Stern Daphna Weinshall

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: