THEDYNAMIC OF CONSENSUS IN DEEPNETWORKS AND THE IDENTIFICATION OF NOISY LABELS Daniel Shwartz Uri Stern Daphna Weinshall

2025-05-06 0 0 5.25MB 22 页 10玖币
侵权投诉
THE DYNAMIC OF CONSENSUS IN DEEP NETWORKS
AND THE IDENTIFICATION OF NOISY LABELS
Daniel Shwartz, Uri Stern & Daphna Weinshall
School of Computer Science and Engineering
Hebrew University of Jerusalem
Jerusalem, Israel
{Daniel.Shwartz1,Uri.Stern,daphna}@mail.huji.ac.il
ABSTRACT
Deep neural networks have incredible capacity and expressibility, and can seem-
ingly memorize any training set. This introduces a problem when training in the
presence of noisy labels, as the noisy examples cannot be distinguished from clean
examples by the end of training. Recent research has dealt with this challenge
by utilizing the fact that deep networks seem to memorize clean examples much
earlier than noisy examples. Here we report a new empirical result: for each
example, when looking at the time it has been memorized by each model in an
ensemble of networks, the diversity seen in noisy examples is much larger than
the clean examples. We use this observation to develop a new method for noisy
labels filtration. The method is based on a statistics of the data, which captures
the differences in ensemble learning dynamics between clean and noisy data. We
test our method on three tasks: (i) noise amount estimation; (ii) noise filtration;
(iii) supervised classification. We show that our method improves over existing
baselines in all three tasks using a variety of datasets, noise models, and noise
levels. Aside from its improved performance, our method has two other advantages.
(i) Simplicity, which implies that no additional hyperparameters are introduced.
(ii) Our method is modular: it does not work in an end-to-end fashion, and can
therefore be used to clean a dataset for any other future usage.
1 INTRODUCTION
Deep neural networks dominate the state of the art in an ever increasing list of application domains,
but for the most part, this incredible success relies on very large datasets of annotated examples
available for training. Unfortunately, large amounts of high-quality annotated data are hard and
expensive to acquire, whereas cheap alternatives (obtained by way of crowd-sourcing or automatic
labeling, for example) often introduce noisy labels into the training set. By now there is much
empirical evidence that neural networks can memorize almost every training set, including ones with
noisy and even random labels Zhang et al. (2017), which in turn increases the generalization error of
the model. As a result, the problems of identifying the existence of label noise and the separation of
noisy labels from clean ones, are becoming more urgent and therefore attract increasing attention.
Henceforth, we will call the set of examples in the training data whose labels are correct "clean data",
and the set of examples whose labels are incorrect "noisy data". While all labels can be eventually
learned by deep models, it has been empirically shown that most noisy datapoints are learned by
deep models late, after most of the clean data has already been learned (Arpit et al.,2017). Therefore
many methods focus on the learning time of an example in order to classify it as noisy or clean, by
looking at its loss (Pleiss et al.,2020;Arazo et al.,2019) or loss per epoch (Li et al.,2020) in a single
model. However, these methods struggle to classify correctly clean and noisy datapoints that are
learned at the same time, or worse - noisy datapoints that are learned early. Additionally, many of
these methods work in an end-to-end manner, and thus neither provide noise level estimation nor do
they deliver separate sets of clean and noisy data for novel future usages.
Our first contribution is a new empirical results regarding the learning dynamics of an ensemble of
deep networks, showing that the dynamics is different when training with clean data vs. noisy data.
1
arXiv:2210.00583v1 [cs.LG] 2 Oct 2022
The dynamics of clean data has been studied in (Hacohen et al.,2020;Pliushch et al.,2021), where it
is reported that different deep models learn examples in the same order and pace. This means that
when training a few models and comparing their predictions, a binary occurrence (approximately)
is seen at each epoch
e
: either all the networks correctly predict the example’s label, or none of
them does. This further implies that for the most part, the distribution of predictions across points
is bimodal. Additionally, a variety of studies showed that the bias and variance of deep networks
decrease as the networks complexity grow (Nakkiran et al.,2021;Neal et al.,2018), providing
additional evidence that different deep networks learn data at the same time simultaneously.
Figure 1: With noisy labels models show higher disagreement. The
noisy examples are not only learned at a later stage, but each model
learns the example at its own different time.
In Section 3we describe a new em-
pirical result: when training an en-
semble of deep models with noisy
data, and in contrast to what happens
when using clean data, different mod-
els learn different datapoints at differ-
ent times (see Fig. 1). This empirical
finding tells us that in an ensemble
of networks, the learning dynamics
of clean data and noisy data can be
distinguished. When training such an
ensemble with a mixture of clean and noisy data, the emerging dynamics reflects this observation, as
well as the tendency of clean data to be learned faster as previously observed.
In our second contribution, we use this result to develop a new algorithm for noise level estimation
and noise filtration, which we call DisagreeNet (see Section 4). Importantly, unlike most alternative
methods, our algorithm is simple (it does not introduce any new hyperparameters), parallelizable,
easy to integrate with any supervised or semi-supervised learning method and any loss function,
and does not rely on prior knowledge of the noise amount. When used for noise filtration, our
empirical study (see Section 5) shows the superiority of DisagreeNet as compared to the state of
the art, using different datasets, different noise models and different noise levels. When used for
supervised classification by way of pre-processing the training set prior to training a deep model, it
provides a significant boost in performance, more so than alternative methods.
Relation to prior art
Work on the dynamics of learning in deep models has received increased attention in recent years (e.g.,
Nguyen et al.,2020;Hacohen et al.,2020;Baldock et al.,2021). Our work adds a new observation
to this body of knowledge, which is seemingly unique to an ensemble of deep models (as against
an ensemble of other commonly used classifiers). Thus, while there exist other methods that use
ensembles to handle label noise (e.g., Sabzevari et al.,2018;Feng et al.,2020;Chai et al.,2021;
de Moura et al.,2018), for the most part they cannot take advantage of this characteristic of deep
models, and as a result are forced to use additional knowledge, typically the availability of a clean
validation set and/or prior knowledge of the noise amount.
Work on deep learning with noisy labels (see Song et al. (2022) for a recent survey) can be coarsely
divided to two categories: general methods that use a modified loss or network’s architecture, and
methods that focus on noise identification. The first group includes methods that aim to estimate the
underlying noise transition matrix (Goldberger and Ben-Reuven,2016;Patrini et al.,2017), employ
a noise-robust loss (Ghosh et al.,2017;Zhang and Sabuncu,2018;Wang et al.,2019;Xu et al.,
2019), or achieve robustness to noise by way of regularization (Tanno et al.,2019;Jenni and Favaro,
2018). Methods in the second group, which is more inline with our approach, focus more directly
on noise identification. Some methods assume that clean examples are usually learned faster than
noisy examples (e.g. Liu et al.,2020). Others (Arazo et al.,2019;Li et al.,2020) generate soft labels
by interpolating the given labels and the model’s predictions during training. Yet other methods
(Jiang et al.,2018;Han et al.,2018;Malach and Shalev-Shwartz,2017;Yu et al.,2019;Lee and
Chung,2019), like our own, inspect an ensemble of networks, usually in order to transfer information
between networks and thus avoid agreement bias.
Notably, we also analyze the behavior of ensembles in order to identify the noisy examples, resembling
(Pleiss et al.,2020;Nguyen et al.,2019;Lee and Chung,2019). But unlike these methods, which track
the loss of the networks, we track the dynamics of the agreement between multiple networks over
2
epochs. We then show that this statistics is more effective, and achieves superior results. Additionally
(and not less importantly), unlike these works, we do not assume prior knowledge of the noise amount
or the presence of a clean validation set, and do not introduce new hyper-parameters in our algorithm.
Recently, the emphasis has somewhat shifted to the use of semi-supervised learning and contrastive
learning (Li et al.,2020;Liu et al.,2020;Ortego et al.,2020;Wei et al.,2020;Yao et al.,2021;
Zheltonozhskii et al.,2022;Li et al.,2022;Karim et al.,2022). Semi-supervised learning is an
effective paradigm for the prediction of missing labels. This paradigm is especially useful when the
identification of noisy points cannot be done reliably, in which case it is advantageous to remove
labels whose likelihood to be true is not negligible. The effectiveness of semi-supervised learning in
providing reliable pseudo-labels for unlabeled points will compensate for the loss of clean labels.
However, semi-supervised learning is not universally practical as it often relies on the extraction of
effective representations based on unsupervised learning tasks, which typically introduces implicit
priors (e.g., that contrastive loss is appropriate). In contrast, our goal is to reliably identify noisy
points, to be subsequently removed. Thus, our method can be easily incorporated into any SOTA
method which uses supervised or semi-supervised learning (with or without contrastive learning),
and may provide benefit even when semi-supervised learning is not viable.
2 INTER-NETWORK AGREEMENT: DEFINITION AND SCORES
Measuring the similarity between deep models is not a trivial challenge, as modern deep neural
networks are complex functions defined by a huge number of parameters, which are invariant to
transformations hidden in the model’s architecture. Here we measure the similarity between deep
models in an ensemble by measuring inter-model prediction agreement at each datapoint. Accordingly,
in Section 2.2 we describe scores that are based on the state of the networks at each epoch
e
, while in
Section 2.3 we describe cumulative scores that integrate these states through many epochs. Practically
(see Section 4), our proposed method relies on the cumulative scores, which are shown empirically to
provide more accurate results in the noise filtration task. These scores promise added robustness, as it
is no longer necessary to identify the epoch at which the score is to be evaluated.
2.1 PRELIMINARIES
Notations
Let
fe:Rd[0,1]|C|
denote a deep model, trained with Stochastic Gradient Descent
(SGD) for
e
epochs on training set
X={(xi, yi)}M
i=1
, where
xiRd
denotes a single example and
yi[C]
its corresponding label. Let
Fe(X) = {fe
1, ..., fe
N}
denote an ensemble of
N
such models,
where each model fe
i[N]is initialized and trained independently on X.
Noise model
We analyze the training dynamics of an ensemble of models in the presence of label
noise. Label noise is different from data noise (like image distortion or additive Gaussian noise). Here
it is assumed that after the training set
X={(xi, li)}M
i=1
is sampled, the labels
{li}
are corrupted by
some noise function
g: [C][C]
, and the training set becomes
X={(xi, yi)}M
i=1 , yi=g(li)
. The
two most common models of label noise are termed symmetric noise and asymmetric noise (Patrini
et al.,2017). In both cases it is assumed that some fixed percentage of the labels are corrupted by
g(l)
. With symmetric noise,
g(l)
assigns any new label from the set
[C]\ {l}
with equal probability.
With asymmetric noise,
g(l)
is the deterministic permutation function (see App. Ffor details). Note
that the asymmetric noise model is considered much harder than the symmetric noise model.
2.2 PER-EPOCH AGREEMENT SCORE
Following Hacohen et al. (2020), we define the True Positive Agreement (TPA) score of ensemble
Fe(X)
at each datapoint
(x, y)
, where
TPA(x, y;Fe(X)) = 1
NPN
i=1 [fe
i(x)=y]
. The TPA score
measures the average accuracy of the models in the ensemble, when seeing
x
, after each model has
been trained for exactly
e
epochs on
X
. Note that
TPA
measures the average accuracy of multiple
models on one example, as opposed to the generalization error that measures the average error of one
model on multiple examples.
3
2.3 CUMULATIVE SCORES
When inspecting the dynamics of the TPA score on clean data, we see that at the beginning the
distribution of
{TPA(xi, yi)}
is concentrated around 0, and then quickly shifts to 1 as training
proceeds (see side panels in Fig. 2a). This implies that empirically, data is learned in a specific order
by all models in the ensemble. To measure this phenomenon we use the Ensemble Learning Pace
(ELP) score defined below, which essentially integrates the TPA score over a set of epochs E:
ELP (x, y) = 1
|E| X
e∈E
TPA(x, y;Fe(X)) (1)
ELP (x, y)
captures both the time of learning by a single model, and its consistency across models.
For example, if all the models learned the example early, the score would be high. It would be
significantly lower if some of them learned it later than others (see pseudo-code in App. C).
In our study we evaluated two additional cumulative scores of inter-model agreement:
1. Cumulative loss: CumLoss(x, y) = 1
N|E| X
i,e∈E
CE(fe
i(x), y)
Above
CE
denotes the cross entropy function. This score is very similar to ELP, engaging the
average of the cross-entropy loss instead of the accuracy indicator [fe
i(x)=y].
2.
Area under the margin: following (Pleiss et al.,2020), the MeanMargin score is defined as
follows
MeanM argin(x, y) = 1
N|E| X
i,e∈E
[fe
i(x)]yiargmax
j6=yi
[fe
i(x)]j
The MeanMargin score is the mean of the ’margin’, the difference between the value of the
ground-truth logit (before softmax) and the value of the otherwise maximal logit.
3 THE DYNAMICS OF AGREEMENT: NEW EMPIRICAL OBSERVATION
In this section we analyze, both theoretically and empirically, how measures of inter-network agree-
ment may indicate the detrimental phenomenon of ’overfit’. Overfit is a condition that can occur
during the training of deep neural networks. It is characterized by the co-occurring decrease of train
error or loss and the increase of test error or loss. Recall that train loss is the quantity that is being
continuously minimized during the training of deep models, while the test error is the quantity linked
to generalization error. When these quantities change in opposite directions, training harms the final
performance and thus early stopping is recommended.
We begin by showing in Section 3.1 that in an ensemble of linear regression models, overfit and the
agreement between models are negatively correlated. When this is the case, an epoch in which the
agreement between networks reaches its maximal value is likely to indicate the beginning of overfit.
Our next goal is to examine the relevance of this result to deep learning in practice. Yet inexplicably,
at least as far as image datasets are concerned, overfit rarely occurs in practice when deep learning
is used for image recognition. However, when label noise is introduced, significant overfit occurs.
Capitalizing on this observation, we report in Section 3.3 that when overfit occurs in the independent
training of an ensemble of deep networks, the agreement between the networks starts to decrease.
The approach we describe in Section 4is motivated by these results: Since it has been observed
that noisy data are memorized later than clean data, we hypothesize that overfit occurs when the
memorization of noisy labels becomes dominant. This suggests that measuring the dynamics of
agreement between networks, which is correlated with overfit as shown below, can be effectively
used for the identification of label noise.
3.1 OVERFIT AND AGREEMENT: THEORETICAL RESULT
Since deep learning models are not amenable to a rigorous theoretical analysis, and in order to gain
computational insight into such general phenomena as overfit, simpler models are sometimes analyzed
(e.g. Weinshall and Amir,2020). Accordingly, in App. Awe formally analyze the relation between
4
overfit and inter-model agreement in an ensemble of linear regression models. In this framework, it
can be shown that the two phenomena are negatively correlated, namely, increase in overfit implies
decrease in inter-model agreement. Thus, we prove (under some assumptions) the following result:
Theorem.
Assume an ensemble of models obtained by solving linear regression with gradient descent
and random initialization. If overfit increases at time
t
in all the models in the ensemble, then the
agreement between the models in the ensemble at time tdecreases.
3.2 MEASURING THE AGREEMENT BETWEEN MODELS
In order to obtain a score that captures the level of disagreement between networks, we inspect more
closely the distribution of
TPA(x, y;Fe(X))
, defined in Section 2.2, over a sample of datapoints,
and analyze its dynamics as training proceeds. First, note that if all of the models in ensemble
Fe(X)
give identical predictions at each point, the TPA score would be either 0 (when all the networks predict
a false label) or 1 (when all the networks predict the correct label). In this case, the TPA distribution
is perfectly bimodal, with only two peaks at 0 and 1. If the predictions of the models at each point
are independent with mean accuracy
p
, then it can be readily shown that TPA is approximately the
binomial random variable with a unimodal distribution around p.
Empirically, (Hacohen et al.,2020) showed that in ensembles of deep models trained on ‘real’ datasets
as we use here, the TPA distribution is highly bimodal. Since commonly used measures of bimodality,
such as the Pearson bimodality score, are ill-fitted for the discrete TPA distribution, we measure
bimodality with the following Bimodal Index score:
BI(e) = v
u
u
t
1
M
M
X
i=1
[TPA(xi,yi;Fe(X))=N]+v
u
u
t
1
M
M
X
i=1
[TPA(xi,yi;Fe(X))=0] (2)
BI(e)
measures how many examples are either correctly or incorrectly classified by all the models
in the ensemble, rewarding distributions where points are (roughly) equally divided between 0 and 1.
Here we use this score to measure the agreement between networks at epoch e.
1
2 max BI
3
(a) BI (Y-axis) vs. epochs (X-axis)
4
5
6
(b)
Figure 2: (a) Main panel: bimodality in an ensemble of 10 DenseNet networks, trained to classify Cifar10 with
20% symmetric noise. Side panels: TPA distribution in 6 epochs (blue - clean examples, orange - noisy ones).
(b) Scatter plots of test accuracy vs train bimodality, measured by
BI(e)
as defined in (2), where changes in
color from blue to yellow correspond with advancing epochs.
If we were to draw the Bimodality Index (BI) of the TPA score as a function of the epochs (Fig. 2a),
we often see two distinct phases. Initially (phase 1), BI is monotonically increasing, namely, both test
accuracy and agreement are on the rise. We call it the ‘learning’ phase. Empirically, in this phase
most of the clean examples are being learned (or memorized), as can also be seen in the left side
panels of Fig. 2a (cf. Li et al.,2015). At some point BI may start to decrease, followed by another
possible ascent. This is phase 2, in which empirically the memorization of noisy examples dominates
the learning (see the right side panels of Fig. 2a). This fall and rise is explained by another set of
empirical observations, that noisy labels are
not
being learned in the same order by an ensemble of
networks (see App. B), which therefore predicts a decline in BI when noisy labels are being learned.
5
摘要:

THEDYNAMICOFCONSENSUSINDEEPNETWORKSANDTHEIDENTIFICATIONOFNOISYLABELSDanielShwartz,UriStern&DaphnaWeinshallSchoolofComputerScienceandEngineeringHebrewUniversityofJerusalemJerusalem,Israel{Daniel.Shwartz1,Uri.Stern,daphna}@mail.huji.ac.ilABSTRACTDeepneuralnetworkshaveincrediblecapacityandexpressibilit...

展开>> 收起<<
THEDYNAMIC OF CONSENSUS IN DEEPNETWORKS AND THE IDENTIFICATION OF NOISY LABELS Daniel Shwartz Uri Stern Daphna Weinshall.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:5.25MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注