
The dynamics of clean data has been studied in (Hacohen et al.,2020;Pliushch et al.,2021), where it
is reported that different deep models learn examples in the same order and pace. This means that
when training a few models and comparing their predictions, a binary occurrence (approximately)
is seen at each epoch
e
: either all the networks correctly predict the example’s label, or none of
them does. This further implies that for the most part, the distribution of predictions across points
is bimodal. Additionally, a variety of studies showed that the bias and variance of deep networks
decrease as the networks complexity grow (Nakkiran et al.,2021;Neal et al.,2018), providing
additional evidence that different deep networks learn data at the same time simultaneously.
Figure 1: With noisy labels models show higher disagreement. The
noisy examples are not only learned at a later stage, but each model
learns the example at its own different time.
In Section 3we describe a new em-
pirical result: when training an en-
semble of deep models with noisy
data, and in contrast to what happens
when using clean data, different mod-
els learn different datapoints at differ-
ent times (see Fig. 1). This empirical
finding tells us that in an ensemble
of networks, the learning dynamics
of clean data and noisy data can be
distinguished. When training such an
ensemble with a mixture of clean and noisy data, the emerging dynamics reflects this observation, as
well as the tendency of clean data to be learned faster as previously observed.
In our second contribution, we use this result to develop a new algorithm for noise level estimation
and noise filtration, which we call DisagreeNet (see Section 4). Importantly, unlike most alternative
methods, our algorithm is simple (it does not introduce any new hyperparameters), parallelizable,
easy to integrate with any supervised or semi-supervised learning method and any loss function,
and does not rely on prior knowledge of the noise amount. When used for noise filtration, our
empirical study (see Section 5) shows the superiority of DisagreeNet as compared to the state of
the art, using different datasets, different noise models and different noise levels. When used for
supervised classification by way of pre-processing the training set prior to training a deep model, it
provides a significant boost in performance, more so than alternative methods.
Relation to prior art
Work on the dynamics of learning in deep models has received increased attention in recent years (e.g.,
Nguyen et al.,2020;Hacohen et al.,2020;Baldock et al.,2021). Our work adds a new observation
to this body of knowledge, which is seemingly unique to an ensemble of deep models (as against
an ensemble of other commonly used classifiers). Thus, while there exist other methods that use
ensembles to handle label noise (e.g., Sabzevari et al.,2018;Feng et al.,2020;Chai et al.,2021;
de Moura et al.,2018), for the most part they cannot take advantage of this characteristic of deep
models, and as a result are forced to use additional knowledge, typically the availability of a clean
validation set and/or prior knowledge of the noise amount.
Work on deep learning with noisy labels (see Song et al. (2022) for a recent survey) can be coarsely
divided to two categories: general methods that use a modified loss or network’s architecture, and
methods that focus on noise identification. The first group includes methods that aim to estimate the
underlying noise transition matrix (Goldberger and Ben-Reuven,2016;Patrini et al.,2017), employ
a noise-robust loss (Ghosh et al.,2017;Zhang and Sabuncu,2018;Wang et al.,2019;Xu et al.,
2019), or achieve robustness to noise by way of regularization (Tanno et al.,2019;Jenni and Favaro,
2018). Methods in the second group, which is more inline with our approach, focus more directly
on noise identification. Some methods assume that clean examples are usually learned faster than
noisy examples (e.g. Liu et al.,2020). Others (Arazo et al.,2019;Li et al.,2020) generate soft labels
by interpolating the given labels and the model’s predictions during training. Yet other methods
(Jiang et al.,2018;Han et al.,2018;Malach and Shalev-Shwartz,2017;Yu et al.,2019;Lee and
Chung,2019), like our own, inspect an ensemble of networks, usually in order to transfer information
between networks and thus avoid agreement bias.
Notably, we also analyze the behavior of ensembles in order to identify the noisy examples, resembling
(Pleiss et al.,2020;Nguyen et al.,2019;Lee and Chung,2019). But unlike these methods, which track
the loss of the networks, we track the dynamics of the agreement between multiple networks over
2