only needs a single model and utilizes a mixup
training strategy based on the aggregated rep-
resentation from pre-trained models.
•
We perform comprehensive experiments on
three different types of text classification
benchmarks under various noise settings, in-
cluding the challenging instance-dependent
noise, which is usually ignored in other works
on textual data, which demonstrate the supe-
riority of our proposed method over strong
baselines.
2 Related Work
Learning with Noisy Labels.
A direct yet effec-
tive idea to handle label noise is to find the noisy
samples and reduce their influence by resampling
or reweighting (Rolnick et al.,2017). Jiang et al.
(2018) train another neural network to provide a
curriculum to help StudentNet focus on the samples
whose labels is probably correct. Han et al. (2018)
jointly train two deep neural networks and feed
each model the top
r
% samples with the lowest loss
evaluated by the other model in each mini-batch.
Following Han et al. (2018), Yu et al. (2019) ex-
plore how disagreement can help the model. Some
researchers believe that there exists a transition
from ground-truth label distribution to the noisy
label distribution and estimate the noise transition
matrix to absorb this transition (Goldberger and
Ben-Reuven,2016). Northcutt et al. (2021) directly
estimate the joint distribution matrix between the
noisy labels and real labels. Garg et al. (2021) use
a fully connected layer to capture the distribution
transition. However, most of these methods either
need model ensembling or require cross-validation,
which is time-consuming and needs multiple pa-
rameters.
Some other works focus on designing a more
robust training strategy. Since DNNs with Cross-
Entropy loss tend to overfit noisy labels (Feng et al.,
2021), some researchers redesign noise-robust loss
functions (Wang et al.,2019b;Zhang and Sabuncu,
2018;Ghosh et al.,2017;Xu et al.,2019). When
trained on noisy data, DNNs tend to learn from the
clean data during an “early learning” phase before
eventually memorizing the wrong data (Arpit et al.,
2017;Zhang et al.,2021), based on which Liu
et al. (2020a) offer an easy regularization capitaliz-
ing on early learning. Some other works like Xia
et al. (2020) find that only partial parameters are
essential for generalization, which offers us a new
perspective to reconsider what difference exactly
the noisy labels make to the model’s learning. This
kind of approach treats all samples indiscriminately
thus the performance is sometimes unsatisfactory
under a high noise ratio.
Some excellent work combines these two ideas
(Ding et al.,2018;Li et al.,2020). Garg et al.
(2021) add an auxiliary noise model
NM
over the
classifier to predict noisy labels and jointly train the
classifier and the noise model through a de-noising
loss function. Cheng et al. (2021) progressively
sieve out corrupted examples and then leverage
semi-supervised learning.
Mixup Training.
Mixup training (Zhang et al.,
2018) is a widely used data-augmentation method
to alleviate memorization and sensitivity to adver-
sarial samples on visual data. It combines the in-
puts and targets of two random training samples to
generate augmented samples. However, applying
mixup on textual data is a great challenge since lin-
ear interpolations on discrete inputs damage the se-
mantic structure. Some literature has explored the
textual mixup mechanism like: Chen et al. (2020)
propose to mix the hidden vector in the last few
encoder layers; Yoon et al. (2021) find a new way
to combine two texts which can also be treated as
a data augmentation strategy. In this paper, we do
not make comparisons for the following reasons:
(1) Our EmbMix is simpler in practical use and
there is little difference in the final performance of
various methods according to Chen et al. (2020).
(2) Some other methods need data augmentation
while EmbMix does not.
Proposed Method.
Since simply redesigning a
robust loss function tends to have poor performance
under a high noise ratio, we combine sample selec-
tion with the robust training methods. Unlike the
previous work that needs model ensembling or uses
cross-validation, we train a single network with
dropout to reduce confirmation bias in self-training.
We make following improvements regarding to the
characteristics of the textual data: (1) The decision
boundaries in image-classification tasks are more
clear. However, the main idea of the same text can
vary under different contexts and sometimes there
is even no absolute correct label. So we iteratively
use the Gaussian Mixture Model (GMM) to fit the
loss distribution and use the predicted soft label
to replace the label of the fusing data rather than
setting a threshold and arbitrarily discarding the
undesired samples at the beginning. (2) Unlike