SelfMix Robust Learning Against Textual Label Noise with Self-Mixup Training Dan Qiao1 Chenchen Dai1 Yuyang Ding1 Juntao Li1

2025-05-03 0 0 1.31MB 11 页 10玖币
侵权投诉
SelfMix: Robust Learning Against Textual Label Noise with
Self-Mixup Training
Dan Qiao1, Chenchen Dai1, Yuyang Ding1, Juntao Li1,
Qiang Chen2, Wenliang Chen1, Min Zhang1
1Institute of Computer Science and Technology, Soochow University, China
2Alibaba Group
{danqiao.jordan,morningcc125,yyding.me}@gmail.com
{ljt,wlchen,minzhang}@suda.edu.cn;lapu.cq@alibaba-inc.com
Abstract
The conventional success of textual classi-
fication relies on annotated data, and the
new paradigm of pre-trained language mod-
els (PLMs) still requires a few labeled data
for downstream tasks. However, in real-world
applications, label noise inevitably exists in
training data, damaging the effectiveness, ro-
bustness, and generalization of the models
constructed on such data. Recently, remark-
able achievements have been made to miti-
gate this dilemma in visual data, while only
a few explore textual data. To fill this gap,
we present SelfMix, a simple yet effective
method, to handle label noise in text classifica-
tion tasks. SelfMix uses the Gaussian Mixture
Model to separate samples and leverages semi-
supervised learning. Unlike previous works re-
quiring multiple models, our method utilizes
the dropout mechanism on a single model to
reduce the confirmation bias in self-training
and introduces a textual level mixup training
strategy. Experimental results on three text
classification benchmarks with different types
of text show that the performance of our pro-
posed method outperforms these strong base-
lines designed for both textual and visual data
under different noise ratios and noise types.
Our code is available at https://github.
com/noise-learning/SelfMix.
1 Introduction
The excellent performance of deep neural net-
works (DNNs) depends on data with high-quality
annotations. However, data obtained from the real
world is inevitably mixed with wrong labels (Guan
et al.,2018;Aït-Sahalia et al.,2010;Liu et al.,
2020b). Models trained on these noisy datasets
would easily overfit the noisy labels (Algan and
Ulusoy,2020;Liu et al.,2020a), especially for pre-
trained large models (Zhang and Li,2021), and the
performance will be negatively affected.
Research on learning with noisy labels (LNL)
has gained popularity. Previous work has revealed
that clean samples and noisy samples play differ-
ent roles in the training process and behave differ-
ently in terms of loss values or convergence speeds
etc. (Liu et al.,2020a). Different types of noise
have different effects on the training. For instance,
the impact of class-conditional noise (CCN) can
simulate the confusion between similar classes, and
the effect of instance-dependent noise (IDN) can
be more complex.
Most of the current methods perform experi-
ments on visual data. Label noise on visual data
often goes against objective facts and is easy to dis-
tinguish. As for NLP, there may be disagreement
even among expert annotators due to the complex-
ity of semantic features and the subjectivity of lan-
guage understanding. For example, suppose there
is a piece of news about “The Economic Benefit of
Competitive Sports to our Cities”. In that case, it is
hard to tell whether it belongs to Economic news or
Sports news without fully understanding the con-
textual information. Although a few works pay at-
tention to the natural language area, their methods
are mostly based on the trained-from-scratch mod-
els like LSTM and Text-CNN (Garg et al.,2021;
Jindal et al.,2019). However, PLMs might be a
better choice since the whole training process can
be divided into two stages, and the wrong labels
do not corrupt the pre-training process. Table 2
makes comparisons between PLMs and traditional
networks on the robustness against label noise.
In conclusion, it is vital to explore how to learn
with noisy labels on textual data and use the robust
PLMs as the base model. This paper proposes Self-
Mix, i.e., a self-distillation robust training method
based on the pre-trained models. Section 2intro-
duces some related works and explains the motiva-
tion of our proposed method.
Our contributions can be concluded as follows:
We propose SelfMix, a simple yet effective
method to help learn with noisy labels, which
utilizes a self-training approach. Our method
arXiv:2210.04525v2 [cs.CL] 11 Oct 2022
only needs a single model and utilizes a mixup
training strategy based on the aggregated rep-
resentation from pre-trained models.
We perform comprehensive experiments on
three different types of text classification
benchmarks under various noise settings, in-
cluding the challenging instance-dependent
noise, which is usually ignored in other works
on textual data, which demonstrate the supe-
riority of our proposed method over strong
baselines.
2 Related Work
Learning with Noisy Labels.
A direct yet effec-
tive idea to handle label noise is to find the noisy
samples and reduce their influence by resampling
or reweighting (Rolnick et al.,2017). Jiang et al.
(2018) train another neural network to provide a
curriculum to help StudentNet focus on the samples
whose labels is probably correct. Han et al. (2018)
jointly train two deep neural networks and feed
each model the top
r
% samples with the lowest loss
evaluated by the other model in each mini-batch.
Following Han et al. (2018), Yu et al. (2019) ex-
plore how disagreement can help the model. Some
researchers believe that there exists a transition
from ground-truth label distribution to the noisy
label distribution and estimate the noise transition
matrix to absorb this transition (Goldberger and
Ben-Reuven,2016). Northcutt et al. (2021) directly
estimate the joint distribution matrix between the
noisy labels and real labels. Garg et al. (2021) use
a fully connected layer to capture the distribution
transition. However, most of these methods either
need model ensembling or require cross-validation,
which is time-consuming and needs multiple pa-
rameters.
Some other works focus on designing a more
robust training strategy. Since DNNs with Cross-
Entropy loss tend to overfit noisy labels (Feng et al.,
2021), some researchers redesign noise-robust loss
functions (Wang et al.,2019b;Zhang and Sabuncu,
2018;Ghosh et al.,2017;Xu et al.,2019). When
trained on noisy data, DNNs tend to learn from the
clean data during an “early learning” phase before
eventually memorizing the wrong data (Arpit et al.,
2017;Zhang et al.,2021), based on which Liu
et al. (2020a) offer an easy regularization capitaliz-
ing on early learning. Some other works like Xia
et al. (2020) find that only partial parameters are
essential for generalization, which offers us a new
perspective to reconsider what difference exactly
the noisy labels make to the model’s learning. This
kind of approach treats all samples indiscriminately
thus the performance is sometimes unsatisfactory
under a high noise ratio.
Some excellent work combines these two ideas
(Ding et al.,2018;Li et al.,2020). Garg et al.
(2021) add an auxiliary noise model
NM
over the
classifier to predict noisy labels and jointly train the
classifier and the noise model through a de-noising
loss function. Cheng et al. (2021) progressively
sieve out corrupted examples and then leverage
semi-supervised learning.
Mixup Training.
Mixup training (Zhang et al.,
2018) is a widely used data-augmentation method
to alleviate memorization and sensitivity to adver-
sarial samples on visual data. It combines the in-
puts and targets of two random training samples to
generate augmented samples. However, applying
mixup on textual data is a great challenge since lin-
ear interpolations on discrete inputs damage the se-
mantic structure. Some literature has explored the
textual mixup mechanism like: Chen et al. (2020)
propose to mix the hidden vector in the last few
encoder layers; Yoon et al. (2021) find a new way
to combine two texts which can also be treated as
a data augmentation strategy. In this paper, we do
not make comparisons for the following reasons:
(1) Our EmbMix is simpler in practical use and
there is little difference in the final performance of
various methods according to Chen et al. (2020).
(2) Some other methods need data augmentation
while EmbMix does not.
Proposed Method.
Since simply redesigning a
robust loss function tends to have poor performance
under a high noise ratio, we combine sample selec-
tion with the robust training methods. Unlike the
previous work that needs model ensembling or uses
cross-validation, we train a single network with
dropout to reduce confirmation bias in self-training.
We make following improvements regarding to the
characteristics of the textual data: (1) The decision
boundaries in image-classification tasks are more
clear. However, the main idea of the same text can
vary under different contexts and sometimes there
is even no absolute correct label. So we iteratively
use the Gaussian Mixture Model (GMM) to fit the
loss distribution and use the predicted soft label
to replace the label of the fusing data rather than
setting a threshold and arbitrarily discarding the
undesired samples at the beginning. (2) Unlike
the pixel input of visual data, the input of text is
discrete. So for the separated data, we leverage
a manifold mixup training strategy based on the
aggregated representation from the PLMs.
3 Methodology
Figure 1: The overall framework of SelfMix
Figure 1shows an overview of our proposed Self-
Mix. Our method first uses GMM to select the sam-
ples that are more likely to be wrong and erase their
original labels. Then we leverage semi-supervised
learning to jointly train the labeled set
X
(contains
mostly clean samples) and an unlabeled set
U
(con-
tains mostly noisy samples). We also introduce a
manifold mixup strategy based on the hidden rep-
resentation of the [CLS] token named EmbMix.
3.1 Preliminary
In real-world data collection, the observed labels
are often corrupted. So the only difference between
this task and the traditional text classification task
is that a certain proportion of incorrect labels exist
in training samples. Let D={(xi, yi)}N
i=1 denote
the original dataset, where
N
is the number of sam-
ples,
xi
is the text of the
ith
sample, and
yi
is the
one-hot representation of the observed label of the
ith
sample. For the base model, we denote
θ
as the
parameters of the pre-trained encoder model and
φ
as the parameters of MLP classifier head with
2 fully connected layers. The standard optimiza-
tion method tries to minimize the empirical risk by
applying the cross-entropy loss:
L={`i}N
i=1 =yT
ilog (p(xi;θ, φ))N
i=1 ,(1)
where
p(x;θ)
denotes the softmax probability of
the model output. We first warm up the model
using
L
to make it capable of doing preliminary
classification tasks without overfitting noisy labels
and then perform SelfMix for the rest epochs.
3.2 Sample Selection
On noisy data, Deep neural networks will preferen-
tially learn simple and logical samples first and re-
duce their loss. Namely, noisy samples tend to have
a higher loss in the early stage (Zhang et al.,2021).
Preliminary experiments show that the loss distri-
butions of clean and noisy samples during train-
ing tend to subject to two Gaussian Distributions,
where the loss of the clean samples hold a smaller
mean value. Taking advantage of such training
phenomena, we apply the popular used Gaussian
Mixture Model (Arazo et al.,2019) to distinguish
noisy samples by feeding the per-sample loss. For
IDN, noisy labels rely on both input features and
underlying true labels, so the noise in each class
is different, making the loss scales from different
classes vary greatly. The relatively high-loss sam-
ples in low-loss class may also be treated as clean
samples. So we compute a class-regularization loss
instead of the standard cross-entropy loss, which
can better model the distributions in IDN. For each
class
c
, the set
Lc={`i|yi=c, i [N]}
con-
tains the cross-entropy loss values of all samples
with label
c
, then
µc
and
σc
denote the arithmetic
mean and standard deviation of
Lc
respectively.
Our regularization loss has the following form:
L0=`0
iN
i=1 ={(`iµyi)yi}N
i=1 .(2)
We feed the loss L(L0for IDN) to a 2-component
GMM and use Expectation-Maximization (EM)
algorithms to fit the GMM to the observations. Let
wi=p(g|`0
i)
represent the probability of the
ith
sample belonging to the Gaussian component with
smaller mean
g
, which can also be considered as the
clean probability due to the small-loss theory (Arpit
et al.,2017). By setting the threshold
τ
for the
probability
wi
, we can divide the original dataset
D
into a labeled set
X
and an unlabeled set
U
where the labels of samples that are more likely to
be wrong will be erased:
X={(xi, yi)|xi∈ D, wiτ},
U={(xi)|xi∈ D, wi< τ}.(3)
3.3 Semi-supervised Self-training
To make semi-supervised learning work better, we
first do pre-process on the unlabeled set. For the
unlabeled set
U
, the original label is most likely
wrong and has been discarded. Therefore, we gen-
erate the soft label
ˆy
by sharpening the model’s pre-
dicted distribution, making the distribution more
摘要:

SelfMix:RobustLearningAgainstTextualLabelNoisewithSelf-MixupTrainingDanQiao1,ChenchenDai1,YuyangDing1,JuntaoLi1,QiangChen2,WenliangChen1,MinZhang11InstituteofComputerScienceandTechnology,SoochowUniversity,China2AlibabaGroup{danqiao.jordan,morningcc125,yyding.me}@gmail.com{ljt,wlchen,minzhang}@suda.e...

展开>> 收起<<
SelfMix Robust Learning Against Textual Label Noise with Self-Mixup Training Dan Qiao1 Chenchen Dai1 Yuyang Ding1 Juntao Li1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.31MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注