SelfMix Robust Learning Against Textual Label Noise with Self-Mixup Training Dan Qiao1 Chenchen Dai1 Yuyang Ding1 Juntao Li1

2025-05-03 0 0 1.31MB 11 页 10玖币

侵权投诉

SelfMix: Robust Learning Against Textual Label Noise with

Self-Mixup Training

Dan Qiao1, Chenchen Dai1, Yuyang Ding1, Juntao Li1,

Qiang Chen2, Wenliang Chen1, Min Zhang1

1Institute of Computer Science and Technology, Soochow University, China

2Alibaba Group

{danqiao.jordan,morningcc125,yyding.me}@gmail.com

{ljt,wlchen,minzhang}@suda.edu.cn;lapu.cq@alibaba-inc.com

Abstract

The conventional success of textual classi-

ﬁcation relies on annotated data, and the

new paradigm of pre-trained language mod-

els (PLMs) still requires a few labeled data

for downstream tasks. However, in real-world

applications, label noise inevitably exists in

training data, damaging the effectiveness, ro-

bustness, and generalization of the models

constructed on such data. Recently, remark-

able achievements have been made to miti-

gate this dilemma in visual data, while only

a few explore textual data. To ﬁll this gap,

we present SelfMix, a simple yet effective

method, to handle label noise in text classiﬁca-

tion tasks. SelfMix uses the Gaussian Mixture

Model to separate samples and leverages semi-

supervised learning. Unlike previous works re-

quiring multiple models, our method utilizes

the dropout mechanism on a single model to

reduce the conﬁrmation bias in self-training

and introduces a textual level mixup training

strategy. Experimental results on three text

classiﬁcation benchmarks with different types

of text show that the performance of our pro-

posed method outperforms these strong base-

lines designed for both textual and visual data

under different noise ratios and noise types.

Our code is available at https://github.

com/noise-learning/SelfMix.

1 Introduction

The excellent performance of deep neural net-

works (DNNs) depends on data with high-quality

annotations. However, data obtained from the real

world is inevitably mixed with wrong labels (Guan

et al.,2018;Aït-Sahalia et al.,2010;Liu et al.,

2020b). Models trained on these noisy datasets

would easily overﬁt the noisy labels (Algan and

Ulusoy,2020;Liu et al.,2020a), especially for pre-

trained large models (Zhang and Li,2021), and the

performance will be negatively affected.

Research on learning with noisy labels (LNL)

has gained popularity. Previous work has revealed

that clean samples and noisy samples play differ-

ent roles in the training process and behave differ-

ently in terms of loss values or convergence speeds

etc. (Liu et al.,2020a). Different types of noise

have different effects on the training. For instance,

the impact of class-conditional noise (CCN) can

simulate the confusion between similar classes, and

the effect of instance-dependent noise (IDN) can

be more complex.

Most of the current methods perform experi-

ments on visual data. Label noise on visual data

often goes against objective facts and is easy to dis-

tinguish. As for NLP, there may be disagreement

even among expert annotators due to the complex-

ity of semantic features and the subjectivity of lan-

guage understanding. For example, suppose there

is a piece of news about “The Economic Beneﬁt of

Competitive Sports to our Cities”. In that case, it is

hard to tell whether it belongs to Economic news or

Sports news without fully understanding the con-

textual information. Although a few works pay at-

tention to the natural language area, their methods

are mostly based on the trained-from-scratch mod-

els like LSTM and Text-CNN (Garg et al.,2021;

Jindal et al.,2019). However, PLMs might be a

better choice since the whole training process can

be divided into two stages, and the wrong labels

do not corrupt the pre-training process. Table 2

makes comparisons between PLMs and traditional

networks on the robustness against label noise.

In conclusion, it is vital to explore how to learn

with noisy labels on textual data and use the robust

PLMs as the base model. This paper proposes Self-

Mix, i.e., a self-distillation robust training method

based on the pre-trained models. Section 2intro-

duces some related works and explains the motiva-

tion of our proposed method.

Our contributions can be concluded as follows:

•

We propose SelfMix, a simple yet effective

method to help learn with noisy labels, which

utilizes a self-training approach. Our method

arXiv:2210.04525v2 [cs.CL] 11 Oct 2022

only needs a single model and utilizes a mixup

training strategy based on the aggregated rep-

resentation from pre-trained models.

•

We perform comprehensive experiments on

three different types of text classiﬁcation

benchmarks under various noise settings, in-

cluding the challenging instance-dependent

noise, which is usually ignored in other works

on textual data, which demonstrate the supe-

riority of our proposed method over strong

baselines.

2 Related Work

Learning with Noisy Labels.

A direct yet effec-

tive idea to handle label noise is to ﬁnd the noisy

samples and reduce their inﬂuence by resampling

or reweighting (Rolnick et al.,2017). Jiang et al.

(2018) train another neural network to provide a

curriculum to help StudentNet focus on the samples

whose labels is probably correct. Han et al. (2018)

jointly train two deep neural networks and feed

each model the top

% samples with the lowest loss

evaluated by the other model in each mini-batch.

Following Han et al. (2018), Yu et al. (2019) ex-

plore how disagreement can help the model. Some

researchers believe that there exists a transition

from ground-truth label distribution to the noisy

label distribution and estimate the noise transition

matrix to absorb this transition (Goldberger and

Ben-Reuven,2016). Northcutt et al. (2021) directly

estimate the joint distribution matrix between the

noisy labels and real labels. Garg et al. (2021) use

a fully connected layer to capture the distribution

transition. However, most of these methods either

need model ensembling or require cross-validation,

which is time-consuming and needs multiple pa-

rameters.

Some other works focus on designing a more

robust training strategy. Since DNNs with Cross-

Entropy loss tend to overﬁt noisy labels (Feng et al.,

2021), some researchers redesign noise-robust loss

functions (Wang et al.,2019b;Zhang and Sabuncu,

2018;Ghosh et al.,2017;Xu et al.,2019). When

trained on noisy data, DNNs tend to learn from the

clean data during an “early learning” phase before

eventually memorizing the wrong data (Arpit et al.,

2017;Zhang et al.,2021), based on which Liu

et al. (2020a) offer an easy regularization capitaliz-

ing on early learning. Some other works like Xia

et al. (2020) ﬁnd that only partial parameters are

essential for generalization, which offers us a new

perspective to reconsider what difference exactly

the noisy labels make to the model’s learning. This

kind of approach treats all samples indiscriminately

thus the performance is sometimes unsatisfactory

under a high noise ratio.

Some excellent work combines these two ideas

(Ding et al.,2018;Li et al.,2020). Garg et al.

(2021) add an auxiliary noise model

over the

classiﬁer to predict noisy labels and jointly train the

classiﬁer and the noise model through a de-noising

loss function. Cheng et al. (2021) progressively

sieve out corrupted examples and then leverage

semi-supervised learning.

Mixup Training.

Mixup training (Zhang et al.,

2018) is a widely used data-augmentation method

to alleviate memorization and sensitivity to adver-

sarial samples on visual data. It combines the in-

puts and targets of two random training samples to

generate augmented samples. However, applying

mixup on textual data is a great challenge since lin-

ear interpolations on discrete inputs damage the se-

mantic structure. Some literature has explored the

textual mixup mechanism like: Chen et al. (2020)

propose to mix the hidden vector in the last few

encoder layers; Yoon et al. (2021) ﬁnd a new way

to combine two texts which can also be treated as

a data augmentation strategy. In this paper, we do

not make comparisons for the following reasons:

(1) Our EmbMix is simpler in practical use and

there is little difference in the ﬁnal performance of

various methods according to Chen et al. (2020).

(2) Some other methods need data augmentation

while EmbMix does not.

Proposed Method.

Since simply redesigning a

robust loss function tends to have poor performance

under a high noise ratio, we combine sample selec-

tion with the robust training methods. Unlike the

previous work that needs model ensembling or uses

cross-validation, we train a single network with

dropout to reduce conﬁrmation bias in self-training.

We make following improvements regarding to the

characteristics of the textual data: (1) The decision

boundaries in image-classiﬁcation tasks are more

clear. However, the main idea of the same text can

vary under different contexts and sometimes there

is even no absolute correct label. So we iteratively

use the Gaussian Mixture Model (GMM) to ﬁt the

loss distribution and use the predicted soft label

to replace the label of the fusing data rather than

setting a threshold and arbitrarily discarding the

undesired samples at the beginning. (2) Unlike

the pixel input of visual data, the input of text is

discrete. So for the separated data, we leverage

a manifold mixup training strategy based on the

aggregated representation from the PLMs.

3 Methodology

Figure 1: The overall framework of SelfMix

Figure 1shows an overview of our proposed Self-

Mix. Our method ﬁrst uses GMM to select the sam-

ples that are more likely to be wrong and erase their

original labels. Then we leverage semi-supervised

learning to jointly train the labeled set

(contains

mostly clean samples) and an unlabeled set

(con-

tains mostly noisy samples). We also introduce a

manifold mixup strategy based on the hidden rep-

resentation of the [CLS] token named EmbMix.

3.1 Preliminary

In real-world data collection, the observed labels

are often corrupted. So the only difference between

this task and the traditional text classiﬁcation task

is that a certain proportion of incorrect labels exist

in training samples. Let D={(xi, yi)}N

i=1 denote

the original dataset, where

is the number of sam-

ples,

is the text of the

ith

sample, and

is the

one-hot representation of the observed label of the

ith

sample. For the base model, we denote

as the

parameters of the pre-trained encoder model and

as the parameters of MLP classiﬁer head with

2 fully connected layers. The standard optimiza-

tion method tries to minimize the empirical risk by

applying the cross-entropy loss:

L={`i}N

i=1 =−yT

ilog (p(xi;θ, φ))N

i=1 ,(1)

where

p(x;θ)

denotes the softmax probability of

the model output. We ﬁrst warm up the model

using

to make it capable of doing preliminary

classiﬁcation tasks without overﬁtting noisy labels

and then perform SelfMix for the rest epochs.

3.2 Sample Selection

On noisy data, Deep neural networks will preferen-

tially learn simple and logical samples ﬁrst and re-

duce their loss. Namely, noisy samples tend to have

a higher loss in the early stage (Zhang et al.,2021).

Preliminary experiments show that the loss distri-

butions of clean and noisy samples during train-

ing tend to subject to two Gaussian Distributions,

where the loss of the clean samples hold a smaller

mean value. Taking advantage of such training

phenomena, we apply the popular used Gaussian

Mixture Model (Arazo et al.,2019) to distinguish

noisy samples by feeding the per-sample loss. For

IDN, noisy labels rely on both input features and

underlying true labels, so the noise in each class

is different, making the loss scales from different

classes vary greatly. The relatively high-loss sam-

ples in low-loss class may also be treated as clean

samples. So we compute a class-regularization loss

instead of the standard cross-entropy loss, which

can better model the distributions in IDN. For each

class

, the set

Lc={`i|yi=c, i ∈[N]}

con-

tains the cross-entropy loss values of all samples

with label

, then

µc

and

σc

denote the arithmetic

mean and standard deviation of

respectively.

Our regularization loss has the following form:

L0=`0

iN

i=1 ={(`i−µyi)/σyi}N

i=1 .(2)

We feed the loss L(L0for IDN) to a 2-component

GMM and use Expectation-Maximization (EM)

algorithms to ﬁt the GMM to the observations. Let

wi=p(g|`0

represent the probability of the

ith

sample belonging to the Gaussian component with

smaller mean

, which can also be considered as the

clean probability due to the small-loss theory (Arpit

et al.,2017). By setting the threshold

for the

probability

, we can divide the original dataset

into a labeled set

and an unlabeled set

where the labels of samples that are more likely to

be wrong will be erased:

X={(xi, yi)|xi∈ D, wi≥τ},

U={(xi)|xi∈ D, wi< τ}.(3)

3.3 Semi-supervised Self-training

To make semi-supervised learning work better, we

ﬁrst do pre-process on the unlabeled set. For the

unlabeled set

, the original label is most likely

wrong and has been discarded. Therefore, we gen-

erate the soft label

ˆy

by sharpening the model’s pre-

dicted distribution, making the distribution more

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SelfMix:RobustLearningAgainstTextualLabelNoisewithSelf-MixupTrainingDanQiao1,ChenchenDai1,YuyangDing1,JuntaoLi1,QiangChen2,WenliangChen1,MinZhang11InstituteofComputerScienceandTechnology,SoochowUniversity,China2AlibabaGroup{danqiao.jordan,morningcc125,yyding.me}@gmail.com{ljt,wlchen,minzhang}@suda.e...

展开>> 收起<<

SelfMix Robust Learning Against Textual Label Noise with Self-Mixup Training Dan Qiao1 Chenchen Dai1 Yuyang Ding1 Juntao Li1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SelfMix Robust Learning Against Textual Label Noise with Self-Mixup Training Dan Qiao1 Chenchen Dai1 Yuyang Ding1 Juntao Li1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: