Improving Imbalanced Text Classiﬁcation with Dynamic Curriculum Learning Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao

2025-05-08 1 0 618.76KB 6 页 10玖币

侵权投诉

Improving Imbalanced Text Classiﬁcation with

Dynamic Curriculum Learning

Xulong Zhang, Jianzong Wang∗, Ning Cheng, Jing Xiao

Ping An Technology (Shenzhen) Co., Ltd.

Abstract—Recent advances in pre-trained language models

have improved the performance for text classiﬁcation tasks.

However, little attention is paid to the priority scheduling strategy

on the samples during training. Humans acquire knowledge

gradually from easy to complex concepts, and the difﬁculty of

the same material can also vary signiﬁcantly in different learning

stages. Inspired by this insights, we proposed a novel self-paced

dynamic curriculum learning (SPDCL) method for imbalanced

text classiﬁcation, which evaluates the sample difﬁculty by both

linguistic character and model capacity. Meanwhile, rather than

using static curriculum learning as in the existing research, our

SPDCL can reorder and resample training data by difﬁculty

criterion with an adaptive from easy to hard pace. The extensive

experiments on several classiﬁcation tasks show the effectiveness

of SPDCL strategy, especially for the imbalanced dataset.

Index Terms—efﬁcient curriculum learning, imbalanced text

classiﬁcation, self-paced learning, data augmentation, nuclear-

norm

I. INTRODUCTION

Recent attention has been devoted to various training tasks

for large neural networks pre-training, deep model architectures

and model compression [

–

]. Imbalanced training data is

common in real applications, which poses challenges for text

classiﬁers. In this work, we study how to leverage curriculum

learning (CL) to tackle the imbalanced text classiﬁcation.

CL [

], which mimics the human learning process from

easy to hard concepts, can improve the generalization ability

and convergence rate of deep neural networks. However,

such strategies have been largely ignored for imbalanced text

classiﬁcation tasks due to the fact that difﬁculty metrics like

entropy are static and ﬁxed. So these metrics ignore that the

difﬁculty of data samples always vary while training. Therefore

the sample importance weights should be adaptive rather than

ﬁxed. To address this challenge, we proposed the SPDCL

framework, which utilizes both the nuclear-norm and model

capacity to evaluate the difﬁculty of training samples. Nuclear-

norm, the sum of singular values, is used to constrain the low

state of the matrix. From linear algebra, a matrix is low-norm

and contains a large amount of data information, which can be

used to restore data and extract features. From our experiments,

the nuclear norm of the training sample is a process of ﬁrst

decreasing and then increasing until stable ﬂuctuation from

pre-training to ﬁne-tuning stage.

That is to say, for the sample, it is the process of ﬁrstly

removing the noise and then learning the deeper semantic

features of the speciﬁc scene data. For samples whose noise

∗Corresponding author: Jianzong Wang, jzwang@188.com.

points are far from the normal value or whose feature points are

well distinguished, the nuclear norm changes more drastically,

on the contrary, the nuclear-norm changes more smoothly.

Speciﬁcally, simple examples are easier to recognize along with

slight changes in sentence features during different training

time, the nuclear-norm of which change more drastically.

For different training epochs, we dynamically adopt their

corresponding nuclear-norm production to calculate a difﬁculty

score for each example in the train set. In each training phase,

we will calculate the change in the kernel norm of the samples

based on the relative position at the current moment, and

update the samples as a new difﬁculty level. After reordering

the samples in order from simple to complex in each round,

we resampled the data to achieve the goal of learning simple

samples ﬁrst and then gradually increasing the difﬁculty of

learning.

Our contribution can be summarized as follows: 1) We

proposed an adaptive sentence difﬁculty criterion, consisting

of both linguistically motivated features and learning and

task-dependent features. 2)We proposed a novel dynamic CL

strategy that consists of re-calculating the difﬁculty criterion

and resampling each epoch. 3)We prove the effectiveness of

SPDCL in the context of ﬁne-tuning language model for a

series of text classiﬁcation tasks including multi classiﬁcation,

multi label classiﬁcation and text matching.

II. METHODOLOGY

To elaborate our methods, we utilize the BERT-Base model

[

] as our example. The same method is also compatible to

many other encoders, such as RoBERTa [

], etc. Following

BERT, the output vector of the

[CLS]

token can be used for

classiﬁcation and other tasks since it can represent the kernel

feature of all the sentence. So, we just add a linear layer on the

[CLS]

sentence representation, then ﬁne-tune the entire model

with our CL strategy. But before each iteration, we would

calculate the difﬁculty metric on the whole tokens along with

each epoch rather than only on

[CLS]

token to measure the

difﬁculty score of each input sample.

Formally, when we input the text

sequence

{[CLS], x1, . . . , xm,[SEP]}

{[CLS], x11, . . . , x1m,[SEP], x21, . . . , x2n,[SEP]}

into the model, we can get the output

token matrix

{hCLS, hx1, . . . , hxm, hCLS}

{hCLS, hx11 , . . . , hx1m, hSEP, hx21 , . . . , hx2n, hSEP}

. Then,

we compute the nuclear-norm for the whole token-level

representations for each sample per epoch while training that

arXiv:2210.14724v1 [cs.CL] 25 Oct 2022

TABLE I

THE RELATIONSHIP BETWEEN TEXT LENGTH AND NUCLEAR-NORM OF

SENTENCE OF TRAIN DATA ON THE AAPD DATASET OBTAINED FROM

PRE-TRAINED MODEL BEFORE FINE-TUNED PROCEDURE. SENTENCE BINS

ARE SORTED BY SENTENCE DIFFICULTY BASED ON LINGUISTIC

CHARACTER.IT IS OBVIOUS THAT LONGER SENTENCES THAT CONTAINING

MORE LEXICAL-SEMANTIC AND SYNTACTIC INFORMATION TEND TO HAVE

LARGER NUCLEAR-NORM VALUE BEFORE FINE-TUNING PROCESS.

Bins 1 2 3 4 5 6

Avg text length 440 702 881 1052 1260 1594

represents the present information capacity. Before ﬁne-tuning,

the text representation obtained from the pre-trained model

contains some redundant information, especially in some

speciﬁc domain datasets. In the ﬁne-tuning phase, the model

further gets better text representation for speciﬁc tasks,

speciﬁcally a process of removing redundant features to

extract critical information and then further learning more

ﬁnely granular features. In fact, in our experiments, we also

ﬁnd the phenomenon that nuclear-norm of samples ﬁrstly

decreased and then gradually increased. Simple samples vary

more dramatically, and complex samples are more difﬁcult to

learn about ﬁne-grained and high-dimensional characteristics.

We calculate the difference between the current nuclear-norm

and the previous nuclear-norm for each sample, and sorted

with an easy-to-difﬁcult fashion per epoch with our curriculum

learning strategy.

Finally, we regard the ﬁnal layer embedding of ﬁrst token,

[CLS], as the representation of the whole input sequence. On

the one hand, in sequence classiﬁcation and text matching

tasks, we put the last layer matrix into a softmax activation

function to predict its category label. On the other hand, in

multi label classiﬁcation tasks, we take a general approach to

verifying the generality of the CL strategy.

A. Self-paced Dynamic Curriculum Learning

In this section, we would introduce our training strategy, Self-

Paced Dynamic Curriculum Learning (SPDCL),the process of

which is shown in Figure 1. More details about the pseudo-code

can be seen in Algorithm 1. In general, curriculum learning

contains two key modules, difﬁculty criterion and curriculum

arrangement. In our method, we divide the difﬁculty criterion

into two modules, linguistic difﬁculty criterion before training

and difﬁculty criterion based on model capacity while training.

Since we think the curriculum arrangement should change

along with difﬁculty criterion. At ﬁrst, difﬁculty criterion relies

more on linguistic character when the model has not seen the

data of downstream task and the feedback of the current model

is not reliable. Later, when going into the formal training, the

current state of the model itself should plays a more important

role for difﬁculty criterion. What’s more, there are always

associations between difﬁculty criteria that are not independent

in every time period.

Order by lingustic charact er

Order by model capicity

Sampling

Absolut e Value d Algebras of

difference based on re lative

positions

bert

model1

train data

index:

987436125

sample

order

123456789

sample

order

897612345

bert

model2

sample

order

712689345

-bert

model3

sample

order

345712689

-bert

model4

SVD

pretrained

model SVD

Fig. 1. Structure of SPDCL.It contains two key modules, difﬁculty criterion

and curriculum arrangement. In our method, we divide the difﬁculty criterion

into two modules, linguistic difﬁculty criterion before training and difﬁculty

criterion based on model capacity while training.

B. Linguistic difﬁculty criterion

The difﬁculty of sample sentence shows that it can be divided

into two parts: heuristics metrics like sentence length, word

frequency, and model based metrics of the speciﬁc task, which

can be loss, accuracy, precision and F1 score, etc. [

However, these methods divide linguistic character and model

based character of a textual sample that both vital to a textual

sample in different stages. For example, in the early stage, the

teacher can rely more on linguistic character. According to

the linguistic character, most NLP systems have been taking

advantage of distributed word embeddings to capture the

syntactic features of a word both in the direction and the

norm of a word vector. While, the norm of the sentence matrix

is rarely considered and explored in the computation of the

difﬁculty metric for a textual sample. In contrast, the traditional

word based difﬁculty metric does not have a particularly good

way to directly obtain the difﬁculty metric of sentence-level

representations, which generally average or sum all word norm,

but this is not equivalent to the metric of representation of the

modeled sentences displayed. Taking into account the polysemy

and location of the word in the sentence and the deeper syntactic

structure, semantic relationships, we proposed nuclear-norm

based sentence difﬁculty criterion.

The nuclear-norm is the sum of the singular values of the

matrix, a classic representation of the amount of information

that plays a vital role in natural language processing tasks[

].In

our previous experiment, we also ﬁnd consistent phenomenon

that the nuclear-norm of the sentence is related to the length

of sentence shown in Table I. Specially, longer sentences tend

to have larger nuclear-norm values. At the initial learning

stage when the model weights obtained from pre-trained model

that has not yet seen the train dataset for downstream tasks,

linguistic characters of a textual sample itself matters how hard

it is for the downstream task. In our opinion, it is easier to

train and predict the sentences with smaller nuclear-norm, as

those sentences consist less information and easier syntactic

structures. Compared with those sentences with smaller nuclear-

norm, those harder sentences always with larger nuclear-norm

that require being able to recognize the semantics of their

component parts, it is also necessary to identify the syntactic

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingImbalancedTextClassicationwithDynamicCurriculumLearningXulongZhang,JianzongWang,NingCheng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.AbstractRecentadvancesinpre-trainedlanguagemodelshaveimprovedtheperformancefortextclassicationtasks.However,littleattentionispaidtothepriorityschedulingstr...

展开>> 收起<<

Improving Imbalanced Text Classiﬁcation with Dynamic Curriculum Learning Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Imbalanced Text Classiﬁcation with Dynamic Curriculum Learning Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: