Improving Imbalanced Text Classification with
Dynamic Curriculum Learning
Xulong Zhang, Jianzong Wang∗, Ning Cheng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—Recent advances in pre-trained language models
have improved the performance for text classification tasks.
However, little attention is paid to the priority scheduling strategy
on the samples during training. Humans acquire knowledge
gradually from easy to complex concepts, and the difficulty of
the same material can also vary significantly in different learning
stages. Inspired by this insights, we proposed a novel self-paced
dynamic curriculum learning (SPDCL) method for imbalanced
text classification, which evaluates the sample difficulty by both
linguistic character and model capacity. Meanwhile, rather than
using static curriculum learning as in the existing research, our
SPDCL can reorder and resample training data by difficulty
criterion with an adaptive from easy to hard pace. The extensive
experiments on several classification tasks show the effectiveness
of SPDCL strategy, especially for the imbalanced dataset.
Index Terms—efficient curriculum learning, imbalanced text
classification, self-paced learning, data augmentation, nuclear-
norm
I. INTRODUCTION
Recent attention has been devoted to various training tasks
for large neural networks pre-training, deep model architectures
and model compression [
1
–
5
]. Imbalanced training data is
common in real applications, which poses challenges for text
classifiers. In this work, we study how to leverage curriculum
learning (CL) to tackle the imbalanced text classification.
CL [
6
], which mimics the human learning process from
easy to hard concepts, can improve the generalization ability
and convergence rate of deep neural networks. However,
such strategies have been largely ignored for imbalanced text
classification tasks due to the fact that difficulty metrics like
entropy are static and fixed. So these metrics ignore that the
difficulty of data samples always vary while training. Therefore
the sample importance weights should be adaptive rather than
fixed. To address this challenge, we proposed the SPDCL
framework, which utilizes both the nuclear-norm and model
capacity to evaluate the difficulty of training samples. Nuclear-
norm, the sum of singular values, is used to constrain the low
state of the matrix. From linear algebra, a matrix is low-norm
and contains a large amount of data information, which can be
used to restore data and extract features. From our experiments,
the nuclear norm of the training sample is a process of first
decreasing and then increasing until stable fluctuation from
pre-training to fine-tuning stage.
That is to say, for the sample, it is the process of firstly
removing the noise and then learning the deeper semantic
features of the specific scene data. For samples whose noise
∗Corresponding author: Jianzong Wang, jzwang@188.com.
points are far from the normal value or whose feature points are
well distinguished, the nuclear norm changes more drastically,
on the contrary, the nuclear-norm changes more smoothly.
Specifically, simple examples are easier to recognize along with
slight changes in sentence features during different training
time, the nuclear-norm of which change more drastically.
For different training epochs, we dynamically adopt their
corresponding nuclear-norm production to calculate a difficulty
score for each example in the train set. In each training phase,
we will calculate the change in the kernel norm of the samples
based on the relative position at the current moment, and
update the samples as a new difficulty level. After reordering
the samples in order from simple to complex in each round,
we resampled the data to achieve the goal of learning simple
samples first and then gradually increasing the difficulty of
learning.
Our contribution can be summarized as follows: 1) We
proposed an adaptive sentence difficulty criterion, consisting
of both linguistically motivated features and learning and
task-dependent features. 2)We proposed a novel dynamic CL
strategy that consists of re-calculating the difficulty criterion
and resampling each epoch. 3)We prove the effectiveness of
SPDCL in the context of fine-tuning language model for a
series of text classification tasks including multi classification,
multi label classification and text matching.
II. METHODOLOGY
To elaborate our methods, we utilize the BERT-Base model
[
7
] as our example. The same method is also compatible to
many other encoders, such as RoBERTa [
8
], etc. Following
BERT, the output vector of the
[CLS]
token can be used for
classification and other tasks since it can represent the kernel
feature of all the sentence. So, we just add a linear layer on the
[CLS]
sentence representation, then fine-tune the entire model
with our CL strategy. But before each iteration, we would
calculate the difficulty metric on the whole tokens along with
each epoch rather than only on
[CLS]
token to measure the
difficulty score of each input sample.
Formally, when we input the text
sequence
{[CLS], x1, . . . , xm,[SEP]}
or
{[CLS], x11, . . . , x1m,[SEP], x21, . . . , x2n,[SEP]}
into the model, we can get the output
token matrix
{hCLS, hx1, . . . , hxm, hCLS}
or
{hCLS, hx11 , . . . , hx1m, hSEP, hx21 , . . . , hx2n, hSEP}
. Then,
we compute the nuclear-norm for the whole token-level
representations for each sample per epoch while training that
arXiv:2210.14724v1 [cs.CL] 25 Oct 2022