A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing Sophie Henning12William Beluch1Alexander Fraser2Annemarie Friedrich1

2025-04-27 0 0 628.57KB 18 页 10玖币
侵权投诉
A Survey of Methods for Addressing Class Imbalance
in Deep-Learning Based Natural Language Processing
Sophie Henning1,2William Beluch1Alexander Fraser2Annemarie Friedrich1
1Bosch Center for Artificial Intelligence, Renningen, Germany
2Center for Information and Language Processing, LMU Munich, Germany
sophieelisabeth.henning|william.beluch@de.bosch.com
fraser@cis.lmu.de
annemarie.friedrich@de.bosch.com
Abstract
Many natural language processing (NLP) tasks
are naturally imbalanced, as some target cate-
gories occur much more frequently than others
in the real world. In such scenarios, current
NLP models tend to perform poorly on less
frequent classes. Addressing class imbalance
in NLP is an active research topic, yet, finding
a good approach for a particular task and im-
balance scenario is difficult.
In this survey, the first overview on class im-
balance in deep-learning based NLP, we first
discuss various types of controlled and real-
world class imbalance. Our survey then covers
approaches that have been explicitly proposed
for class-imbalanced NLP tasks or, originating
in the computer vision community, have been
evaluated on them. We organize the methods
by whether they are based on sampling, data
augmentation, choice of loss function, staged
learning, or model design. Finally, we discuss
open problems and how to move forward.
1 Introduction
Class imbalance is a major problem in natural lan-
guage processing (NLP), because target category
distributions are almost always skewed in NLP
tasks. As illustrated by Figure 1, this often leads to
poor performance on minority classes. Which cate-
gories matter is highly task-specific and may even
depend on the intended downstream use. Develop-
ing methods that improve model performance in
imbalanced data settings has been an active area for
decades (e.g., Bruzzone and Serpico,1997;Japkow-
icz et al.,2000;Estabrooks and Japkowicz,2001;
Park and Zhang,2002;Tan,2005), and is recently
gaining momentum in the context of maturing neu-
ral approaches (e.g., Buda et al.,2018;Kang et al.,
2020;Li et al.,2020;Yang et al.,2020;Jiang et al.,
2021;Spangher et al.,2021). The problem is exac-
erbated when classes overlap in the feature space
(Lin et al.,2019;Tian et al.,2020). For example,
(a) Single-label relation
classification
on TACRED
(Zhou and Chen,2021)
(b) Hierarchical multi-label
patent classification
(Pujari et al.,2021)
(c) Implicit discourse rela-
tion
classification (PDTB)
(Shi and Demberg,2019)
(d) UD dependency parsing
using RoBERTa on EWT
(Grünewald et al.,2021)
Figure 1: Class imbalance has a negative effect on per-
formance especially for minority classes in a variety of
NLP tasks. Upper charts show label count distributions,
lower part show test/dev F1 by training instance count
(lighter colors indicate fewer test/dev instances). All
models are based on transformers.
in patent classification, technical categories differ
largely in frequency, and the concepts mentioned
in the different categories can be very similar.
On a large variety of NLP tasks, transformer
models such as BERT (Vaswani et al.,2017;Devlin
et al.,2019) outperform both their neural predeces-
sors and traditional models (Liu et al.,2019;Xie
et al.,2020;Mathew et al.,2021). Performance
for minority classes is also often higher when us-
ing self-supervised pre-trained models (e.g., Li and
Scarton,2020;Niklaus et al.,2021), which paral-
lels findings from computer vision (Liu et al.,2022).
However, the advent of BERT has not solved the
class imbalance problem in NLP, as illustrated by
Figure 1.Tänzer et al. (2022) find that on syn-
thetically imbalanced named entity datasets with
arXiv:2210.04675v2 [cs.CL] 22 Feb 2023
(a) Step imbalance, µ= 0.4,ρ= 10 (b) Linear imbalance, ρ= 10 (c) Long-tailed distribution
Figure 2: Instance counts per label follow different distributions: examples of class imbalance types.
majority classes having thousands of examples, at
least 25 instances are required to predict a class
at all, and 100 examples to learn to predict it with
some accuracy.
Despite the relevance of class imbalance to NLP,
related surveys only exist in the computer vision
domain (Johnson and Khoshgoftaar,2019b;Zhang
et al.,2021b). Incorporating methods addressing
class imbalance can lead to performance gains of
up to 20%. Yet, NLP research often overlooks how
important this is in practical applications, where
minority classes may be of special interest.
Our contribution is to draw a clear landscape
of approaches applicable to deep-learning (DL)
based NLP. We set out with a problem defini-
tion (Sec. 2), and then organize approaches by
whether they are based on sampling, data aug-
mentation, choice of loss function, staged learn-
ing, or model design (Sec. 3). Our extensive sur-
vey finds that re-sampling, data augmentation, and
changing the loss function can be relatively simple
ways to increase performance in class-imbalanced
settings and are thus straightforward choices for
NLP practitioners.
1
While promising research di-
rections, staged learning or model modifications
often are implementation-wise and/or computation-
ally costlier. Moreover, we discuss particular chal-
lenges of non-standard classification settings, e.g.,
imbalanced multi-label classification and catch-all
classes, and provide useful connections to related
computer vision work. Finally, we outline promis-
ing directions for future research (Sec. 4).
Scope of this survey.
We focus on approaches
evaluated on or developed for neural methods.
Work from “traditional” NLP (e.g., Tomanek and
Hahn,2009;Li et al.,2011;Li and Nenkova,2014;
Kunchukuttan and Bhattacharyya,2015) as well as
Natural Language Generation (e.g., Nishino et al.,
2020) and Automatic Speech Recognition (e.g.,
Winata et al.,2020;Deng et al.,2022) are not ad-
1
We provide practical advice on identifying potentially ap-
plicable class imbalance methods in the Appendix (Figure 3).
dressed in this survey. Other types of imbalances
such as differently sized data sets of subtasks in
continual learning (Ahrens et al.,2021) or imbal-
anced regression (Yang et al.,2021) are also beyond
the scope of this survey. In Sec. 3.5, we briefly
touch upon the related area of few-shot learning
(Wang et al.,2020c).
Related surveys.
We review imbalance-specific
data augmentation approaches in Sec. 3.2.Feng
et al. (2021) give a broader overview of data aug-
mentation in NLP, Hedderich et al. (2021) provide
an overview of low-resource NLP, and Ramponi
and Plank (2020) discuss neural domain adaptation.
2 Problem Definition
Class imbalance
refers to a classification set-
ting in which one or multiple classes (
minority
classes
) are considerably less frequent than others
(
majority classes
). More concrete definitions, e.g.,
regarding the relative share up to which a class
is seen as a minority class, depend on the task,
dataset and labelset size. Much research focuses on
improving all minority classes equally while main-
taining or at least monitoring majority class perfor-
mance (e.g., Huang et al.,2021;Yang et al.,2020;
Spangher et al.,2021). We next discuss prototypi-
cal types of imbalance (Sec. 2.1) and then compare
controlled and real-world settings (Sec. 2.2).
2.1 Types of Imbalance
To systematically investigate the effect of imbal-
ance, Buda et al. (2018) define two prototypical
types of label distributions, which we explain next.
Step imbalance
is characterized by the fraction
of minority classes,
µ
, and the size ratio between
majority and minority classes,
ρ
. Larger
ρ
values
indicate more imbalanced data sets. In prototyp-
ical step imbalance, if there are multiple minor-
ity classes, all of them are equally sized; if there
are several majority classes, they also have equal
size. Figure 2a shows a step-imbalanced distribu-
tion with
40%
of the classes being minority classes
and an imbalance ratio of
ρ= 10
. NLP datasets
with a large catch-all class as they often arise in
sequence tagging (see Sec. 2.2) or in relevance
judgments in retrieval models frequently resem-
ble step-imbalanced distributions. The
ρ
ratio has
also been reported in NLP, e.g., by Li et al. (2020),
although more task-specific imbalance measures
have been proposed, e.g., for single-label text clas-
sification (Tian et al.,2020). In
linear imbalance
,
class size grows linearly with imbalance ratio
ρ
(see Figure 2b), as, e.g., in the naturally imbal-
anced SICK dataset for natural language inference
(Marelli et al.,2014).
Long-tailed
label distributions (Figure 2c) are
conceptually similar to linear imbalance. They
contain many data points for a small number of
classes (head classes), but only very few for the
rest of the classes (tail classes). These distributions
are common in computer vision tasks like instance
segmentation (e.g., Gupta et al.,2019a), but also
in multi-label text classification, for example with
the goal of assigning clinical codes (Mullenbach
et al.,2018), patent categories (Pujari et al.,2021),
or news and research topics (Huang et al.,2021).
2.2 Controlled vs. Real-World Class
Imbalance
Most real-world label distributions in NLP tasks do
not perfectly match the prototypical distributions
proposed by Buda et al. (2018). Yet, awareness
of these settings helps practitioners to select ap-
propriate methods for their data set or problem by
comparing distribution plots. Using synthetically
imbalanced data sets, researchers can control for
more experimental factors and investigate several
scenarios at once. However, evaluating on naturally
imbalanced data provides evidence of a method’s
real-world effectiveness. Some recent studies com-
bine both types of evaluation (e.g., Tian et al.,2021;
Subramanian et al.,2021;Jang et al.,2021).
Many NLP tasks require treating a large, often
heterogenous
catch-all class
that contains all in-
stances that are not of interest to the task, while
the remaining (minority) classes are approximately
same-sized. Examples include the “Outside” label
in IOB sequence tagging, or tweets that mention
products in contexts that are irrelevant to the an-
notated categories (Adel et al.,2017). Such real-
world settings often roughly follow a step imbal-
ance distribution, with the additional difficulty of
the catch-all class.
2.3 Evaluation
As accuracy and micro-averages mostly reflect ma-
jority class performance, choosing a good evalu-
ation setting and metric is non-trivial. It is also
highly task-dependent: in many NLP tasks, recog-
nizing one or all minority classes well is at least
equally important as majority class performance.
For instance, non-hateful tweets are much more
frequent in Twitter (Waseem and Hovy,2016), but
recognizing hateful content is the key motivation of
hate speech detection. Which classes matter may
even depend on downstream considerations, i.e.,
the same named entity tagger might be used in one
application where a majority class matters, and an-
other where minority classes matter more. Several
evaluation metrics exist that have been designed
to account for class-imbalanced settings, but no de
facto standard exists. For example, balanced ac-
curacy (Brodersen et al.,2010) corresponds to the
average of per-class recall scores. It is often useful
to record performance on all classes and to report
macro-averages, which treat all classes equally.
3 Methods for Addressing Class
Imbalance in NLP
In this section, we survey methods that either
have been explicitly proposed to address class-
imbalance issues in NLP or that have been em-
pirically shown to be applicable for NLP problems.
We provide an overview of which methods are ap-
plicable to a selection of NLP tasks in Appendix A.
3.1 Re-Sampling
To increase the importance of minority instances
in training, the label distribution can be changed
by various sampling strategies. Sampling can ei-
ther be executed once or repeatedly during training
(Pouyanfar et al.,2018). In random oversampling
(
ROS
), a random choice of minority instances
are duplicated, whereas in random undersampling
(
RUS
), a random choice of majority instances are
removed from the dataset. ROS can lead to overfit-
ting and increases training times. RUS, however,
discards potentially valuable data, but has been
shown to work well in language-modeling objec-
tives (Mikolov et al.,2013).
When applied in DL, ROS outperforms RUS
both in synthetic step and linear imbalance (Buda
et al.,2018) and in binary and multi-class English
and Korean text classification (Juuti et al.,2020;
Akhbardeh et al.,2021;Jang et al.,2021). More
flexible variants, e.g., re-sampling only a tunable
share of classes (Tepper et al.,2020) or interpo-
lating between the (imbalanced) data distribution
and an almost perfectly balanced distribution (Ari-
vazhagan et al.,2019), can also further improve
results. Class-aware sampling (
CAS
,Shen et al.,
2016), also referred to as class-balanced sampling,
first chooses a class, and then an instance from
this class. Performance-based re-sampling dur-
ing training, following the idea of Pouyanfar et al.
(2018), works well in multi-class text classification
(Akhbardeh et al.,2021).
Issues in multi-label classification.
In multi-
label classification, label dependencies between
majority and minority classes complicate sampling
approaches, as over-sampling an instance with a
minority label may simultaneously amplify the ma-
jority class count (Charte et al.,2015;Huang et al.,
2021). CAS also suffers from this issue, and addi-
tionally introduces within-class imbalance, as in-
stances of one class are selected with different prob-
abilities depending on the co-assigned labels (Wu
et al.,2020). Effective sampling in such settings is
still an open issue. Existing approaches monitor the
class distributions during sampling (Charte et al.,
2015) or assign instance-based sampling probabili-
ties (Gupta et al.,2019b;Wu et al.,2020).
3.2 Data Augmentation
Increasing the amount of minority class data dur-
ing corpus construction, e.g., by writing additional
examples or selecting examples to be labeled using
Active Learning, can mitigate the class imbalance
problem to some extent (Cho et al.,2020;Ein-Dor
et al.,2020). However, this is particularly labo-
rious in naturally imbalanced settings as it may
require finding “the needle in the haystack,” or may
lead to biased minority class examples, e.g., due
to collection via keyword queries. Synthetically
generating additional minority instances thus is a
promising direction. In this section, we survey data
augmentation methods that have been explicitly
proposed to mitigate class imbalance and that have
been evaluated in combination with DL.
Text augmentation
generates new natural lan-
guage instances of minority classes, ranging from
simple string-based manipulations such as syn-
onym replacements to Transformer-based gener-
ation. Easy Data Augmentation (
EDA
,Wei and
Zou,2019), which uses dictionary-based synonym
replacements, random insertion, random swap, and
random deletion, has been shown to work well
in class-imbalanced settings (Jiang et al.,2021;
Jang et al.,2021;Juuti et al.,2020). Juuti et al.
(2020) generate new minority class instances for
English binary text classification using EDA and
embedding-based synonym replacements, and by
adding a random majority class sentence to a mi-
nority class document. They also prompt the pre-
trained language model GPT-2 (Radford et al.,
2019) with a minority class instance to generate
new minority class samples. Tepper et al. (2020)
evaluate generation with GPT-2 on English multi-
class text classification datasets, coupled with a
flexible balancing policy (see Sec. 3.1).
Similarly, Gaspers et al. (2020) combine
machine-translation based text augmentation with
dataset balancing to build a multi-task model. Both
the main and auxiliary tasks are German intent clas-
sification. Only the training data for the latter is
balanced and enriched with synthetic minority in-
stances. In a long-tailed multi-label setting, Zhang
et al. (2022) learn an attention-based text augmen-
tation that augments instances with text segments
that are relevant to tail classes, leading to small im-
provements. In general, transferring methods such
as EDA or backtranslation to multi-label settings
is difficult (Zhang et al.,2022,2020;Tang et al.,
2020).
Hidden space augmentation
generates new in-
stance vectors that are not directly associated
with a particular natural language string, leverag-
ing the representations of real examples. Using
representation-based augmentations to tackle class
imbalance is not tied to DL. SMOTE (Chawla et al.,
2002), which interpolates minority instances with
randomly chosen examples from their K-nearest
neighbours, is popular in traditional machine learn-
ing (Fernández et al.,2018), but leads to mixed
results in DL-based NLP (Ek and Ghanimifard,
2019;Tran and Litman,2021;Wei et al.,2022).
Inspired by CutMix (Yun et al.,2019), which cuts
and pastes a single pixel region in an image,
Text-
Cut
(Jiang et al.,2021) randomly replaces small
parts of the BERT representation of one instance
with those of the other. In binary and multi-class
text classification experiments, TextCut improves
over non-augmented BERT and EDA.
Good-enough example extrapolation (
GE3
,Wei,
2021) and
REPRINT
(Wei et al.,2022) also oper-
ate in the original representation space. To synthe-
size a new minority instance, GE3 adds the vec-
摘要:

ASurveyofMethodsforAddressingClassImbalanceinDeep-LearningBasedNaturalLanguageProcessingSophieHenning1;2WilliamBeluch1AlexanderFraser2AnnemarieFriedrich11BoschCenterforArticialIntelligence,Renningen,Germany2CenterforInformationandLanguageProcessing,LMUMunich,Germanysophieelisabeth.henning|william.b...

展开>> 收起<<
A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing Sophie Henning12William Beluch1Alexander Fraser2Annemarie Friedrich1.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:628.57KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注