A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing Sophie Henning12William Beluch1Alexander Fraser2Annemarie Friedrich1

2025-04-27 1 0 628.57KB 18 页 10玖币

侵权投诉

A Survey of Methods for Addressing Class Imbalance

in Deep-Learning Based Natural Language Processing

Sophie Henning1,2William Beluch1Alexander Fraser2Annemarie Friedrich1

1Bosch Center for Artiﬁcial Intelligence, Renningen, Germany

2Center for Information and Language Processing, LMU Munich, Germany

sophieelisabeth.henning|william.beluch@de.bosch.com

fraser@cis.lmu.de

annemarie.friedrich@de.bosch.com

Abstract

Many natural language processing (NLP) tasks

are naturally imbalanced, as some target cate-

gories occur much more frequently than others

in the real world. In such scenarios, current

NLP models tend to perform poorly on less

frequent classes. Addressing class imbalance

in NLP is an active research topic, yet, ﬁnding

a good approach for a particular task and im-

balance scenario is difﬁcult.

In this survey, the ﬁrst overview on class im-

balance in deep-learning based NLP, we ﬁrst

discuss various types of controlled and real-

world class imbalance. Our survey then covers

approaches that have been explicitly proposed

for class-imbalanced NLP tasks or, originating

in the computer vision community, have been

evaluated on them. We organize the methods

by whether they are based on sampling, data

augmentation, choice of loss function, staged

learning, or model design. Finally, we discuss

open problems and how to move forward.

1 Introduction

Class imbalance is a major problem in natural lan-

guage processing (NLP), because target category

distributions are almost always skewed in NLP

tasks. As illustrated by Figure 1, this often leads to

poor performance on minority classes. Which cate-

gories matter is highly task-speciﬁc and may even

depend on the intended downstream use. Develop-

ing methods that improve model performance in

imbalanced data settings has been an active area for

decades (e.g., Bruzzone and Serpico,1997;Japkow-

icz et al.,2000;Estabrooks and Japkowicz,2001;

Park and Zhang,2002;Tan,2005), and is recently

gaining momentum in the context of maturing neu-

ral approaches (e.g., Buda et al.,2018;Kang et al.,

2020;Li et al.,2020;Yang et al.,2020;Jiang et al.,

2021;Spangher et al.,2021). The problem is exac-

erbated when classes overlap in the feature space

(Lin et al.,2019;Tian et al.,2020). For example,

(a) Single-label relation

classiﬁcation

on TACRED

(Zhou and Chen,2021)

(b) Hierarchical multi-label

patent classiﬁcation

(Pujari et al.,2021)

tion

classiﬁcation (PDTB)

(Shi and Demberg,2019)

(d) UD dependency parsing

using RoBERTa on EWT

(Grünewald et al.,2021)

Figure 1: Class imbalance has a negative effect on per-

formance especially for minority classes in a variety of

NLP tasks. Upper charts show label count distributions,

lower part show test/dev F1 by training instance count

(lighter colors indicate fewer test/dev instances). All

models are based on transformers.

in patent classiﬁcation, technical categories differ

largely in frequency, and the concepts mentioned

in the different categories can be very similar.

On a large variety of NLP tasks, transformer

models such as BERT (Vaswani et al.,2017;Devlin

et al.,2019) outperform both their neural predeces-

sors and traditional models (Liu et al.,2019;Xie

et al.,2020;Mathew et al.,2021). Performance

for minority classes is also often higher when us-

ing self-supervised pre-trained models (e.g., Li and

Scarton,2020;Niklaus et al.,2021), which paral-

lels ﬁndings from computer vision (Liu et al.,2022).

However, the advent of BERT has not solved the

class imbalance problem in NLP, as illustrated by

Figure 1.Tänzer et al. (2022) ﬁnd that on syn-

thetically imbalanced named entity datasets with

arXiv:2210.04675v2 [cs.CL] 22 Feb 2023

(a) Step imbalance, µ= 0.4,ρ= 10 (b) Linear imbalance, ρ= 10 (c) Long-tailed distribution

Figure 2: Instance counts per label follow different distributions: examples of class imbalance types.

majority classes having thousands of examples, at

least 25 instances are required to predict a class

at all, and 100 examples to learn to predict it with

some accuracy.

Despite the relevance of class imbalance to NLP,

related surveys only exist in the computer vision

domain (Johnson and Khoshgoftaar,2019b;Zhang

et al.,2021b). Incorporating methods addressing

class imbalance can lead to performance gains of

up to 20%. Yet, NLP research often overlooks how

important this is in practical applications, where

minority classes may be of special interest.

Our contribution is to draw a clear landscape

of approaches applicable to deep-learning (DL)

based NLP. We set out with a problem deﬁni-

tion (Sec. 2), and then organize approaches by

whether they are based on sampling, data aug-

mentation, choice of loss function, staged learn-

ing, or model design (Sec. 3). Our extensive sur-

vey ﬁnds that re-sampling, data augmentation, and

changing the loss function can be relatively simple

ways to increase performance in class-imbalanced

settings and are thus straightforward choices for

NLP practitioners.

While promising research di-

rections, staged learning or model modiﬁcations

often are implementation-wise and/or computation-

ally costlier. Moreover, we discuss particular chal-

lenges of non-standard classiﬁcation settings, e.g.,

imbalanced multi-label classiﬁcation and catch-all

classes, and provide useful connections to related

computer vision work. Finally, we outline promis-

ing directions for future research (Sec. 4).

Scope of this survey.

We focus on approaches

evaluated on or developed for neural methods.

Work from “traditional” NLP (e.g., Tomanek and

Hahn,2009;Li et al.,2011;Li and Nenkova,2014;

Kunchukuttan and Bhattacharyya,2015) as well as

Natural Language Generation (e.g., Nishino et al.,

2020) and Automatic Speech Recognition (e.g.,

Winata et al.,2020;Deng et al.,2022) are not ad-

We provide practical advice on identifying potentially ap-

plicable class imbalance methods in the Appendix (Figure 3).

dressed in this survey. Other types of imbalances

such as differently sized data sets of subtasks in

continual learning (Ahrens et al.,2021) or imbal-

anced regression (Yang et al.,2021) are also beyond

the scope of this survey. In Sec. 3.5, we brieﬂy

touch upon the related area of few-shot learning

(Wang et al.,2020c).

Related surveys.

We review imbalance-speciﬁc

data augmentation approaches in Sec. 3.2.Feng

et al. (2021) give a broader overview of data aug-

mentation in NLP, Hedderich et al. (2021) provide

an overview of low-resource NLP, and Ramponi

and Plank (2020) discuss neural domain adaptation.

2 Problem Deﬁnition

Class imbalance

refers to a classiﬁcation set-

ting in which one or multiple classes (

minority

classes

) are considerably less frequent than others

(

majority classes

). More concrete deﬁnitions, e.g.,

regarding the relative share up to which a class

is seen as a minority class, depend on the task,

dataset and labelset size. Much research focuses on

improving all minority classes equally while main-

taining or at least monitoring majority class perfor-

mance (e.g., Huang et al.,2021;Yang et al.,2020;

Spangher et al.,2021). We next discuss prototypi-

cal types of imbalance (Sec. 2.1) and then compare

controlled and real-world settings (Sec. 2.2).

2.1 Types of Imbalance

To systematically investigate the effect of imbal-

ance, Buda et al. (2018) deﬁne two prototypical

types of label distributions, which we explain next.

Step imbalance

is characterized by the fraction

of minority classes,

, and the size ratio between

majority and minority classes,

. Larger

values

indicate more imbalanced data sets. In prototyp-

ical step imbalance, if there are multiple minor-

ity classes, all of them are equally sized; if there

are several majority classes, they also have equal

size. Figure 2a shows a step-imbalanced distribu-

tion with

40%

of the classes being minority classes

and an imbalance ratio of

ρ= 10

. NLP datasets

with a large catch-all class as they often arise in

sequence tagging (see Sec. 2.2) or in relevance

judgments in retrieval models frequently resem-

ble step-imbalanced distributions. The

ratio has

also been reported in NLP, e.g., by Li et al. (2020),

although more task-speciﬁc imbalance measures

have been proposed, e.g., for single-label text clas-

siﬁcation (Tian et al.,2020). In

linear imbalance

class size grows linearly with imbalance ratio

(see Figure 2b), as, e.g., in the naturally imbal-

anced SICK dataset for natural language inference

(Marelli et al.,2014).

Long-tailed

label distributions (Figure 2c) are

conceptually similar to linear imbalance. They

contain many data points for a small number of

classes (head classes), but only very few for the

rest of the classes (tail classes). These distributions

are common in computer vision tasks like instance

segmentation (e.g., Gupta et al.,2019a), but also

in multi-label text classiﬁcation, for example with

the goal of assigning clinical codes (Mullenbach

et al.,2018), patent categories (Pujari et al.,2021),

or news and research topics (Huang et al.,2021).

2.2 Controlled vs. Real-World Class

Imbalance

Most real-world label distributions in NLP tasks do

not perfectly match the prototypical distributions

proposed by Buda et al. (2018). Yet, awareness

of these settings helps practitioners to select ap-

propriate methods for their data set or problem by

comparing distribution plots. Using synthetically

imbalanced data sets, researchers can control for

more experimental factors and investigate several

scenarios at once. However, evaluating on naturally

imbalanced data provides evidence of a method’s

real-world effectiveness. Some recent studies com-

bine both types of evaluation (e.g., Tian et al.,2021;

Subramanian et al.,2021;Jang et al.,2021).

Many NLP tasks require treating a large, often

heterogenous

catch-all class

that contains all in-

stances that are not of interest to the task, while

the remaining (minority) classes are approximately

same-sized. Examples include the “Outside” label

in IOB sequence tagging, or tweets that mention

products in contexts that are irrelevant to the an-

notated categories (Adel et al.,2017). Such real-

world settings often roughly follow a step imbal-

ance distribution, with the additional difﬁculty of

the catch-all class.

2.3 Evaluation

As accuracy and micro-averages mostly reﬂect ma-

jority class performance, choosing a good evalu-

ation setting and metric is non-trivial. It is also

highly task-dependent: in many NLP tasks, recog-

nizing one or all minority classes well is at least

equally important as majority class performance.

For instance, non-hateful tweets are much more

frequent in Twitter (Waseem and Hovy,2016), but

recognizing hateful content is the key motivation of

hate speech detection. Which classes matter may

even depend on downstream considerations, i.e.,

the same named entity tagger might be used in one

application where a majority class matters, and an-

other where minority classes matter more. Several

evaluation metrics exist that have been designed

to account for class-imbalanced settings, but no de

facto standard exists. For example, balanced ac-

curacy (Brodersen et al.,2010) corresponds to the

average of per-class recall scores. It is often useful

to record performance on all classes and to report

macro-averages, which treat all classes equally.

3 Methods for Addressing Class

Imbalance in NLP

In this section, we survey methods that either

have been explicitly proposed to address class-

imbalance issues in NLP or that have been em-

pirically shown to be applicable for NLP problems.

We provide an overview of which methods are ap-

plicable to a selection of NLP tasks in Appendix A.

3.1 Re-Sampling

To increase the importance of minority instances

in training, the label distribution can be changed

by various sampling strategies. Sampling can ei-

ther be executed once or repeatedly during training

(Pouyanfar et al.,2018). In random oversampling

(

ROS

), a random choice of minority instances

are duplicated, whereas in random undersampling

(

RUS

), a random choice of majority instances are

removed from the dataset. ROS can lead to overﬁt-

ting and increases training times. RUS, however,

discards potentially valuable data, but has been

shown to work well in language-modeling objec-

tives (Mikolov et al.,2013).

When applied in DL, ROS outperforms RUS

both in synthetic step and linear imbalance (Buda

et al.,2018) and in binary and multi-class English

and Korean text classiﬁcation (Juuti et al.,2020;

Akhbardeh et al.,2021;Jang et al.,2021). More

ﬂexible variants, e.g., re-sampling only a tunable

share of classes (Tepper et al.,2020) or interpo-

lating between the (imbalanced) data distribution

and an almost perfectly balanced distribution (Ari-

vazhagan et al.,2019), can also further improve

results. Class-aware sampling (

CAS

,Shen et al.,

2016), also referred to as class-balanced sampling,

ﬁrst chooses a class, and then an instance from

this class. Performance-based re-sampling dur-

ing training, following the idea of Pouyanfar et al.

(2018), works well in multi-class text classiﬁcation

(Akhbardeh et al.,2021).

Issues in multi-label classiﬁcation.

In multi-

label classiﬁcation, label dependencies between

majority and minority classes complicate sampling

approaches, as over-sampling an instance with a

minority label may simultaneously amplify the ma-

jority class count (Charte et al.,2015;Huang et al.,

2021). CAS also suffers from this issue, and addi-

tionally introduces within-class imbalance, as in-

stances of one class are selected with different prob-

abilities depending on the co-assigned labels (Wu

et al.,2020). Effective sampling in such settings is

still an open issue. Existing approaches monitor the

class distributions during sampling (Charte et al.,

2015) or assign instance-based sampling probabili-

ties (Gupta et al.,2019b;Wu et al.,2020).

3.2 Data Augmentation

Increasing the amount of minority class data dur-

ing corpus construction, e.g., by writing additional

examples or selecting examples to be labeled using

Active Learning, can mitigate the class imbalance

problem to some extent (Cho et al.,2020;Ein-Dor

et al.,2020). However, this is particularly labo-

rious in naturally imbalanced settings as it may

require ﬁnding “the needle in the haystack,” or may

lead to biased minority class examples, e.g., due

to collection via keyword queries. Synthetically

generating additional minority instances thus is a

promising direction. In this section, we survey data

augmentation methods that have been explicitly

proposed to mitigate class imbalance and that have

been evaluated in combination with DL.

Text augmentation

generates new natural lan-

guage instances of minority classes, ranging from

simple string-based manipulations such as syn-

onym replacements to Transformer-based gener-

ation. Easy Data Augmentation (

EDA

,Wei and

Zou,2019), which uses dictionary-based synonym

replacements, random insertion, random swap, and

random deletion, has been shown to work well

in class-imbalanced settings (Jiang et al.,2021;

Jang et al.,2021;Juuti et al.,2020). Juuti et al.

(2020) generate new minority class instances for

English binary text classiﬁcation using EDA and

embedding-based synonym replacements, and by

adding a random majority class sentence to a mi-

nority class document. They also prompt the pre-

trained language model GPT-2 (Radford et al.,

2019) with a minority class instance to generate

new minority class samples. Tepper et al. (2020)

evaluate generation with GPT-2 on English multi-

class text classiﬁcation datasets, coupled with a

ﬂexible balancing policy (see Sec. 3.1).

Similarly, Gaspers et al. (2020) combine

machine-translation based text augmentation with

dataset balancing to build a multi-task model. Both

the main and auxiliary tasks are German intent clas-

siﬁcation. Only the training data for the latter is

balanced and enriched with synthetic minority in-

stances. In a long-tailed multi-label setting, Zhang

et al. (2022) learn an attention-based text augmen-

tation that augments instances with text segments

that are relevant to tail classes, leading to small im-

provements. In general, transferring methods such

as EDA or backtranslation to multi-label settings

is difﬁcult (Zhang et al.,2022,2020;Tang et al.,

2020).

Hidden space augmentation

generates new in-

stance vectors that are not directly associated

with a particular natural language string, leverag-

ing the representations of real examples. Using

representation-based augmentations to tackle class

imbalance is not tied to DL. SMOTE (Chawla et al.,

2002), which interpolates minority instances with

randomly chosen examples from their K-nearest

neighbours, is popular in traditional machine learn-

ing (Fernández et al.,2018), but leads to mixed

results in DL-based NLP (Ek and Ghanimifard,

2019;Tran and Litman,2021;Wei et al.,2022).

Inspired by CutMix (Yun et al.,2019), which cuts

and pastes a single pixel region in an image,

Text-

Cut

(Jiang et al.,2021) randomly replaces small

parts of the BERT representation of one instance

with those of the other. In binary and multi-class

text classiﬁcation experiments, TextCut improves

over non-augmented BERT and EDA.

Good-enough example extrapolation (

GE3

,Wei,

2021) and

REPRINT

(Wei et al.,2022) also oper-

ate in the original representation space. To synthe-

size a new minority instance, GE3 adds the vec-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ASurveyofMethodsforAddressingClassImbalanceinDeep-LearningBasedNaturalLanguageProcessingSophieHenning1;2WilliamBeluch1AlexanderFraser2AnnemarieFriedrich11BoschCenterforArticialIntelligence,Renningen,Germany2CenterforInformationandLanguageProcessing,LMUMunich,Germanysophieelisabeth.henning|william.b...

展开>> 收起<<

A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing Sophie Henning12William Beluch1Alexander Fraser2Annemarie Friedrich1.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing Sophie Henning12William Beluch1Alexander Fraser2Annemarie Friedrich1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: