Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classiﬁcation Linxin Song1 Jieyu Zhang2 Tianxiang Yang1and Masayuki Goto1

2025-04-27 0 0 1.25MB 15 页 10玖币

侵权投诉

Adaptive Ranking-based Sample Selection

for Weakly Supervised Class-imbalanced Text Classiﬁcation

Linxin Song1, Jieyu Zhang2, Tianxiang Yang1and Masayuki Goto1

1Waseda University 2University of Washington

songlx.imse.gt@ruri.waseda.jp, jieyuz2@cs.washington.edu, you_tensyou@akane.waseda.jp, masagoto@waseda.jp

Abstract

To obtain a large amount of training labels

inexpensively, researchers have recently

adopted the weak supervision (WS) paradigm,

which leverages labeling rules to synthesize

training labels rather than using individual

annotations to achieve competitive results

for natural language processing (NLP) tasks.

However, data imbalance is often overlooked

in applying the WS paradigm, despite being a

common issue in a variety of NLP tasks. To

address this challenge, we propose Adaptive

Ranking-based Sample Selection (ARS2), a

model-agnostic framework to alleviate the

data imbalance issue in the WS paradigm.

Speciﬁcally, it calculates a probabilistic mar-

gin score based on the output of the current

model to measure and rank the cleanliness of

each data point. Then, the ranked data are

sampled based on both class-wise and rule-

aware ranking. In particular, the two sample

strategies corresponds to our motivations: (1)

to train the model with balanced data batches

to reduce the data imbalance issue and (2)

to exploit the expertise of each labeling rule

for collecting clean samples. Experiments

on four text classiﬁcation datasets with four

different imbalance ratios show that ARS2

outperformed the state-of-the-art imbalanced

learning and WS methods, leading to a 2%-

57.8% improvement on their F1-score. Our

implementation can be found in https://

github.com/JieyuZ2/wrench/blob/

main/wrench/endmodel/ars2.py.

1 Introduction

Deep learning models rely heavily on high-quality

yet expensive, labeled data. Owing to this consid-

erable cost, the weak supervision (WS) paradigm

has increasingly been used to reduce human ef-

forts (Ratner et al.,2016a;Zhang et al.,2021). This

approach synthesizes training labels with labeling

rules to signiﬁcantly improve the efﬁciency of cre-

ating training sets and have achieved competitive

Figure 1: Comparison of class distribution between the

ground truth labels and labels produced by weak super-

vision (WS) on TREC dataset. The uncovered piece

represents the data not covered by any labeling rule in

WS. It may be observed that WS ampliﬁed the class

imbalance.

results in natural language processing (NLP) (Yu

et al.,2020;Ren et al.,2020;Rühling Cachay et al.,

2021). However, existing methods leveraging the

WS paradigm to perform NLP tasks mostly focus

on reducing the noise in training labels brought by

labeling rules, while ignoring the common and crit-

ical problem of data imbalance. In fact, in a prelim-

inary experiment performed as part of the present

work (Fig. 1), we found that the WS paradigm may

amplify the imbalance ratio of the dataset because

the synthesized training labels tend to have more

imbalanced distribution.

To address this issue, we propose ARS2 as a

general model-agnostic framework based on the

WS paradigm. ARS2 is mainly divided in two steps,

including (1) warm-up, in which stage noisy data is

used to train the model and obtain a noise detector;

(2) continual training with adaptive ranking-based

sample selection. In this stage, we use the noise

detector trained in the warm-up stage to evaluate

the cleanliness of the data, and use the ranking

obtained based on this evaluation to sample the

data. We followed previous works Ratner et al.

(2016a); Ren et al. (2020); Zhang et al. (2022b) in

using heuristic programmatic rules to annotate the

data. In weak supervised learning, researchers use

a label model to aggregate weak labels annotated by

arXiv:2210.03092v2 [cs.CL] 7 Oct 2022

rules to estimate the probabilistic class distribution

of each data point. In this work, we use a label

model to integrate the weak labels given by the

rules as pseudo-labels during the training process

to obviate the need for manual labeling.

To select the samples most likely to be clean,

we adopt a selection strategy based on small-loss,

which is a very common method that has been veri-

ﬁed to be effective in many situations (Jiang et al.,

2018;Yu et al.,2019;Yao et al.,2020). Speciﬁ-

cally, deep neural networks, have strong ability of

memorization (Wu et al.,2018;Wei et al.,2021),

will ﬁrst memorize labels of clean data and then

those of noisy data with the assumption that the

clean data are of the majority in a noisy dataset.

Data with small loss can thus be regarded as clean

examples with high probability. Inspired by this

approach, we propose probabilistic margin score

(PMS) as a criterion to judge whether data are clean.

Instead of using the conﬁdence given by a model

directly, a conﬁdence margin is used for better per-

formance (Ye et al.,2020). We also performed a

comparative experiment on the use of margin ver-

sus the direct use of conﬁdence, as described in

Sec. 3.3.

Sample selection based on weak labels can lead

to severe class imbalance. Consequently, models

trained using these imbalanced subsets can exhibit

both superior performance on majority classes and

inferior performance on minority classes (Cui et al.,

2019). A reweighted loss function can partially ad-

dress this problem. However, performance remains

nonetheless limited by noisy labels, that is, data

with majority-class features may be annotated as

minority-class data incorrectly, which misleads the

training process. Therefore, we propose a sample

selection strategy based on class-wise ranking (CR)

to address imbalanced data. Using this strategy, we

can select relatively balanced sample batches for

training and avoid the strong inﬂuence of the ma-

jority class.

To further exploit the expertise of labeling rules,

we also propose another sample selection strategy

called rule-aware ranking (RR). We use aggregated

labels as pseudo-labels in the WS paradigm and dis-

cards weak labels. However, the annotations gen-

erated by rules are likely to contain a considerable

amount of valid information. For example, some

rules yield a high proportion of correct results. The

higher the PMS, the more likely the labeling result

of the rules is to be close to the ground truth. Using

this strategy, we can select batches with clean data

for training and avoid the inﬂuence of noise.

The primary contributions of this work are sum-

marized as follows. (1) We propose a general,

model-agnostic weakly supervised leading frame-

work called ARS2 for imbalanced datasets; (2) we

also propose two reliable adaptive sampling strate-

gies to address data imbalance issues. (3) The

results of experiments on four benchmark datasets

are presented to demonstrate that the ARS2 im-

proved on the performance of existing imbalanced

learning and weakly supervised learning methods,

by 2%-57.8% in terms of F1-score.

2 Weakly Supervised Class-imbalanced

Text Classiﬁcation

2.1 Problem Formulation

In this work, we study class-imbalanced text classi-

ﬁcation in a setting with weak supervision. Speciﬁ-

cally, we consider an unlabeled dataset

consist-

ing of

documents, each of which is denoted by

xi∈ X

. For each document

, the correspond-

ing label

yi∈ Y ={1,2, ..., C}

is unknown to us,

whereas the class prior

p(y)

is given and highly

imbalanced. Our goal is to learn a parameterized

function

f(·;θ) : X −→ ∆C1

which outputs the

class probability

p(y|x)

and can be used to clas-

sify documents during inference.

To address the lack of ground truth training la-

bels, we adopt the two-stage weak supervision

paradigm (Ratner et al.,2016b;Zhang et al.,2021).

In particular, we rely on

user-provided heuristic

rules

{ri}i∈{1,...,k}

to provide weak labels. Each

rule

is associated with a particular label

yri∈ Y

and we denote by

the output of the rule

. It

either assigns the associated label (

li=yri

) to a

given document or abstains (

li=−1

) on this ex-

ample. Note that the user-provided rules could be

noisy and conﬂict with one another. For the docu-

ment

, we concatenate the output weak labels of

rules

l1, ..., lk

. Throughout this work, we

apply the weak labels output by heuristic rules to

train a text classiﬁer.

2.2 Aggregation of Weak Labels

Label models are used to aggregate weak labels

under the weak supervision paradigm, which are

in turn used to train the desired end model in the

next stage. Existing label models include Major-

ity Voting (MV), Probabilistic Graphical Models

1∆Cis a C-dimension simplex.

Figure 2: Overview of ARS2. Our framework has two stages: (1) warm-up, which is used to let the model learn

how to distinguish noisy data; (2) continual training with adaptive sampling, which is used to sample clean data.

We adopt two different adaptive sampling strategies, including class-wise ranking sampling and rule-aware ranking

sampling.

(PGM) (Dawid and Skene,1979;Ratner et al.,

2019b;Fu et al.,2020), etc. In this research, we

use PGM implement by Ratner et al. (2019b) as

our label model g(·), which can be described as

g(lx) = P(y|lx).(1)

This assumes that

as a random variable for label

model. After modeling the relationship between

the observed variable

and unobserved variable

by Bayes methods, the label model obtains the

posterior distribution of

given

by inference pro-

cess like expectation maximization or variational

inference. Finally, we set the maximum value of

P(y|lx)

as the hard pseudo-label

˜yx

for later

end model training.

2.3 Adaptive Ranking-based Sample

Selection

We propose an adaptive ranking sample selection

approach to simultaneously solve the problems

caused by data imbalance and those resulting from

the noise generated by the application of proce-

dural rules. First, the classiﬁcation model

f(·;θ)

is warmed up with pseudo-labels

˜yx

, which are

used to train the model as a noise indicator that can

discriminate noise in the next stage. Then, we con-

tinue training the warmed-up model by using the

data sampled by adaptive data sampling strategies,

including class-wise ranking (CR) and rule-aware

ranking (RR) supported by probabilistic margin

score (PMS). The training procedures are summa-

rized in Algorithm 1and the proposed framework

is illustrated in Figure 2.

Warm-up.

Inspired by Zheng et al. (2020), the

prediction of a noisy classiﬁcation model can be

Algorithm 1: ARS2

Input:

Weak labeled training data

; Pseudo label

˜y

;

Classiﬁcation model f(·;θ).

// Warm-up f(·,θ)with weak labeled data.

for t= 1,2,· · · , T do

1. Sample a minibatch Bfrom X.

2. Update θby Eq. (2) before early stop.

// Continue training with sample selection.

for t= 1,2,· · · , Tsdo

1. Calculate score for all x∈ X (Sec. 2.4).

2. Sample Q(t)from X(Sec. 2.5).

3. Sample S(t)from X(Sec. 2.6).

4. Update

using

U(t)=Q(t)∪ S(t)

by Eq. (3).

Output: Output ﬁnal model f(·;θ).

a good indicator of whether the label of a training

data point is clean. Our method relies on a noise

indicator trained at warm-up to determine whether

each training data is clean. However, a model

with sufﬁcient capacity (e.g., more parameters than

training examples) can “memorize” each example,

overﬁtting the training set and yielding poor gen-

eralization to validation and test sets (Goodfellow

et al.,2016). To prevent a model from overﬁtting

noisy data, we warm-up the model

f(·;θ)

with

early stopping (Dodge et al.,2020), and solve the

optimization problem by

min

x∈X

L(f(x;θ),˜yx),(2)

where

denotes a loss function and

˜yx

is pseudo-

label aggregated by a label model. In this research,

we do not limit the deﬁnition of the loss func-

tion; that is, any loss function suitable for a multi-

classiﬁcation task can be used.

Continual training with sample selection.

Noisy and imbalanced labels impair the predictive

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdaptiveRanking-basedSampleSelectionforWeaklySupervisedClass-imbalancedTextClassicationLinxinSong1,JieyuZhang2,TianxiangYang1andMasayukiGoto11WasedaUniversity2UniversityofWashingtonsonglx.imse.gt@ruri.waseda.jp,jieyuz2@cs.washington.edu,you_tensyou@akane.waseda.jp,masagoto@waseda.jpAbstractToobtain...

展开>> 收起<<

Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classiﬁcation Linxin Song1 Jieyu Zhang2 Tianxiang Yang1and Masayuki Goto1.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classiﬁcation Linxin Song1 Jieyu Zhang2 Tianxiang Yang1and Masayuki Goto1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: