Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification Linxin Song1 Jieyu Zhang2 Tianxiang Yang1and Masayuki Goto1

2025-04-27 0 0 1.25MB 15 页 10玖币
侵权投诉
Adaptive Ranking-based Sample Selection
for Weakly Supervised Class-imbalanced Text Classification
Linxin Song1, Jieyu Zhang2, Tianxiang Yang1and Masayuki Goto1
1Waseda University 2University of Washington
songlx.imse.gt@ruri.waseda.jp, jieyuz2@cs.washington.edu, you_tensyou@akane.waseda.jp, masagoto@waseda.jp
Abstract
To obtain a large amount of training labels
inexpensively, researchers have recently
adopted the weak supervision (WS) paradigm,
which leverages labeling rules to synthesize
training labels rather than using individual
annotations to achieve competitive results
for natural language processing (NLP) tasks.
However, data imbalance is often overlooked
in applying the WS paradigm, despite being a
common issue in a variety of NLP tasks. To
address this challenge, we propose Adaptive
Ranking-based Sample Selection (ARS2), a
model-agnostic framework to alleviate the
data imbalance issue in the WS paradigm.
Specifically, it calculates a probabilistic mar-
gin score based on the output of the current
model to measure and rank the cleanliness of
each data point. Then, the ranked data are
sampled based on both class-wise and rule-
aware ranking. In particular, the two sample
strategies corresponds to our motivations: (1)
to train the model with balanced data batches
to reduce the data imbalance issue and (2)
to exploit the expertise of each labeling rule
for collecting clean samples. Experiments
on four text classification datasets with four
different imbalance ratios show that ARS2
outperformed the state-of-the-art imbalanced
learning and WS methods, leading to a 2%-
57.8% improvement on their F1-score. Our
implementation can be found in https://
github.com/JieyuZ2/wrench/blob/
main/wrench/endmodel/ars2.py.
1 Introduction
Deep learning models rely heavily on high-quality
yet expensive, labeled data. Owing to this consid-
erable cost, the weak supervision (WS) paradigm
has increasingly been used to reduce human ef-
forts (Ratner et al.,2016a;Zhang et al.,2021). This
approach synthesizes training labels with labeling
rules to significantly improve the efficiency of cre-
ating training sets and have achieved competitive
Figure 1: Comparison of class distribution between the
ground truth labels and labels produced by weak super-
vision (WS) on TREC dataset. The uncovered piece
represents the data not covered by any labeling rule in
WS. It may be observed that WS amplified the class
imbalance.
results in natural language processing (NLP) (Yu
et al.,2020;Ren et al.,2020;Rühling Cachay et al.,
2021). However, existing methods leveraging the
WS paradigm to perform NLP tasks mostly focus
on reducing the noise in training labels brought by
labeling rules, while ignoring the common and crit-
ical problem of data imbalance. In fact, in a prelim-
inary experiment performed as part of the present
work (Fig. 1), we found that the WS paradigm may
amplify the imbalance ratio of the dataset because
the synthesized training labels tend to have more
imbalanced distribution.
To address this issue, we propose ARS2 as a
general model-agnostic framework based on the
WS paradigm. ARS2 is mainly divided in two steps,
including (1) warm-up, in which stage noisy data is
used to train the model and obtain a noise detector;
(2) continual training with adaptive ranking-based
sample selection. In this stage, we use the noise
detector trained in the warm-up stage to evaluate
the cleanliness of the data, and use the ranking
obtained based on this evaluation to sample the
data. We followed previous works Ratner et al.
(2016a); Ren et al. (2020); Zhang et al. (2022b) in
using heuristic programmatic rules to annotate the
data. In weak supervised learning, researchers use
a label model to aggregate weak labels annotated by
arXiv:2210.03092v2 [cs.CL] 7 Oct 2022
rules to estimate the probabilistic class distribution
of each data point. In this work, we use a label
model to integrate the weak labels given by the
rules as pseudo-labels during the training process
to obviate the need for manual labeling.
To select the samples most likely to be clean,
we adopt a selection strategy based on small-loss,
which is a very common method that has been veri-
fied to be effective in many situations (Jiang et al.,
2018;Yu et al.,2019;Yao et al.,2020). Specifi-
cally, deep neural networks, have strong ability of
memorization (Wu et al.,2018;Wei et al.,2021),
will first memorize labels of clean data and then
those of noisy data with the assumption that the
clean data are of the majority in a noisy dataset.
Data with small loss can thus be regarded as clean
examples with high probability. Inspired by this
approach, we propose probabilistic margin score
(PMS) as a criterion to judge whether data are clean.
Instead of using the confidence given by a model
directly, a confidence margin is used for better per-
formance (Ye et al.,2020). We also performed a
comparative experiment on the use of margin ver-
sus the direct use of confidence, as described in
Sec. 3.3.
Sample selection based on weak labels can lead
to severe class imbalance. Consequently, models
trained using these imbalanced subsets can exhibit
both superior performance on majority classes and
inferior performance on minority classes (Cui et al.,
2019). A reweighted loss function can partially ad-
dress this problem. However, performance remains
nonetheless limited by noisy labels, that is, data
with majority-class features may be annotated as
minority-class data incorrectly, which misleads the
training process. Therefore, we propose a sample
selection strategy based on class-wise ranking (CR)
to address imbalanced data. Using this strategy, we
can select relatively balanced sample batches for
training and avoid the strong influence of the ma-
jority class.
To further exploit the expertise of labeling rules,
we also propose another sample selection strategy
called rule-aware ranking (RR). We use aggregated
labels as pseudo-labels in the WS paradigm and dis-
cards weak labels. However, the annotations gen-
erated by rules are likely to contain a considerable
amount of valid information. For example, some
rules yield a high proportion of correct results. The
higher the PMS, the more likely the labeling result
of the rules is to be close to the ground truth. Using
this strategy, we can select batches with clean data
for training and avoid the influence of noise.
The primary contributions of this work are sum-
marized as follows. (1) We propose a general,
model-agnostic weakly supervised leading frame-
work called ARS2 for imbalanced datasets; (2) we
also propose two reliable adaptive sampling strate-
gies to address data imbalance issues. (3) The
results of experiments on four benchmark datasets
are presented to demonstrate that the ARS2 im-
proved on the performance of existing imbalanced
learning and weakly supervised learning methods,
by 2%-57.8% in terms of F1-score.
2 Weakly Supervised Class-imbalanced
Text Classification
2.1 Problem Formulation
In this work, we study class-imbalanced text classi-
fication in a setting with weak supervision. Specifi-
cally, we consider an unlabeled dataset
D
consist-
ing of
N
documents, each of which is denoted by
xi∈ X
. For each document
xi
, the correspond-
ing label
yi∈ Y ={1,2, ..., C}
is unknown to us,
whereas the class prior
p(y)
is given and highly
imbalanced. Our goal is to learn a parameterized
function
f(·;θ) : X C1
which outputs the
class probability
p(y|x)
and can be used to clas-
sify documents during inference.
To address the lack of ground truth training la-
bels, we adopt the two-stage weak supervision
paradigm (Ratner et al.,2016b;Zhang et al.,2021).
In particular, we rely on
k
user-provided heuristic
rules
{ri}i∈{1,...,k}
to provide weak labels. Each
rule
ri
is associated with a particular label
yri∈ Y
,
and we denote by
li
the output of the rule
ri
. It
either assigns the associated label (
li=yri
) to a
given document or abstains (
li=1
) on this ex-
ample. Note that the user-provided rules could be
noisy and conflict with one another. For the docu-
ment
x
, we concatenate the output weak labels of
k
rules
l1, ..., lk
as
lx
. Throughout this work, we
apply the weak labels output by heuristic rules to
train a text classifier.
2.2 Aggregation of Weak Labels
Label models are used to aggregate weak labels
under the weak supervision paradigm, which are
in turn used to train the desired end model in the
next stage. Existing label models include Major-
ity Voting (MV), Probabilistic Graphical Models
1Cis a C-dimension simplex.
Figure 2: Overview of ARS2. Our framework has two stages: (1) warm-up, which is used to let the model learn
how to distinguish noisy data; (2) continual training with adaptive sampling, which is used to sample clean data.
We adopt two different adaptive sampling strategies, including class-wise ranking sampling and rule-aware ranking
sampling.
(PGM) (Dawid and Skene,1979;Ratner et al.,
2019b;Fu et al.,2020), etc. In this research, we
use PGM implement by Ratner et al. (2019b) as
our label model g(·), which can be described as
g(lx) = P(y|lx).(1)
This assumes that
lx
as a random variable for label
model. After modeling the relationship between
the observed variable
lx
and unobserved variable
y
by Bayes methods, the label model obtains the
posterior distribution of
y
given
lx
by inference pro-
cess like expectation maximization or variational
inference. Finally, we set the maximum value of
P(y|lx)
as the hard pseudo-label
˜yx
of
x
for later
end model training.
2.3 Adaptive Ranking-based Sample
Selection
We propose an adaptive ranking sample selection
approach to simultaneously solve the problems
caused by data imbalance and those resulting from
the noise generated by the application of proce-
dural rules. First, the classification model
f(·;θ)
is warmed up with pseudo-labels
˜yx
, which are
used to train the model as a noise indicator that can
discriminate noise in the next stage. Then, we con-
tinue training the warmed-up model by using the
data sampled by adaptive data sampling strategies,
including class-wise ranking (CR) and rule-aware
ranking (RR) supported by probabilistic margin
score (PMS). The training procedures are summa-
rized in Algorithm 1and the proposed framework
is illustrated in Figure 2.
Warm-up.
Inspired by Zheng et al. (2020), the
prediction of a noisy classification model can be
Algorithm 1: ARS2
Input:
Weak labeled training data
X
; Pseudo label
˜y
;
Classification model f(·;θ).
// Warm-up f(·,θ)with weak labeled data.
for t= 1,2,· · · , T do
1. Sample a minibatch Bfrom X.
2. Update θby Eq. (2) before early stop.
// Continue training with sample selection.
for t= 1,2,· · · , Tsdo
1. Calculate score for all x∈ X (Sec. 2.4).
2. Sample Q(t)from X(Sec. 2.5).
3. Sample S(t)from X(Sec. 2.6).
4. Update
θ
using
U(t)=Q(t)∪ S(t)
by Eq. (3).
Output: Output final model f(·;θ).
a good indicator of whether the label of a training
data point is clean. Our method relies on a noise
indicator trained at warm-up to determine whether
each training data is clean. However, a model
with sufficient capacity (e.g., more parameters than
training examples) can “memorize” each example,
overfitting the training set and yielding poor gen-
eralization to validation and test sets (Goodfellow
et al.,2016). To prevent a model from overfitting
noisy data, we warm-up the model
f(·;θ)
with
early stopping (Dodge et al.,2020), and solve the
optimization problem by
min
θ
1
NX
x∈X
L(f(x;θ),˜yx),(2)
where
L
denotes a loss function and
˜yx
is pseudo-
label aggregated by a label model. In this research,
we do not limit the definition of the loss func-
tion; that is, any loss function suitable for a multi-
classification task can be used.
Continual training with sample selection.
Noisy and imbalanced labels impair the predictive
摘要:

AdaptiveRanking-basedSampleSelectionforWeaklySupervisedClass-imbalancedTextClassicationLinxinSong1,JieyuZhang2,TianxiangYang1andMasayukiGoto11WasedaUniversity2UniversityofWashingtonsonglx.imse.gt@ruri.waseda.jp,jieyuz2@cs.washington.edu,you_tensyou@akane.waseda.jp,masagoto@waseda.jpAbstractToobtain...

展开>> 收起<<
Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification Linxin Song1 Jieyu Zhang2 Tianxiang Yang1and Masayuki Goto1.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.25MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注