Detecting Label Errors in Token Classiﬁcation Data Wei-Chen Wang wangericmit.edu

2025-04-27 0 0 1.28MB 15 页 10玖币

侵权投诉

Detecting Label Errors in Token Classiﬁcation Data

Wei-Chen Wang

wangeric@mit.edu

Cleanlab, MIT

Jonas Mueller

jonas@cleanlab.ai

Cleanlab

Abstract

Mislabeled examples are a common issue in real-world data, particularly for tasks

like token classiﬁcation where many labels must be chosen on a ﬁne-grained basis.

Here we consider the task of ﬁnding sentences that contain label errors in token

classiﬁcation datasets. We study 11 different straightforward methods that score

tokens/sentences based on the predicted class probabilities output by a (any) token

classiﬁcation model (trained via any procedure). In precision-recall evaluations

based on real-world label errors in entity recognition data from CoNLL-2003, we

identify a simple and effective method that consistently detects those sentences

containing label errors when applied with different token classiﬁcation models.

1 Introduction

It has recently come to light that many supervised learning datasets contain numerous incorrectly

labeled examples [

]. To efﬁciently improve the quality of such data, Label Error Detection (LED)

has emerged as a task of interest [

], in which algorithms ﬂag examples whose labels are

likely wrong for reviewers to inspect/correct. This may be done by means of a score for each example

which reﬂects its estimated label quality. A useful score ranks mislabeled examples higher than

others, allowing reviewers to much more efﬁciently identify the label errors in a dataset.

This paper considers LED for token classiﬁcation tasks (such as entity recognition) in which each

token in a sentence has been given its own class label. While it is possible to score the labels of

individual tokens, reviewing a candidate token ﬂagged as potentially mislabeled requires looking at

the entire sentence to understand the broader context. Here we propose

worst-token

, a method to

score sentences based on the likelihood that they contain some mislabeled tokens, such that sentences

can be effectively ranked for efﬁcient label review

. We evaluate the LED performance of this

approach and others on real-world data with naturally occuring label errors, unlike many past LED

evaluations based on synthetically-introduced label errors [

], for which conclusions may differ

from real-world errors [10,8,23].

Related Work.

Extensive research has been conducted on standard classiﬁcation with noisy labels

[

]. Our work builds on label quality scoring methods for classiﬁcation data studied

by Northcutt et al.

[14]

, Kuan and Mueller

[10]

, which merely depend on predictions from a trained

multiclass classiﬁcation model. These methods are straightforward to implement and broadly appli-

cable, being compatible with any classiﬁer and training procedure, as is our worst-token method.

Only a few prior works have studied label error detection for token classiﬁcation tasks speciﬁcally

[

]. Also aiming to score the overall label quality of an entire sentence like us, Wang

et al.

[22]

propose

CrossWeigh

which: trains a large ensemble of token classiﬁcation models with

1Code to run our proposed method is available here: https://github.com/cleanlab/cleanlab

Code to reproduce our results: https://github.com/cleanlab/token-label-error-benchmarks

Preprint. Under review.

arXiv:2210.03920v1 [cs.CL] 8 Oct 2022

entity-disjoint cross-validation, and scores a sentence based on the number of ensemble-member

class predictions that deviate from the given label, across all tokens in the sentence. Reiss et al.

[17]

propose a similar approach to LED in token classiﬁcation, which also relies on counting deviations

between token labels and corresponding class predictions output by a large ensemble of diverse

models trained via cross-validation. Unlike our

worst-token

method, the methods of Wang et al.

[22]

, Reiss et al.

[17]

are more computationally complex due to ensembling many models, and do

not account for the conﬁdence of individual predictions (as they are based on hard class predictions

rather than estimated class probabilities). Given the LED performance of

worst-token

improves

with a more accurate token classiﬁer, we expect

worst-token

to beneﬁt from ensembling’s reliable

predictive accuracy improvement in the same way that ensembling beneﬁts the methods of Wang et al.

[22]

, Reiss et al.

[17]

. We also tried entity-disjoint vs. standard data splitting, but did not observe

beneﬁts from the former variant (it often removes a signiﬁcant number of entities).

Klie et al.

[9]

study various approaches for LED in token classiﬁcation, but only consider scores for

individual tokens rather than entire sentences. Here we compare against some of the methods that per-

formed best in their study. The studies of Klie et al.

[9]

, Northcutt et al.

[14]

, Kuan and Mueller

[10]

indicate that label errors can be more effectively detected by considering the conﬁdence level of clas-

siﬁers rather than only their hard class predictions. Instead depending on predicted class probabilities

for each token, our worst-token method appropriately accounts for classiﬁer conﬁdence.

2 Methods

Typical token classiﬁcation data is composed of many sentences (i.e. training instances), each of

which is split into individual tokens (words or sub-words) where each token is labeled as one of

classes (i.e. entities in entity recognition). Given a sentence

, a trained token classiﬁcation

model

Mp¨q

outputs predicted probabilities

p“Mpxq

where

pij

is the probability that the

token in sentence

belongs to class

. Throughout, we assume these probabilities are out-of-sample,

computed from a copy of the model that did not see

during training (e.g. because

is in test set, or

cross-validation was utilized).

Using

, we ﬁrst consider evaluating the individual per-token labels. Here we apply effective

LED methods for standard classiﬁcation settings [

] by simply treating each token as a separate

independent instance (ignoring which sentence it belongs to). Following Kuan and Mueller

[10]

, we

compute a label quality score qiP r0,1sfor the ith token (assume it is labeled as class k) via one of

the following options:

• self-conﬁdence (sc): qi“pik, i.e. predicted probability of the given label for this token.

• normalized margin (nm): qi“pik ´pi˜

kwith ˜

k“argmaxjtpij u

• conﬁdence-weighted entropy (cwe): qi“pik

Hppiqwhere Hppiq“´ 1

log K

j“1

pij logppij q

Higher values of these label quality scores correspond to tokens whose label is more likely to be

correct [

]. We can alternatively evaluate the per-token labels via the Conﬁdent Learning algorithm

of Northcutt et al.

[14]

, which classiﬁes each token as correclty labeled or not (

bi“1

if this token is

ﬂagged as likely mislabeled, = 0 otherwise) based on adaptive thresholds set with respect to per-class

classiﬁer conﬁdence levels.

For one sentence with nword-level tokens, we thus have:

•p, a nˆKmatrix where pij is model predicted probability that the ith token belongs to class j.

•l“ rl1, . . . , lns, where liP t0, . . . , K ´1uis the given class label of the ith token.

•q“ rq1, . . . , qns, where qiis a label quality score for the ith token (one of the above options).

•b“ rb1, . . . , bns, where bi“1if ith token is ﬂagged as potentially mislabeled, otherwise bi“0.

Recall that to properly verify whether a token is really mislabeled, a reviewer must read the full

sentence containing this token to understand the broader context. Thus the most efﬁcient way to

review labels in a dataset is to prioritize inspection of those sentences most likely to contain a

mislabeled token. We consider

methods to estimate an overall quality score

spxq

for the sentence

x, where higher values correspond to sentences whose labels are more likely all correct.

1. predicted-difference

: The number of disagreements between the given and model-

predicted class labels over the tokens in the sentence, also utilized in the methods of Wang

et al.

[22]

, Reiss et al.

[17]

. Here we break sentence-score ties in favor of the highest-

conﬁdence disagreement. More formally:

spxq“´|R| ´max

iPRpi,p

where p

li“argmaxjtpij uand R“ ti:p

li‰liu. If R“ H, we let maxiPRpi,p

li“0.

2. bad-token-counts

spxq“´řibi

, the number of Conﬁdent Learning ﬂagged tokens.

Similarly considered by Klie et al.

[9]

, this approach is a natural token-classiﬁcation exten-

sion of the method of Northcutt et al. [14] for LED in standard classiﬁcation tasks.

3. bad-token-counts-avg

: Again scoring based on number of tokens ﬂagged as potentially

mislabeled, but now breaking ties primarily via the average label quality score of the ﬂagged

tokens and secondarily via the average label quality score of the other tokens. More formally:

spxq“´ÿ

bi`1

|R| ÿ

iPR

qi`

|S| ÿ

iPS

where R“ ti:bi“1u,S“ ti:bi“0u, and is some small constant.

4. bad-token-counts-min

: Similar to

bad-token-counts-avg

, but break ties using mini-

mum token quality rather than average token quality. More formally:

spxq “ ´ ÿ

bi`min

iPRqi`¨min

iPSqi

5. good-fraction: Fraction of tokens not ﬂagged as potential issues, spxq“´1

i“1

bi.

6. penalize-bad-tokens

: Penalize ﬂagged tokens based on their corresponding label quality

scores. More formally,

spxq “ 1´1

i“1

bip1´qiq

7. average-quality: Average label quality of tokens in the sentence, spxq “ 1

i“1

qi.

8. product

spxq “ řilogpqi`cq

, where

is a constant hyperparameter. This score places

greater emphasis on tokens with low estimated label-quality, while still being inﬂuenced by

all tokens’ quality (like the previous

average-quality

method). With

based on

token-scores, the

product

and

average-quality

methods are natural sentence extensions

of the CU or PM methods considered in Klie et al. [9] for token-level LED.

9. expected-bad

: A rough approximation of the expected number of mislabeled tokens in

sentence. More formally:

spxq “

minpn,Jq

j“1

j¨qpjq

where

qpiq

is the

th lowest token label quality score in this sentence, and

is a hyperparam-

eter. If using the

label-quality score,

1´qpiq

can be considered a loose proxy for the

probability of having at least ilabel errors in this sentence.

10. expected-alt

: Similar to

expected-bad

, but only considering the likelihood of any label

error rather than how many might be in this sentence. More formally:

spxq “

minpn,Jq

j“1

qpjq

11. worst-token

: The quality of the worst-labeled token in the sentence determines its overall

quality score,

spxq “ mintq1, q2, . . . , qnu

. This is a reasonable way to rank the sentences

most likely to have some label error, i.e. those most worthy of manual review.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DetectingLabelErrorsinTokenClassicationDataWei-ChenWangwangeric@mit.eduCleanlab,MITJonasMuellerjonas@cleanlab.aiCleanlabAbstractMislabeledexamplesareacommonissueinreal-worlddata,particularlyfortasksliketokenclassicationwheremanylabelsmustbechosenonane-grainedbasis.Hereweconsiderthetaskofndingsen...

展开>> 收起<<

Detecting Label Errors in Token Classiﬁcation Data Wei-Chen Wang wangericmit.edu.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Detecting Label Errors in Token Classiﬁcation Data Wei-Chen Wang wangericmit.edu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: