Detecting Label Errors in Token Classification Data Wei-Chen Wang wangericmit.edu

2025-04-27 0 0 1.28MB 15 页 10玖币
侵权投诉
Detecting Label Errors in Token Classification Data
Wei-Chen Wang
wangeric@mit.edu
Cleanlab, MIT
Jonas Mueller
jonas@cleanlab.ai
Cleanlab
Abstract
Mislabeled examples are a common issue in real-world data, particularly for tasks
like token classification where many labels must be chosen on a fine-grained basis.
Here we consider the task of finding sentences that contain label errors in token
classification datasets. We study 11 different straightforward methods that score
tokens/sentences based on the predicted class probabilities output by a (any) token
classification model (trained via any procedure). In precision-recall evaluations
based on real-world label errors in entity recognition data from CoNLL-2003, we
identify a simple and effective method that consistently detects those sentences
containing label errors when applied with different token classification models.
1 Introduction
It has recently come to light that many supervised learning datasets contain numerous incorrectly
labeled examples [
13
]. To efficiently improve the quality of such data, Label Error Detection (LED)
has emerged as a task of interest [
10
,
14
,
11
], in which algorithms flag examples whose labels are
likely wrong for reviewers to inspect/correct. This may be done by means of a score for each example
which reflects its estimated label quality. A useful score ranks mislabeled examples higher than
others, allowing reviewers to much more efficiently identify the label errors in a dataset.
This paper considers LED for token classification tasks (such as entity recognition) in which each
token in a sentence has been given its own class label. While it is possible to score the labels of
individual tokens, reviewing a candidate token flagged as potentially mislabeled requires looking at
the entire sentence to understand the broader context. Here we propose
worst-token
, a method to
score sentences based on the likelihood that they contain some mislabeled tokens, such that sentences
can be effectively ranked for efficient label review
1
. We evaluate the LED performance of this
approach and others on real-world data with naturally occuring label errors, unlike many past LED
evaluations based on synthetically-introduced label errors [
1
,
11
,
7
], for which conclusions may differ
from real-world errors [10,8,23].
Related Work.
Extensive research has been conducted on standard classification with noisy labels
[
19
,
23
,
2
,
11
,
12
,
1
]. Our work builds on label quality scoring methods for classification data studied
by Northcutt et al.
[14]
, Kuan and Mueller
[10]
, which merely depend on predictions from a trained
multiclass classification model. These methods are straightforward to implement and broadly appli-
cable, being compatible with any classifier and training procedure, as is our worst-token method.
Only a few prior works have studied label error detection for token classification tasks specifically
[
22
,
17
,
9
]. Also aiming to score the overall label quality of an entire sentence like us, Wang
et al.
[22]
propose
CrossWeigh
which: trains a large ensemble of token classification models with
1Code to run our proposed method is available here: https://github.com/cleanlab/cleanlab
Code to reproduce our results: https://github.com/cleanlab/token-label-error-benchmarks
Preprint. Under review.
arXiv:2210.03920v1 [cs.CL] 8 Oct 2022
entity-disjoint cross-validation, and scores a sentence based on the number of ensemble-member
class predictions that deviate from the given label, across all tokens in the sentence. Reiss et al.
[17]
propose a similar approach to LED in token classification, which also relies on counting deviations
between token labels and corresponding class predictions output by a large ensemble of diverse
models trained via cross-validation. Unlike our
worst-token
method, the methods of Wang et al.
[22]
, Reiss et al.
[17]
are more computationally complex due to ensembling many models, and do
not account for the confidence of individual predictions (as they are based on hard class predictions
rather than estimated class probabilities). Given the LED performance of
worst-token
improves
with a more accurate token classifier, we expect
worst-token
to benefit from ensembling’s reliable
predictive accuracy improvement in the same way that ensembling benefits the methods of Wang et al.
[22]
, Reiss et al.
[17]
. We also tried entity-disjoint vs. standard data splitting, but did not observe
benefits from the former variant (it often removes a significant number of entities).
Klie et al.
[9]
study various approaches for LED in token classification, but only consider scores for
individual tokens rather than entire sentences. Here we compare against some of the methods that per-
formed best in their study. The studies of Klie et al.
[9]
, Northcutt et al.
[14]
, Kuan and Mueller
[10]
indicate that label errors can be more effectively detected by considering the confidence level of clas-
sifiers rather than only their hard class predictions. Instead depending on predicted class probabilities
for each token, our worst-token method appropriately accounts for classifier confidence.
2 Methods
Typical token classification data is composed of many sentences (i.e. training instances), each of
which is split into individual tokens (words or sub-words) where each token is labeled as one of
K
classes (i.e. entities in entity recognition). Given a sentence
x
, a trained token classification
model
Mp¨q
outputs predicted probabilities
pMpxq
where
pij
is the probability that the
i
th
token in sentence
x
belongs to class
j
. Throughout, we assume these probabilities are out-of-sample,
computed from a copy of the model that did not see
x
during training (e.g. because
x
is in test set, or
cross-validation was utilized).
Using
p
, we first consider evaluating the individual per-token labels. Here we apply effective
LED methods for standard classification settings [
14
] by simply treating each token as a separate
independent instance (ignoring which sentence it belongs to). Following Kuan and Mueller
[10]
, we
compute a label quality score qiP r0,1sfor the ith token (assume it is labeled as class k) via one of
the following options:
self-confidence (sc): qipik, i.e. predicted probability of the given label for this token.
normalized margin (nm): qipik ´pi˜
kwith ˜
kargmaxjtpij u
confidence-weighted entropy (cwe): qipik
Hppiqwhere Hppiq“´ 1
log K
K
ÿ
j1
pij logppij q
Higher values of these label quality scores correspond to tokens whose label is more likely to be
correct [
10
]. We can alternatively evaluate the per-token labels via the Confident Learning algorithm
of Northcutt et al.
[14]
, which classifies each token as correclty labeled or not (
bi1
if this token is
flagged as likely mislabeled, = 0 otherwise) based on adaptive thresholds set with respect to per-class
classifier confidence levels.
For one sentence with nword-level tokens, we thus have:
p, a nˆKmatrix where pij is model predicted probability that the ith token belongs to class j.
l“ rl1, . . . , lns, where liP t0, . . . , K ´1uis the given class label of the ith token.
q“ rq1, . . . , qns, where qiis a label quality score for the ith token (one of the above options).
b“ rb1, . . . , bns, where bi1if ith token is flagged as potentially mislabeled, otherwise bi0.
Recall that to properly verify whether a token is really mislabeled, a reviewer must read the full
sentence containing this token to understand the broader context. Thus the most efficient way to
review labels in a dataset is to prioritize inspection of those sentences most likely to contain a
2
mislabeled token. We consider
11
methods to estimate an overall quality score
spxq
for the sentence
x, where higher values correspond to sentences whose labels are more likely all correct.
1. predicted-difference
: The number of disagreements between the given and model-
predicted class labels over the tokens in the sentence, also utilized in the methods of Wang
et al.
[22]
, Reiss et al.
[17]
. Here we break sentence-score ties in favor of the highest-
confidence disagreement. More formally:
spxq“´|R| ´max
iPRpi,p
li
where p
liargmaxjtpij uand R“ ti:p
liliu. If R“ H, we let maxiPRpi,p
li0.
2. bad-token-counts
:
spxq“´řibi
, the number of Confident Learning flagged tokens.
Similarly considered by Klie et al.
[9]
, this approach is a natural token-classification exten-
sion of the method of Northcutt et al. [14] for LED in standard classification tasks.
3. bad-token-counts-avg
: Again scoring based on number of tokens flagged as potentially
mislabeled, but now breaking ties primarily via the average label quality score of the flagged
tokens and secondarily via the average label quality score of the other tokens. More formally:
spxq“´ÿ
i
bi`1
|R| ÿ
iPR
qi`
|S| ÿ
iPS
qi
where R“ ti:bi1u,S“ ti:bi0u, and is some small constant.
4. bad-token-counts-min
: Similar to
bad-token-counts-avg
, but break ties using mini-
mum token quality rather than average token quality. More formally:
spxq ´ ÿ
i
bi`min
iPRqi`¨min
iPSqi
5. good-fraction: Fraction of tokens not flagged as potential issues, spxq“´1
n
n
ÿ
i1
bi.
6. penalize-bad-tokens
: Penalize flagged tokens based on their corresponding label quality
scores. More formally,
spxq “ 1´1
n
n
ÿ
i1
bip1´qiq
7. average-quality: Average label quality of tokens in the sentence, spxq “ 1
n
n
ÿ
i1
qi.
8. product
:
spxq “ řilogpqi`cq
, where
c
is a constant hyperparameter. This score places
greater emphasis on tokens with low estimated label-quality, while still being influenced by
all tokens’ quality (like the previous
average-quality
method). With
q
based on
sc
or
nm
token-scores, the
product
and
average-quality
methods are natural sentence extensions
of the CU or PM methods considered in Klie et al. [9] for token-level LED.
9. expected-bad
: A rough approximation of the expected number of mislabeled tokens in
sentence. More formally:
spxq “
minpn,Jq
ÿ
j1
j¨qpjq
where
qpiq
is the
i
th lowest token label quality score in this sentence, and
J
is a hyperparam-
eter. If using the
sc
label-quality score,
1´qpiq
can be considered a loose proxy for the
probability of having at least ilabel errors in this sentence.
10. expected-alt
: Similar to
expected-bad
, but only considering the likelihood of any label
error rather than how many might be in this sentence. More formally:
spxq “
minpn,Jq
ÿ
j1
qpjq
11. worst-token
: The quality of the worst-labeled token in the sentence determines its overall
quality score,
spxq “ mintq1, q2, . . . , qnu
. This is a reasonable way to rank the sentences
most likely to have some label error, i.e. those most worthy of manual review.
3
摘要:

DetectingLabelErrorsinTokenClassicationDataWei-ChenWangwangeric@mit.eduCleanlab,MITJonasMuellerjonas@cleanlab.aiCleanlabAbstractMislabeledexamplesareacommonissueinreal-worlddata,particularlyfortasksliketokenclassicationwheremanylabelsmustbechosenonane-grainedbasis.Hereweconsiderthetaskofndingsen...

展开>> 收起<<
Detecting Label Errors in Token Classification Data Wei-Chen Wang wangericmit.edu.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.28MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注