
entity-disjoint cross-validation, and scores a sentence based on the number of ensemble-member
class predictions that deviate from the given label, across all tokens in the sentence. Reiss et al.
[17]
propose a similar approach to LED in token classification, which also relies on counting deviations
between token labels and corresponding class predictions output by a large ensemble of diverse
models trained via cross-validation. Unlike our
worst-token
method, the methods of Wang et al.
[22]
, Reiss et al.
[17]
are more computationally complex due to ensembling many models, and do
not account for the confidence of individual predictions (as they are based on hard class predictions
rather than estimated class probabilities). Given the LED performance of
worst-token
improves
with a more accurate token classifier, we expect
worst-token
to benefit from ensembling’s reliable
predictive accuracy improvement in the same way that ensembling benefits the methods of Wang et al.
[22]
, Reiss et al.
[17]
. We also tried entity-disjoint vs. standard data splitting, but did not observe
benefits from the former variant (it often removes a significant number of entities).
Klie et al.
[9]
study various approaches for LED in token classification, but only consider scores for
individual tokens rather than entire sentences. Here we compare against some of the methods that per-
formed best in their study. The studies of Klie et al.
[9]
, Northcutt et al.
[14]
, Kuan and Mueller
[10]
indicate that label errors can be more effectively detected by considering the confidence level of clas-
sifiers rather than only their hard class predictions. Instead depending on predicted class probabilities
for each token, our worst-token method appropriately accounts for classifier confidence.
2 Methods
Typical token classification data is composed of many sentences (i.e. training instances), each of
which is split into individual tokens (words or sub-words) where each token is labeled as one of
K
classes (i.e. entities in entity recognition). Given a sentence
x
, a trained token classification
model
Mp¨q
outputs predicted probabilities
p“Mpxq
where
pij
is the probability that the
i
th
token in sentence
x
belongs to class
j
. Throughout, we assume these probabilities are out-of-sample,
computed from a copy of the model that did not see
x
during training (e.g. because
x
is in test set, or
cross-validation was utilized).
Using
p
, we first consider evaluating the individual per-token labels. Here we apply effective
LED methods for standard classification settings [
14
] by simply treating each token as a separate
independent instance (ignoring which sentence it belongs to). Following Kuan and Mueller
[10]
, we
compute a label quality score qiP r0,1sfor the ith token (assume it is labeled as class k) via one of
the following options:
• self-confidence (sc): qi“pik, i.e. predicted probability of the given label for this token.
• normalized margin (nm): qi“pik ´pi˜
kwith ˜
k“argmaxjtpij u
• confidence-weighted entropy (cwe): qi“pik
Hppiqwhere Hppiq“´ 1
log K
K
ÿ
j“1
pij logppij q
Higher values of these label quality scores correspond to tokens whose label is more likely to be
correct [
10
]. We can alternatively evaluate the per-token labels via the Confident Learning algorithm
of Northcutt et al.
[14]
, which classifies each token as correclty labeled or not (
bi“1
if this token is
flagged as likely mislabeled, = 0 otherwise) based on adaptive thresholds set with respect to per-class
classifier confidence levels.
For one sentence with nword-level tokens, we thus have:
•p, a nˆKmatrix where pij is model predicted probability that the ith token belongs to class j.
•l“ rl1, . . . , lns, where liP t0, . . . , K ´1uis the given class label of the ith token.
•q“ rq1, . . . , qns, where qiis a label quality score for the ith token (one of the above options).
•b“ rb1, . . . , bns, where bi“1if ith token is flagged as potentially mislabeled, otherwise bi“0.
Recall that to properly verify whether a token is really mislabeled, a reviewer must read the full
sentence containing this token to understand the broader context. Thus the most efficient way to
review labels in a dataset is to prioritize inspection of those sentences most likely to contain a
2