CROWDLAB Supervised learning to infer consensus labels and quality scores for data with multiple annotators

2025-04-24 0 0 730.21KB 19 页 10玖币
侵权投诉
CROWDLAB: Supervised learning to infer consensus labels
and quality scores for data with multiple annotators
Hui Wen Goh 1Ulyana Tkachenko 1Jonas Mueller 1
Abstract
Real-world data for classification is often labeled
by multiple annotators. For analyzing such data,
we introduce CROWDLAB, a straightforward ap-
proach to utilize any trained classifier to estimate:
(1) A consensus label for each example that aggre-
gates the available annotations; (2) A confidence
score for how likely each consensus label is cor-
rect; (3) A rating for each annotator quantifying
the overall correctness of their labels. Existing
algorithms to estimate related quantities in crowd-
sourcing often rely on sophisticated generative
models with iterative inference. CROWDLAB
instead uses a straightforward weighted ensemble.
Existing algorithms often rely solely on annota-
tor statistics, ignoring the features of the exam-
ples from which the annotations derive. CROWD-
LAB utilizes any classifier model trained on these
features, and can thus better generalize between
examples with similar features. On real-world
multi-annotator image data, our proposed method
provides superior estimates for (1)-(3) than exist-
ing algorithms like Dawid-Skene/GLAD.
1. Introduction
Training data for multiclass classification are often labeled
by multiple annotators, with some redundancy between an-
notators to ensure high-quality labels. Such settings have
been studied in crowdsourcing research (Monarch,2021b;
Paun et al.,2018). There it is often assumed that many anno-
tators have labeled each example (Carpenter,2008;Khetan
et al.,2018), but this can be prohibitively expensive. This
paper considers general settings where each example in the
dataset is merely labeled by at least one annotator, and each
annotator labels many examples (but still only a subset of
the dataset). Each annotation corresponds to the selection
of one class
yP t1, . . . , Ku
which the annotator believes to
be most appropriate for this example.
Certain classification models can be trained in a special man-
1
Cleanlab. Correspondence to: HWG <huiwen@cleanlab.ai>,
UT <ulyana@cleanlab.ai>, JM <jonas@cleanlab.ai>.
ner to account for the multiple labels per example (Nguyen
et al.,2014;Peterson et al.,2019), but this is rarely done
in practical applications. A common approach is to aggre-
gate the labels for each example into a single consensus
label, e.g. via majority-vote or statistical crowdsourcing al-
gorithms (Dawid and Skene,1979). Any classifier can then
be trained on these consensus labels via off-the-shelf code.
Here we propose a method
1
that leverages any already-
trained classifier to: (1) establish accurate consensus labels,
(2) estimate their quality, and (3) estimate the quality of
each annotator (Monarch,2021c). The latter two aims help
us determine which data is least trustworthy and should per-
haps be verified via additional annotation (Bernhardt et al.,
2022).
CROWDLAB
(Classifier Refinement Of croWD-
sourced LABels) is based on a straightforward weighted
ensemble of the classifier predictions and individual anno-
tations. Weights are assigned according to the (estimated)
trusthworthiness of each annotator relative to the trained
classifier. CROWDLAB is easy to implement/understand,
computationally efficient (non-iterative), and extremely flex-
ible. It works with any classifier and training procedure, as
well as any classification dataset (including those containing
examples only labeled by one annotator).
Motivations.
Illustrating how many real-world multi-
annotator datasets look, Figure 1shows a disparity in an-
notator quality as well as many examples whose consensus
label will be incorrect if we rely on majority vote (nonethe-
less often done in practice due to its straightforward appeal).
Unsurprisingly, consensus labels are more likely to be incor-
rect for those examples with fewer annotations. An effective
method to estimate consensus label quality should properly
account for the number of annotations an example has re-
ceived, as well as the quality of the annotators who selected
these labels. Many of the examples whose consensus label
is wrong merely have a single annotation, which provides
little information. Using a trained classifier can help us
better generalize to such examples to estimate their labels’
quality (especially if the data contain other examples with
similar feature values). When incorporating a classifier, we
1
Code:
https://github.com/cleanlab/cleanlab
Reproduce our results:
https://github.com/cleanlab/
multiannotator-benchmarks
arXiv:2210.06812v2 [cs.LG] 27 Jan 2023
CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators
Figure 1.
Statistics of our Hardest dataset, which has images annotated by many actual humans.
(a)
Distribution over annotators showing
the overall accuracy of each annotator’s chosen labels.
(b)
Distribution over examples showing the number of annotations per example,
grouped by whether the majority-vote consensus label is correct or not. Accuracy is measured against underlying ground-truth labels.
also wish to account for the accuracy and confidence of its
estimates. CROWDLAB offers a straightforward way to
appropriately account for all of these factors.
2. Methods
Consider a dataset sampled from (feature, class label) pairs
pX, Y q
that is comprised of:
n
examples,
K
classes, and
m
annotators in total. We first establish some notation to
formally describe our setting:
|J|
denotes the cardinality of
set
J
,
1p¨q
is an indicator function emitting 1 if its condition
is True and 0 otherwise, and
rns“t1,2, ..., nu
indexes
examples in the dataset.
Xi
denotes the features of the
i
th
example, which belongs to one class
YiP rKs
. This true
class is unknown to us. For
jP rms
:
Aj
denotes the
j
th
annotator, and
Yij P rKs
is the class this annotator chose
for
Xi
.
Yij “ H
if
Aj
did not label this particular example.
Each example receives at most
m
annotations, with many
examples receiving fewer.
p
Yi
denotes the consensus label
for example
i
, representing our best estimate of its true class
Yi
.
Ij:“ tiP rns:Yij ‰ Hu
denotes the subset of
examples labeled by
Aj
. We assume each annotator has
labeled multiple examples, i.e.
|Ij| ą 1
.
Ji:“ tjP rms:
Yij ‰ Hu
denotes the subset of annotators that labeled
Xi
.
Some examples may only be labeled by a single annotator.
We assume some classifier model
M
has been trained to
predict the given labels based on feature values. CROWD-
LAB can be used with any type of classifier
M
(and training
procedure), as long as it outputs predicted class probabilities
p
pMpY|Xq P RK
estimating the likelihood that example
X
belongs to each class
k
. To avoid overfit predictions, we
fit
M
via cross-validation. This provides held-out predic-
tions
ppMpYi|Xiq
for each example in the dataset (from a
copy of
M
which never saw
Xi
during training). In our ex-
periments, we simply train
M
on consensus labels derived
via majority vote. But one could train the classifier on any
other set of consensus labels or even on the individual labels
from each annotator (simply duplicating multiply-annotated
examples in the training set). All methods considered here
that use
M
will benefit from improvements in the classi-
fier’s predictive accuracy. However CROWDLAB is the
only method that explicitly accounts for shortcomings of the
classifier’s predictions (inevitable due to estimation error).
2.1. Scoring Consensus Quality
We first outline methods to estimate our confidence that a
given consensus label for each example is correct. These
quality estimates
qiP r0,1s
may be applied to any given la-
bel no matter which method was used to establish consensus.
Once we can estimate the quality of any one label for each
example, we estimate the best consensus label under each
method as the class with the highest consensus quality score.
This class can be identified efficiently for CROWDLAB.
CROWDLAB combines the complementary strengths of
two basic estimators that we discuss first.
Agreement
(Monarch,2021b). The fraction of annotators
who agree with consensus label (does not use a classifier).
qi1
|Ji|ÿ
jPJi
1pYij p
Yiq(1)
Final consensus labels can be established via majority vote.
Label Quality Score
(Kuan and Mueller,2022). Instead
of relying on the annotators, one can rely on the classi-
fier model via methods used to evaluate labels in standard
(singly-labeled) classification datasets. This approach ig-
nores information from individual annotators when comput-
ing consensus quality: qiLpp
Yi,ppMpYi|Xiqq.
CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators
Here
LpY, ˆpqPr0,1s
is a label quality score which quan-
tifies our confidence that a particular label
YP rKs
is
correct for example
X
, given model-prediction
ˆpPRK
estimating the likelihood that
X
belongs to each class.
Our work uses self-confidence as the label quality score:
LpY, pq “ ˆppY|Xq
. This simply represents the model-
estimated probability that the example belongs to its labeled
class. Kuan and Mueller (2022); Northcutt et al. (2021b)
found this to be effective for scoring label errors in singly-
labeled data based on classifier predictions.
CROWDLAB (Classifier Refinement Of croWDsourced
LABels)
. The aforementioned approaches fail to consider
both annotators and classifier. Treating these as different
predictors of an example’s true label, we take inspiration
from prediction competitions where weighted ensembling
of predictors produces accurate and calibrated predictions.
CROWDLAB also employs the same label quality score
for each consensus label, but applies it to a different class
probability vector which modifies the prediction output by
our classifier to account for the individual annotations for
an example: qiLpp
Yi,ppCRpYi|Xi,tYij uqq.
We estimate these class probabilities by means of a weighted
ensemble aggregation (Fakoor et al.,2021):
p
pCRpYi|Xi,tYij uq “
wM¨ppMpYi|Xiq ` řjPJiwj¨ppAjpYi| tYij uq
wM`řjPJiwj
Here
ppMPRK
is the probability of each class predicted by
our classifier,
ppAjPRK
is a similar likelihood vector treat-
ing each annotator’s label as a probabilistic “prediction”,
and
wj, wMPR
are weights to account for the (estimated)
relative trustworthiness of each annotator and our classifier.
Our estimation procedure for these weights ensures
wM
is
smaller if our classifier was poorly trained and
wj
is smaller
for the annotators who give less accurate labels overall.
To present the remaining details, we first define a likelihood
parameter
P
as the average annotator agreement, across
examples with more than one annotation.
P
estimates the
probability that an arbitrary annotator’s label will match the
majority-vote consensus label for an arbitrary example.
P1
|I`|ÿ
iPI`
1
|Ji|ÿ
jPJi
1pYij p
Yiq
where I`:“ tiP rns:|Ji| ą 1u(2)
We then simply define our per annotator predicted class
probability vector used in (2.1) to be:
p
pAjpYik| tYij uq “ #Pwhen Yij k
1´P
K´1when Yij k(3)
This likelihood is shared across annotators and only involves
a single parameter
P
, easily estimated from limited data.
P
is a simple estimate of the accuracy of labels from a typical
annotator. Note that including singly-annotated examples
in (2) would bias
P
. This likelihood facilitates comparing
classifier outputs against outputs from the typical annotator.
Now we detail how to estimate the trustworthiness weights
wj, wM
. Let
sj
represent annotator
j
s agreement with
other annotators who labeled the same examples:
sjřiPIjř`PJi,`j1pYij Yi`q
řiPIjp|Ji| ´ 1q(4)
Let
AM
be the (empirical) accuracy of our classifier with
respect to the majority-vote consensus labels over the exam-
ples with more than one annotation:
AM1
|I`|ÿ
iPI`
1pYi,Mp
Yiq(5)
Here
Yi,M:arg maxkppMpYik|Xiq P rKs
is the
class predicted by our model for
Xi
.
AM
and
sj
from
(4) are analogous accuracy estimates for our classifier and
individual annotators. Both are computed with only the
multiply-annotated examples, since majority-vote consensus
labels are more reliable for this subset.
Before defining the trustworthiness weights, we normalize
these accuracy estimates with respect to a baseline that puts
them on a meaningful scale. This baseline is based on the
estimated accuracy
AMLC
of always predicting the overall
most labeled class across all examples’ annotations
YMLC :
arg maxkřij 1pYij kq
, i.e. the class selected the most
by the annotators across all examples. This accuracy is also
estimated on only the subset of examples that have more
than one annotator, I`defined in (2).
AMLC 1
|I`|ÿ
iPI`
1pYMLC p
Yiq(6)
Adopting this most-labeled-class-accuracy as a baseline,
we compute normalized versions of our estimates for: each
annotator’s agreement with other annotators and the adjusted
accuracy of the model.
wj1´1´sj
1´AMLC
(7)
wMˆ1´1´AM
1´AMLC ˙¨d1
nÿ
i
|Ji|(8)
CROWDLAB uses
wj
and
wM
to weight our annotators
and classifier model in its weighted ensemble of predictors.
Each trustworthiness weight can thus be understood as 1
minus the (estimated) relative error of the corresponding pre-
dictor. Such normalized-error based weighting is commonly
employed to combine predictors in model averaging.
CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators
2.2. Scoring Annotator Quality
Beyond estimating consensus labels and their quality, we
consider ranking which annotators provide the best/worst
labels. Here are methods to get an overall quality score
ajP r0,1ssummarizing each annotator’s accuracy/skill.
Agreement
(Monarch,2021b). A simple score is the em-
pirical accuracy of each annotator’s labels with respect to
majority-vote consensus labels. Examples with one annota-
tion are not considered in this calculation to reduce bias.
aj1
|Ij,`|ÿ
iPIj,`
1pYij p
Yiq
where Ij,`:IjXI`“ tiPIj:|Ji| ą 1u(9)
Label Quality Score
(Kuan and Mueller,2022). Agree-
ment scores rate annotators solely based on labeling statis-
tics. We can also rely on our classfier predictions
ppM
to rate
the average quality of all labels provided by one annotator.
aj1
|Ij|ÿ
iPIj
L`Yij ,ppMpYi|Xiq˘(10)
CROWDLAB
. Our method takes into account both the
label quality score of each annotator’s labels (computed
based on our classifier), as well as the agreement between
each annotator’s label and the CROWDLAB consensus label.
As in (10), we estimate an average label quality score of
labels given by each annotator, but here using estimated
class probabilities ppCR from CROWDLAB in Sec. 2.1:
Qj1
|Ij|ÿ
iPIj
L`Yij ,ppCRpYi|Xi,tYij uq˘(11)
Next, we compute each annotator’s agreement with consen-
sus among examples with over one annotation.
Aj1
|Ij,`|ÿ
iPIj,`
1pYij p
Yiq(12)
Here
Ij,`
is defined in (9) and the consensus labels
p
Yi
are
established via the CROWDLAB method from Sec. 2.1.
Since CROWDLAB is an effective method to estimate con-
sensus labels
p
Yi
, one might wonder why
Aj
alone from (12)
does not produce the best estimate of annotator quality. One
reason is that
Aj
fails to account for our confidence in each
consensus label and how individual annotators deviate from
consensus. If two annotators exhibit the same overall rate
of agreement with the consensus labels, we should favor the
annotator whose deviations from consensus are predicted
to be likely classes by the classifier and tend to occur for
examples with lower consensus quality score.
Therefore we base CROWDLAB’s annotator quality score
on a weighted average between
Aj
and
Qj
. Using the
model/annotator weights
wM, wj
computed by CROWD-
LAB in (7) and (8)), we find a single aggregate weight to
compare all annotators against the classifier.
swwM
wM`w0
where w01
nm
n
ÿ
i1
m
ÿ
j1
wj¨ |Ji|(13)
Here
s
w
is shared across all annotators. It represents the
(estimated) relative trustworthiness of our classifier against
the average annotator. A quality score for each annotator is
finally computed via a weighted average of: the label quality
score and the annotator agreement with consensus labels:
ajswQj` p1´swqAj(14)
3. Related Work
Prior work for estimating (1)-(3) from multi-annotator
datasets has fallen into two camps. The first camp relies on
statistical generative models that only account for the ob-
served annotator statistics (Carpenter,2008). Like CROWD-
LAB, the second camp of approaches also models feature-
label relationships, but does so via linear models (Jin et al.,
2017), autoencoders (Liu et al.,2021a), or classifiers fit
to soft labels in an iterative manner (Raykar et al.,2010;
Khetan et al.,2018;Rodrigues and Pereira,2018;Platanios
et al.,2020;Liu et al.,2021a). These methods cannot utilize
an arbitrary classifier (trained via any procedure), and they
are more specialized and complex than CROWDLAB. Due
to this complexity, approaches from the former camp remain
much more popular in practical applications (Toloka).
The following sections describe existing baseline methods
to estimate consensus/annotator quality that our subsequent
experiments compare CROWDLAB against. We focus our
comparison on approaches which are either: commonly
used in practice, or able to utilize any classifier to produce
better estimates (rather than approaches that are restricted to
a specific type of model or non-standard training procedure).
3.1. Baseline Consensus Quality Scores
Dawid-Skene
(Dawid and Skene,1979). This Bayesian
method specifies a generative model of the dataset anno-
tations. It employs iterative expectation-maximization (EM)
to estimate each annotator’s error rates in a class-specific
manner. A key estimate in this approach is
ppDSpYi| tYij uq
,
the posterior probability vector of the true class
Yi
for the
ith example, given the dataset annotations tYij u.
Define
πpjq
k,`
as the probability that annotator
j
labels an
example as class
`
when the true label of that example is
k
. This individual class confusion matrix for each annotator
摘要:

CROWDLAB:SupervisedlearningtoinferconsensuslabelsandqualityscoresfordatawithmultipleannotatorsHuiWenGoh1UlyanaTkachenko1JonasMueller1AbstractReal-worlddataforclassicationisoftenlabeledbymultipleannotators.Foranalyzingsuchdata,weintroduceCROWDLAB,astraightforwardap-proachtoutilizeanytrainedclassier...

展开>> 收起<<
CROWDLAB Supervised learning to infer consensus labels and quality scores for data with multiple annotators.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:730.21KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注