CROWDLAB Supervised learning to infer consensus labels and quality scores for data with multiple annotators

2025-04-24 0 0 730.21KB 19 页 10玖币

侵权投诉

CROWDLAB: Supervised learning to infer consensus labels

and quality scores for data with multiple annotators

Hui Wen Goh 1Ulyana Tkachenko 1Jonas Mueller 1

Abstract

Real-world data for classiﬁcation is often labeled

by multiple annotators. For analyzing such data,

we introduce CROWDLAB, a straightforward ap-

proach to utilize any trained classiﬁer to estimate:

(1) A consensus label for each example that aggre-

gates the available annotations; (2) A conﬁdence

score for how likely each consensus label is cor-

rect; (3) A rating for each annotator quantifying

the overall correctness of their labels. Existing

algorithms to estimate related quantities in crowd-

sourcing often rely on sophisticated generative

models with iterative inference. CROWDLAB

instead uses a straightforward weighted ensemble.

Existing algorithms often rely solely on annota-

tor statistics, ignoring the features of the exam-

ples from which the annotations derive. CROWD-

LAB utilizes any classiﬁer model trained on these

features, and can thus better generalize between

examples with similar features. On real-world

multi-annotator image data, our proposed method

provides superior estimates for (1)-(3) than exist-

ing algorithms like Dawid-Skene/GLAD.

1. Introduction

Training data for multiclass classiﬁcation are often labeled

by multiple annotators, with some redundancy between an-

notators to ensure high-quality labels. Such settings have

been studied in crowdsourcing research (Monarch,2021b;

Paun et al.,2018). There it is often assumed that many anno-

tators have labeled each example (Carpenter,2008;Khetan

et al.,2018), but this can be prohibitively expensive. This

paper considers general settings where each example in the

dataset is merely labeled by at least one annotator, and each

annotator labels many examples (but still only a subset of

the dataset). Each annotation corresponds to the selection

of one class

yP t1, . . . , Ku

which the annotator believes to

be most appropriate for this example.

Certain classiﬁcation models can be trained in a special man-

Cleanlab. Correspondence to: HWG <huiwen@cleanlab.ai>,

UT <ulyana@cleanlab.ai>, JM <jonas@cleanlab.ai>.

ner to account for the multiple labels per example (Nguyen

et al.,2014;Peterson et al.,2019), but this is rarely done

in practical applications. A common approach is to aggre-

gate the labels for each example into a single consensus

label, e.g. via majority-vote or statistical crowdsourcing al-

gorithms (Dawid and Skene,1979). Any classiﬁer can then

be trained on these consensus labels via off-the-shelf code.

Here we propose a method

that leverages any already-

trained classiﬁer to: (1) establish accurate consensus labels,

(2) estimate their quality, and (3) estimate the quality of

each annotator (Monarch,2021c). The latter two aims help

us determine which data is least trustworthy and should per-

haps be veriﬁed via additional annotation (Bernhardt et al.,

2022).

CROWDLAB

(Classiﬁer Reﬁnement Of croWD-

sourced LABels) is based on a straightforward weighted

ensemble of the classiﬁer predictions and individual anno-

tations. Weights are assigned according to the (estimated)

trusthworthiness of each annotator relative to the trained

classiﬁer. CROWDLAB is easy to implement/understand,

computationally efﬁcient (non-iterative), and extremely ﬂex-

ible. It works with any classiﬁer and training procedure, as

well as any classiﬁcation dataset (including those containing

examples only labeled by one annotator).

Motivations.

Illustrating how many real-world multi-

annotator datasets look, Figure 1shows a disparity in an-

notator quality as well as many examples whose consensus

label will be incorrect if we rely on majority vote (nonethe-

less often done in practice due to its straightforward appeal).

Unsurprisingly, consensus labels are more likely to be incor-

rect for those examples with fewer annotations. An effective

method to estimate consensus label quality should properly

account for the number of annotations an example has re-

ceived, as well as the quality of the annotators who selected

these labels. Many of the examples whose consensus label

is wrong merely have a single annotation, which provides

little information. Using a trained classiﬁer can help us

better generalize to such examples to estimate their labels’

quality (especially if the data contain other examples with

similar feature values). When incorporating a classiﬁer, we

Code:

https://github.com/cleanlab/cleanlab

Reproduce our results:

https://github.com/cleanlab/

multiannotator-benchmarks

arXiv:2210.06812v2 [cs.LG] 27 Jan 2023

CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators

Figure 1.

Statistics of our Hardest dataset, which has images annotated by many actual humans.

(a)

Distribution over annotators showing

the overall accuracy of each annotator’s chosen labels.

(b)

Distribution over examples showing the number of annotations per example,

grouped by whether the majority-vote consensus label is correct or not. Accuracy is measured against underlying ground-truth labels.

also wish to account for the accuracy and conﬁdence of its

estimates. CROWDLAB offers a straightforward way to

appropriately account for all of these factors.

2. Methods

Consider a dataset sampled from (feature, class label) pairs

pX, Y q

that is comprised of:

examples,

classes, and

annotators in total. We ﬁrst establish some notation to

formally describe our setting:

|J|

denotes the cardinality of

set

1p¨q

is an indicator function emitting 1 if its condition

is True and 0 otherwise, and

rns“t1,2, ..., nu

indexes

examples in the dataset.

denotes the features of the

example, which belongs to one class

YiP rKs

. This true

class is unknown to us. For

jP rms

denotes the

annotator, and

Yij P rKs

is the class this annotator chose

for

Yij “ H

did not label this particular example.

Each example receives at most

annotations, with many

examples receiving fewer.

denotes the consensus label

for example

, representing our best estimate of its true class

Ij:“ tiP rns:Yij ‰ Hu

denotes the subset of

examples labeled by

. We assume each annotator has

labeled multiple examples, i.e.

|Ij| ą 1

Ji:“ tjP rms:

Yij ‰ Hu

denotes the subset of annotators that labeled

Some examples may only be labeled by a single annotator.

We assume some classiﬁer model

has been trained to

predict the given labels based on feature values. CROWD-

LAB can be used with any type of classiﬁer

(and training

procedure), as long as it outputs predicted class probabilities

pMpY|Xq P RK

estimating the likelihood that example

belongs to each class

. To avoid overﬁt predictions, we

ﬁt

via cross-validation. This provides held-out predic-

tions

ppMpYi|Xiq

for each example in the dataset (from a

copy of

which never saw

during training). In our ex-

periments, we simply train

on consensus labels derived

via majority vote. But one could train the classiﬁer on any

other set of consensus labels or even on the individual labels

from each annotator (simply duplicating multiply-annotated

examples in the training set). All methods considered here

that use

will beneﬁt from improvements in the classi-

ﬁer’s predictive accuracy. However CROWDLAB is the

only method that explicitly accounts for shortcomings of the

classiﬁer’s predictions (inevitable due to estimation error).

2.1. Scoring Consensus Quality

We ﬁrst outline methods to estimate our conﬁdence that a

given consensus label for each example is correct. These

quality estimates

qiP r0,1s

may be applied to any given la-

bel no matter which method was used to establish consensus.

Once we can estimate the quality of any one label for each

example, we estimate the best consensus label under each

method as the class with the highest consensus quality score.

This class can be identiﬁed efﬁciently for CROWDLAB.

CROWDLAB combines the complementary strengths of

two basic estimators that we discuss ﬁrst.

Agreement

(Monarch,2021b). The fraction of annotators

who agree with consensus label (does not use a classiﬁer).

qi“1

|Ji|ÿ

jPJi

1pYij “p

Yiq(1)

Final consensus labels can be established via majority vote.

Label Quality Score

(Kuan and Mueller,2022). Instead

of relying on the annotators, one can rely on the classi-

ﬁer model via methods used to evaluate labels in standard

(singly-labeled) classiﬁcation datasets. This approach ig-

nores information from individual annotators when comput-

ing consensus quality: qi“Lpp

Yi,ppMpYi|Xiqq.

CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators

Here

LpY, ˆpqPr0,1s

is a label quality score which quan-

tiﬁes our conﬁdence that a particular label

YP rKs

correct for example

, given model-prediction

ˆpPRK

estimating the likelihood that

belongs to each class.

Our work uses self-conﬁdence as the label quality score:

LpY, pq “ ˆppY|Xq

. This simply represents the model-

estimated probability that the example belongs to its labeled

class. Kuan and Mueller (2022); Northcutt et al. (2021b)

found this to be effective for scoring label errors in singly-

labeled data based on classiﬁer predictions.

CROWDLAB (Classiﬁer Reﬁnement Of croWDsourced

LABels)

. The aforementioned approaches fail to consider

both annotators and classiﬁer. Treating these as different

predictors of an example’s true label, we take inspiration

from prediction competitions where weighted ensembling

of predictors produces accurate and calibrated predictions.

CROWDLAB also employs the same label quality score

for each consensus label, but applies it to a different class

probability vector which modiﬁes the prediction output by

our classiﬁer to account for the individual annotations for

an example: qi“Lpp

Yi,ppCRpYi|Xi,tYij uqq.

We estimate these class probabilities by means of a weighted

ensemble aggregation (Fakoor et al.,2021):

pCRpYi|Xi,tYij uq “

wM¨ppMpYi|Xiq ` řjPJiwj¨ppAjpYi| tYij uq

wM`řjPJiwj

Here

ppMPRK

is the probability of each class predicted by

our classiﬁer,

ppAjPRK

is a similar likelihood vector treat-

ing each annotator’s label as a probabilistic “prediction”,

and

wj, wMPR

are weights to account for the (estimated)

relative trustworthiness of each annotator and our classiﬁer.

Our estimation procedure for these weights ensures

smaller if our classiﬁer was poorly trained and

is smaller

for the annotators who give less accurate labels overall.

To present the remaining details, we ﬁrst deﬁne a likelihood

parameter

as the average annotator agreement, across

examples with more than one annotation.

estimates the

probability that an arbitrary annotator’s label will match the

majority-vote consensus label for an arbitrary example.

P“1

|I`|ÿ

iPI`

|Ji|ÿ

jPJi

1pYij “p

Yiq

where I`:“ tiP rns:|Ji| ą 1u(2)

We then simply deﬁne our per annotator predicted class

probability vector used in (2.1) to be:

pAjpYi“k| tYij uq “ #Pwhen Yij “k

1´P

K´1when Yij ‰k(3)

This likelihood is shared across annotators and only involves

a single parameter

, easily estimated from limited data.

is a simple estimate of the accuracy of labels from a typical

annotator. Note that including singly-annotated examples

in (2) would bias

. This likelihood facilitates comparing

classiﬁer outputs against outputs from the typical annotator.

Now we detail how to estimate the trustworthiness weights

wj, wM

. Let

represent annotator

’s agreement with

other annotators who labeled the same examples:

sj“řiPIjř`PJi,`‰j1pYij “Yi`q

řiPIjp|Ji| ´ 1q(4)

Let

be the (empirical) accuracy of our classiﬁer with

respect to the majority-vote consensus labels over the exam-

ples with more than one annotation:

AM“1

|I`|ÿ

iPI`

1pYi,M“p

Yiq(5)

Here

Yi,M:“arg maxkppMpYi“k|Xiq P rKs

is the

class predicted by our model for

and

from

(4) are analogous accuracy estimates for our classiﬁer and

individual annotators. Both are computed with only the

multiply-annotated examples, since majority-vote consensus

labels are more reliable for this subset.

Before deﬁning the trustworthiness weights, we normalize

these accuracy estimates with respect to a baseline that puts

them on a meaningful scale. This baseline is based on the

estimated accuracy

AMLC

of always predicting the overall

most labeled class across all examples’ annotations

YMLC :“

arg maxkřij 1pYij “kq

, i.e. the class selected the most

by the annotators across all examples. This accuracy is also

estimated on only the subset of examples that have more

than one annotator, I`deﬁned in (2).

AMLC “1

|I`|ÿ

iPI`

1pYMLC “p

Yiq(6)

Adopting this most-labeled-class-accuracy as a baseline,

we compute normalized versions of our estimates for: each

annotator’s agreement with other annotators and the adjusted

accuracy of the model.

wj“1´1´sj

1´AMLC

(7)

wM“ˆ1´1´AM

1´AMLC ˙¨d1

nÿ

|Ji|(8)

CROWDLAB uses

and

to weight our annotators

and classiﬁer model in its weighted ensemble of predictors.

Each trustworthiness weight can thus be understood as 1

minus the (estimated) relative error of the corresponding pre-

dictor. Such normalized-error based weighting is commonly

employed to combine predictors in model averaging.

CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators

2.2. Scoring Annotator Quality

Beyond estimating consensus labels and their quality, we

consider ranking which annotators provide the best/worst

labels. Here are methods to get an overall quality score

ajP r0,1ssummarizing each annotator’s accuracy/skill.

Agreement

(Monarch,2021b). A simple score is the em-

pirical accuracy of each annotator’s labels with respect to

majority-vote consensus labels. Examples with one annota-

tion are not considered in this calculation to reduce bias.

aj“1

|Ij,`|ÿ

iPIj,`

1pYij “p

Yiq

where Ij,`:“IjXI`“ tiPIj:|Ji| ą 1u(9)

Label Quality Score

(Kuan and Mueller,2022). Agree-

ment scores rate annotators solely based on labeling statis-

tics. We can also rely on our classﬁer predictions

ppM

to rate

the average quality of all labels provided by one annotator.

aj“1

|Ij|ÿ

iPIj

L`Yij ,ppMpYi|Xiq˘(10)

CROWDLAB

. Our method takes into account both the

label quality score of each annotator’s labels (computed

based on our classiﬁer), as well as the agreement between

each annotator’s label and the CROWDLAB consensus label.

As in (10), we estimate an average label quality score of

labels given by each annotator, but here using estimated

class probabilities ppCR from CROWDLAB in Sec. 2.1:

Qj“1

|Ij|ÿ

iPIj

L`Yij ,ppCRpYi|Xi,tYij uq˘(11)

Next, we compute each annotator’s agreement with consen-

sus among examples with over one annotation.

Aj“1

|Ij,`|ÿ

iPIj,`

1pYij “p

Yiq(12)

Here

Ij,`

is deﬁned in (9) and the consensus labels

are

established via the CROWDLAB method from Sec. 2.1.

Since CROWDLAB is an effective method to estimate con-

sensus labels

, one might wonder why

alone from (12)

does not produce the best estimate of annotator quality. One

reason is that

fails to account for our conﬁdence in each

consensus label and how individual annotators deviate from

consensus. If two annotators exhibit the same overall rate

of agreement with the consensus labels, we should favor the

annotator whose deviations from consensus are predicted

to be likely classes by the classiﬁer and tend to occur for

examples with lower consensus quality score.

Therefore we base CROWDLAB’s annotator quality score

on a weighted average between

and

. Using the

model/annotator weights

wM, wj

computed by CROWD-

LAB in (7) and (8)), we ﬁnd a single aggregate weight to

compare all annotators against the classiﬁer.

sw“wM

wM`w0

where w0“1

i“1

j“1

wj¨ |Ji|(13)

Here

is shared across all annotators. It represents the

(estimated) relative trustworthiness of our classiﬁer against

the average annotator. A quality score for each annotator is

ﬁnally computed via a weighted average of: the label quality

score and the annotator agreement with consensus labels:

aj“swQj` p1´swqAj(14)

3. Related Work

Prior work for estimating (1)-(3) from multi-annotator

datasets has fallen into two camps. The ﬁrst camp relies on

statistical generative models that only account for the ob-

served annotator statistics (Carpenter,2008). Like CROWD-

LAB, the second camp of approaches also models feature-

label relationships, but does so via linear models (Jin et al.,

2017), autoencoders (Liu et al.,2021a), or classiﬁers ﬁt

to soft labels in an iterative manner (Raykar et al.,2010;

Khetan et al.,2018;Rodrigues and Pereira,2018;Platanios

et al.,2020;Liu et al.,2021a). These methods cannot utilize

an arbitrary classiﬁer (trained via any procedure), and they

are more specialized and complex than CROWDLAB. Due

to this complexity, approaches from the former camp remain

much more popular in practical applications (Toloka).

The following sections describe existing baseline methods

to estimate consensus/annotator quality that our subsequent

experiments compare CROWDLAB against. We focus our

comparison on approaches which are either: commonly

used in practice, or able to utilize any classiﬁer to produce

better estimates (rather than approaches that are restricted to

a speciﬁc type of model or non-standard training procedure).

3.1. Baseline Consensus Quality Scores

Dawid-Skene

(Dawid and Skene,1979). This Bayesian

method speciﬁes a generative model of the dataset anno-

tations. It employs iterative expectation-maximization (EM)

to estimate each annotator’s error rates in a class-speciﬁc

manner. A key estimate in this approach is

ppDSpYi| tYij uq

the posterior probability vector of the true class

for the

ith example, given the dataset annotations tYij u.

Deﬁne

πpjq

k,`

as the probability that annotator

labels an

example as class

when the true label of that example is

. This individual class confusion matrix for each annotator

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CROWDLAB:SupervisedlearningtoinferconsensuslabelsandqualityscoresfordatawithmultipleannotatorsHuiWenGoh1UlyanaTkachenko1JonasMueller1AbstractReal-worlddataforclassicationisoftenlabeledbymultipleannotators.Foranalyzingsuchdata,weintroduceCROWDLAB,astraightforwardap-proachtoutilizeanytrainedclassier...

展开>> 收起<<

CROWDLAB Supervised learning to infer consensus labels and quality scores for data with multiple annotators.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CROWDLAB Supervised learning to infer consensus labels and quality scores for data with multiple annotators

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: