
CROWDLAB: Supervised learning to infer consensus labels
and quality scores for data with multiple annotators
Hui Wen Goh 1Ulyana Tkachenko 1Jonas Mueller 1
Abstract
Real-world data for classification is often labeled
by multiple annotators. For analyzing such data,
we introduce CROWDLAB, a straightforward ap-
proach to utilize any trained classifier to estimate:
(1) A consensus label for each example that aggre-
gates the available annotations; (2) A confidence
score for how likely each consensus label is cor-
rect; (3) A rating for each annotator quantifying
the overall correctness of their labels. Existing
algorithms to estimate related quantities in crowd-
sourcing often rely on sophisticated generative
models with iterative inference. CROWDLAB
instead uses a straightforward weighted ensemble.
Existing algorithms often rely solely on annota-
tor statistics, ignoring the features of the exam-
ples from which the annotations derive. CROWD-
LAB utilizes any classifier model trained on these
features, and can thus better generalize between
examples with similar features. On real-world
multi-annotator image data, our proposed method
provides superior estimates for (1)-(3) than exist-
ing algorithms like Dawid-Skene/GLAD.
1. Introduction
Training data for multiclass classification are often labeled
by multiple annotators, with some redundancy between an-
notators to ensure high-quality labels. Such settings have
been studied in crowdsourcing research (Monarch,2021b;
Paun et al.,2018). There it is often assumed that many anno-
tators have labeled each example (Carpenter,2008;Khetan
et al.,2018), but this can be prohibitively expensive. This
paper considers general settings where each example in the
dataset is merely labeled by at least one annotator, and each
annotator labels many examples (but still only a subset of
the dataset). Each annotation corresponds to the selection
of one class
yP t1, . . . , Ku
which the annotator believes to
be most appropriate for this example.
Certain classification models can be trained in a special man-
1
Cleanlab. Correspondence to: HWG <huiwen@cleanlab.ai>,
UT <ulyana@cleanlab.ai>, JM <jonas@cleanlab.ai>.
ner to account for the multiple labels per example (Nguyen
et al.,2014;Peterson et al.,2019), but this is rarely done
in practical applications. A common approach is to aggre-
gate the labels for each example into a single consensus
label, e.g. via majority-vote or statistical crowdsourcing al-
gorithms (Dawid and Skene,1979). Any classifier can then
be trained on these consensus labels via off-the-shelf code.
Here we propose a method
1
that leverages any already-
trained classifier to: (1) establish accurate consensus labels,
(2) estimate their quality, and (3) estimate the quality of
each annotator (Monarch,2021c). The latter two aims help
us determine which data is least trustworthy and should per-
haps be verified via additional annotation (Bernhardt et al.,
2022).
CROWDLAB
(Classifier Refinement Of croWD-
sourced LABels) is based on a straightforward weighted
ensemble of the classifier predictions and individual anno-
tations. Weights are assigned according to the (estimated)
trusthworthiness of each annotator relative to the trained
classifier. CROWDLAB is easy to implement/understand,
computationally efficient (non-iterative), and extremely flex-
ible. It works with any classifier and training procedure, as
well as any classification dataset (including those containing
examples only labeled by one annotator).
Motivations.
Illustrating how many real-world multi-
annotator datasets look, Figure 1shows a disparity in an-
notator quality as well as many examples whose consensus
label will be incorrect if we rely on majority vote (nonethe-
less often done in practice due to its straightforward appeal).
Unsurprisingly, consensus labels are more likely to be incor-
rect for those examples with fewer annotations. An effective
method to estimate consensus label quality should properly
account for the number of annotations an example has re-
ceived, as well as the quality of the annotators who selected
these labels. Many of the examples whose consensus label
is wrong merely have a single annotation, which provides
little information. Using a trained classifier can help us
better generalize to such examples to estimate their labels’
quality (especially if the data contain other examples with
similar feature values). When incorporating a classifier, we
1
Code:
https://github.com/cleanlab/cleanlab
Reproduce our results:
https://github.com/cleanlab/
multiannotator-benchmarks
arXiv:2210.06812v2 [cs.LG] 27 Jan 2023