often asymmetric between true / false, and (2) the
error rate is as high as 40% for some attribute val-
ues. (See Section 3.)
• We propose a semi-automated workflow to clean
existing annotations, and use it to create corrected
MSO attribute values for CelebA. In part of this
correction/cleaning, we identify (1) a small number
of images that are unusable and we propose should
be dropped from CelebA, and (2) identify images
for which true/false cannot be assigned to a partic-
ular attribute and so we propose an “information
not visible” value must be introduced. (See Sec-
tions 4.1 and 4.2.)
• Comparing models learned using the original MSO
values versus our cleaned values, we show that
the models are substantially different, and that our
cleaned values enable a model that achieves state-
of-the-art accuracy on MSO. (See Section 4.3.)
2. Related work
There is a large literature in facial attribute analysis,
and several surveys give a broad coverage of the field
[1,2,26,31]. We cover only a few of the most relevant
works here.
CelebA was introduced by Liu et al. [16] in 2015
specifically to support research in deep learning for fa-
cial attributes. CelebA has 202,599 images grouped into
10,177 identities. Each image has 40 true/false attributes
– pointy nose, oval face, gray hair, wearing hat, etc. –
and five landmark locations. CelebA also has a recom-
mended subject-disjoint split into train, validation and
test. The creation of the attribute values is described
only as – “Each image in CelebA and LFWA is anno-
tated with forty facial attributes and five key points by a
professional labeling company” [16]. No description of
how the attribute values are created, or estimate of their
consistency or accuracy, is given.
Thom and Hand stated [26] that, “CelebA and LFWA
were the first (and only to date) large-scale datasets in-
troduced for the problem of facial attribute recognition
from images. Prior to CelebA and LFWA, no dataset
labeled with attributes was large enough to effectively
train deep neural networks.” In number of identities and
images, CelebA is substantially larger than LFWA, and
is the most-used research dataset in this area. In ad-
dition, Thom and Hand speculate that noise in the at-
tribute values may lead to an apparent plateau in re-
search progress [26] – “There is a recent plateau in facial
attribute recognition performance, which could be due
to poor labeling of data. . . . While crowdsourcing such
tasks can be very useful and result in large quantities of
reasonably labeled data, there are some tasks which may
be consistently labeled incorrectly ...”.
The only paper we are aware of to discuss cleaning
CelebA is [27]. They deal specifically with errors in the
identity groupings, and do not consider errors in the at-
tribute values. Compared to the original 202,599 images
of 10,177 identities, their identity-cleaned version has
197,477 images of 9,996 identities. In a simple manual
check of 100 identities in CelebA, we found a few addi-
tional instances of identity errors in the identity-cleaned
version of [27]. Because our work is focused on attribute
classification and not face matching, we start with the
original CelebA rather than the version of [27].
Terh¨
orst et al. [25] firstly estimated the quality of
annotations in CelebA by letting three human evalua-
tors manually evaluate randomly selected 50 positively-
annotated and 50 negatively-annotated images for each
attribute. They claimed that ”Similar to LFW, there is a
tendency that most of the wrong annotations are within
the positives”. In this paper, we provide a more statisti-
cally and systematically analysis on each attribute, and
we, furthermore, provide a possible solution to clean the
dataset.
Motivated by comments in [8,26] on the importance
of correct labels for machine learning and errors encoun-
tered in CelebA attribute values, we present the first de-
tailed analysis of the accuracy of CelebA attribute val-
ues. We create a cleaned version of the MSO values, and
perform experiments to assess the impact of the origi-
nal versus cleaned MSO attribute, using AFFACT [8],
MOON [20], DenseNet [11] and ResNet [9]. We show
that the cleaned attribute values result in learning a more
coherent model that achieves higher accuracy.
3. Accuracy of Attributes In Training Data
We first examine the general consistency of manual
annotations of face attributes. Then we use duplicate
faces in CelebA to examine the consistency of the at-
tribute values distributed with CelebA. Then we manu-
ally audit a sample of CelebA to estimate the accuracy
of its attribute values.
3.1. Consistency of Manual Annotations?
To estimate the level of consistency that can be ex-
pected in manual annotations of commonly used face
attributes, two annotators independently assigned val-
ues for each of 40 attributes of 1,000 images. The im-
ages were randomly selected from CelebA and should
be representative of web-scraped, in-the-wild celebrity
images. The annotators viewed the cropped, normalized
face images, with no knowledge of the other annotator’s
results.
Table 1lists, for each attribute, the number of images
for which the two annotations disagree. The least dis-
agreement was for eyeglasses, at just 3 images. The 3
2