Consistency and Accuracy of CelebA Attribute Values Haiyu Wu1 Grace Bezold1 Manuel G unther2 Terrance Boult3 Michael C. King4 Kevin W. Bowyer1

2025-05-02 0 0 2.73MB 9 页 10玖币
侵权投诉
Consistency and Accuracy of CelebA Attribute Values
Haiyu Wu1, Grace Bezold1, Manuel G¨
unther2, Terrance Boult3,
Michael C. King4, Kevin W. Bowyer1
1University of Notre Dame, 2University of Zurich,
3University of Colorado Colorado Springs, 4Florida Institute of Technology
Abstract
We report the first systematic analysis of the ex-
perimental foundations of facial attribute classification.
Two annotators independently assigning attribute val-
ues shows that only 12 of 40 common attributes are as-
signed values with 95% consistency, and three (high
cheekbones, pointed nose, oval face) have essentially
random consistency. Of 5,068 duplicate face appear-
ances in CelebA, attributes have contradicting values
on from 10 to 860 of the 5,068 duplicates. Manual
audit of a subset of CelebA estimates error rates as
high as 40% for (no beard=false), even though the la-
beling consistency experiment indicates that no beard
could be assigned with 95% consistency. Selecting
the mouth slightly open (MSO) for deeper analysis, we
estimate the error rate for (MSO=true) at about 20%
and (MSO=false) at about 2%. A corrected version of
the MSO attribute values enables learning a model that
achieves higher accuracy than previously reported for
MSO. Corrected values for CelebA MSO are available
at https://github.com/HaiyuWu/ CelebAMSO.
1. Introduction
Facial attributes have potential uses in face match-
ing/recognition [3,4,12,13,17,22], face image retrieval
[15,18], re-identification [21,23,24], training GANs [5,
6,10,14] for generation of synthetic images, analyzing
AI biases [29,30] and other areas. CelebA [16] is the
largest and most widely used dataset in this research
area. However, recent papers have described cleaning
the identity groups in CelebA [27], and suggested that
the facial attribute annotations [8,26] need some “clean-
ing”. This paper provides the first analysis of the con-
sistency with which the commonly-used face attributes
can be manually marked, and also of the quality of the
attribute values distributed with the CelebA images. We
also propose an auditing workflow to clean existing an-
notations, and demonstrate that using a corrected set of
Figure 1. Which Set of Attribute Values Enables Learning a
Better Model? Lower left quadrant contains images with orig-
inal (MSO=false) corrected to (MSO=true); Upper right quad-
rant contains images with original (MSO=true) corrected to
(MSO=false).
attribute values enables learning a substantially different
and more accurate model. Contributions of this work
include:
Analysis of independent manual annotation of the
40 commonly used face attributes shows that only
12 are labeled with 95% consistency and 3 have
random (50%) consistency. (See Section 3.)
For the 12 attributes that we determined can be
labeled with 95% consistency across annota-
tors, we audit the attribute values provided with the
CelebA images and find that (1) the error rate is
1
arXiv:2210.07356v2 [cs.CV] 16 Apr 2023
often asymmetric between true / false, and (2) the
error rate is as high as 40% for some attribute val-
ues. (See Section 3.)
We propose a semi-automated workflow to clean
existing annotations, and use it to create corrected
MSO attribute values for CelebA. In part of this
correction/cleaning, we identify (1) a small number
of images that are unusable and we propose should
be dropped from CelebA, and (2) identify images
for which true/false cannot be assigned to a partic-
ular attribute and so we propose an “information
not visible” value must be introduced. (See Sec-
tions 4.1 and 4.2.)
Comparing models learned using the original MSO
values versus our cleaned values, we show that
the models are substantially different, and that our
cleaned values enable a model that achieves state-
of-the-art accuracy on MSO. (See Section 4.3.)
2. Related work
There is a large literature in facial attribute analysis,
and several surveys give a broad coverage of the field
[1,2,26,31]. We cover only a few of the most relevant
works here.
CelebA was introduced by Liu et al. [16] in 2015
specifically to support research in deep learning for fa-
cial attributes. CelebA has 202,599 images grouped into
10,177 identities. Each image has 40 true/false attributes
– pointy nose, oval face, gray hair, wearing hat, etc.
and five landmark locations. CelebA also has a recom-
mended subject-disjoint split into train, validation and
test. The creation of the attribute values is described
only as – “Each image in CelebA and LFWA is anno-
tated with forty facial attributes and five key points by a
professional labeling company” [16]. No description of
how the attribute values are created, or estimate of their
consistency or accuracy, is given.
Thom and Hand stated [26] that, “CelebA and LFWA
were the first (and only to date) large-scale datasets in-
troduced for the problem of facial attribute recognition
from images. Prior to CelebA and LFWA, no dataset
labeled with attributes was large enough to effectively
train deep neural networks. In number of identities and
images, CelebA is substantially larger than LFWA, and
is the most-used research dataset in this area. In ad-
dition, Thom and Hand speculate that noise in the at-
tribute values may lead to an apparent plateau in re-
search progress [26] – “There is a recent plateau in facial
attribute recognition performance, which could be due
to poor labeling of data. . . . While crowdsourcing such
tasks can be very useful and result in large quantities of
reasonably labeled data, there are some tasks which may
be consistently labeled incorrectly ...”.
The only paper we are aware of to discuss cleaning
CelebA is [27]. They deal specifically with errors in the
identity groupings, and do not consider errors in the at-
tribute values. Compared to the original 202,599 images
of 10,177 identities, their identity-cleaned version has
197,477 images of 9,996 identities. In a simple manual
check of 100 identities in CelebA, we found a few addi-
tional instances of identity errors in the identity-cleaned
version of [27]. Because our work is focused on attribute
classification and not face matching, we start with the
original CelebA rather than the version of [27].
Terh¨
orst et al. [25] firstly estimated the quality of
annotations in CelebA by letting three human evalua-
tors manually evaluate randomly selected 50 positively-
annotated and 50 negatively-annotated images for each
attribute. They claimed that ”Similar to LFW, there is a
tendency that most of the wrong annotations are within
the positives”. In this paper, we provide a more statisti-
cally and systematically analysis on each attribute, and
we, furthermore, provide a possible solution to clean the
dataset.
Motivated by comments in [8,26] on the importance
of correct labels for machine learning and errors encoun-
tered in CelebA attribute values, we present the first de-
tailed analysis of the accuracy of CelebA attribute val-
ues. We create a cleaned version of the MSO values, and
perform experiments to assess the impact of the origi-
nal versus cleaned MSO attribute, using AFFACT [8],
MOON [20], DenseNet [11] and ResNet [9]. We show
that the cleaned attribute values result in learning a more
coherent model that achieves higher accuracy.
3. Accuracy of Attributes In Training Data
We first examine the general consistency of manual
annotations of face attributes. Then we use duplicate
faces in CelebA to examine the consistency of the at-
tribute values distributed with CelebA. Then we manu-
ally audit a sample of CelebA to estimate the accuracy
of its attribute values.
3.1. Consistency of Manual Annotations?
To estimate the level of consistency that can be ex-
pected in manual annotations of commonly used face
attributes, two annotators independently assigned val-
ues for each of 40 attributes of 1,000 images. The im-
ages were randomly selected from CelebA and should
be representative of web-scraped, in-the-wild celebrity
images. The annotators viewed the cropped, normalized
face images, with no knowledge of the other annotator’s
results.
Table 1lists, for each attribute, the number of images
for which the two annotations disagree. The least dis-
agreement was for eyeglasses, at just 3 images. The 3
2
摘要:

ConsistencyandAccuracyofCelebAAttributeValuesHaiyuWu1,GraceBezold1,ManuelG¨unther2,TerranceBoult3,MichaelC.King4,KevinW.Bowyer11UniversityofNotreDame,2UniversityofZurich,3UniversityofColoradoColoradoSprings,4FloridaInstituteofTechnologyAbstractWereportthefirstsystematicanalysisoftheex-perimentalfoun...

展开>> 收起<<
Consistency and Accuracy of CelebA Attribute Values Haiyu Wu1 Grace Bezold1 Manuel G unther2 Terrance Boult3 Michael C. King4 Kevin W. Bowyer1.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:2.73MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注