Consistency and Accuracy of CelebA Attribute Values Haiyu Wu1 Grace Bezold1 Manuel G unther2 Terrance Boult3 Michael C. King4 Kevin W. Bowyer1

2025-05-02 0 0 2.73MB 9 页 10玖币

侵权投诉

Consistency and Accuracy of CelebA Attribute Values

Haiyu Wu1, Grace Bezold1, Manuel G¨

unther2, Terrance Boult3,

Michael C. King4, Kevin W. Bowyer1

1University of Notre Dame, 2University of Zurich,

3University of Colorado Colorado Springs, 4Florida Institute of Technology

Abstract

We report the ﬁrst systematic analysis of the ex-

perimental foundations of facial attribute classiﬁcation.

Two annotators independently assigning attribute val-

ues shows that only 12 of 40 common attributes are as-

signed values with ≥95% consistency, and three (high

cheekbones, pointed nose, oval face) have essentially

random consistency. Of 5,068 duplicate face appear-

ances in CelebA, attributes have contradicting values

on from 10 to 860 of the 5,068 duplicates. Manual

audit of a subset of CelebA estimates error rates as

high as 40% for (no beard=false), even though the la-

beling consistency experiment indicates that no beard

could be assigned with ≥95% consistency. Selecting

the mouth slightly open (MSO) for deeper analysis, we

estimate the error rate for (MSO=true) at about 20%

and (MSO=false) at about 2%. A corrected version of

the MSO attribute values enables learning a model that

achieves higher accuracy than previously reported for

MSO. Corrected values for CelebA MSO are available

at https://github.com/HaiyuWu/ CelebAMSO.

1. Introduction

Facial attributes have potential uses in face match-

ing/recognition [3,4,12,13,17,22], face image retrieval

[15,18], re-identiﬁcation [21,23,24], training GANs [5,

6,10,14] for generation of synthetic images, analyzing

AI biases [29,30] and other areas. CelebA [16] is the

largest and most widely used dataset in this research

area. However, recent papers have described cleaning

the identity groups in CelebA [27], and suggested that

the facial attribute annotations [8,26] need some “clean-

ing”. This paper provides the ﬁrst analysis of the con-

sistency with which the commonly-used face attributes

can be manually marked, and also of the quality of the

attribute values distributed with the CelebA images. We

also propose an auditing workﬂow to clean existing an-

notations, and demonstrate that using a corrected set of

Figure 1. Which Set of Attribute Values Enables Learning a

Better Model? Lower left quadrant contains images with orig-

inal (MSO=false) corrected to (MSO=true); Upper right quad-

rant contains images with original (MSO=true) corrected to

(MSO=false).

attribute values enables learning a substantially different

and more accurate model. Contributions of this work

include:

• Analysis of independent manual annotation of the

40 commonly used face attributes shows that only

12 are labeled with ≥95% consistency and 3 have

random (50%) consistency. (See Section 3.)

• For the 12 attributes that we determined can be

labeled with ≥95% consistency across annota-

tors, we audit the attribute values provided with the

CelebA images and ﬁnd that (1) the error rate is

arXiv:2210.07356v2 [cs.CV] 16 Apr 2023

often asymmetric between true / false, and (2) the

error rate is as high as 40% for some attribute val-

ues. (See Section 3.)

• We propose a semi-automated workﬂow to clean

existing annotations, and use it to create corrected

MSO attribute values for CelebA. In part of this

correction/cleaning, we identify (1) a small number

of images that are unusable and we propose should

be dropped from CelebA, and (2) identify images

for which true/false cannot be assigned to a partic-

ular attribute and so we propose an “information

not visible” value must be introduced. (See Sec-

tions 4.1 and 4.2.)

• Comparing models learned using the original MSO

values versus our cleaned values, we show that

the models are substantially different, and that our

cleaned values enable a model that achieves state-

of-the-art accuracy on MSO. (See Section 4.3.)

2. Related work

There is a large literature in facial attribute analysis,

and several surveys give a broad coverage of the ﬁeld

[1,2,26,31]. We cover only a few of the most relevant

works here.

CelebA was introduced by Liu et al. [16] in 2015

speciﬁcally to support research in deep learning for fa-

cial attributes. CelebA has 202,599 images grouped into

10,177 identities. Each image has 40 true/false attributes

– pointy nose, oval face, gray hair, wearing hat, etc. –

and ﬁve landmark locations. CelebA also has a recom-

mended subject-disjoint split into train, validation and

test. The creation of the attribute values is described

only as – “Each image in CelebA and LFWA is anno-

tated with forty facial attributes and ﬁve key points by a

professional labeling company” [16]. No description of

how the attribute values are created, or estimate of their

consistency or accuracy, is given.

Thom and Hand stated [26] that, “CelebA and LFWA

were the ﬁrst (and only to date) large-scale datasets in-

troduced for the problem of facial attribute recognition

from images. Prior to CelebA and LFWA, no dataset

labeled with attributes was large enough to effectively

train deep neural networks.” In number of identities and

images, CelebA is substantially larger than LFWA, and

is the most-used research dataset in this area. In ad-

dition, Thom and Hand speculate that noise in the at-

tribute values may lead to an apparent plateau in re-

search progress [26] – “There is a recent plateau in facial

attribute recognition performance, which could be due

to poor labeling of data. . . . While crowdsourcing such

tasks can be very useful and result in large quantities of

reasonably labeled data, there are some tasks which may

be consistently labeled incorrectly ...”.

The only paper we are aware of to discuss cleaning

CelebA is [27]. They deal speciﬁcally with errors in the

identity groupings, and do not consider errors in the at-

tribute values. Compared to the original 202,599 images

of 10,177 identities, their identity-cleaned version has

197,477 images of 9,996 identities. In a simple manual

check of 100 identities in CelebA, we found a few addi-

tional instances of identity errors in the identity-cleaned

version of [27]. Because our work is focused on attribute

classiﬁcation and not face matching, we start with the

original CelebA rather than the version of [27].

Terh¨

orst et al. [25] ﬁrstly estimated the quality of

annotations in CelebA by letting three human evalua-

tors manually evaluate randomly selected 50 positively-

annotated and 50 negatively-annotated images for each

attribute. They claimed that ”Similar to LFW, there is a

tendency that most of the wrong annotations are within

the positives”. In this paper, we provide a more statisti-

cally and systematically analysis on each attribute, and

we, furthermore, provide a possible solution to clean the

dataset.

Motivated by comments in [8,26] on the importance

of correct labels for machine learning and errors encoun-

tered in CelebA attribute values, we present the ﬁrst de-

tailed analysis of the accuracy of CelebA attribute val-

ues. We create a cleaned version of the MSO values, and

perform experiments to assess the impact of the origi-

nal versus cleaned MSO attribute, using AFFACT [8],

MOON [20], DenseNet [11] and ResNet [9]. We show

that the cleaned attribute values result in learning a more

coherent model that achieves higher accuracy.

3. Accuracy of Attributes In Training Data

We ﬁrst examine the general consistency of manual

annotations of face attributes. Then we use duplicate

faces in CelebA to examine the consistency of the at-

tribute values distributed with CelebA. Then we manu-

ally audit a sample of CelebA to estimate the accuracy

of its attribute values.

3.1. Consistency of Manual Annotations?

To estimate the level of consistency that can be ex-

pected in manual annotations of commonly used face

attributes, two annotators independently assigned val-

ues for each of 40 attributes of 1,000 images. The im-

ages were randomly selected from CelebA and should

be representative of web-scraped, in-the-wild celebrity

images. The annotators viewed the cropped, normalized

face images, with no knowledge of the other annotator’s

results.

Table 1lists, for each attribute, the number of images

for which the two annotations disagree. The least dis-

agreement was for eyeglasses, at just 3 images. The 3

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConsistencyandAccuracyofCelebAAttributeValuesHaiyuWu1,GraceBezold1,ManuelG¨unther2,TerranceBoult3,MichaelC.King4,KevinW.Bowyer11UniversityofNotreDame,2UniversityofZurich,3UniversityofColoradoColoradoSprings,4FloridaInstituteofTechnologyAbstractWereportthefirstsystematicanalysisoftheex-perimentalfoun...

展开>> 收起<<

Consistency and Accuracy of CelebA Attribute Values Haiyu Wu1 Grace Bezold1 Manuel G unther2 Terrance Boult3 Michael C. King4 Kevin W. Bowyer1.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Consistency and Accuracy of CelebA Attribute Values Haiyu Wu1 Grace Bezold1 Manuel G unther2 Terrance Boult3 Michael C. King4 Kevin W. Bowyer1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: