Race Bias Analysis of Bona Fide Errors in face anti-spoofing
different mobile devices. WMCA [
32
] contains 72 subjects and information is captured in RGB, depth, infrared, and
thermal. CASIA-SURF [33] consists of 1000 subjects captured in RGB, depth, and infrared.
The first face anti-spoofing database to include explicit ethnic labels was CASIA-SURF CeFA [
34
], which has 1,607 in
three ethnicities, captured in three modalities. In this paper, for bias analysis we use the RFW [
28
], which includes four
types of ethnicities, Caucasian, Asian, Indian, and African. RFW does not specialise in face anti-spoofing, and it is
more widely used in the bias analysis literature.
2.3 Bias in machine learning
In [
35
], several high profile cases of machine learning bias are documented; Google search results appeared to be biased
towards women in 2015; Hewlett-Packard’s software for web cameras struggled to recognize dark skin tones; and
Nikon’s camera software was inaccurately identifying Asian people as blinking.
Thus, given also the ethical, legal, and regulatory issues associated with the problem of bias within human populations,
there is a considerable amount of research on the subject, especially in face recognition (FR). A recent comprehensive
survey can be found in [
36
], where the significant sources of bias [
37
,
38
] are categorised and discussed, and the
negative effect of bias on downstream learning tasks is pointed out. We also note that while the current deep learning
based FR algorithms are under intense scrutiny for potential bias [
39
], this is due to their wider deployment in real life
applications, rather than any evidence that they are more biased than traditional approaches.
In one of the earliest studies of bias in FR, predating deep learning, [
40
] reported differences in the performance on
humans of Caucasian and East Asian decent between Western and East Asia developed algorithms. In [
41
], several
deep learning based FR algorithms are analysed and a small amount of bias is detected in all of them. Then, the authors
show how this bias can be exploited to enhance the power of malicious morphing attacks to FR based security systems.
In [
42
], the authors compute cluster validation measures on the clusters of the various demographics inside the whole
population, aiming at measuring the algorithm’s potential for bias. Their result is negative, and they argue for the need of
more sophisticated clustering approaches. We note that in our paper, an investigation in the latent space of the potential
for bias, by measuring the discriminative power of SVMs over the various ethnicities, returned a similarly negative
result. In [
43
], the aim is the detection of bias by analysing the activation ratios at the various layers of the network.
Similarly to our work, their target application is the detection of race bias on a binary classification problem, gender
classification in their case. Their result is positive in that they report a correlation between the measured activation
ratios and bias in the final outcomes of the classifier. However, it is not clear if their method can be used to measure and
assess the statistical significance of the expected bias.
In Cavazos et al. [
44
], similarly to our approach, most of the analysis assumes a one-sided error cost, in their case the
false acceptance rate, and the decision thresholds are treated as user defined variables. However, the analytical tools
they used, mostly visual inspection of ROC curves, do not allow for a deep study of the distributions of the similarity
scores, while, here, we give a more detailed analysis of the distribution of the responses, which is the equivalent of the
similarity scores. In Pereira and Marcel [
45
], a fairness metric is proposed, which can be optimised over the decision
thresholds, but again, there is no in-depth statistical analysis of the scores, as we do here for the responses, and thus
they offer a more limited insight.
2.3.1 Bias in Presentation Attack Detection
The literature on bias in presentation attacks is more sparse. Race bias was the key theme in the competition of face
anti-spoofing algorithm on the CASIA-SURF CeFA database [
46
]. Bias was assessed by the performance of the
algorithm under a cross-ethnicity validation scenario. Standard performance metrics, such as APCER, BPCER and
ACER we reported. In [
47
], the standard CNN models Resnet 50 and VGG16, were compared for gender bias against
the debiasing-VAE proposed in [
48
], and several performance metrics were reported. In a recent white paper by the ID
R&D company, which develops face anti-spoofing software, the results of a large scale bias assessment experiment
conducted by Bixelab, a NIST accredited independent laboratory [
1
]. Similarly to our approach, they focus on the bona
fide errors, and their aim is the BPCER error metric to be below a prespecified threshold across all demographics.
Regarding other biometric identification modalities, [
49
] studied gender bias in iris PAD algorithms. They reported
three error metrics, APCER, BPCER, and HTER, finding that female users would be less protected against iris PAD
attacks.
3