A survey of Identification and mitigation of Machine Learning algorithmic biases in Image Analysis
influenced by a third unknown variable leading to
confounding bias
, or when the observation setting favors one class
over the other leading to
selection bias
. The sources of this bias may be related to observation tools, methods or
external factors as it will be pointed out later.
A third interesting case concerns the bias induced by the model itself, which is often referred to as
inductive bias
:
L(ˆ
Y|X, Y, A)6=L(ˆ
Y|X, Y )
. This opposes the world created by the algorithm –i.e. the distribution of the algorithm
outputs – to the original data. From a different point of view, bias can also arise when the different categories of the
algorithm outputs differ from the categories as originally labeled in the dataset – i.e.
L(Y|ˆ
Y , X, A)6=L(Y|ˆ
Y , X)
– a
condition that is often referred to as lack of sufficiency.
Finally, the two previous conditions can also be formulated by considering the distribution of the algorithm prediction
errors and their variability with respect to the sensitive variable:
L(`(Y, ˆ
Y)|X, A)6=L(`(Y, ˆ
Y)|X)
, where
ˆ
Y×Y7→
`(ˆ
Y , Y )is the loss function measuring the error incurred by the algorithm by forecasting ˆ
Yin place of Y.
2.2 Potential causes of bias in Computer Vision
In practice, the above described situations may materialize through different causes in image datasets.
2.2.1 Improperly sampled training data
First, the bias may come from the data themselves, in the sense that the distribution of the training data is not the
ideal distribution that would reflect the desired behavior that we want to learn. Compared with tabular data, image
datasets can be difficult to collect, store and manipulate due to their considerable size and the memory storage they
require. Hence, many of them have proven to lack diversity – e.g. because not all regions are studied (geographic
diversity), or not all sub-population samples are uniformly collected (gender or racial diversity). The growing use of
facial recognition algorithms in a wide range of areas affecting our society is currently debated. Indeed, they have
demonstrated to be a source of racial [
28
,
29
], or gender [
30
] discrimination. Besides, well-known datasets such as
CelebA [
31
], Open Images [
32
] or ImageNet [
1
] lack of diversity – as shown in [
33
] or [
34
] – resulting in imbalanced
samples. Thus, state-of-the-art algorithms are unable to yield uniform performance over all sub-populations. A similar
lack of diversity appear in the newly created Metaverse as pointed out in [
35
] creating racial bias. This encouraged
several researchers to design datasets that do not suffer from these drawbacks – i.e. preserving diversity – as illustrated
by the Pilot Parliament Benchmark (PPB) dataset [36] or in [37] or in Fairface dataset [38].
Combining diverse databases to get a sufficient accuracy in all sub-populations is even more critical for high-stakes
systems, like those commonly used in Medicine. The fact that medical cohorts and longitudinal databases suffer from
biases has been long ago acknowledged in medical studies. The situation is even more complex in medical image
analysis for specialties such as radiology (National Lung Screening Trial, MIMIC-CXR-JPG [
39
], CheXpert [
40
]) or
dermatology (Melanoma detection for skin cancer, HAM10000 database [
41
]), where biased datasets are provided
for medical applications. Indeed, under-represented populations in some datasets lead to critical drop of accuracy, for
instance in skin cancer detection as in [42], [43] or for general research in medicine [44] and references therein.
The captioning of images is a relevant example where shortcoming of diversity hampers the quality of the algorithms’
predictions, and may result in biased forecasts as pointed out in [
45
] or in [
46
]. Therefore, it is of utmost importance to
include diversity (e.g geographic, social, ..) when building image datasets that will be used as reference benchmarks to
build and test the efficiency of computer vision algorithms.
2.2.2 Spurious correlations and external factors
The context in which the data is collected can also create spurious correlations between groups of images. Different
acquisition situations may provide different contextual information that can generate systematic artifacts in specific
kinds of images. For instance, confounding variables such as the snowy background in the Wolves versus Huskies
example of [
20
] (see Section 1) may add bias in algorithmic decisions. In this case, different objects in images may have
similar features due to the presence of a similar context, such as the color background, which can play an important
role in the classification task due to spurious correlations. We refer to [
47
] for more references. This phenomenon is
also well known in biology where spectroscopy data are highly influenced by the fluorescence methods as highlighted
in [
48
], which makes machine learning difficult to use without correcting the bias. Different biases related to different
instruments of measures are also described for medical data in [49].
An external factor can also induce biases and shift the distributions. It is important to note that all images are acquired
using sensors and pre-processed afterwards, potentially introducing defects to the images. In addition, their storage
may require to compress the information they contain in many different ways. All this makes for a type of data with
a considerable variability depending on the quality of the sensors, pre-processing pipeline and compression method.
4