
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection
performance (Fourure et al.,2021;Emmott et al.,2015)
and reduces the trust in the detection system. If such an
estimate was coupled with a measure of uncertainty, one
could take into account this uncertainty to improve decisions.
Although existing methods propose Bayesian anomaly de-
tectors (Shen & Cooper,2010;Roberts et al.,2019;Hou
et al.,2022;Heard et al.,2010), none of them study how to
transform scores into hard predictions.
Therefore, we are the first to study the estimation of the con-
tamination factor from a Bayesian perspective. We propose
γ
GMM, the first algorithm for estimating the contamination
factor’s (posterior) distribution in unlabeled anomaly detec-
tion setups. First, we use a set of unsupervised anomaly
detectors to assign anomaly scores for all samples and use
these scores as a new representation of the data. Second, we
fit a Bayesian Gaussian Mixture model with a Dirichlet Pro-
cess prior (DPGMM) (Ferguson,1973;Rasmussen,1999)
in this new space. If we knew which components contain
the anomalies, we could derive the contamination factor’s
posterior distribution as the distribution of the sum of such
components’ weights. Because we do not know this, as a
third step
γ
GMM estimates the probability that the
k
most
extreme components are jointly anomalous, and uses this
information to construct the desired posterior. The method
explained in detail in Section 3.
In summary, we make four contributions. First, we adopt
a Bayesian perspective and introduce the problem of es-
timating the contamination factor’s posterior distribution.
Second, we propose an algorithm that is able to sample
from this posterior. Third, we demonstrate experimentally
that the implied uncertainty-aware predictions are well cali-
brated and that taking the posterior mean as point estimate
of
γ
outperforms several other algorithms in common bench-
marks. Finally, we show that using the posterior mean as a
threshold improves the actual anomaly detection accuracy.
2. Preliminaries
Let
(Ω,F,P)
be a probability space, and
X: Ω →Rd
a
random variable, from which a dataset
D={X1, . . . , XN}
of
N
random examples is drawn. Assume that
X
has a
distribution of the form
P= (1 −γ)·P1+γ·P2
, where
P1
and
P2
are the distributions on
Rd
corresponding to normal
examples and anomalies, respectively, and
γ∈[0,1]
is the
contamination factor, i.e. the proportion of anomalies. An
(unsupervised) anomaly detector is a measurable function
f:Rd→R
that assigns real-valued anomaly scores
f(X)
to the examples. Such anomaly scores follow the rule that
the higher the score, the more anomalous the example.
A Gaussian mixture model (GMM) with
K
components
(see e.g. Roberts et al. (1998)) is a generative model de-
fined by a distribution on a space
RM
such that
p(s) =
PK
k=1 πkN(s|µk,Σk)
for
s∈RM
, where
N(s|µk,Σk)
denotes the Gaussian distribution with mean vector
µk
and
covariance matrix
Σk∈RM×M
, and
πk
are the mixing
proportions such that
PK
k=1 πk= 1
. For finite mixtures, we
typically have a Dirichlet prior over
π= [π1, . . . , πK]
, but
Dirichlet Process (DP) priors allow treating also the number
of components as unknown (G
¨
or
¨
ur & Rasmussen,2010).
For both cases, we need approximate inference to estimate
the posterior of the model parameters.
3. Methodology
We tackle the problem: Given an unlabeled dataset
D
and
a set of
M
unsupervised anomaly detectors; Estimate a
(posterior) distribution of the contamination factor γ.
Learning from an unlabeled dataset has three key challenges.
First, the absence of labels forces us to make relatively
strong assumptions. Second, the anomaly detectors rely
on different heuristics that may or may not hold, and their
performance can hence vary significantly across datasets.
Third, we need to be careful in introducing user-specified
hyperparameters, because setting them properly may be as
hard as directly specifying the contamination factor.
In this paper, we propose
γ
GMM, a novel Bayesian ap-
proach that estimates the contamination factor’s posterior
distribution in four steps, which are illustrated in Figure 1:
Step 1. Because anomalies may not follow any particu-
lar pattern in covariate space,
γ
GMM maps the covariates
X∈Rd
into an
M
dimensional anomaly space, where the
dimensions correspond to the anomaly scores assigned by
the
M
unsupervised anomaly detectors. Within each dimen-
sion of such a space, the evident pattern is that “the higher
the more anomalous”.
Step 2. We model the data points in the new space
RM
using a Dirichlet Process Gaussian Mixture Model
(DPGMM) (Neal,1992;Rasmussen,1999). We assume that
each of the (potentially many) mixture components contains
either only normals or only anomalies. If we knew which
components contained anomalies, we could then easily de-
rive
γ
’s posterior as the sum of the mixing proportions
π
of
the anomalous components. However, such information is
not available in our setting.
Step 3. Thus, we order the components in decreasing order,
and we estimate the probability of the largest
k
components
being anomalous. This poses three challenges: (a) how to
represent each
M
-dimensional component by a single value
to sort them from the most to the least anomalous, (b) how
to compute the probability that the
k
th component is anoma-
lous given that the
(k−1)
th is such, (c) how to derive the
target probability that
k
components are jointly anomalous.
Step 4.
γ
GMM estimates the contamination factor’s pos-
terior by exploiting such a joint probability and the compo-
nents’ mixing proportions posterior.
2