Estimating the Contamination Factors Distribution in Unsupervised Anomaly Detection

2025-05-06 0 0 613.33KB 12 页 10玖币
侵权投诉
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly
Detection
Lorenzo Perini 1Paul-Christian B ¨
urkner 2Arto Klami 3
Abstract
Anomaly detection methods identify examples
that do not follow the expected behaviour, typ-
ically in an unsupervised fashion, by assigning
real-valued anomaly scores to the examples based
on various heuristics. These scores need to be
transformed into actual predictions by threshold-
ing so that the proportion of examples marked
as anomalies equals the expected proportion of
anomalies, called contamination factor. Unfortu-
nately, there are no good methods for estimating
the contamination factor itself. We address this
need from a Bayesian perspective, introducing a
method for estimating the posterior distribution
of the contamination factor for a given unlabeled
dataset. We leverage several anomaly detectors to
capture the basic notion of anomalousness and es-
timate the contamination using a specific mixture
formulation. Empirically on
22
datasets, we show
that the estimated distribution is well-calibrated
and that setting the threshold using the posterior
mean improves the detectors’ performance over
several alternative methods.
1. Introduction
Anomaly detection aims at automatically identifying sam-
ples that do not conform to the normal behaviour, accord-
ing to some notion of normality (see e.g., Chandola et al.
(2009)). Anomalies are often indicative of critical events
such as intrusions in web networks (Malaiya et al.,2018),
failures in petroleum extraction (Mart
´
ı et al.,2015), or break-
downs in wind and gas turbines (Zaher et al.,2009;Yan &
Yu,2019). Such events have an associated high cost and
detecting them avoids wasting time and resources.
1
DTAI lab & Leuven.AI, Department of Computer Science,
KU Leuven, Belgium
2
Cluster of Excellence SimTech, University
of Stuttgart, Germany
3
Department of Computer Science, Uni-
versity of Helsinki, Finland. Correspondence to: Lorenzo Perini
<lorenzo.perini@kuleuven.be>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
Typically, anomaly detection is tackled from an unsu-
pervised perspective (Maxion & Tan,2000;Goldstein &
Uchida,2016;Zong et al.,2018;Perini et al.,2020b;Han
et al.,2022) because labeled samples, especially anomalies,
may be expensive and difficult to acquire (e.g., you do not
want to voluntarily break the equipment simply to observe
anomalous behaviours), or simply rare (e.g., you may need
to inspect many samples before finding an anomalous one).
Unsupervised anomaly detectors exploit data-driven heuris-
tic assumptions (e.g., anomalies are far away from normals)
to assign a real-valued score to each sample denoting how
anomalous it is. Using such anomaly scores enables ranking
the samples from most to least anomalous.
Converting the anomaly scores into discrete predictions
would practically allow the user to flag the anomalies. Com-
monly, one sets a decision threshold and labels samples with
higher scores as anomalous and samples with lower scores
as normal. However, setting the threshold is a challenging
task as it cannot be tuned (e.g., by maximizing the model
performance) due to the absence of labels. One approach is
to set the threshold such that the proportion of scores above
it matches the dataset’s contamination factor
γ
, i.e. the
expected proportion of anomalies. If the ranking is correct
(that is, all anomalies are ranked before any normal instance)
then thresholding with exactly the correct
γ
correctly iden-
tifies all anomalies. However, in most of the real-world
scenarios the contamination factor is unknown.
Estimating the contamination factor
γ
is challenging. Exist-
ing works provide an estimate by using either some normal
labels (Perini et al.,2020a) or domain knowledge (Perini
et al.,2022). Alternatively, one can directly threshold the
scores through statistical threshold estimators, and derive
γ
as the proportion of scores higher than the threshold. For in-
stance, the Modified Thompson Tau test thresholder (MTT)
finds the threshold through the modified Thompson Tau
test (Rengasamy et al.,2021), while the Inter-Quartile Re-
gion thresholder (IQR) uses the third quartile plus
1.5
times
the inter-quartile region (Bardet & Dimby,2017). In Sec-
tion 4we provide a comprehensive list of estimators.
Transforming the scores into predictions using an incor-
rect estimate of the contamination factor (or, equivalently,
an incorrect threshold) deteriorates the anomaly detector’s
1
arXiv:2210.10487v2 [cs.LG] 17 Oct 2023
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection
performance (Fourure et al.,2021;Emmott et al.,2015)
and reduces the trust in the detection system. If such an
estimate was coupled with a measure of uncertainty, one
could take into account this uncertainty to improve decisions.
Although existing methods propose Bayesian anomaly de-
tectors (Shen & Cooper,2010;Roberts et al.,2019;Hou
et al.,2022;Heard et al.,2010), none of them study how to
transform scores into hard predictions.
Therefore, we are the first to study the estimation of the con-
tamination factor from a Bayesian perspective. We propose
γ
GMM, the first algorithm for estimating the contamination
factor’s (posterior) distribution in unlabeled anomaly detec-
tion setups. First, we use a set of unsupervised anomaly
detectors to assign anomaly scores for all samples and use
these scores as a new representation of the data. Second, we
fit a Bayesian Gaussian Mixture model with a Dirichlet Pro-
cess prior (DPGMM) (Ferguson,1973;Rasmussen,1999)
in this new space. If we knew which components contain
the anomalies, we could derive the contamination factor’s
posterior distribution as the distribution of the sum of such
components’ weights. Because we do not know this, as a
third step
γ
GMM estimates the probability that the
k
most
extreme components are jointly anomalous, and uses this
information to construct the desired posterior. The method
explained in detail in Section 3.
In summary, we make four contributions. First, we adopt
a Bayesian perspective and introduce the problem of es-
timating the contamination factor’s posterior distribution.
Second, we propose an algorithm that is able to sample
from this posterior. Third, we demonstrate experimentally
that the implied uncertainty-aware predictions are well cali-
brated and that taking the posterior mean as point estimate
of
γ
outperforms several other algorithms in common bench-
marks. Finally, we show that using the posterior mean as a
threshold improves the actual anomaly detection accuracy.
2. Preliminaries
Let
(Ω,F,P)
be a probability space, and
X: Ω Rd
a
random variable, from which a dataset
D={X1, . . . , XN}
of
N
random examples is drawn. Assume that
X
has a
distribution of the form
P= (1 γ)·P1+γ·P2
, where
P1
and
P2
are the distributions on
Rd
corresponding to normal
examples and anomalies, respectively, and
γ[0,1]
is the
contamination factor, i.e. the proportion of anomalies. An
(unsupervised) anomaly detector is a measurable function
f:RdR
that assigns real-valued anomaly scores
f(X)
to the examples. Such anomaly scores follow the rule that
the higher the score, the more anomalous the example.
A Gaussian mixture model (GMM) with
K
components
(see e.g. Roberts et al. (1998)) is a generative model de-
fined by a distribution on a space
RM
such that
p(s) =
PK
k=1 πkN(s|µk,Σk)
for
sRM
, where
N(s|µk,Σk)
denotes the Gaussian distribution with mean vector
µk
and
covariance matrix
ΣkRM×M
, and
πk
are the mixing
proportions such that
PK
k=1 πk= 1
. For finite mixtures, we
typically have a Dirichlet prior over
π= [π1, . . . , πK]
, but
Dirichlet Process (DP) priors allow treating also the number
of components as unknown (G
¨
or
¨
ur & Rasmussen,2010).
For both cases, we need approximate inference to estimate
the posterior of the model parameters.
3. Methodology
We tackle the problem: Given an unlabeled dataset
D
and
a set of
M
unsupervised anomaly detectors; Estimate a
(posterior) distribution of the contamination factor γ.
Learning from an unlabeled dataset has three key challenges.
First, the absence of labels forces us to make relatively
strong assumptions. Second, the anomaly detectors rely
on different heuristics that may or may not hold, and their
performance can hence vary significantly across datasets.
Third, we need to be careful in introducing user-specified
hyperparameters, because setting them properly may be as
hard as directly specifying the contamination factor.
In this paper, we propose
γ
GMM, a novel Bayesian ap-
proach that estimates the contamination factor’s posterior
distribution in four steps, which are illustrated in Figure 1:
Step 1. Because anomalies may not follow any particu-
lar pattern in covariate space,
γ
GMM maps the covariates
XRd
into an
M
dimensional anomaly space, where the
dimensions correspond to the anomaly scores assigned by
the
M
unsupervised anomaly detectors. Within each dimen-
sion of such a space, the evident pattern is that “the higher
the more anomalous”.
Step 2. We model the data points in the new space
RM
using a Dirichlet Process Gaussian Mixture Model
(DPGMM) (Neal,1992;Rasmussen,1999). We assume that
each of the (potentially many) mixture components contains
either only normals or only anomalies. If we knew which
components contained anomalies, we could then easily de-
rive
γ
s posterior as the sum of the mixing proportions
π
of
the anomalous components. However, such information is
not available in our setting.
Step 3. Thus, we order the components in decreasing order,
and we estimate the probability of the largest
k
components
being anomalous. This poses three challenges: (a) how to
represent each
M
-dimensional component by a single value
to sort them from the most to the least anomalous, (b) how
to compute the probability that the
k
th component is anoma-
lous given that the
(k1)
th is such, (c) how to derive the
target probability that
k
components are jointly anomalous.
Step 4.
γ
GMM estimates the contamination factor’s pos-
terior by exploiting such a joint probability and the compo-
nents’ mixing proportions posterior.
2
Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection
Figure 1.
Illustration of the
γ
GMM’s four steps on a 2D toy dataset (left plot): we 1) map the 2D dataset into an
M= 2
dimensional
anomaly space, 2) fit a DPGMM model on it, 3) compute the components’ probability of being anomalous (conditional, in the plot), and
4) derive γ|Ss posterior. γGMM’s mean is an accurate point estimate for the true value γ.
In the following, we describe these steps in detail.
3.1. Representing Data Using Anomaly Scores
Learning from an unlabeled anomaly detection dataset has
two major challenges. First, anomalies are rare and sparse
events, which makes it hard to use common unsupervised
methods like clustering (Breunig et al.,2000). Second,
making assumptions on the unlabeled data is challenging
due to the absence of specific patterns in the anomalies,
which makes it hard to choose a specific anomaly detector.
Therefore, we use a set of
M
anomaly detectors to map
the
d
-dimensional input space into an
M
-dimensional score
space RM, such that a sample xgets a score s:
Rdx[f1(x), f2(x), . . . , fM(x)] = sRM.
This has two main effects: (1) it introduces an interpretable
space where the evident pattern is that, within each dimen-
sion, higher scores are more likely to be anomalous, and (2)
it accounts for multiple inductive biases by using multiple
arbitrary anomaly detectors.
To make the dimensions comparable, we (independently for
each dimension) map the scores
sS
to
log(smin(S) +
0.01)
, where the log is used to shorten heavy right tails, and
normalize them to have zero mean and unit variance.
3.2. Modeling the Density with DPGMM
We use mixture models as basis for quantifying the distribu-
tion of the contamination factor, relying on their ability to
model the proportions of samples using the mixture weights.
For flexible modeling, we use the DPGMM
si N (˜µi,˜
Σi)i= 1, . . . , N
(˜µi,˜
Σi)G
GDP (G0, α)
G0=N IW(M, λ, V, u)
where
G
is a random distribution of the mean vectors
µi
and covariance matrices
Σi
, drawn from a DP with
base distribution
G0
. We use the explicit representation
G=P
k=1 πkδ(µk,Σk)(˜µi,˜
Σi)
, where
δ(µk,Σk)
is the delta
distribution at
(µk,Σk)
and
πk
follow the stick-breaking
distribution. We set
G0
as Normal Inverse Wishart (Nydick,
2012) with parameters
M, λ, V, u
common to all compo-
nents. We use variational inference (VI; see e.g. Blei et al.
(2017) for details) for approximating the posterior as VI is
computationally efficient and sufficiently accurate for our
purposes. Alternative methods (e.g., Markov Chain Monte
Carlo (Brooks et al.,2011)) could also be used but were not
considered worth the additional computational effort here.
Choice of DPGMM. DPGMM has two key properties that
justify its use over other flexible density models. First, we
choose Gaussian distributions over more robust heavy-tailed
distributions because isolated samples are likely candidates
for outliers, and encouraging the model to represent them
using the heavy tails would be counter-productive. Second,
the rich-get-richer property of DPs is desirable because we
expect some very large components of normals but want
to allow arbitrarily small clusters of anomalies. Moreover,
the DP formulation allows us to refrain from specifying
the number of components
K
. After fitting the model, we
only consider the components with at least one observation
assigned to them and propagate all the remaining density
uniformly over the active components. Thus, for the follow-
ing steps we can still proceed as if the model was a finite
mixture with πfollowing a Dirichlet distribution.
3.3. Estimating the Components’ Anomalousness
We assume that each mixture component either contains
only anomalous or only normal samples. All unsupervised
methods rely on some assumption on nearby samples shar-
ing latent characteristics, and this cluster assumption is a nat-
ural and weak assumption. If we knew which components
contain anomalies, we could directly derive the posterior of
3
摘要:

EstimatingtheContaminationFactor’sDistributioninUnsupervisedAnomalyDetectionLorenzoPerini1Paul-ChristianB¨urkner2ArtoKlami3AbstractAnomalydetectionmethodsidentifyexamplesthatdonotfollowtheexpectedbehaviour,typ-icallyinanunsupervisedfashion,byassigningreal-valuedanomalyscorestotheexamplesbasedonvario...

展开>> 收起<<
Estimating the Contamination Factors Distribution in Unsupervised Anomaly Detection.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:613.33KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注