Estimating the Contamination Factors Distribution in Unsupervised Anomaly Detection

2025-05-06 0 0 613.33KB 12 页 10玖币

侵权投诉

Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly

Detection

Lorenzo Perini 1Paul-Christian B ¨

urkner 2Arto Klami 3

Abstract

Anomaly detection methods identify examples

that do not follow the expected behaviour, typ-

ically in an unsupervised fashion, by assigning

real-valued anomaly scores to the examples based

on various heuristics. These scores need to be

transformed into actual predictions by threshold-

ing so that the proportion of examples marked

as anomalies equals the expected proportion of

anomalies, called contamination factor. Unfortu-

nately, there are no good methods for estimating

the contamination factor itself. We address this

need from a Bayesian perspective, introducing a

method for estimating the posterior distribution

of the contamination factor for a given unlabeled

dataset. We leverage several anomaly detectors to

capture the basic notion of anomalousness and es-

timate the contamination using a speciﬁc mixture

formulation. Empirically on

datasets, we show

that the estimated distribution is well-calibrated

and that setting the threshold using the posterior

mean improves the detectors’ performance over

several alternative methods.

1. Introduction

Anomaly detection aims at automatically identifying sam-

ples that do not conform to the normal behaviour, accord-

ing to some notion of normality (see e.g., Chandola et al.

(2009)). Anomalies are often indicative of critical events

such as intrusions in web networks (Malaiya et al.,2018),

failures in petroleum extraction (Mart

ı et al.,2015), or break-

downs in wind and gas turbines (Zaher et al.,2009;Yan &

Yu,2019). Such events have an associated high cost and

detecting them avoids wasting time and resources.

DTAI lab & Leuven.AI, Department of Computer Science,

KU Leuven, Belgium

Cluster of Excellence SimTech, University

of Stuttgart, Germany

Department of Computer Science, Uni-

versity of Helsinki, Finland. Correspondence to: Lorenzo Perini

<lorenzo.perini@kuleuven.be>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

Typically, anomaly detection is tackled from an unsu-

pervised perspective (Maxion & Tan,2000;Goldstein &

Uchida,2016;Zong et al.,2018;Perini et al.,2020b;Han

et al.,2022) because labeled samples, especially anomalies,

may be expensive and difﬁcult to acquire (e.g., you do not

want to voluntarily break the equipment simply to observe

anomalous behaviours), or simply rare (e.g., you may need

to inspect many samples before ﬁnding an anomalous one).

Unsupervised anomaly detectors exploit data-driven heuris-

tic assumptions (e.g., anomalies are far away from normals)

to assign a real-valued score to each sample denoting how

anomalous it is. Using such anomaly scores enables ranking

the samples from most to least anomalous.

Converting the anomaly scores into discrete predictions

would practically allow the user to ﬂag the anomalies. Com-

monly, one sets a decision threshold and labels samples with

higher scores as anomalous and samples with lower scores

as normal. However, setting the threshold is a challenging

task as it cannot be tuned (e.g., by maximizing the model

performance) due to the absence of labels. One approach is

to set the threshold such that the proportion of scores above

it matches the dataset’s contamination factor

, i.e. the

expected proportion of anomalies. If the ranking is correct

(that is, all anomalies are ranked before any normal instance)

then thresholding with exactly the correct

correctly iden-

tiﬁes all anomalies. However, in most of the real-world

scenarios the contamination factor is unknown.

Estimating the contamination factor

is challenging. Exist-

ing works provide an estimate by using either some normal

labels (Perini et al.,2020a) or domain knowledge (Perini

et al.,2022). Alternatively, one can directly threshold the

scores through statistical threshold estimators, and derive

as the proportion of scores higher than the threshold. For in-

stance, the Modiﬁed Thompson Tau test thresholder (MTT)

ﬁnds the threshold through the modiﬁed Thompson Tau

test (Rengasamy et al.,2021), while the Inter-Quartile Re-

gion thresholder (IQR) uses the third quartile plus

1.5

times

the inter-quartile region (Bardet & Dimby,2017). In Sec-

tion 4we provide a comprehensive list of estimators.

Transforming the scores into predictions using an incor-

rect estimate of the contamination factor (or, equivalently,

an incorrect threshold) deteriorates the anomaly detector’s

arXiv:2210.10487v2 [cs.LG] 17 Oct 2023

Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

performance (Fourure et al.,2021;Emmott et al.,2015)

and reduces the trust in the detection system. If such an

estimate was coupled with a measure of uncertainty, one

could take into account this uncertainty to improve decisions.

Although existing methods propose Bayesian anomaly de-

tectors (Shen & Cooper,2010;Roberts et al.,2019;Hou

et al.,2022;Heard et al.,2010), none of them study how to

transform scores into hard predictions.

Therefore, we are the ﬁrst to study the estimation of the con-

tamination factor from a Bayesian perspective. We propose

GMM, the ﬁrst algorithm for estimating the contamination

factor’s (posterior) distribution in unlabeled anomaly detec-

tion setups. First, we use a set of unsupervised anomaly

detectors to assign anomaly scores for all samples and use

these scores as a new representation of the data. Second, we

ﬁt a Bayesian Gaussian Mixture model with a Dirichlet Pro-

cess prior (DPGMM) (Ferguson,1973;Rasmussen,1999)

in this new space. If we knew which components contain

the anomalies, we could derive the contamination factor’s

posterior distribution as the distribution of the sum of such

components’ weights. Because we do not know this, as a

third step

GMM estimates the probability that the

most

extreme components are jointly anomalous, and uses this

information to construct the desired posterior. The method

explained in detail in Section 3.

In summary, we make four contributions. First, we adopt

a Bayesian perspective and introduce the problem of es-

timating the contamination factor’s posterior distribution.

Second, we propose an algorithm that is able to sample

from this posterior. Third, we demonstrate experimentally

that the implied uncertainty-aware predictions are well cali-

brated and that taking the posterior mean as point estimate

outperforms several other algorithms in common bench-

marks. Finally, we show that using the posterior mean as a

threshold improves the actual anomaly detection accuracy.

2. Preliminaries

Let

(Ω,F,P)

be a probability space, and

X: Ω →Rd

random variable, from which a dataset

D={X1, . . . , XN}

random examples is drawn. Assume that

has a

distribution of the form

P= (1 −γ)·P1+γ·P2

, where

and

are the distributions on

corresponding to normal

examples and anomalies, respectively, and

γ∈[0,1]

is the

contamination factor, i.e. the proportion of anomalies. An

(unsupervised) anomaly detector is a measurable function

f:Rd→R

that assigns real-valued anomaly scores

f(X)

to the examples. Such anomaly scores follow the rule that

the higher the score, the more anomalous the example.

A Gaussian mixture model (GMM) with

components

(see e.g. Roberts et al. (1998)) is a generative model de-

ﬁned by a distribution on a space

such that

p(s) =

k=1 πkN(s|µk,Σk)

for

s∈RM

, where

N(s|µk,Σk)

denotes the Gaussian distribution with mean vector

µk

and

covariance matrix

Σk∈RM×M

, and

πk

are the mixing

proportions such that

k=1 πk= 1

. For ﬁnite mixtures, we

typically have a Dirichlet prior over

π= [π1, . . . , πK]

, but

Dirichlet Process (DP) priors allow treating also the number

of components as unknown (G

ur & Rasmussen,2010).

For both cases, we need approximate inference to estimate

the posterior of the model parameters.

3. Methodology

We tackle the problem: Given an unlabeled dataset

and

a set of

unsupervised anomaly detectors; Estimate a

(posterior) distribution of the contamination factor γ.

Learning from an unlabeled dataset has three key challenges.

First, the absence of labels forces us to make relatively

strong assumptions. Second, the anomaly detectors rely

on different heuristics that may or may not hold, and their

performance can hence vary signiﬁcantly across datasets.

Third, we need to be careful in introducing user-speciﬁed

hyperparameters, because setting them properly may be as

hard as directly specifying the contamination factor.

In this paper, we propose

GMM, a novel Bayesian ap-

proach that estimates the contamination factor’s posterior

distribution in four steps, which are illustrated in Figure 1:

Step 1. Because anomalies may not follow any particu-

lar pattern in covariate space,

GMM maps the covariates

X∈Rd

into an

dimensional anomaly space, where the

dimensions correspond to the anomaly scores assigned by

the

unsupervised anomaly detectors. Within each dimen-

sion of such a space, the evident pattern is that “the higher

the more anomalous”.

Step 2. We model the data points in the new space

using a Dirichlet Process Gaussian Mixture Model

(DPGMM) (Neal,1992;Rasmussen,1999). We assume that

each of the (potentially many) mixture components contains

either only normals or only anomalies. If we knew which

components contained anomalies, we could then easily de-

rive

’s posterior as the sum of the mixing proportions

the anomalous components. However, such information is

not available in our setting.

Step 3. Thus, we order the components in decreasing order,

and we estimate the probability of the largest

components

being anomalous. This poses three challenges: (a) how to

represent each

-dimensional component by a single value

to sort them from the most to the least anomalous, (b) how

to compute the probability that the

th component is anoma-

lous given that the

(k−1)

th is such, (c) how to derive the

target probability that

components are jointly anomalous.

Step 4.

GMM estimates the contamination factor’s pos-

terior by exploiting such a joint probability and the compo-

nents’ mixing proportions posterior.

Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

Figure 1.

Illustration of the

GMM’s four steps on a 2D toy dataset (left plot): we 1) map the 2D dataset into an

M= 2

dimensional

anomaly space, 2) ﬁt a DPGMM model on it, 3) compute the components’ probability of being anomalous (conditional, in the plot), and

4) derive γ|S’s posterior. γGMM’s mean is an accurate point estimate for the true value γ∗.

In the following, we describe these steps in detail.

3.1. Representing Data Using Anomaly Scores

Learning from an unlabeled anomaly detection dataset has

two major challenges. First, anomalies are rare and sparse

events, which makes it hard to use common unsupervised

methods like clustering (Breunig et al.,2000). Second,

making assumptions on the unlabeled data is challenging

due to the absence of speciﬁc patterns in the anomalies,

which makes it hard to choose a speciﬁc anomaly detector.

Therefore, we use a set of

anomaly detectors to map

the

-dimensional input space into an

-dimensional score

space RM, such that a sample xgets a score s:

Rd∋x→[f1(x), f2(x), . . . , fM(x)] = s∈RM.

This has two main effects: (1) it introduces an interpretable

space where the evident pattern is that, within each dimen-

sion, higher scores are more likely to be anomalous, and (2)

it accounts for multiple inductive biases by using multiple

arbitrary anomaly detectors.

To make the dimensions comparable, we (independently for

each dimension) map the scores

s∈S

log(s−min(S) +

0.01)

, where the log is used to shorten heavy right tails, and

normalize them to have zero mean and unit variance.

3.2. Modeling the Density with DPGMM

We use mixture models as basis for quantifying the distribu-

tion of the contamination factor, relying on their ability to

model the proportions of samples using the mixture weights.

For ﬂexible modeling, we use the DPGMM

si∼ N (˜µi,˜

Σi)i= 1, . . . , N

(˜µi,˜

Σi)∼G

G∼DP (G0, α)

G0=N IW(M, λ, V, u)

where

is a random distribution of the mean vectors

µi

and covariance matrices

Σi

, drawn from a DP with

base distribution

. We use the explicit representation

G=P∞

k=1 πkδ(µk,Σk)(˜µi,˜

Σi)

, where

δ(µk,Σk)

is the delta

distribution at

(µk,Σk)

and

πk

follow the stick-breaking

distribution. We set

as Normal Inverse Wishart (Nydick,

2012) with parameters

M, λ, V, u

common to all compo-

nents. We use variational inference (VI; see e.g. Blei et al.

(2017) for details) for approximating the posterior as VI is

computationally efﬁcient and sufﬁciently accurate for our

purposes. Alternative methods (e.g., Markov Chain Monte

Carlo (Brooks et al.,2011)) could also be used but were not

considered worth the additional computational effort here.

Choice of DPGMM. DPGMM has two key properties that

justify its use over other ﬂexible density models. First, we

choose Gaussian distributions over more robust heavy-tailed

distributions because isolated samples are likely candidates

for outliers, and encouraging the model to represent them

using the heavy tails would be counter-productive. Second,

the rich-get-richer property of DPs is desirable because we

expect some very large components of normals but want

to allow arbitrarily small clusters of anomalies. Moreover,

the DP formulation allows us to refrain from specifying

the number of components

. After ﬁtting the model, we

only consider the components with at least one observation

assigned to them and propagate all the remaining density

uniformly over the active components. Thus, for the follow-

ing steps we can still proceed as if the model was a ﬁnite

mixture with πfollowing a Dirichlet distribution.

3.3. Estimating the Components’ Anomalousness

We assume that each mixture component either contains

only anomalous or only normal samples. All unsupervised

methods rely on some assumption on nearby samples shar-

ing latent characteristics, and this cluster assumption is a nat-

ural and weak assumption. If we knew which components

contain anomalies, we could directly derive the posterior of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EstimatingtheContaminationFactor’sDistributioninUnsupervisedAnomalyDetectionLorenzoPerini1Paul-ChristianB¨urkner2ArtoKlami3AbstractAnomalydetectionmethodsidentifyexamplesthatdonotfollowtheexpectedbehaviour,typ-icallyinanunsupervisedfashion,byassigningreal-valuedanomalyscorestotheexamplesbasedonvario...

展开>> 收起<<

Estimating the Contamination Factors Distribution in Unsupervised Anomaly Detection.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Estimating the Contamination Factors Distribution in Unsupervised Anomaly Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: