Anomaly detection using data depth multivariate case Pavlo Mozharovskyi

2025-04-27 0 0 6.83MB 36 页 10玖币

侵权投诉

Anomaly detection using data depth:

multivariate case

Pavlo Mozharovskyi

LTCI, T´el´ecom Paris, Institut Polytechnique de Paris

Romain Valla

LTCI, T´el´ecom Paris, Institut Polytechnique de Paris

July 8, 2024

Abstract

Anomaly detection is a branch of data analysis and machine learning which aims

at identifying observations that exhibit abnormal behaviour. Be it measurement

errors, disease development, severe weather, production quality default(s) (items)

or failed equipment, ﬁnancial frauds or crisis events, their on-time identiﬁcation,

isolation and explanation constitute an important task in almost any branch of science

and industry. By providing a robust ordering, data depth—statistical function that

measures belongingness of any point of the space to a data set—becomes a particularly

useful tool for detection of anomalies. Already known for its theoretical properties,

data depth has undergone substantial computational developments in the last decade

and particularly recent years, which has made it applicable for contemporary-sized

problems of data analysis and machine learning.

In this article, data depth is studied as an eﬃcient anomaly detection tool, as-

signing abnormality labels to observations with lower depth values, in a multivariate

setting. Practical questions of necessity and reasonability of invariances and shape

of the depth function, its robustness and computational complexity, choice of the

threshold are discussed. Illustrations include use-cases that underline advantageous

behaviour of data depth in various settings.

Keywords: Data depth, anomaly detection, robustness, aﬃne invariance, computa-

tional statistics, projection depth, halfspace depth, visualization, data analysis.

arXiv:2210.02851v2 [stat.ML] 10 Jul 2024

1 Motivation

Being applicable in a large variety of domains, anomaly detection increasingly gains popu-

larity among researchers and practitioners. Having been in use since decades, it constitutes

a contemporary domain of rapid development to meet growing demand in various areas

such as industry, economy, social sciences, etc. With large amounts of data recorded in

modern applications and constantly present probability of abnormal events, these cannot

be identiﬁed by operator’s hand anymore: automatic procedures are necessary.

It is not the goal of the current article to provide a complete overview of anomaly

detection methods, the reader is referred to Chandola et al. (2009); see also following

Sections 1.1 and 1.2 for intuition. Here, a narrower question is in scope: why and how to

employ data depth for anomaly detection?

1.1 Diﬀerence from outlier detection

With two terms “outlier” and “anomaly” being used by two communities with small overlap,

a discussion on their similarity is important.

From statistical point of view, both outlier and anomaly detection focus on identify-

ing atypical observations. Nevertheless, there is a substantial diﬀerence in application of

methods from these the two groups. First of all, while the term “outlier detection” is

traditionally used by statisticians, “anomaly detection” has been adopted by the machine

learning community. As a consequence, (more theoretically oriented) statisticians “did

not need” and often were unaware of (some of the) anomaly detection methods devel-

oped by the machine learning community, while—when searching for practical solutions in

applications—machine learners did not ﬁnd outlier detection methods suﬃciently ﬂexible

(w.r.t. the data space and shape of the distribution) and scalable (with number of obser-

vations and variables). Furthermore, rigorous statistical analysis and inference tools, being

often in the center of attention for statisticians, often do not exist for anomaly detection

methods, with latter taking frequently form of heuristics.

Indeed, perhaps in the best way the diﬀerence between “outlier” and “anomaly” can

be described in application. Given a data set at hand, the task of identiﬁcation of outliers

consists in searching for observations not resembling the majority of the data set. “Anomaly

detection” approach is more operational and follows rather the philosophy of machine

learning. That is, given a training data set, which itself can contain anomalies or not,

the task is to construct a rule (training phase) which can assign (on the detection phase)

each observation of the space (including the observations of the training set) either to the

category of anomalies or normal observations.

This work-ﬂow imposes certain requirements on anomaly detection methodology, e.g.,

regarding the data set used to learn the anomaly detection rule. Should the rule simply

save the entire training data set (this would be the case when directly applying data depth),

only part of it, or not at all; should the rule be updated, and how often? Continuing the

example with data depth, on the learning phase (again in direct application), training data

set should be simply saved in the memory and no computations are to be done. When

checking abnormality of a new observation, its data depth shall be computed w.r.t. the

(saved) training data set based on which the decision about the observation’s abnormality

shall be made. To keep the rule scalable (and ﬁtting in limited machine memory), only its

subset can be stored instead. In the case of Mahalanobis depth, only parameters (center

vector and scatter matrix) need to be saved, and no data at all.

It is important to keep attention on this operational aspect when underlining suitability

of data depth for anomaly detection in industrial context in the following Section 1.2, and

focus on this aspect later in Section 5.

1.2 Industrial context

Regard the following example simulating industrial data. Think of a (potential) production

line that manufactures certain items. On several stages of the production process, measure-

ments are taken on each of the items to ensure the quality of the produced pieces. These

measurements can be numerous if the line is well automatized, or rare if this is not the case.

If—for each item—these measurements can be assembled in a vector (of length d), then the

item can be represented as a multivariate observation xin an Euclidean space (x∈Rd),

and the entire manufacturing process as a data set in the same space ({x1, ..., xn} ⊂ Rd).

Regard Figure 1, left. For visualization purposes, let us restrict to two measurements,

whence each produced item is represented by an observation with two variables (=measure-

ments). To construct an anomaly detection rule, a subset of production data is taken as a

training set, which can itself contain anomalies or not; this corresponds to the 500 black

pixels, let us denote them Xtr ={x1, ..., x500} ⊂ R2. 8 new observations are now to be

labeled either as normal observation or as anomalies, namely four green dots (correspond-

ing to normal items), three red pluses and cross (anomalies). While in this bivariate visual

case with d= 2 it is trivially resolved by a simple visual inspection, the task becomes much

more complicated once dincreases.

The simplest, though still frequently applied approach, is to deﬁne validation band for

each measurement, i.e., upper and lower bound for each variable: this rule is depicted by

black dashed lines parallel to variables’ axes and—if well calibrated—allows to identify three

out of four anomalies (red pluses) and is computationally extremely fast (computation, as

well as following item’s production, can even stop after crossing any of the bounds):

gbox(x|Xtr) = (anomaly (=1) ,if x/∈Tj=1,...,d(Hj,lj∩Hj,hj),

normal (=0) ,otherwise.(1)

with l1, h1, ..., ld, hdbeing lower and upper validation bounds (calibrated using Xtr) for

each axis and Hj,a ={y∈Rd|y⊤ej≤a},Hj,b ={y∈Rd|y⊤ej≥b}where ejis the

orthant of the jth axis. The fourth anomaly (red cross) remains invisible for rule (1).

Obviously, this fourth anomaly can be identiﬁed using rule based on Mahalanobis depth

DMah, deﬁned later by (7)

gMah(x|Xtr) = (anomaly ,if DMah(x|Xtr )< tMah,Xtr ,

normal ,otherwise.(2)

where tMah,Xtr is chosen based on Xtr(= 0.075) in a way that the Mahalanobis depth

contour is largest not to exceed the variable-wise validation bounds. While rule (2) easily

identiﬁes all present anomalies, two aspects shall be taken into account: (i) the training

data does not contain anomalies itself and (ii) is large (especially when compared to d).

In the beginning of the production process—the phase where diagnostic is particularly

important—not many observations are available, but anomalies should still be identiﬁed

among them; similar situation occurs when produced items are time/resources consuming

and are not produced very often. To simulate this situation, regard Figure 1, right. Here,

the training data set contains 25 observations: 19 being generated from Gaussian distribu-

tion (gray dots), 4 former normal observations (green dots), and the same four anomalies

(red pluses and cross). Rule (2) (with the same threshold tMah,Xtr as before provides mis-

leading ellipse (solid black line) that classiﬁes all anomalies as normal observations. When

employing a rule based on projection depth deﬁned later by (12) instead (denoted in blue

dashed line), i.e.,

gprj(x|Xtr) = (anomaly ,if Dprj(x|Xtr )< tprj,Xtr ,

normal ,otherwise.(3)

all four anomalies are identiﬁed. Furthermore, when all 500 observations become available

(e.g., when the production line work for longer period of time), rule from (3) (with the

same threshold tprj,Xtr (= 0.1575)) almost coincides with the Mahalanobis depth rule (2);

see Figure 1 (left) again. We refer the reader to Section 4.2 for a discussion on the choice

of the threshold.

Figure 1: Four normal observations (green dots) and four anomalies (red pluses and cross);

contours of Mahalanobis (black solid) and projection (blue dashed) depths. Left: A sam-

ple of 500 bivariate Gaussian observations (black pixels). Right: 17 bivariate Gaussian

observations (gray dots).

1.3 Outline of conceptual challenges

In the rest of the article, after a brief introduction of data depth (Sections 2 and 3),

challenges connected with its application to anomaly detection are discussed. These can

be roughly split in three parts:

•Which depth notion to choose in a case at hand (Section 4)?

•Why to use data depth, i.e., why and in which cases is it advantageous over existing

methods (Section 5)?

•How to deal with computational issues when employing data depth (Section 6)?

We gather remarks and a disclaimer in Section 8.

While being destined for practitioners, this article is entirely based on simulated data.

This is mainly due to four reasons: First, in general enterprises are not willing to share

data because of its conﬁdentiality, often for competition reasons. Fortunately, this ten-

dency starts to decrease, which can be witnessed by numerous data challenges, because—

depending on the industrial sector—(a) data are getting quickly outdated and much more

important is to (quickly) ﬁnd clues how to treat it or/and (b) enterprise does not have

enough internal expertise and searches for external ideas, releasing at least part of their

data. Second, industrial cases are normally a result of continuous work (or collaboration)

being augmented and labeled over time, often based on several data sets and using apriori

knowledge of domain experts—a complex situation not necessarily presentable as a simple

example. Third, the purpose of data illustrations of this article is to pin cases in which

employing data depth based methodology is advantageous, and illustrating it to better de-

gree is more gainful with synthetic data. Fourth, for veriﬁcation and comparison purposes,

a feedback in needed. In any case, there is only very little probability that applicant would

encounter exactly the same real data situation (repeated) in practice. (Though all the

examples presented in the article are based on simulated data, we shall continue calling

them observations in what follows.)

2 What is data depth?

Data depth is a statistical function that, given a data set X⊂Rd, assigns to each element

of the space (where it is deﬁned) a value (usually) between 0 and 1, which characterizes

how deep this element is in the data set:

D:Rd×Rn×d→[0,1] ,(x,X)7→ D(x|X).(4)

This element can be an observation that belongs to the data set, or any other arbitrary

element of the space, e.g., future observation. Being a function of data, data depth inherits

statistical properties of the data set, and thus describes it in one or another manner and

can serve many purposes:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Anomalydetectionusingdatadepth:multivariatecasePavloMozharovskyiLTCI,T´el´ecomParis,InstitutPolytechniquedeParisRomainVallaLTCI,T´el´ecomParis,InstitutPolytechniquedeParisJuly8,2024AbstractAnomalydetectionisabranchofdataanalysisandmachinelearningwhichaimsatidentifyingobservationsthatexhibitabnormalb...

展开>> 收起<<

Anomaly detection using data depth multivariate case Pavlo Mozharovskyi.pdf

共36页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Anomaly detection using data depth multivariate case Pavlo Mozharovskyi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: