Anomaly detection using data depth multivariate case Pavlo Mozharovskyi

2025-04-27 0 0 6.83MB 36 页 10玖币
侵权投诉
Anomaly detection using data depth:
multivariate case
Pavlo Mozharovskyi
LTCI, T´el´ecom Paris, Institut Polytechnique de Paris
Romain Valla
LTCI, T´el´ecom Paris, Institut Polytechnique de Paris
July 8, 2024
Abstract
Anomaly detection is a branch of data analysis and machine learning which aims
at identifying observations that exhibit abnormal behaviour. Be it measurement
errors, disease development, severe weather, production quality default(s) (items)
or failed equipment, financial frauds or crisis events, their on-time identification,
isolation and explanation constitute an important task in almost any branch of science
and industry. By providing a robust ordering, data depth—statistical function that
measures belongingness of any point of the space to a data set—becomes a particularly
useful tool for detection of anomalies. Already known for its theoretical properties,
data depth has undergone substantial computational developments in the last decade
and particularly recent years, which has made it applicable for contemporary-sized
problems of data analysis and machine learning.
In this article, data depth is studied as an efficient anomaly detection tool, as-
signing abnormality labels to observations with lower depth values, in a multivariate
setting. Practical questions of necessity and reasonability of invariances and shape
of the depth function, its robustness and computational complexity, choice of the
threshold are discussed. Illustrations include use-cases that underline advantageous
behaviour of data depth in various settings.
Keywords: Data depth, anomaly detection, robustness, affine invariance, computa-
tional statistics, projection depth, halfspace depth, visualization, data analysis.
1
arXiv:2210.02851v2 [stat.ML] 10 Jul 2024
1 Motivation
Being applicable in a large variety of domains, anomaly detection increasingly gains popu-
larity among researchers and practitioners. Having been in use since decades, it constitutes
a contemporary domain of rapid development to meet growing demand in various areas
such as industry, economy, social sciences, etc. With large amounts of data recorded in
modern applications and constantly present probability of abnormal events, these cannot
be identified by operator’s hand anymore: automatic procedures are necessary.
It is not the goal of the current article to provide a complete overview of anomaly
detection methods, the reader is referred to Chandola et al. (2009); see also following
Sections 1.1 and 1.2 for intuition. Here, a narrower question is in scope: why and how to
employ data depth for anomaly detection?
1.1 Difference from outlier detection
With two terms “outlier” and “anomaly” being used by two communities with small overlap,
a discussion on their similarity is important.
From statistical point of view, both outlier and anomaly detection focus on identify-
ing atypical observations. Nevertheless, there is a substantial difference in application of
methods from these the two groups. First of all, while the term “outlier detection” is
traditionally used by statisticians, “anomaly detection” has been adopted by the machine
learning community. As a consequence, (more theoretically oriented) statisticians “did
not need” and often were unaware of (some of the) anomaly detection methods devel-
oped by the machine learning community, while—when searching for practical solutions in
applications—machine learners did not find outlier detection methods sufficiently flexible
(w.r.t. the data space and shape of the distribution) and scalable (with number of obser-
vations and variables). Furthermore, rigorous statistical analysis and inference tools, being
often in the center of attention for statisticians, often do not exist for anomaly detection
methods, with latter taking frequently form of heuristics.
Indeed, perhaps in the best way the difference between “outlier” and “anomaly” can
be described in application. Given a data set at hand, the task of identification of outliers
consists in searching for observations not resembling the majority of the data set. “Anomaly
detection” approach is more operational and follows rather the philosophy of machine
learning. That is, given a training data set, which itself can contain anomalies or not,
the task is to construct a rule (training phase) which can assign (on the detection phase)
each observation of the space (including the observations of the training set) either to the
category of anomalies or normal observations.
This work-flow imposes certain requirements on anomaly detection methodology, e.g.,
regarding the data set used to learn the anomaly detection rule. Should the rule simply
save the entire training data set (this would be the case when directly applying data depth),
only part of it, or not at all; should the rule be updated, and how often? Continuing the
example with data depth, on the learning phase (again in direct application), training data
set should be simply saved in the memory and no computations are to be done. When
checking abnormality of a new observation, its data depth shall be computed w.r.t. the
2
(saved) training data set based on which the decision about the observation’s abnormality
shall be made. To keep the rule scalable (and fitting in limited machine memory), only its
subset can be stored instead. In the case of Mahalanobis depth, only parameters (center
vector and scatter matrix) need to be saved, and no data at all.
It is important to keep attention on this operational aspect when underlining suitability
of data depth for anomaly detection in industrial context in the following Section 1.2, and
focus on this aspect later in Section 5.
1.2 Industrial context
Regard the following example simulating industrial data. Think of a (potential) production
line that manufactures certain items. On several stages of the production process, measure-
ments are taken on each of the items to ensure the quality of the produced pieces. These
measurements can be numerous if the line is well automatized, or rare if this is not the case.
If—for each item—these measurements can be assembled in a vector (of length d), then the
item can be represented as a multivariate observation xin an Euclidean space (xRd),
and the entire manufacturing process as a data set in the same space ({x1, ..., xn} ⊂ Rd).
Regard Figure 1, left. For visualization purposes, let us restrict to two measurements,
whence each produced item is represented by an observation with two variables (=measure-
ments). To construct an anomaly detection rule, a subset of production data is taken as a
training set, which can itself contain anomalies or not; this corresponds to the 500 black
pixels, let us denote them Xtr ={x1, ..., x500} ⊂ R2. 8 new observations are now to be
labeled either as normal observation or as anomalies, namely four green dots (correspond-
ing to normal items), three red pluses and cross (anomalies). While in this bivariate visual
case with d= 2 it is trivially resolved by a simple visual inspection, the task becomes much
more complicated once dincreases.
The simplest, though still frequently applied approach, is to define validation band for
each measurement, i.e., upper and lower bound for each variable: this rule is depicted by
black dashed lines parallel to variables’ axes and—if well calibrated—allows to identify three
out of four anomalies (red pluses) and is computationally extremely fast (computation, as
well as following item’s production, can even stop after crossing any of the bounds):
gbox(x|Xtr) = (anomaly (=1) ,if x/Tj=1,...,d(Hj,ljHj,hj),
normal (=0) ,otherwise.(1)
with l1, h1, ..., ld, hdbeing lower and upper validation bounds (calibrated using Xtr) for
each axis and Hj,a ={yRd|yeja},Hj,b ={yRd|yejb}where ejis the
orthant of the jth axis. The fourth anomaly (red cross) remains invisible for rule (1).
Obviously, this fourth anomaly can be identified using rule based on Mahalanobis depth
DMah, defined later by (7)
gMah(x|Xtr) = (anomaly ,if DMah(x|Xtr )< tMah,Xtr ,
normal ,otherwise.(2)
where tMah,Xtr is chosen based on Xtr(= 0.075) in a way that the Mahalanobis depth
contour is largest not to exceed the variable-wise validation bounds. While rule (2) easily
3
identifies all present anomalies, two aspects shall be taken into account: (i) the training
data does not contain anomalies itself and (ii) is large (especially when compared to d).
In the beginning of the production process—the phase where diagnostic is particularly
important—not many observations are available, but anomalies should still be identified
among them; similar situation occurs when produced items are time/resources consuming
and are not produced very often. To simulate this situation, regard Figure 1, right. Here,
the training data set contains 25 observations: 19 being generated from Gaussian distribu-
tion (gray dots), 4 former normal observations (green dots), and the same four anomalies
(red pluses and cross). Rule (2) (with the same threshold tMah,Xtr as before provides mis-
leading ellipse (solid black line) that classifies all anomalies as normal observations. When
employing a rule based on projection depth defined later by (12) instead (denoted in blue
dashed line), i.e.,
gprj(x|Xtr) = (anomaly ,if Dprj(x|Xtr )< tprj,Xtr ,
normal ,otherwise.(3)
all four anomalies are identified. Furthermore, when all 500 observations become available
(e.g., when the production line work for longer period of time), rule from (3) (with the
same threshold tprj,Xtr (= 0.1575)) almost coincides with the Mahalanobis depth rule (2);
see Figure 1 (left) again. We refer the reader to Section 4.2 for a discussion on the choice
of the threshold.
Figure 1: Four normal observations (green dots) and four anomalies (red pluses and cross);
contours of Mahalanobis (black solid) and projection (blue dashed) depths. Left: A sam-
ple of 500 bivariate Gaussian observations (black pixels). Right: 17 bivariate Gaussian
observations (gray dots).
4
1.3 Outline of conceptual challenges
In the rest of the article, after a brief introduction of data depth (Sections 2 and 3),
challenges connected with its application to anomaly detection are discussed. These can
be roughly split in three parts:
Which depth notion to choose in a case at hand (Section 4)?
Why to use data depth, i.e., why and in which cases is it advantageous over existing
methods (Section 5)?
How to deal with computational issues when employing data depth (Section 6)?
We gather remarks and a disclaimer in Section 8.
While being destined for practitioners, this article is entirely based on simulated data.
This is mainly due to four reasons: First, in general enterprises are not willing to share
data because of its confidentiality, often for competition reasons. Fortunately, this ten-
dency starts to decrease, which can be witnessed by numerous data challenges, because—
depending on the industrial sector—(a) data are getting quickly outdated and much more
important is to (quickly) find clues how to treat it or/and (b) enterprise does not have
enough internal expertise and searches for external ideas, releasing at least part of their
data. Second, industrial cases are normally a result of continuous work (or collaboration)
being augmented and labeled over time, often based on several data sets and using apriori
knowledge of domain experts—a complex situation not necessarily presentable as a simple
example. Third, the purpose of data illustrations of this article is to pin cases in which
employing data depth based methodology is advantageous, and illustrating it to better de-
gree is more gainful with synthetic data. Fourth, for verification and comparison purposes,
a feedback in needed. In any case, there is only very little probability that applicant would
encounter exactly the same real data situation (repeated) in practice. (Though all the
examples presented in the article are based on simulated data, we shall continue calling
them observations in what follows.)
2 What is data depth?
Data depth is a statistical function that, given a data set XRd, assigns to each element
of the space (where it is defined) a value (usually) between 0 and 1, which characterizes
how deep this element is in the data set:
D:Rd×Rn×d[0,1] ,(x,X)7→ D(x|X).(4)
This element can be an observation that belongs to the data set, or any other arbitrary
element of the space, e.g., future observation. Being a function of data, data depth inherits
statistical properties of the data set, and thus describes it in one or another manner and
can serve many purposes:
5
摘要:

Anomalydetectionusingdatadepth:multivariatecasePavloMozharovskyiLTCI,T´el´ecomParis,InstitutPolytechniquedeParisRomainVallaLTCI,T´el´ecomParis,InstitutPolytechniquedeParisJuly8,2024AbstractAnomalydetectionisabranchofdataanalysisandmachinelearningwhichaimsatidentifyingobservationsthatexhibitabnormalb...

展开>> 收起<<
Anomaly detection using data depth multivariate case Pavlo Mozharovskyi.pdf

共36页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:36 页 大小:6.83MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 36
客服
关注