Unsupervised Model Selection for Time-series Anomaly Detection

2025-05-06 0 0 1.65MB 25 页 10玖币
侵权投诉
UNSUPERVISED MODEL SELECTION FOR TIME-SERIES
ANOMALY DETECTION
A PREPRINT
Mononito Goswami , Cristian Challu
School of Computer Science
Carnegie Mellon University
{mgoswami,cchallu}@cs.cmu.edu
Laurent Callot, Lenon Minorics & Andrey Kan
AWS AI Labs
{lcallot,avkan}@amazon.com,
minorics@amazon.de
ABSTRACT
Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly
detection methods have been proposed in the literature, a recent survey concluded that no single
method is the most accurate across various datasets. To make matters worse, anomaly labels are
scarce and rarely available in practice. The practical problem of selecting the most accurate model
for a given dataset without labels has received little attention in the literature. This paper answers this
question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select
the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics,
namely, prediction error,model centrality, and performance on injected synthetic anomalies, and
show that some metrics are highly correlated with standard supervised anomaly detection performance
metrics such as the
F1
score, but to varying degrees. We formulate metric combination with multiple
imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical
justification behind the proposed approach. Large-scale experiments on multiple real-world datasets
demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate
model based on partially labeled data.
Keywords Model Selection ·Time-series Anomaly Detection ·Unsupervised Learning
1 Introduction
Anomaly detection in time-series data has gained considerable attention from the academic and industrial research
communities due to the explosion in the amount of data produced and the number of automated system requiring some
form of monitoring. A large number of anomaly detection methods have been developed to solve this task [Schmidl
et al., 2022, Blázquez-García et al., 2021], ranging from simple algorithms [Keogh et al., 2005, Ramaswamy et al.,
2000] to complex deep-learning models [Xu et al., 2018, Challu et al., 2022]. These models have significant variance in
performance across datasets [Schmidl et al., 2022, Paparrizos et al., 2022a], and evaluating their actual performance on
real-world anomaly detection tasks is non-trivial, even when labeled datasets are available [Wu and Keogh, 2021].
Labels are seldom available for many, if not most, anomaly detection tasks. Labels are indications of which time points
in a time-series are anomalous. The definition of an anomaly varies with the use case, but these definitions have in
common that anomalies are rare events. Hence, accumulating a sizable number of labeled anomalies typically requires
reviewing a large portion of a dataset by a domain expert. This is an expensive, time-consuming, subjective, and thereby
error-prone task which is a considerable hurdle for labeling even a subset of data. Unsurprisingly, a large number of
time-series anomaly detection methods are unsupervised or semi-supervised – i.e. they do not require any anomaly
labels during training and inference.
There is no single universally best method [Schmidl et al., 2022, Paparrizos et al., 2022a]. Therefore it is important to
select the most accurate method for a given dataset without access to anomaly labels. The problem of unsupervised
anomaly detection model selection has been overlooked in the literature, even though it is a key problem in practical
Work carried out when the first two authors were interns at Amazon AWS AI Labs.
arXiv:2210.01078v3 [cs.LG] 24 Jan 2023
Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT
M1
M2
M3
0 0 0 0 1 1 1 0 0 0
0 0 0 1 1 1 1 0 0 0
1 0 0 0 0 1 1 1 0 0
M3
M2
M1
M3
M2
M1
M3
M2
M1
Prediction Error
Synthetic Anomaly Injection
Model Centrality
Trained Models
Train TS
Rankings
Metrics on Test TS
RRA
MSE
Scale
Anomaly
Dist. to
NN
M3
M2
M1
Best Model
Figure 1: The Model Selection Workflow. We identify three classes of surrogate metrics of model quality (Sec. 3), and
propose a novel robust rank aggregation framework to combine multiple rankings from metrics (Sec. 4).
applications. Thus, we offer an answer to the question:
Given an unlabeled dataset and a set of candidate models,
how can we select the most accurate model?
Our approach is based on computing “surrogate” metrics that correlate with model performance yet do not require
anomaly labels, followed by aggregating model ranks induced by these metrics (Fig. 1). Empirical evaluation on 10
real-world datasets spanning diverse domains such as medicine, sports, etc. shows that
our approach can perform
unsupervised model selection as efficiently as selection based on labeling a subset of data.
In summary, our contributions are as follows:
To the best of our knowledge, we propose one of the first methods for unsupervised selection of anomaly
detection models on time-series data. To this end, we identify intuitive and effective unsupervised metrics
for model performance. Prior work has used a few of these unsupervised metrics for problems other than
time-series anomaly detection model selection.
We propose a novel robust rank aggregation method for combining multiple surrogate metrics into a single
model selection criterion. We show that our approach performs on par with selection based on labeling a
subset of data.
We conduct large-scale experiments on over 275 diverse time-series, spanning a gamut of domains such as
medicine, entomology, etc. with 5 popular and widely-used anomaly detection models, each with 1 to 4
hyper-parameter combinations, resulting in over 5,000 trained models. Upon acceptance, we will make our
code publicly available.
In the next section, we formalize the model selection problem. We then describe the surrogate metric classes in Sec.
3, and present our rank aggregation method in Sec. 4. Our empirical evaluation is described in Sec. 5. Finally, we
summarize related work in Sec. 6 and conclude in Sec. 7.
2 Preliminaries & The Model Selection Problem
Let
{xt, yt}T
t=1
denote a multivariate time-series (TS) with observations
(x1,...,xT)
,
xtRd
and anomaly labels
(y1, . . . , yT)
,
yt∈ {0,1}
, where
yt= 1
indicates that the observation
xt
is an anomaly. The labels are only used for
evaluating our selection approaches, but not for the model selection procedure itself.
Next, let
M={Ai}N
i=1
denote a set of
N
candidate anomaly detection models. Each model
Ai
is a tuple (
detector
,
hyper-parameters
), i.e., a combination of an anomaly detection method (e.g. LSTM-VAE [Park et al., 2018]) and
a fully specified hyper-parameter configuration (e.g. for LSTM-VAE,
hidden_layer_size=128, num_layers=2,
· · · ).
Here, we only consider models that do not require anomaly labels for training. Some models still require unlabeled
time-series as training data. We therefore, consider a train/test split
{xt}ttest1
t=1
,
{xt}T
t=ttest
, where the former is used if a
model requires training (without labels). In our notation, Aidenotes a trained model.
2
Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT
We assume that a trained model
Ai
, when applied to observations
{xt}T
t=ttest
, produces anomaly scores
{si
t}T
t=ttest
,
si
tR0
. We assume that a higher anomaly score indicates that the observation is more likely to be an anomaly.
However, we do not assume that the scores correspond to likelihoods of any particular statistical model or that the
scores are comparable across models.
Model performance or, equivalently, quality of the scores can be measured using a supervised metric
Q({st}T
t=1,{yt}T
t=1)
, such as the Area under Precision Recall curve or best
F1
score, commonly used in literature. We
discuss the choice of the quality metric in the next section.
In general, rather than considering a single time-series, we will be performing model selection for a set of
L
time-series
with observations
X={{xj
t}T
t=1}L
j=1
and labels
Y={{yt}T
t=1}L
j=1
, where
j
indexes time-series. Let
Xtrain
,
Xtest
, and
Ytest
denote the train and test portions of observations, and the test portion of labels, respectively. We are now ready to
introduce the following problem.
Problem 1. Unsupervised Time-series Anomaly Detection Model Selection.
Given observations
Xtest
and a set of
models
M={Ai}N
i=1
trained using
Xtrain
, select a model that maximizes the anomaly detection quality metric
Q(Ai(Xtest),Ytest). The selection procedure cannot use labels.
2.1 Measuring Anomaly Detection Model Performance
Anomaly Detection can be viewed as a binary classification problem where each time point is classified as an anomaly
or a normal observation. Hence, the performance of a model
Ai
can be measured using standard precision and recall
metrics. However, these metrics ignore the sequential nature of time-series; thus, time-series anomaly detection is
usually evaluated using adjusted versions of precision and recall [Paparrizos et al., 2022b]. We adopt widely used
adjusted versions of precision and recall [Xu et al., 2018, Challu et al., 2022, Shen et al., 2020, Su et al., 2019, Carmona
et al., 2021]. These metrics treat time points as independent samples, except when an anomaly lasts for several
consecutive time points. In this case, detecting any of these points is treated as if all points inside the anomalous
segment were detected.
Adjusted precision and recall can be summarized in a metric called adjusted
F1
score, which is the harmonic mean of
the two. The adjusted
F1
score depends on the choice of a decision threshold on anomaly scores. In line with several
recent studies, we consider threshold selection as a problem orthogonal to our problem [Schmidl et al., 2022, Laptev
et al., 2015, Blázquez-García et al., 2021, Rebjock et al., 2021, Paparrizos et al., 2022b,a] and therefore, consider
metrics that summarize model performance across all possible thresholds. Common metrics include area under the
precision-recall curve and best
F1
(maximum
F1
over all possible thresholds). Like Xu et al. [2018], we found a strong
correlation between the two (App. A.10). We also found best
F1
to have statistically significant positive correlation
with volume under the surface of ROC curve, a recently proposed robust evaluation metric (App. A.9). Thus, in the
remainder of the paper, we restrict our analysis to identifying models with the highest adjusted best
F1
score (i.e.
Q
is
adjusted best F1).
3 Surrogate Metrics of Model Performance
The problem of unsupervised model selection is often viewed as finding an unsupervised metric which correlates well
with supervised metrics of interest [Ma et al., 2021, Goix, 2016, Lin et al., 2020, Duan et al., 2019]. Each unsupervised
metric serves as a noisy measure of “model goodness" and reduces the problem of picking the best performing model
according to the metric. We identified three classes of imperfect metrics that closely align with expert intuition in
predicting the performance of time-series anomaly detection models. Our metrics are unsupervised because they do
not require anomaly labels. However, some metrics such as
F1
score on synthetic anomalies are typically used for
supervised evaluation. To avoid confusion we use term surrogate for our metrics. Below we elaborate on each class of
surrogate metrics, starting with an intuition behind each class.
Prediction Error
If a model can forecast or reconstruct time-series well it must also be a good anomaly detector. A
large number of anomaly detection methods are based on forecasting or reconstruction [Schmidl et al., 2022], and for
such models, we can compute forecasting or reconstruction error without anomaly labels. We collectively call these
prediction error metrics and consider a common set of statistics, such as mean absolute error, mean squared error,
mean absolute percentage error, symmetric mean absolute percentage error, and the likelihood of observations. For
multivariate time-series, we average each metric across all the variables. For example, given a series of observations
{xt}T
t=1
and their predictions
{ˆxt}T
t=1
, mean squared error is defined as
MSE =1
TPT
t=1(xtˆxt)2
. In the interest of
space, we refer the reader to a textbook (e.g., Hyndman and Athanasopoulos [2018]) for definitions of other metrics.
Prior work on time-series anomaly detection [Saganowski and Andrysiak, 2020, Laptev et al., 2015, Kuang et al., 2022]
3
Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT
used forecasting error to select between some classes of models. Prediction metrics are not perfect for model selection
because the time-series can contain anomalies. Low prediction errors are desirable for all points except anomalies,
but we do not know where anomalies are without labels. While prediction metrics can only be computed for anomaly
detection methods based on time-series forecasting or reconstruction, the next two classes of metrics apply to any
anomaly detection method.
Figure 2: Different kinds of synthetic anomalies. We
inject 9 different types of synthetic anomalies ran-
domly, one at a time, across multiple copies of the
original time-series. The --is original time-series be-
fore anomalies are injected, is the injected anomaly
and the solid is the time-series after anomaly injec-
tion.
Synthetic Anomaly Injection
A good anomaly detection
model will perform well on data with synthetically injected
anomalies. Some studies have previously explored using syn-
thetic anomaly injection to train anomaly detection models
[Carmona et al., 2021]. Here, we extend this line of research
by systematically evaluating model selection capability after in-
jecting different kinds of anomalies. Given an input time-series
without anomaly labels, we randomly inject an anomaly of a
particular type. The location of the injected anomaly is then
treated as a (pseudo-)positive label, while the rest of the time
points are treated as a (pseudo-)negative label. We then select a
model that achieves the best adjusted
F1
score with respect to
pseudo-labels. Instead of relying on complex deep generative
models [Wen et al., 2020] we devise simple and efficient yet
effective procedures to inject anomalies of previously described
types [Schmidl et al., 2022, Wu and Keogh, 2021] as shown in
Fig. 2. We defer the details of anomaly injection approaches
to Appendix A.5. Model selection based on anomaly injection
might not be perfect because (i) real anomalies might be of a different type compared to injected ones, and (ii) there
may already be anomalies for which our pseudo-labels will be negative.
Model Centrality
There is only one ground truth, thus models close to ground truth are close to each other. Hence,
the most “central" model is the best one. Centrality-based metrics have seen recent success in unsupervised model
selection for disentangled representation learning [Lin et al., 2020, Duan et al., 2019] and anomaly detection [Ma et al.,
2021]. To adapt this approach to time-series, we leverage the scores from the anomaly detection models to define a
notion of model proximity. Anomaly scores of different models are not directly comparable. However, since anomaly
detection is performed by thresholding the scores, we can consider ranking of time points by their anomaly scores. Let
σk(i)
denote the rank of time point
i
according to the scores from model
k
. We define the distance between models
Ak
and Alas the Kendall’s τdistance, which is the number of pairwise disagreements:
dτ(σk, σl) = X
i<j
I{(σk(i)σk(j))(σl(i)σl(j)) <0}(1)
Next, we measure model centrality as the average distance to its
K
nearest neighbors, where
K
is a parameter. This
metric favors models which are close to their nearest neighbors. The centrality-based approach is imperfect, because
“bad" models might produce similar results and form a cluster.
Each metric, while imperfect, is predictive of anomaly detection performance but to a varying extent as we show in
Sec. 5. Access to multiple noisy metrics raises a natural question: How can we combine multiple imperfect metrics to
improve model selection performance?
4 Robust Rank Aggregation
Many studies have explored the benefits of combining multiple sources of noisy information to reduce error in different
areas of machine learning such as crowd-sourcing [Dawid and Skene, 1979] and programmatic weak supervision
[Ratner et al., 2016, Goswami et al., 2021, Dey et al., 2022, Gao et al., 2022, Zhang et al., 2022a]. Each of our surrogate
metrics induces a (noisy) model ranking. Thus two natural questions arise in the context of our work: (1) How do we
reliably combine multiple model rankings? (2) Does aggregating multiple rankings help in model selection?
4.1 Approaching model selection as a Rank Aggregation Problem
Let
[N] = {1,2,· · · , N}
denote the universe of elements and
SN
be the set of permutations on
[N]
. For
σSN
, let
σ(i)
denote the rank of element
i
and
σ1(j)
denote the element at rank
j
. Then the rank aggregation problem is as
follows.
4
Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT
Problem 2. Rank Aggregation.
Given a collection of
M
permutations
σ1,· · · , σMSN
of
N
items, find
σSN
that best summarizes these permutations according to some objective C(σ, σ1,· · · , σM).
In our context, the
N
items correspond to the
N
candidate anomaly detection models,
M
corresponds to the number of
surrogate metrics, σis the best summary ranking of models, and σ∗−1(1) is the predicted best model.2
The problem of rank aggregation has been extensively studied in social choice, bio-informatics and machine learning
literature. Here, we consider Kemeny rank aggregation which involves choosing a distance on the set of rankings and
finding a barycentric or median ranking [Korba et al., 2017]. Specifically, the rank aggregation problem is known as the
Kemeny-Young problem if the aggregation objective is defined as
C=1
MPM
i=1 dτ(σ, σi)
, where
dτ
is the Kendall’s
τdistance (Eq. 1).
Kemeny aggregation has been shown to satisfy many desirable properties, but it is NP-hard even with as few as 4
rankings [Dwork et al., 2001]. To this end, we consider an efficient approximate solution using the Borda method
[Borda, 1784] in which each item (model) receives points from each ranking (surrogate metric). For example, if a
model is ranked
r
-th by a surrogate metric, it receives
(Nr)
points from that metric. The models are then ranked in
descending order of total points received from all metrics.
Suppose that we have several surrogate metrics for ranking anomaly detection models. Why would it be beneficial to
aggregate the rankings induced by these metrics? The following theorem provides an insight into the benefit of Borda
rank aggregation.
Consider two arbitrary models
i
and
j
with (unknown) ground truth rankings
σ(i)
and
σ(j)
. Without the loss of
generality, assume
σ(i)< σ(j)
, i.e., model
i
is better. Next, let’s view each surrogate metric
k= 1, . . . , M
as a
random variable
k
taking values in permutations
SN
, such that
k(i)[N]
denotes the rank of item
i
. We apply
Borda rank aggregation on realizations of these random variables. Theorem 1 states that, under certain assumptions, the
probability of Borda ranking making a mistake (i.e., placing model
i
lower than
j
) is upper bounded in a way such that
increasing Mdecreases this bound.
Theorem 1.
Assume that
k
s are pairwise independent, and that
P(Ωk(i)<k(j)) >0.5
for any arbitrary models
i
and
j
, i.e., each surrogate metric is better than random. Then the probability that Borda aggregation makes a mistake
in ranking items iand jis at most 2 exp M2
2, for some fixed .
The proof of the theorem is given in Appendix A.8. We do not assume that
k
are identically distributed, but only that
k
s are pairwise independent. We emphasize that the notion of independence of
k
as random variables is different
from rank correlation between realizations of these variables (i.e., permutations). For example, two permutations drawn
independently from a Mallow’s distribution can have a high Kendall’s
τ
coefficient (rank correlation). Therefore, our
assumption of independence, commonly adopted in theoretical analysis [Ratner et al., 2016], does not contradict our
observation of high rank correlation between surrogate metrics.
The Borda method weighs all surrogate metrics equally. However, our experience and theoretical analysis suggest that
focusing only on “good” metrics can improve performance after aggregation (see Sec. 5 and Corollary 1). But how do
we identify “good" ranks without access to anomaly labels?
4.2 Empirical Influence and Robust Rank Aggregation
Since we do not have access to anomaly labels, we introduce an intuitive unsupervised way to identify “good" or
“reliable" rankings based on the notion of empirical influence. Empirical influence functions have been previously
used to detect influential cases in regression [Cook and Weisberg, 1980]. Given a collection of
M
rankings
S=
{σ1,· · · , σM}
, let
Borda(S)
denote the aggregate ranking from the Borda method. Then for some
σi∈ S
, empirical
influence EI(σi,S)measures the impact of σion the Borda solution.
EI(σi,S) = f(S)f(S \ σi),where f(A) = 1
|A|
|A|
X
i=1
dτ(Borda(A), σi)(2)
Under the assumption that a majority of rankings are good, EI is likely to identify bad rankings as ones with high
positive EI. This is intuitive since removing a bad ranking results in larger decrease in the objective f(S \ {σi}).
We cluster all rankings based on their EI using a two-component single-linkage agglomerative clustering [Murtagh and
Contreras, 2012], discard all rankings from the “bad" cluster, and apply Borda aggregation to the remaining ones. To
2
We use terms “ranking” and “permutation” as synonyms. The subtle difference is that in a ranking multiple models can receive
the same rank. Where necessary, we break ties uniformly at random.
5
摘要:

UNSUPERVISEDMODELSELECTIONFORTIME-SERIESANOMALYDETECTIONAPREPRINTMononitoGoswami,CristianChalluSchoolofComputerScienceCarnegieMellonUniversity{mgoswami,cchallu}@cs.cmu.eduLaurentCallot,LenonMinorics&AndreyKanAWSAILabs{lcallot,avkan}@amazon.com,minorics@amazon.deABSTRACTAnomalydetectionintime-serie...

展开>> 收起<<
Unsupervised Model Selection for Time-series Anomaly Detection.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:1.65MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注