Unsupervised Model Selection for Time-series Anomaly Detection

2025-05-06 1 0 1.65MB 25 页 10玖币

侵权投诉

UNSUPERVISED MODEL SELECTION FOR TIME-SERIES

ANOMALY DETECTION

A PREPRINT

Mononito Goswami ∗, Cristian Challu ∗

School of Computer Science

Carnegie Mellon University

{mgoswami,cchallu}@cs.cmu.edu

Laurent Callot, Lenon Minorics & Andrey Kan

AWS AI Labs

{lcallot,avkan}@amazon.com,

minorics@amazon.de

ABSTRACT

Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly

detection methods have been proposed in the literature, a recent survey concluded that no single

method is the most accurate across various datasets. To make matters worse, anomaly labels are

scarce and rarely available in practice. The practical problem of selecting the most accurate model

for a given dataset without labels has received little attention in the literature. This paper answers this

question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select

the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics,

namely, prediction error,model centrality, and performance on injected synthetic anomalies, and

show that some metrics are highly correlated with standard supervised anomaly detection performance

metrics such as the

score, but to varying degrees. We formulate metric combination with multiple

imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical

justiﬁcation behind the proposed approach. Large-scale experiments on multiple real-world datasets

demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate

model based on partially labeled data.

Keywords Model Selection ·Time-series Anomaly Detection ·Unsupervised Learning

1 Introduction

Anomaly detection in time-series data has gained considerable attention from the academic and industrial research

communities due to the explosion in the amount of data produced and the number of automated system requiring some

form of monitoring. A large number of anomaly detection methods have been developed to solve this task [Schmidl

et al., 2022, Blázquez-García et al., 2021], ranging from simple algorithms [Keogh et al., 2005, Ramaswamy et al.,

2000] to complex deep-learning models [Xu et al., 2018, Challu et al., 2022]. These models have signiﬁcant variance in

performance across datasets [Schmidl et al., 2022, Paparrizos et al., 2022a], and evaluating their actual performance on

real-world anomaly detection tasks is non-trivial, even when labeled datasets are available [Wu and Keogh, 2021].

Labels are seldom available for many, if not most, anomaly detection tasks. Labels are indications of which time points

in a time-series are anomalous. The deﬁnition of an anomaly varies with the use case, but these deﬁnitions have in

common that anomalies are rare events. Hence, accumulating a sizable number of labeled anomalies typically requires

reviewing a large portion of a dataset by a domain expert. This is an expensive, time-consuming, subjective, and thereby

error-prone task which is a considerable hurdle for labeling even a subset of data. Unsurprisingly, a large number of

time-series anomaly detection methods are unsupervised or semi-supervised – i.e. they do not require any anomaly

labels during training and inference.

There is no single universally best method [Schmidl et al., 2022, Paparrizos et al., 2022a]. Therefore it is important to

select the most accurate method for a given dataset without access to anomaly labels. The problem of unsupervised

anomaly detection model selection has been overlooked in the literature, even though it is a key problem in practical

∗Work carried out when the ﬁrst two authors were interns at Amazon AWS AI Labs.

arXiv:2210.01078v3 [cs.LG] 24 Jan 2023

Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT

0 0 0 0 1 1 1 0 0 0

0 0 0 1 1 1 1 0 0 0

1 0 0 0 0 1 1 1 0 0

Prediction Error

Synthetic Anomaly Injection

Model Centrality

Trained Models

Train TS

Rankings

Metrics on Test TS

RRA

MSE

Scale

Anomaly

Dist. to

Best Model

Figure 1: The Model Selection Workﬂow. We identify three classes of surrogate metrics of model quality (Sec. 3), and

propose a novel robust rank aggregation framework to combine multiple rankings from metrics (Sec. 4).

applications. Thus, we offer an answer to the question:

Given an unlabeled dataset and a set of candidate models,

how can we select the most accurate model?

Our approach is based on computing “surrogate” metrics that correlate with model performance yet do not require

anomaly labels, followed by aggregating model ranks induced by these metrics (Fig. 1). Empirical evaluation on 10

real-world datasets spanning diverse domains such as medicine, sports, etc. shows that

our approach can perform

unsupervised model selection as efﬁciently as selection based on labeling a subset of data.

In summary, our contributions are as follows:

•

To the best of our knowledge, we propose one of the ﬁrst methods for unsupervised selection of anomaly

detection models on time-series data. To this end, we identify intuitive and effective unsupervised metrics

for model performance. Prior work has used a few of these unsupervised metrics for problems other than

time-series anomaly detection model selection.

•

We propose a novel robust rank aggregation method for combining multiple surrogate metrics into a single

model selection criterion. We show that our approach performs on par with selection based on labeling a

subset of data.

•

We conduct large-scale experiments on over 275 diverse time-series, spanning a gamut of domains such as

medicine, entomology, etc. with 5 popular and widely-used anomaly detection models, each with 1 to 4

hyper-parameter combinations, resulting in over 5,000 trained models. Upon acceptance, we will make our

code publicly available.

In the next section, we formalize the model selection problem. We then describe the surrogate metric classes in Sec.

3, and present our rank aggregation method in Sec. 4. Our empirical evaluation is described in Sec. 5. Finally, we

summarize related work in Sec. 6 and conclude in Sec. 7.

2 Preliminaries & The Model Selection Problem

Let

{xt, yt}T

t=1

denote a multivariate time-series (TS) with observations

(x1,...,xT)

xt∈Rd

and anomaly labels

(y1, . . . , yT)

yt∈ {0,1}

, where

yt= 1

indicates that the observation

is an anomaly. The labels are only used for

evaluating our selection approaches, but not for the model selection procedure itself.

Next, let

M={Ai}N

i=1

denote a set of

candidate anomaly detection models. Each model

is a tuple (

detector

hyper-parameters

), i.e., a combination of an anomaly detection method (e.g. LSTM-VAE [Park et al., 2018]) and

a fully speciﬁed hyper-parameter conﬁguration (e.g. for LSTM-VAE,

hidden_layer_size=128, num_layers=2,

· · · ).

Here, we only consider models that do not require anomaly labels for training. Some models still require unlabeled

time-series as training data. We therefore, consider a train/test split

{xt}ttest−1

t=1

{xt}T

t=ttest

, where the former is used if a

model requires training (without labels). In our notation, Aidenotes a trained model.

Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT

We assume that a trained model

, when applied to observations

{xt}T

t=ttest

, produces anomaly scores

{si

t}T

t=ttest

t∈R≥0

. We assume that a higher anomaly score indicates that the observation is more likely to be an anomaly.

However, we do not assume that the scores correspond to likelihoods of any particular statistical model or that the

scores are comparable across models.

Model performance or, equivalently, quality of the scores can be measured using a supervised metric

Q({st}T

t=1,{yt}T

t=1)

, such as the Area under Precision Recall curve or best

score, commonly used in literature. We

discuss the choice of the quality metric in the next section.

In general, rather than considering a single time-series, we will be performing model selection for a set of

time-series

with observations

X={{xj

t}T

t=1}L

j=1

and labels

Y={{yt}T

t=1}L

j=1

, where

indexes time-series. Let

Xtrain

Xtest

, and

Ytest

denote the train and test portions of observations, and the test portion of labels, respectively. We are now ready to

introduce the following problem.

Problem 1. Unsupervised Time-series Anomaly Detection Model Selection.

Given observations

Xtest

and a set of

models

M={Ai}N

i=1

trained using

Xtrain

, select a model that maximizes the anomaly detection quality metric

Q(Ai(Xtest),Ytest). The selection procedure cannot use labels.

2.1 Measuring Anomaly Detection Model Performance

Anomaly Detection can be viewed as a binary classiﬁcation problem where each time point is classiﬁed as an anomaly

or a normal observation. Hence, the performance of a model

can be measured using standard precision and recall

metrics. However, these metrics ignore the sequential nature of time-series; thus, time-series anomaly detection is

usually evaluated using adjusted versions of precision and recall [Paparrizos et al., 2022b]. We adopt widely used

adjusted versions of precision and recall [Xu et al., 2018, Challu et al., 2022, Shen et al., 2020, Su et al., 2019, Carmona

et al., 2021]. These metrics treat time points as independent samples, except when an anomaly lasts for several

consecutive time points. In this case, detecting any of these points is treated as if all points inside the anomalous

segment were detected.

Adjusted precision and recall can be summarized in a metric called adjusted

score, which is the harmonic mean of

the two. The adjusted

score depends on the choice of a decision threshold on anomaly scores. In line with several

recent studies, we consider threshold selection as a problem orthogonal to our problem [Schmidl et al., 2022, Laptev

et al., 2015, Blázquez-García et al., 2021, Rebjock et al., 2021, Paparrizos et al., 2022b,a] and therefore, consider

metrics that summarize model performance across all possible thresholds. Common metrics include area under the

precision-recall curve and best

(maximum

over all possible thresholds). Like Xu et al. [2018], we found a strong

correlation between the two (App. A.10). We also found best

to have statistically signiﬁcant positive correlation

with volume under the surface of ROC curve, a recently proposed robust evaluation metric (App. A.9). Thus, in the

remainder of the paper, we restrict our analysis to identifying models with the highest adjusted best

score (i.e.

adjusted best F1).

3 Surrogate Metrics of Model Performance

The problem of unsupervised model selection is often viewed as ﬁnding an unsupervised metric which correlates well

with supervised metrics of interest [Ma et al., 2021, Goix, 2016, Lin et al., 2020, Duan et al., 2019]. Each unsupervised

metric serves as a noisy measure of “model goodness" and reduces the problem of picking the best performing model

according to the metric. We identiﬁed three classes of imperfect metrics that closely align with expert intuition in

predicting the performance of time-series anomaly detection models. Our metrics are unsupervised because they do

not require anomaly labels. However, some metrics such as

score on synthetic anomalies are typically used for

supervised evaluation. To avoid confusion we use term surrogate for our metrics. Below we elaborate on each class of

surrogate metrics, starting with an intuition behind each class.

Prediction Error

If a model can forecast or reconstruct time-series well it must also be a good anomaly detector. A

large number of anomaly detection methods are based on forecasting or reconstruction [Schmidl et al., 2022], and for

such models, we can compute forecasting or reconstruction error without anomaly labels. We collectively call these

prediction error metrics and consider a common set of statistics, such as mean absolute error, mean squared error,

mean absolute percentage error, symmetric mean absolute percentage error, and the likelihood of observations. For

multivariate time-series, we average each metric across all the variables. For example, given a series of observations

{xt}T

t=1

and their predictions

{ˆxt}T

t=1

, mean squared error is deﬁned as

MSE =1

TPT

t=1(xt−ˆxt)2

. In the interest of

space, we refer the reader to a textbook (e.g., Hyndman and Athanasopoulos [2018]) for deﬁnitions of other metrics.

Prior work on time-series anomaly detection [Saganowski and Andrysiak, 2020, Laptev et al., 2015, Kuang et al., 2022]

Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT

used forecasting error to select between some classes of models. Prediction metrics are not perfect for model selection

because the time-series can contain anomalies. Low prediction errors are desirable for all points except anomalies,

but we do not know where anomalies are without labels. While prediction metrics can only be computed for anomaly

detection methods based on time-series forecasting or reconstruction, the next two classes of metrics apply to any

anomaly detection method.

Figure 2: Different kinds of synthetic anomalies. We

inject 9 different types of synthetic anomalies ran-

domly, one at a time, across multiple copies of the

original time-series. The --is original time-series be-

fore anomalies are injected, –is the injected anomaly

and the solid –is the time-series after anomaly injec-

tion.

Synthetic Anomaly Injection

A good anomaly detection

model will perform well on data with synthetically injected

anomalies. Some studies have previously explored using syn-

thetic anomaly injection to train anomaly detection models

[Carmona et al., 2021]. Here, we extend this line of research

by systematically evaluating model selection capability after in-

jecting different kinds of anomalies. Given an input time-series

without anomaly labels, we randomly inject an anomaly of a

particular type. The location of the injected anomaly is then

treated as a (pseudo-)positive label, while the rest of the time

points are treated as a (pseudo-)negative label. We then select a

model that achieves the best adjusted

score with respect to

pseudo-labels. Instead of relying on complex deep generative

models [Wen et al., 2020] we devise simple and efﬁcient yet

effective procedures to inject anomalies of previously described

types [Schmidl et al., 2022, Wu and Keogh, 2021] as shown in

Fig. 2. We defer the details of anomaly injection approaches

to Appendix A.5. Model selection based on anomaly injection

might not be perfect because (i) real anomalies might be of a different type compared to injected ones, and (ii) there

may already be anomalies for which our pseudo-labels will be negative.

Model Centrality

There is only one ground truth, thus models close to ground truth are close to each other. Hence,

the most “central" model is the best one. Centrality-based metrics have seen recent success in unsupervised model

selection for disentangled representation learning [Lin et al., 2020, Duan et al., 2019] and anomaly detection [Ma et al.,

2021]. To adapt this approach to time-series, we leverage the scores from the anomaly detection models to deﬁne a

notion of model proximity. Anomaly scores of different models are not directly comparable. However, since anomaly

detection is performed by thresholding the scores, we can consider ranking of time points by their anomaly scores. Let

σk(i)

denote the rank of time point

according to the scores from model

. We deﬁne the distance between models

and Alas the Kendall’s τdistance, which is the number of pairwise disagreements:

dτ(σk, σl) = X

i<j

I{(σk(i)−σk(j))(σl(i)−σl(j)) <0}(1)

Next, we measure model centrality as the average distance to its

nearest neighbors, where

is a parameter. This

metric favors models which are close to their nearest neighbors. The centrality-based approach is imperfect, because

“bad" models might produce similar results and form a cluster.

Each metric, while imperfect, is predictive of anomaly detection performance but to a varying extent as we show in

Sec. 5. Access to multiple noisy metrics raises a natural question: How can we combine multiple imperfect metrics to

improve model selection performance?

4 Robust Rank Aggregation

Many studies have explored the beneﬁts of combining multiple sources of noisy information to reduce error in different

areas of machine learning such as crowd-sourcing [Dawid and Skene, 1979] and programmatic weak supervision

[Ratner et al., 2016, Goswami et al., 2021, Dey et al., 2022, Gao et al., 2022, Zhang et al., 2022a]. Each of our surrogate

metrics induces a (noisy) model ranking. Thus two natural questions arise in the context of our work: (1) How do we

reliably combine multiple model rankings? (2) Does aggregating multiple rankings help in model selection?

4.1 Approaching model selection as a Rank Aggregation Problem

Let

[N] = {1,2,· · · , N}

denote the universe of elements and

be the set of permutations on

[N]

. For

σ∈SN

, let

σ(i)

denote the rank of element

and

σ−1(j)

denote the element at rank

. Then the rank aggregation problem is as

follows.

Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT

Problem 2. Rank Aggregation.

Given a collection of

permutations

σ1,· · · , σM∈SN

items, ﬁnd

σ∗∈SN

that best summarizes these permutations according to some objective C(σ∗, σ1,· · · , σM).

In our context, the

items correspond to the

candidate anomaly detection models,

corresponds to the number of

surrogate metrics, σ∗is the best summary ranking of models, and σ∗−1(1) is the predicted best model.2

The problem of rank aggregation has been extensively studied in social choice, bio-informatics and machine learning

literature. Here, we consider Kemeny rank aggregation which involves choosing a distance on the set of rankings and

ﬁnding a barycentric or median ranking [Korba et al., 2017]. Speciﬁcally, the rank aggregation problem is known as the

Kemeny-Young problem if the aggregation objective is deﬁned as

C=1

MPM

i=1 dτ(σ∗, σi)

, where

dτ

is the Kendall’s

τdistance (Eq. 1).

Kemeny aggregation has been shown to satisfy many desirable properties, but it is NP-hard even with as few as 4

rankings [Dwork et al., 2001]. To this end, we consider an efﬁcient approximate solution using the Borda method

[Borda, 1784] in which each item (model) receives points from each ranking (surrogate metric). For example, if a

model is ranked

-th by a surrogate metric, it receives

(N−r)

points from that metric. The models are then ranked in

descending order of total points received from all metrics.

Suppose that we have several surrogate metrics for ranking anomaly detection models. Why would it be beneﬁcial to

aggregate the rankings induced by these metrics? The following theorem provides an insight into the beneﬁt of Borda

rank aggregation.

Consider two arbitrary models

and

with (unknown) ground truth rankings

σ∗(i)

and

σ∗(j)

. Without the loss of

generality, assume

σ∗(i)< σ∗(j)

, i.e., model

is better. Next, let’s view each surrogate metric

k= 1, . . . , M

as a

random variable

Ωk

taking values in permutations

, such that

Ωk(i)∈[N]

denotes the rank of item

. We apply

Borda rank aggregation on realizations of these random variables. Theorem 1 states that, under certain assumptions, the

probability of Borda ranking making a mistake (i.e., placing model

lower than

) is upper bounded in a way such that

increasing Mdecreases this bound.

Theorem 1.

Assume that

Ωk

’s are pairwise independent, and that

P(Ωk(i)<Ωk(j)) >0.5

for any arbitrary models

and

, i.e., each surrogate metric is better than random. Then the probability that Borda aggregation makes a mistake

in ranking items iand jis at most 2 exp −M2

2, for some ﬁxed .

The proof of the theorem is given in Appendix A.8. We do not assume that

Ωk

are identically distributed, but only that

Ωk

’s are pairwise independent. We emphasize that the notion of independence of

Ωk

as random variables is different

from rank correlation between realizations of these variables (i.e., permutations). For example, two permutations drawn

independently from a Mallow’s distribution can have a high Kendall’s

coefﬁcient (rank correlation). Therefore, our

assumption of independence, commonly adopted in theoretical analysis [Ratner et al., 2016], does not contradict our

observation of high rank correlation between surrogate metrics.

The Borda method weighs all surrogate metrics equally. However, our experience and theoretical analysis suggest that

focusing only on “good” metrics can improve performance after aggregation (see Sec. 5 and Corollary 1). But how do

we identify “good" ranks without access to anomaly labels?

4.2 Empirical Inﬂuence and Robust Rank Aggregation

Since we do not have access to anomaly labels, we introduce an intuitive unsupervised way to identify “good" or

“reliable" rankings based on the notion of empirical inﬂuence. Empirical inﬂuence functions have been previously

used to detect inﬂuential cases in regression [Cook and Weisberg, 1980]. Given a collection of

rankings

{σ1,· · · , σM}

, let

Borda(S)

denote the aggregate ranking from the Borda method. Then for some

σi∈ S

, empirical

inﬂuence EI(σi,S)measures the impact of σion the Borda solution.

EI(σi,S) = f(S)−f(S \ σi),where f(A) = 1

|A|

i=1

dτ(Borda(A), σi)(2)

Under the assumption that a majority of rankings are good, EI is likely to identify bad rankings as ones with high

positive EI. This is intuitive since removing a bad ranking results in larger decrease in the objective f(S \ {σi}).

We cluster all rankings based on their EI using a two-component single-linkage agglomerative clustering [Murtagh and

Contreras, 2012], discard all rankings from the “bad" cluster, and apply Borda aggregation to the remaining ones. To

We use terms “ranking” and “permutation” as synonyms. The subtle difference is that in a ranking multiple models can receive

the same rank. Where necessary, we break ties uniformly at random.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UNSUPERVISEDMODELSELECTIONFORTIME-SERIESANOMALYDETECTIONAPREPRINTMononitoGoswami,CristianChalluSchoolofComputerScienceCarnegieMellonUniversity{mgoswami,cchallu}@cs.cmu.eduLaurentCallot,LenonMinorics&AndreyKanAWSAILabs{lcallot,avkan}@amazon.com,minorics@amazon.deABSTRACTAnomalydetectionintime-serie...

展开>> 收起<<

Unsupervised Model Selection for Time-series Anomaly Detection.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Model Selection for Time-series Anomaly Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: