
Unsupervised Model Selection for Time-series Anomaly Detection A PREPRINT
We assume that a trained model
Ai
, when applied to observations
{xt}T
t=ttest
, produces anomaly scores
{si
t}T
t=ttest
,
si
t∈R≥0
. We assume that a higher anomaly score indicates that the observation is more likely to be an anomaly.
However, we do not assume that the scores correspond to likelihoods of any particular statistical model or that the
scores are comparable across models.
Model performance or, equivalently, quality of the scores can be measured using a supervised metric
Q({st}T
t=1,{yt}T
t=1)
, such as the Area under Precision Recall curve or best
F1
score, commonly used in literature. We
discuss the choice of the quality metric in the next section.
In general, rather than considering a single time-series, we will be performing model selection for a set of
L
time-series
with observations
X={{xj
t}T
t=1}L
j=1
and labels
Y={{yt}T
t=1}L
j=1
, where
j
indexes time-series. Let
Xtrain
,
Xtest
, and
Ytest
denote the train and test portions of observations, and the test portion of labels, respectively. We are now ready to
introduce the following problem.
Problem 1. Unsupervised Time-series Anomaly Detection Model Selection.
Given observations
Xtest
and a set of
models
M={Ai}N
i=1
trained using
Xtrain
, select a model that maximizes the anomaly detection quality metric
Q(Ai(Xtest),Ytest). The selection procedure cannot use labels.
2.1 Measuring Anomaly Detection Model Performance
Anomaly Detection can be viewed as a binary classification problem where each time point is classified as an anomaly
or a normal observation. Hence, the performance of a model
Ai
can be measured using standard precision and recall
metrics. However, these metrics ignore the sequential nature of time-series; thus, time-series anomaly detection is
usually evaluated using adjusted versions of precision and recall [Paparrizos et al., 2022b]. We adopt widely used
adjusted versions of precision and recall [Xu et al., 2018, Challu et al., 2022, Shen et al., 2020, Su et al., 2019, Carmona
et al., 2021]. These metrics treat time points as independent samples, except when an anomaly lasts for several
consecutive time points. In this case, detecting any of these points is treated as if all points inside the anomalous
segment were detected.
Adjusted precision and recall can be summarized in a metric called adjusted
F1
score, which is the harmonic mean of
the two. The adjusted
F1
score depends on the choice of a decision threshold on anomaly scores. In line with several
recent studies, we consider threshold selection as a problem orthogonal to our problem [Schmidl et al., 2022, Laptev
et al., 2015, Blázquez-García et al., 2021, Rebjock et al., 2021, Paparrizos et al., 2022b,a] and therefore, consider
metrics that summarize model performance across all possible thresholds. Common metrics include area under the
precision-recall curve and best
F1
(maximum
F1
over all possible thresholds). Like Xu et al. [2018], we found a strong
correlation between the two (App. A.10). We also found best
F1
to have statistically significant positive correlation
with volume under the surface of ROC curve, a recently proposed robust evaluation metric (App. A.9). Thus, in the
remainder of the paper, we restrict our analysis to identifying models with the highest adjusted best
F1
score (i.e.
Q
is
adjusted best F1).
3 Surrogate Metrics of Model Performance
The problem of unsupervised model selection is often viewed as finding an unsupervised metric which correlates well
with supervised metrics of interest [Ma et al., 2021, Goix, 2016, Lin et al., 2020, Duan et al., 2019]. Each unsupervised
metric serves as a noisy measure of “model goodness" and reduces the problem of picking the best performing model
according to the metric. We identified three classes of imperfect metrics that closely align with expert intuition in
predicting the performance of time-series anomaly detection models. Our metrics are unsupervised because they do
not require anomaly labels. However, some metrics such as
F1
score on synthetic anomalies are typically used for
supervised evaluation. To avoid confusion we use term surrogate for our metrics. Below we elaborate on each class of
surrogate metrics, starting with an intuition behind each class.
Prediction Error
If a model can forecast or reconstruct time-series well it must also be a good anomaly detector. A
large number of anomaly detection methods are based on forecasting or reconstruction [Schmidl et al., 2022], and for
such models, we can compute forecasting or reconstruction error without anomaly labels. We collectively call these
prediction error metrics and consider a common set of statistics, such as mean absolute error, mean squared error,
mean absolute percentage error, symmetric mean absolute percentage error, and the likelihood of observations. For
multivariate time-series, we average each metric across all the variables. For example, given a series of observations
{xt}T
t=1
and their predictions
{ˆxt}T
t=1
, mean squared error is defined as
MSE =1
TPT
t=1(xt−ˆxt)2
. In the interest of
space, we refer the reader to a textbook (e.g., Hyndman and Athanasopoulos [2018]) for definitions of other metrics.
Prior work on time-series anomaly detection [Saganowski and Andrysiak, 2020, Laptev et al., 2015, Kuang et al., 2022]
3