OPTIMAL SUB-SAMPLING
TO BOOST POWER OF KERNEL SEQUENTIAL CHANGE-POINT DETECTION
Song Wei Chaofan Huang
School of Industrial and Systems Engineering, Georgia Institute of Technology
ABSTRACT
We present a novel scheme to boost detection power for kernel
maximum mean discrepancy based sequential change-point de-
tection procedures. Our proposed scheme features an optimal
sub-sampling of the history data before the detection proce-
dure, in order to tackle the power loss incurred by the random
sub-sample from the enormous history data. We apply our
proposed scheme to both Scan
B
and Kernel Cumulative Sum
(CUSUM) procedures, and improved performance is observed
from extensive numerical experiments.
Index Terms—
Kernel methods; maximum mean discrep-
ancy (MMD); online decision making; optimal sub-sampling;
sequential change-point detection.
1. INTRODUCTION
Sequential change-point detection is a classic and fundamental
problem in signal processing, information theory and statistics,
and recently it is of great interest to machine learning and data
mining community. The goal is to raise an alarm as soon as
possible once a change in distribution occurs in streaming data,
and meanwhile to keep the frequency of false alarm as low as
possible. Classic parametric methods typically handle changes
in distribution parameters, but their success heavily relies on
exact specification of the probability distribution. As a result,
the distribution-free non-parametric methods are much more
favorable nowadays.
Kernel maximum mean discrepancy (MMD) [1
–
4] is one
of the most popular and powerful non-parametric methods
for testing if the two samples are from the same distribution.
This can be naturally applied to the change-point detection
problem by viewing the history observations and the streaming
data as the two samples considered in the test [5]. However,
the quadratic complexity of computing the kernel MMD is
the key limitation for its practical use in the online detection
setting. Existing methods to tackle such challenge include [6],
who proposed Scan
B
-procedure by leveraging
B
-test statistic
[7] to achieve linear complexity, and [8], who proposed a
Kernel Cumulative Sum (KCUSUM) procedure by replacing
the likelihood ratio in classic CUSUM with the linear-time
MMD statistic. Both methods used a random sub-sample
instead of the complete history observations to maintain the
computation and memory efficiency. However, such random
sub-sample leads to a power loss due to its poor representation
of the pre-change distribution.
Instead, one would want to identify the optimal sub-
samples that are most representative of the pre-change dis-
tribution, i.e., big data, to minimize the information loss.
Kernel herding [9] and kernel thinning [10] are two optimal
sub-sampling methods based on the kernel MMD statistics,
i.e., the sub-samples are chosen such that the kernel MMD
between the sub-samples and the big data is minimized. With
minimal information loss in the sub-sample selection, one
should expect an improvement in the detection power over the
use of random sub-samples.
In this work, we propose a scheme where we perform
optimal sub-sampling, or rather, kernel thinning, on history
observations as a pre-process before the detection procedure,
in order to tackle the poor representation of the random sub-
samples. We apply our proposed scheme on both Scan
B
and
KCUSUM procedures. Extensive numerical experiments are
conducted to demonstrate that our proposed scheme can help
boost the detection power for kernel MMD-based sequential
change-point detection procedure.
2. SETUP
Consider sufficiently large amount of history data following
pre-change distribution Pon domain X(typically X=Rd):
X1, . . . , XM∼P, i.i.d.,
and streaming data
Y1, . . . , Yt
up to current time
t
. In online-
change point detection, the goal is to test
H0:X1, . . . , XM, Y1, . . . , Yt∼P,
H1:∃r < t, X1, . . . , XM, Y1, . . . , Yr∼P,
Yr+1, . . . , Yt∼Q,
for some unknown distribution Q6=P.
In sequential change-point detection, we aim to detect the
change-point as soon as possible subject to the false alarm
rate (i.e., type I error). The surrogate of false alarm rate is
average run length (ARL). Typically, we raise an alarm once
the detection statistic exceeds a pre-selected threshold
b
, which
arXiv:2210.15060v2 [stat.ME] 14 Jan 2023