GENETIC ALGORITHM WITH A BAYESIAN APPROACH FOR THE DETECTION OF MULTIPLE POINTS OF CHANGE OF TIME SERIES OF COUNTING EXCEEDANCES OF SPECIFIC THRESHOLDS

2025-05-08 0 0 926.51KB 34 页 10玖币
侵权投诉
GENETIC ALGORITHM WITH A BAYESIAN APPROACH FOR THE
DETECTION OF MULTIPLE POINTS OF CHANGE OF TIME SERIES OF
COUNTING EXCEEDANCES OF SPECIFIC THRESHOLDS
Biviana Marcela Suárez-Sierra
Computing and analytics area. School of Applied Sciences and Engineering.
EAFIT University
Medellín, Colombia
bmsuarezs@eafit.edu.co
Arrigo Coen
Department of Mathematics, Faculty of Sciences.
National Autonomous University of México
CDMX, México, México
coen@ciencias.unam.mx
Carlos Alberto Taimal
Computing and analytics area. School of Applied Sciences and Engineering.
EAFIT University
Medellín, Colombia
cataimaly@eafit.edu.co
ABSTRACT
Although the applications of Non-Homogeneous Poisson Processes (NHPP) to model and study the
threshold overshoots of interest in different time series of measurements have proven to provide
good results, they needed to be complemented with an efficient and automatic diagnostic technique
to establish the location of the change-points, which, when taken into account, make the estimated
model fit poorly in regards of the information contained in the real one. Because of this, a new
method is proposed to solve the segmentation uncertainty of the time series of measurements, where
the generating distribution of exceedances of a specific threshold is the focus of investigation. One of
the great contributions of the present algorithm is that all the days that trespassed are candidates to be
a change-point, so all the possible configurations of overflow days under the heuristics of a genetic
algorithm are the possible chromosomes, which will unite to produce new solutions. Also, such
methods will be guarantee to non-local and the best possible one solution, reducing wasted machine
time evaluating the least likely chromosomes to be a feasible solution. The analytical evaluation
technique will be by means of the Minimum Description Length (MDL) as the objective function,
which is the joint posterior distribution function of the parameters of the NHPP of each regime and
the change-points that determines them and which account as well for the influence of the presence
of said times.
Thus, one of the practical implications of the present work comes in terms of overcoming the need
of modeling the time series of measurements, where the distributions of exceedances of certain
thresholds, or where the counting of certain events involving abrupt changes, is the main focus with
applications in phenomena such as climate change, information security and epidemiology, to name a
few.
Keywords Multiple Change-point Detection, Genetic Algorithms, Minimum Description Length,
Non-homogeneous Poisson Processes, Maximum A Posteriori Estimation
arXiv:2210.14807v4 [stat.AP] 14 Sep 2023
Genetic algorithm with a Bayesian approach
1 Introduction
The study of exceedances of a specific threshold in the time series of environmental measurements [
1
],[
2
], data
traffic measurements in computer security [
3
] or the measurement of the prevalence of tuberculosis in New York [
4
]
in certain time periods are some examples of the practicality of a model that allows us to establish whether certain
exceedance mitigation contingencies in the measurements have worked correctly in an observed period, that is, if
there are moments in the time series that indicate decrease, growth or steadiness in the rate of exceedance emission.
These moments will be represented as abrupt changes, represented in the model as change-points. One of the most
commonly used methodologies for this purpose is based on Non-Homogeneous Poisson Processes (NHPP), in which
the change-points that determine the different regimes, which make up the entire time series, are estimated. In the same
way, the parameters of the distribution of the excesses that compose each regime are estimated. For this, most of the
time, the prior distribution of the change-points is considered to be the uniform distribution, with hyperparameters at
the extremes of the interval where the researcher approximately observes the change-point. From there and from a
Bayesian mechanism, the joint distribution of the change-points is updated with the rest of the parameters of each
regime. The likelihood function of the observed overshootings and the aforementioned prior distribution are multiplied
to obtain the posterior distribution. Using a Bayesian computational technique such as Markov Chain Monte Carlo
(MCMC) on the posterior distribution, the parameters and change-points of interest are estimated.
It is worth noting that on the one hand we are talking about the time series of the measurements (TS1) and on the
other hand about the time series of the number of times these measurements exceed a threshold of interest, so that
cumulatively these counting measurements generate a new time series (TS2) of the number of exceedances at time
t
.
The regimes and change points detected in TS2 will be the same as those determined in TS1. Since TS2 is a counting
time series, it is possible to fit a Non-Homogeneous Poisson Process (NHPP) process, with a time-dependent overrun
emission intensity function. This last feature gives the advantage to the model that it does not assume equal distribution
between regimes, as conventional time series models do. Furthermore, regardless of the random mechanism that
generates the TS1 measurements, by means of TS2, the number of change-points and the respective location in TS1
can be found. Because of the above, the objective function is organized over the TS2 observations in the likelihood
and the parameters of the joint distribution function will be the same as those of the overshoot emission rate function
which for the purposes of the present proposal, are functions with less than three parameters, as will be presented later
in the expression (34). That is, the complexity of the model will be directly related to the NHPP overshoot emission
rate related to the TS2 observations with respect to time and not to the uncertainty of the TS1 series measurements.
One of the disadvantages that arise in the implementation of the aforementioned models is the model execution time,
because the more change-points are considered, the longer the time it will take to reach an optimum. But on the other
hand, the greater the number of change-points, the greater the adjustment of the estimated to the observed, allowing
less loss of information. The latter can be seen by comparing the Deviance information criterion (DIC) of the models
with zero, one, two or more change-points in [
2
]. So we want a model that continues with the same goodness of fit, but
automates the detection of change-points, in short run times. Therefore, in the present work we make a new proposal
that unifies the principles of construction of the likelihood function from non-homogeneous Poisson processes, as it
is a counting problem, and on the other hand, Bayesian computational methods to estimate the parameters of each
regime and its change-points. After the joint posterior distribution is obtained, it will be combined with information
theory principles to penalize the integer and real parameters present in the segmentation of the time series. All the
above under the heuristics of a genetic algorithm, will convert the penalized posterior distribution into the objective
function. Hence, the problem of solving the segmentation uncertainty with the present proposal is reduced to finding the
penalized Maximum A Posteriori (MAP) estimator, which corresponds to a modification of the Minimum Description
Length principle (MDL) and such we will refer to it as the Bayesian-MDL.
The cumulative mean function, which uniquely determines the NHPP, provides more information about the number of
threshold exceedances up to a given time
t
. This mean function has an associated intensity function that will indicate
the time trend in each of the segments and the speed of emission of the threshold overshoots of interest. The parameters
of these functions will be estimated with the genetic algorithm, to which the starting conditions of the first generation of
change-point configurations or chromosomes can be set. The latter will be responsible for generating the offspring
with the best characteristics. To guarantee this, the mutation distribution, mating dynamics and reordering between
chromosomes will be implemented and tested as in [
5
]. The execution time is also declared at the beginning of the
algorithm and is determined by the number of generations. From each generation will come out the best of the children,
which will be evaluated by the MDL, hence the chromosome with the lowest MDL among all the generations will be
the best of all. In this type of algorithms, the execution time is compensated by the quality of the solution. Another
advantage of this model is that once the cumulative mean function is known, the distribution of the number of overflows
present at time tcan be explicitly established and from there different forecasting exercises can be performed.
2
Genetic algorithm with a Bayesian approach
It should also be established that as per the characterization exposed in [
6
] for the type of method, the here proposed
procedure is of the off-line type as the whole dataset is considered at once in a retrospective manner and the conditions
set for the genetic algorithm are evaluated to determine an optimal solution. At the end of the algorithm execution, it
will return time segments that will be independent and identically distributed (iid) conditioned to the parameters of each
segment. Assuming that the observations in each segment determined by the proposed algorithm are iid may be wrong
in the presence of correlation signals. However, [
3
] comments that in the presence of outliers, the assumption turns out
to be robust, as in the case of [7] and here.
The presentation of this work is structured as follows. In the second section, a background review will be presented
related to what has been developed in the field of NHPP as methods of adjustment of the univariate function of
cumulative mean of the exceedances of a threshold of interest up to time
t
and its respective inferential analysis. In this
same section the scheme of the genetic algorithm will be presented, displaying the start, pairing and stop conditions
to guarantee that the solution is optimal. Section three will show the theoretical construction of the computational
model for four different forms of the intensity function
λ(t)
proposed in the literature. In section four we will present
the results with three sets of simulated data and one set of real data. From the simulated data, the performance of
the algorithm will be shown to determine the number of change-points, their location and the number of generations
necessary to reach the optimal solution. With the real dataset, we will evaluate the ability of the algorithm to assess the
impact of public policy on air pollution data in Bogotá by
P M2.5
between the years 2018 and 2020. Finally, in section
five we will present the conclusions and future considerations.
2 Inferential Analysis of the Univariate Non-Homogeneous Poisson Process
Based on the construction of the likelihood from a non-homogeneous Poisson process discussed in [
8
], we have the
following expression, from which we will start to construct the penalized MAP function. The developments that will
be shown in the following sections will be based on the assumption that the true intensity function of the overshoots
corresponds to some of the function in (34). Then the process that we have correspond to a NHPP with intensities
varying with respect to t,
λ(W)(t|θ) = (α/β)(t/β)α1, α, β > 0
λ(MO)(t|θ) = β
t+α, α, β > 0
λ(GO)(t|θ) = αβ exp(βt), α, β > 0
λ(GGO)(t|θ) = αβγtγ1exp(βtγ), α, β, γ > 0.
(1)
Each letter in the superscript of the left-hand term corresponds to the initial letter of the name of the distribution
used as intensity function as follows, Weibull (
W
) [
9
,
10
], Musa-Okumoto (
MO
) [
11
], Goel-Okumoto (
GO
) and a
generalization of the Goel-Okumoto model (
GGO
) [
12
], for which the mean cumulative function,
m(t|θ)
is defined
respectively by:
m(W)(t|θ) = (t/β)α, α, β > 0
m(MO)(t|θ) = βlog 1 + t
α, α, β > 0
m(GO)(t|θ) = α[1 exp(βt)], α, β > 0
m(GGO)(t|θ) = α[1 exp(βtγ)], α, β, γ > 0.
(2)
In the first three cases, the vector of parameters is θ= (α, β)and for the last one, θ= (α, β, γ).
The way the rate functions of the process are formulated shows that the observations within the same segment are
equally distributed and independent, conditional on the parameters of the emission intensity function of the observations
in that segment. Hence, this information is taken into account in the Bayesian-MDL to correspond to the growth,
decrease or steadiness of excess emissions in each interval of the partition determined by the change-points. For
example as exposed in [
1
], in air pollution models, the Weibull rate function shows a better fit than the GGO, without
and with change-points.
The present model determines the number of change-points and their specific location over the time series of interest.
Since the data suggested to be studied with this type of model are exceedances of a specific threshold, the nature of the
change-points will represent an abrupt variation in the rate of growth or decay of the exceedances in the cumulative
mean function, which in turn will represent variations in the original time series.
3
Genetic algorithm with a Bayesian approach
2.1 Likelihood function with change-points
A change point is defined as an instant where the structural pattern of a time series shifts [
6
,
13
]. We assumed the
presence of
J
change-points,
{τ1, τ2,··· , τJ}
such that there are variations on the model parameters in between
segments
τj1<t<τj, j = 0,1,2, . . . , J + 1, j0= 1, jJ+1 =T
. These changes can be attributed to environmental
policies or legislations in a certain year, the suspension of some network station due to maintenance for a climate case,
macroeconomic policies from an economic standpoint or the presence of a stimulus in a neuroscience context. On
the other side, assuming the overshoots from a given threshold between two regimes determined by the presence of a
change-point can be modeled after a NHPP, then the intensity functions are,
λ(t|θ) = (λ(t|θ1),0t<τ1,
λ(t|θj), τj1t<τj, j = 2,3,··· , J,
λ(t|θJ+1). τJtT,
(3)
where
θj
is the vector of parameters between the change-points
τj1
and
τj
, for
j= 2, ..., J
, and
θ1
and
θJ+1
are the
parameter vectors before and after the first and last change-points, respectively. With
n
observations, the functions for
the means are (see, e.g., [1]),
m(t|θ) =
m(t|θ1),0t < τ1,
m(τ1|θ1) + m(t|θ2)m(τ1|θ2), τ1t<τ2,
m(t|θj+1)m(τj|θj+1)
+Pj
i=2[m(τi|θi)m(τi1|θi)]
+m(τ1|θ1), τjt<τj+1, j = 2,3,··· , J,
(4)
where
τJ+1 =T
. That is, because
m(t|θ1)
represents the average number of exceedances of the standard, before the
first change-point.
m(τ1|θ1) + m(t|θ2)m(τ1|θ2)
is the average number of exceedances of the standard between
the first change-point τ1and the second one τ2, given that the vector of parameters θ2is known, and so on.
Be
D=d1, . . . , dn
, where
dk
(as in the case without change-points), is the time of occurrence of the
kth
event
(the
kth
time the maximum level of the environmental standard is exceeded, for example), with
k= 1,2, ..., n
, the
likelihood function is determined by the expression below where
Nτi
represents the exceedances before the change-point
τi, with i= 1,2, ..., J (see [14, 15]).
L(D|ϕ)
Nτ1
Y
i=1
λ(di|θ1)
em(τ1|θ1)
×
J
Y
j=2
Nτj
Y
i=Nτj1+1
λ(di|θj)
e[m(τj|θj)m(τj1|θj)]
×
n
Y
i=NτJ+1
λ(di|θJ+1)
e[m(T|θJ+1)m(τJ|θJ+1 )],(5)
Using the expression (5), we infer the parameters
ϕ= (θ, τ)
, with
θ= (θ1, ..., θJ)
and
τ= (τ1, ..., τJ)
using a
Bayesian approach. This perspective consists of finding the relationship between the prior distribution of the parameter
θ
, on whose intensity function
λ(t|θ)
is dependent and the posterior distribution of the same, after taking into
consideration the observed information
D
. In [
16
], this method was applied to obtained results very close to the
observed ones, hence the descriptive capacity of the model and the methodology used. In such work, the criteria used to
select the model that best fits the data together with the graphic part was the MDL.
2.2 Detection of multiple change-points using genetic algorithm
2.2.1 MDL framework
Since finding
J
change-points implies finding out
J+ 1
regimes for the time series or fitting
J+ 1
models with different
parameters, statiscal criteria has been used for such purpose in the available literature. Some include the Akaike
4
Genetic algorithm with a Bayesian approach
Information Criterion (AIC), the Bayesian Information Criterion (BIC), Cross-Validation methods, and MDL-based
methods. For problems involving regime shift detection, MDL methods usually provide superior empirical results. This
superiority is probably due to the fact that both AIC and BIC apply the same penalty to all parameters, regardless of the
nature of the parameter. On the other hand, MDL methods can adapt penalties to parameters depending on their nature
be it continuous or discrete, bounded or not. In short, MDL defines the best fitting model as the one that enables the
best compression of the data minimizing a penalized likelihood function. That is,
MDL =log2(Lopt) + P. (6)
Here
log2(Lopt)
is the required amount of information needed to store the fitted model, term taken from information
theory. More details on this can be found in [
17
].
Lopt
is obtained from replacing the maximum likelihood estimator in
its function (5). This will be explained in the next section.
Because of the above, it is possible to make the natural connection between the likelihood and the MDL objective
function by means of the penalty
P
(see [
17
]). The broad penalty methodology is summarized in three principles as
stated by [
5
]. The first one is to penalize the real valued parameters by the number of observations. Say
k
that are
used to estimate it, then, the penalty will be
log2k
2
. For this principle, it is important to take into consideration how the
observations are arranged to calculate the parameter of interest because this arrangement will be reflected in the penalty.
The second principle involves the penalty of how many integer parameters, such as the number of change-points
J
and where they are located represented by
τ1, ..., τJ
should be charged. This charging is calculated based on the value
for each of them. For example, the quantity
J
, which is bounded by the total number of observations
T
is charged an
amount of log2T
2. For each of the τjwith j= 1, ..., J, we have that τj< τj+1, therefore the cost of its penalty will be
log2τj+1
2for j= 2, ..., J.
The last principle, mentioned in [
5
], is the additivity principle. It involves constructing
P
based on the sum of all the
partial penalties mentioned above. The more parameters the model has, the higher
P
will be. However, if despite adding
parameters, the expression
log2(Lopt)
does not grow larger than the penalty
P
of the extra parameters, the simpler
model will be preferred. For the purposes of this paper, the following will be used as the penalty function
Pτ(θ)
for a
fixed change point configuration,
Pτ(θ) = R
J+1
X
j=1
ln(τ1τi1)
2+ln(J) +
J
X
j=2
ln(τj),(7)
where
R= 2,3
depending on whether
θ= (α, β)
or
θ= (α, β, γ)
, i.e. if
θ
has one or two parameters. The first
summand of the right-hand term of expression represents that each of the real-valued parameters (
αj, βj, γj
) will be
penalized by
ln(τ1τj1)
2
of the
j
-th regime to which they belong and since there are
J+ 1
regimes, the sum goes from
1 to
J+ 1
. The second summand of the right-hand term is the penalty of the number of points of change, and the last
one comes from the sum of each of the penalties of each change-point.
2.2.2 Genetic Algorithm Schema
As exposed in [
5
] the total possible cases to evaluate the MDL corresponds to
T
J
, where
T
is the number of observations
in the time series and
J
is the number of change-points. However, this number of parametric configurations is a quantity
that does not make a computationally efficient optimization algorithm that aims to choose the best of the parametric
configurations that minimize the MDL. For this reason, we will use the genetic algorithm that, by natural selection
criteria will establish the best of the parameters configurations that we will call chromosomes. Each chromosome will
be labeled as
(J, τ1, ..., τJ)
, where the first component
J
stands for the number of change-points, located respectively at
times
τ1, ..., τJ
, corresponding to the respective coordinates. The following is to establish how the genetic algorithm
(GA) evaluates each of the chromosomes, while avoiding those probabilistically less optimal.
Let us now see how a complete generation is produced from an initial one with a given size, although the size can also
be a couple. For this purpose, suppose there are
k
individuals or chromosomes in the initial generation set at random.
Each of the
T
observations in the time series is allowed to be a change-point, independent of all other change-points,
with probability, for example as seen in [
5
], of 0.06. The number of change-points for each chromosome in the initial
generation has a binomial distribution Binomial(n=T1, p = 0.06).
Two chromosomes are taken out of the initial generation, one mother and one father chromosome, to make a child of
the next generation. This is done by a probabilistic combination of the parents. The principle of natural selection in this
5
摘要:

GENETICALGORITHMWITHABAYESIANAPPROACHFORTHEDETECTIONOFMULTIPLEPOINTSOFCHANGEOFTIMESERIESOFCOUNTINGEXCEEDANCESOFSPECIFICTHRESHOLDSBivianaMarcelaSuárez-SierraComputingandanalyticsarea.SchoolofAppliedSciencesandEngineering.EAFITUniversityMedellín,Colombiabmsuarezs@eafit.edu.coArrigoCoenDepartmentofMath...

展开>> 收起<<
GENETIC ALGORITHM WITH A BAYESIAN APPROACH FOR THE DETECTION OF MULTIPLE POINTS OF CHANGE OF TIME SERIES OF COUNTING EXCEEDANCES OF SPECIFIC THRESHOLDS.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:926.51KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注