GENETIC ALGORITHM WITH A BAYESIAN APPROACH FOR THE DETECTION OF MULTIPLE POINTS OF CHANGE OF TIME SERIES OF COUNTING EXCEEDANCES OF SPECIFIC THRESHOLDS

2025-05-08 0 0 926.51KB 34 页 10玖币

侵权投诉

GENETIC ALGORITHM WITH A BAYESIAN APPROACH FOR THE

DETECTION OF MULTIPLE POINTS OF CHANGE OF TIME SERIES OF

COUNTING EXCEEDANCES OF SPECIFIC THRESHOLDS

Biviana Marcela Suárez-Sierra

Computing and analytics area. School of Applied Sciences and Engineering.

EAFIT University

Medellín, Colombia

bmsuarezs@eafit.edu.co

Arrigo Coen

Department of Mathematics, Faculty of Sciences.

National Autonomous University of México

CDMX, México, México

coen@ciencias.unam.mx

Carlos Alberto Taimal

Computing and analytics area. School of Applied Sciences and Engineering.

EAFIT University

Medellín, Colombia

cataimaly@eafit.edu.co

ABSTRACT

Although the applications of Non-Homogeneous Poisson Processes (NHPP) to model and study the

threshold overshoots of interest in different time series of measurements have proven to provide

good results, they needed to be complemented with an efﬁcient and automatic diagnostic technique

to establish the location of the change-points, which, when taken into account, make the estimated

model ﬁt poorly in regards of the information contained in the real one. Because of this, a new

method is proposed to solve the segmentation uncertainty of the time series of measurements, where

the generating distribution of exceedances of a speciﬁc threshold is the focus of investigation. One of

the great contributions of the present algorithm is that all the days that trespassed are candidates to be

a change-point, so all the possible conﬁgurations of overﬂow days under the heuristics of a genetic

algorithm are the possible chromosomes, which will unite to produce new solutions. Also, such

methods will be guarantee to non-local and the best possible one solution, reducing wasted machine

time evaluating the least likely chromosomes to be a feasible solution. The analytical evaluation

technique will be by means of the Minimum Description Length (MDL) as the objective function,

which is the joint posterior distribution function of the parameters of the NHPP of each regime and

the change-points that determines them and which account as well for the inﬂuence of the presence

of said times.

Thus, one of the practical implications of the present work comes in terms of overcoming the need

of modeling the time series of measurements, where the distributions of exceedances of certain

thresholds, or where the counting of certain events involving abrupt changes, is the main focus with

applications in phenomena such as climate change, information security and epidemiology, to name a

few.

Keywords Multiple Change-point Detection, Genetic Algorithms, Minimum Description Length,

Non-homogeneous Poisson Processes, Maximum A Posteriori Estimation

arXiv:2210.14807v4 [stat.AP] 14 Sep 2023

Genetic algorithm with a Bayesian approach

1 Introduction

The study of exceedances of a speciﬁc threshold in the time series of environmental measurements [

],[

], data

trafﬁc measurements in computer security [

] or the measurement of the prevalence of tuberculosis in New York [

]

in certain time periods are some examples of the practicality of a model that allows us to establish whether certain

exceedance mitigation contingencies in the measurements have worked correctly in an observed period, that is, if

there are moments in the time series that indicate decrease, growth or steadiness in the rate of exceedance emission.

These moments will be represented as abrupt changes, represented in the model as change-points. One of the most

commonly used methodologies for this purpose is based on Non-Homogeneous Poisson Processes (NHPP), in which

the change-points that determine the different regimes, which make up the entire time series, are estimated. In the same

way, the parameters of the distribution of the excesses that compose each regime are estimated. For this, most of the

time, the prior distribution of the change-points is considered to be the uniform distribution, with hyperparameters at

the extremes of the interval where the researcher approximately observes the change-point. From there and from a

Bayesian mechanism, the joint distribution of the change-points is updated with the rest of the parameters of each

regime. The likelihood function of the observed overshootings and the aforementioned prior distribution are multiplied

to obtain the posterior distribution. Using a Bayesian computational technique such as Markov Chain Monte Carlo

(MCMC) on the posterior distribution, the parameters and change-points of interest are estimated.

It is worth noting that on the one hand we are talking about the time series of the measurements (TS1) and on the

other hand about the time series of the number of times these measurements exceed a threshold of interest, so that

cumulatively these counting measurements generate a new time series (TS2) of the number of exceedances at time

The regimes and change points detected in TS2 will be the same as those determined in TS1. Since TS2 is a counting

time series, it is possible to ﬁt a Non-Homogeneous Poisson Process (NHPP) process, with a time-dependent overrun

emission intensity function. This last feature gives the advantage to the model that it does not assume equal distribution

between regimes, as conventional time series models do. Furthermore, regardless of the random mechanism that

generates the TS1 measurements, by means of TS2, the number of change-points and the respective location in TS1

can be found. Because of the above, the objective function is organized over the TS2 observations in the likelihood

and the parameters of the joint distribution function will be the same as those of the overshoot emission rate function

which for the purposes of the present proposal, are functions with less than three parameters, as will be presented later

in the expression (34). That is, the complexity of the model will be directly related to the NHPP overshoot emission

rate related to the TS2 observations with respect to time and not to the uncertainty of the TS1 series measurements.

One of the disadvantages that arise in the implementation of the aforementioned models is the model execution time,

because the more change-points are considered, the longer the time it will take to reach an optimum. But on the other

hand, the greater the number of change-points, the greater the adjustment of the estimated to the observed, allowing

less loss of information. The latter can be seen by comparing the Deviance information criterion (DIC) of the models

with zero, one, two or more change-points in [

]. So we want a model that continues with the same goodness of ﬁt, but

automates the detection of change-points, in short run times. Therefore, in the present work we make a new proposal

that uniﬁes the principles of construction of the likelihood function from non-homogeneous Poisson processes, as it

is a counting problem, and on the other hand, Bayesian computational methods to estimate the parameters of each

regime and its change-points. After the joint posterior distribution is obtained, it will be combined with information

theory principles to penalize the integer and real parameters present in the segmentation of the time series. All the

above under the heuristics of a genetic algorithm, will convert the penalized posterior distribution into the objective

function. Hence, the problem of solving the segmentation uncertainty with the present proposal is reduced to ﬁnding the

penalized Maximum A Posteriori (MAP) estimator, which corresponds to a modiﬁcation of the Minimum Description

Length principle (MDL) and such we will refer to it as the Bayesian-MDL.

The cumulative mean function, which uniquely determines the NHPP, provides more information about the number of

threshold exceedances up to a given time

. This mean function has an associated intensity function that will indicate

the time trend in each of the segments and the speed of emission of the threshold overshoots of interest. The parameters

of these functions will be estimated with the genetic algorithm, to which the starting conditions of the ﬁrst generation of

change-point conﬁgurations or chromosomes can be set. The latter will be responsible for generating the offspring

with the best characteristics. To guarantee this, the mutation distribution, mating dynamics and reordering between

chromosomes will be implemented and tested as in [

]. The execution time is also declared at the beginning of the

algorithm and is determined by the number of generations. From each generation will come out the best of the children,

which will be evaluated by the MDL, hence the chromosome with the lowest MDL among all the generations will be

the best of all. In this type of algorithms, the execution time is compensated by the quality of the solution. Another

advantage of this model is that once the cumulative mean function is known, the distribution of the number of overﬂows

present at time tcan be explicitly established and from there different forecasting exercises can be performed.

Genetic algorithm with a Bayesian approach

It should also be established that as per the characterization exposed in [

] for the type of method, the here proposed

procedure is of the off-line type as the whole dataset is considered at once in a retrospective manner and the conditions

set for the genetic algorithm are evaluated to determine an optimal solution. At the end of the algorithm execution, it

will return time segments that will be independent and identically distributed (iid) conditioned to the parameters of each

segment. Assuming that the observations in each segment determined by the proposed algorithm are iid may be wrong

in the presence of correlation signals. However, [

] comments that in the presence of outliers, the assumption turns out

to be robust, as in the case of [7] and here.

The presentation of this work is structured as follows. In the second section, a background review will be presented

related to what has been developed in the ﬁeld of NHPP as methods of adjustment of the univariate function of

cumulative mean of the exceedances of a threshold of interest up to time

and its respective inferential analysis. In this

same section the scheme of the genetic algorithm will be presented, displaying the start, pairing and stop conditions

to guarantee that the solution is optimal. Section three will show the theoretical construction of the computational

model for four different forms of the intensity function

λ(t)

proposed in the literature. In section four we will present

the results with three sets of simulated data and one set of real data. From the simulated data, the performance of

the algorithm will be shown to determine the number of change-points, their location and the number of generations

necessary to reach the optimal solution. With the real dataset, we will evaluate the ability of the algorithm to assess the

impact of public policy on air pollution data in Bogotá by

P M2.5

between the years 2018 and 2020. Finally, in section

ﬁve we will present the conclusions and future considerations.

2 Inferential Analysis of the Univariate Non-Homogeneous Poisson Process

Based on the construction of the likelihood from a non-homogeneous Poisson process discussed in [

], we have the

following expression, from which we will start to construct the penalized MAP function. The developments that will

be shown in the following sections will be based on the assumption that the true intensity function of the overshoots

corresponds to some of the function in (34). Then the process that we have correspond to a NHPP with intensities

varying with respect to t,

λ(W)(t|θ) = (α/β)(t/β)α−1, α, β > 0

λ(MO)(t|θ) = β

t+α, α, β > 0

λ(GO)(t|θ) = αβ exp(−βt), α, β > 0

λ(GGO)(t|θ) = αβγtγ−1exp(−βtγ), α, β, γ > 0.

(1)

Each letter in the superscript of the left-hand term corresponds to the initial letter of the name of the distribution

used as intensity function as follows, Weibull (

) [

], Musa-Okumoto (

) [

], Goel-Okumoto (

) and a

generalization of the Goel-Okumoto model (

GGO

) [

], for which the mean cumulative function,

m(t|θ)

is deﬁned

respectively by:

m(W)(t|θ) = (t/β)α, α, β > 0

m(MO)(t|θ) = βlog 1 + t

α, α, β > 0

m(GO)(t|θ) = α[1 −exp(−βt)], α, β > 0

m(GGO)(t|θ) = α[1 −exp(−βtγ)], α, β, γ > 0.

(2)

In the ﬁrst three cases, the vector of parameters is θ= (α, β)and for the last one, θ= (α, β, γ).

The way the rate functions of the process are formulated shows that the observations within the same segment are

equally distributed and independent, conditional on the parameters of the emission intensity function of the observations

in that segment. Hence, this information is taken into account in the Bayesian-MDL to correspond to the growth,

decrease or steadiness of excess emissions in each interval of the partition determined by the change-points. For

example as exposed in [

], in air pollution models, the Weibull rate function shows a better ﬁt than the GGO, without

and with change-points.

The present model determines the number of change-points and their speciﬁc location over the time series of interest.

Since the data suggested to be studied with this type of model are exceedances of a speciﬁc threshold, the nature of the

change-points will represent an abrupt variation in the rate of growth or decay of the exceedances in the cumulative

mean function, which in turn will represent variations in the original time series.

Genetic algorithm with a Bayesian approach

2.1 Likelihood function with change-points

A change point is deﬁned as an instant where the structural pattern of a time series shifts [

]. We assumed the

presence of

change-points,

{τ1, τ2,··· , τJ}

such that there are variations on the model parameters in between

segments

τj−1<t<τj, j = 0,1,2, . . . , J + 1, j0= 1, jJ+1 =T

. These changes can be attributed to environmental

policies or legislations in a certain year, the suspension of some network station due to maintenance for a climate case,

macroeconomic policies from an economic standpoint or the presence of a stimulus in a neuroscience context. On

the other side, assuming the overshoots from a given threshold between two regimes determined by the presence of a

change-point can be modeled after a NHPP, then the intensity functions are,

λ(t|θ) = (λ(t|θ1),0≤t<τ1,

λ(t|θj), τj−1≤t<τj, j = 2,3,··· , J,

λ(t|θJ+1). τJ≤t≤T,

(3)

where

θj

is the vector of parameters between the change-points

τj−1

and

τj

, for

j= 2, ..., J

, and

θ1

and

θJ+1

are the

parameter vectors before and after the ﬁrst and last change-points, respectively. With

observations, the functions for

the means are (see, e.g., [1]),

m(t|θ) = 









m(t|θ1),0≤t < τ1,

m(τ1|θ1) + m(t|θ2)−m(τ1|θ2), τ1≤t<τ2,

m(t|θj+1)−m(τj|θj+1)

+Pj

i=2[m(τi|θi)−m(τi−1|θi)]

+m(τ1|θ1), τj≤t<τj+1, j = 2,3,··· , J,

(4)

where

τJ+1 =T

. That is, because

m(t|θ1)

represents the average number of exceedances of the standard, before the

ﬁrst change-point.

m(τ1|θ1) + m(t|θ2)−m(τ1|θ2)

is the average number of exceedances of the standard between

the ﬁrst change-point τ1and the second one τ2, given that the vector of parameters θ2is known, and so on.

D=d1, . . . , dn

, where

(as in the case without change-points), is the time of occurrence of the

k−th

event

(the

k−th

time the maximum level of the environmental standard is exceeded, for example), with

k= 1,2, ..., n

, the

likelihood function is determined by the expression below where

Nτi

represents the exceedances before the change-point

τi, with i= 1,2, ..., J (see [14, 15]).

L(D|ϕ)∝



Nτ1

i=1

λ(di|θ1)

e−m(τ1|θ1)

×



j=2 



Nτj

i=Nτj−1+1

λ(di|θj)

e−[m(τj|θj)−m(τj−1|θj)]



×



i=NτJ+1

λ(di|θJ+1)

e−[m(T|θJ+1)−m(τJ|θJ+1 )],(5)

Using the expression (5), we infer the parameters

ϕ= (θ, τ)

, with

θ= (θ1, ..., θJ)

and

τ= (τ1, ..., τJ)

using a

Bayesian approach. This perspective consists of ﬁnding the relationship between the prior distribution of the parameter

, on whose intensity function

λ(t|θ)

is dependent and the posterior distribution of the same, after taking into

consideration the observed information

. In [

], this method was applied to obtained results very close to the

observed ones, hence the descriptive capacity of the model and the methodology used. In such work, the criteria used to

select the model that best ﬁts the data together with the graphic part was the MDL.

2.2 Detection of multiple change-points using genetic algorithm

2.2.1 MDL framework

Since ﬁnding

change-points implies ﬁnding out

J+ 1

regimes for the time series or ﬁtting

J+ 1

models with different

parameters, statiscal criteria has been used for such purpose in the available literature. Some include the Akaike

Genetic algorithm with a Bayesian approach

Information Criterion (AIC), the Bayesian Information Criterion (BIC), Cross-Validation methods, and MDL-based

methods. For problems involving regime shift detection, MDL methods usually provide superior empirical results. This

superiority is probably due to the fact that both AIC and BIC apply the same penalty to all parameters, regardless of the

nature of the parameter. On the other hand, MDL methods can adapt penalties to parameters depending on their nature

be it continuous or discrete, bounded or not. In short, MDL deﬁnes the best ﬁtting model as the one that enables the

best compression of the data minimizing a penalized likelihood function. That is,

MDL =−log2(Lopt) + P. (6)

Here

log2(Lopt)

is the required amount of information needed to store the ﬁtted model, term taken from information

theory. More details on this can be found in [

Lopt

is obtained from replacing the maximum likelihood estimator in

its function (5). This will be explained in the next section.

Because of the above, it is possible to make the natural connection between the likelihood and the MDL objective

function by means of the penalty

(see [

]). The broad penalty methodology is summarized in three principles as

stated by [

]. The ﬁrst one is to penalize the real valued parameters by the number of observations. Say

that are

used to estimate it, then, the penalty will be

log2k

. For this principle, it is important to take into consideration how the

observations are arranged to calculate the parameter of interest because this arrangement will be reﬂected in the penalty.

The second principle involves the penalty of how many integer parameters, such as the number of change-points

and where they are located represented by

τ1, ..., τJ

should be charged. This charging is calculated based on the value

for each of them. For example, the quantity

, which is bounded by the total number of observations

is charged an

amount of log2T

2. For each of the τjwith j= 1, ..., J, we have that τj< τj+1, therefore the cost of its penalty will be

log2τj+1

2for j= 2, ..., J.

The last principle, mentioned in [

], is the additivity principle. It involves constructing

based on the sum of all the

partial penalties mentioned above. The more parameters the model has, the higher

will be. However, if despite adding

parameters, the expression

log2(Lopt)

does not grow larger than the penalty

of the extra parameters, the simpler

model will be preferred. For the purposes of this paper, the following will be used as the penalty function

Pτ(θ)

for a

ﬁxed change point conﬁguration,

Pτ(θ) = R

J+1

j=1

ln(τ1−τi−1)

2+ln(J) +

j=2

ln(τj),(7)

where

R= 2,3

depending on whether

θ= (α, β)

θ= (α, β, γ)

, i.e. if

has one or two parameters. The ﬁrst

summand of the right-hand term of expression represents that each of the real-valued parameters (

αj, βj, γj

) will be

penalized by

ln(τ1−τj−1)

of the

-th regime to which they belong and since there are

J+ 1

regimes, the sum goes from

1 to

J+ 1

. The second summand of the right-hand term is the penalty of the number of points of change, and the last

one comes from the sum of each of the penalties of each change-point.

2.2.2 Genetic Algorithm Schema

As exposed in [

] the total possible cases to evaluate the MDL corresponds to

T

J

, where

is the number of observations

in the time series and

is the number of change-points. However, this number of parametric conﬁgurations is a quantity

that does not make a computationally efﬁcient optimization algorithm that aims to choose the best of the parametric

conﬁgurations that minimize the MDL. For this reason, we will use the genetic algorithm that, by natural selection

criteria will establish the best of the parameters conﬁgurations that we will call chromosomes. Each chromosome will

be labeled as

(J, τ1, ..., τJ)

, where the ﬁrst component

stands for the number of change-points, located respectively at

times

τ1, ..., τJ

, corresponding to the respective coordinates. The following is to establish how the genetic algorithm

(GA) evaluates each of the chromosomes, while avoiding those probabilistically less optimal.

Let us now see how a complete generation is produced from an initial one with a given size, although the size can also

be a couple. For this purpose, suppose there are

individuals or chromosomes in the initial generation set at random.

Each of the

observations in the time series is allowed to be a change-point, independent of all other change-points,

with probability, for example as seen in [

], of 0.06. The number of change-points for each chromosome in the initial

generation has a binomial distribution Binomial(n=T−1, p = 0.06).

Two chromosomes are taken out of the initial generation, one mother and one father chromosome, to make a child of

the next generation. This is done by a probabilistic combination of the parents. The principle of natural selection in this

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GENETICALGORITHMWITHABAYESIANAPPROACHFORTHEDETECTIONOFMULTIPLEPOINTSOFCHANGEOFTIMESERIESOFCOUNTINGEXCEEDANCESOFSPECIFICTHRESHOLDSBivianaMarcelaSuárez-SierraComputingandanalyticsarea.SchoolofAppliedSciencesandEngineering.EAFITUniversityMedellín,Colombiabmsuarezs@eafit.edu.coArrigoCoenDepartmentofMath...

展开>> 收起<<

GENETIC ALGORITHM WITH A BAYESIAN APPROACH FOR THE DETECTION OF MULTIPLE POINTS OF CHANGE OF TIME SERIES OF COUNTING EXCEEDANCES OF SPECIFIC THRESHOLDS.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GENETIC ALGORITHM WITH A BAYESIAN APPROACH FOR THE DETECTION OF MULTIPLE POINTS OF CHANGE OF TIME SERIES OF COUNTING EXCEEDANCES OF SPECIFIC THRESHOLDS

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: