Genetic algorithm with a Bayesian approach
1 Introduction
The study of exceedances of a specific threshold in the time series of environmental measurements [
1
],[
2
], data
traffic measurements in computer security [
3
] or the measurement of the prevalence of tuberculosis in New York [
4
]
in certain time periods are some examples of the practicality of a model that allows us to establish whether certain
exceedance mitigation contingencies in the measurements have worked correctly in an observed period, that is, if
there are moments in the time series that indicate decrease, growth or steadiness in the rate of exceedance emission.
These moments will be represented as abrupt changes, represented in the model as change-points. One of the most
commonly used methodologies for this purpose is based on Non-Homogeneous Poisson Processes (NHPP), in which
the change-points that determine the different regimes, which make up the entire time series, are estimated. In the same
way, the parameters of the distribution of the excesses that compose each regime are estimated. For this, most of the
time, the prior distribution of the change-points is considered to be the uniform distribution, with hyperparameters at
the extremes of the interval where the researcher approximately observes the change-point. From there and from a
Bayesian mechanism, the joint distribution of the change-points is updated with the rest of the parameters of each
regime. The likelihood function of the observed overshootings and the aforementioned prior distribution are multiplied
to obtain the posterior distribution. Using a Bayesian computational technique such as Markov Chain Monte Carlo
(MCMC) on the posterior distribution, the parameters and change-points of interest are estimated.
It is worth noting that on the one hand we are talking about the time series of the measurements (TS1) and on the
other hand about the time series of the number of times these measurements exceed a threshold of interest, so that
cumulatively these counting measurements generate a new time series (TS2) of the number of exceedances at time
t
.
The regimes and change points detected in TS2 will be the same as those determined in TS1. Since TS2 is a counting
time series, it is possible to fit a Non-Homogeneous Poisson Process (NHPP) process, with a time-dependent overrun
emission intensity function. This last feature gives the advantage to the model that it does not assume equal distribution
between regimes, as conventional time series models do. Furthermore, regardless of the random mechanism that
generates the TS1 measurements, by means of TS2, the number of change-points and the respective location in TS1
can be found. Because of the above, the objective function is organized over the TS2 observations in the likelihood
and the parameters of the joint distribution function will be the same as those of the overshoot emission rate function
which for the purposes of the present proposal, are functions with less than three parameters, as will be presented later
in the expression (34). That is, the complexity of the model will be directly related to the NHPP overshoot emission
rate related to the TS2 observations with respect to time and not to the uncertainty of the TS1 series measurements.
One of the disadvantages that arise in the implementation of the aforementioned models is the model execution time,
because the more change-points are considered, the longer the time it will take to reach an optimum. But on the other
hand, the greater the number of change-points, the greater the adjustment of the estimated to the observed, allowing
less loss of information. The latter can be seen by comparing the Deviance information criterion (DIC) of the models
with zero, one, two or more change-points in [
2
]. So we want a model that continues with the same goodness of fit, but
automates the detection of change-points, in short run times. Therefore, in the present work we make a new proposal
that unifies the principles of construction of the likelihood function from non-homogeneous Poisson processes, as it
is a counting problem, and on the other hand, Bayesian computational methods to estimate the parameters of each
regime and its change-points. After the joint posterior distribution is obtained, it will be combined with information
theory principles to penalize the integer and real parameters present in the segmentation of the time series. All the
above under the heuristics of a genetic algorithm, will convert the penalized posterior distribution into the objective
function. Hence, the problem of solving the segmentation uncertainty with the present proposal is reduced to finding the
penalized Maximum A Posteriori (MAP) estimator, which corresponds to a modification of the Minimum Description
Length principle (MDL) and such we will refer to it as the Bayesian-MDL.
The cumulative mean function, which uniquely determines the NHPP, provides more information about the number of
threshold exceedances up to a given time
t
. This mean function has an associated intensity function that will indicate
the time trend in each of the segments and the speed of emission of the threshold overshoots of interest. The parameters
of these functions will be estimated with the genetic algorithm, to which the starting conditions of the first generation of
change-point configurations or chromosomes can be set. The latter will be responsible for generating the offspring
with the best characteristics. To guarantee this, the mutation distribution, mating dynamics and reordering between
chromosomes will be implemented and tested as in [
5
]. The execution time is also declared at the beginning of the
algorithm and is determined by the number of generations. From each generation will come out the best of the children,
which will be evaluated by the MDL, hence the chromosome with the lowest MDL among all the generations will be
the best of all. In this type of algorithms, the execution time is compensated by the quality of the solution. Another
advantage of this model is that once the cumulative mean function is known, the distribution of the number of overflows
present at time tcan be explicitly established and from there different forecasting exercises can be performed.
2