
3.1 Time series model examples
In this section, we introduce some (well-known) examples of time series models that satisfy the assumptions of
Theorem 1. The first example is to let
{t}t∈Z
and
{Xt}t∈Z
be two independent processes. This independence
assumption implies that the input variables
Xt
are exogenous, thus Assumption 3 is automatically satisfied. Further,
assume
t
satisfy the moment conditions in Assumption 1 (for example, they have normal distribution) and
Xt
is
a stationary and geometrically
α
-mixing process. There are many examples of such processes including certain
finite-order auto-regressive processes, see e.g. [
23
,
21
]. The second example is to consider non-linear auto-regressive
models, i.e. assume
Xt=g(Xt−1,··· , Xt−d) + t.(8)
This is a special case of model
(2)
by setting
Xt= (Xt−d,··· , Xt−1)
and
Yt=Xt
. Assuming
t
’s are i.i.d. random
variables with positive density in the real line and the boundedness of the function
g
, it can be shown that there exists
a stationary solution to equation
(8)
while the solution is exponentially
α
-mixing as well [
24
,
21
]. In both examples,
assumptions of Theorem 1 are satisfied, thus the results of this Theorem are applicable.
Now, we consider a more general time series model. Recall that by the well-known Wold representation, every purely
nondeterministic stationary and zero-mean stochastic process
Xt
can be expressed as
Xt=P∞
i=0 ait−i
where
t
is
a mean-zero white noise. Further, if
Xt
has a non-vanishing spectral density and absolute summable auto-regressive
coefficients, i.e.
P∞
i=1 |φi|<∞
, it has the
AR(∞)
representation
Xt=P∞
i=1 φiXt−i+t
(see e.g. [
25
]). Motivated
by this discussion, we consider a general family of times series models satisfying
Xt=∞
X
i=1
φiXt−i+t,(9)
where
t
’s are i.i.d. errors. Independence among
t
’s is a strong assumption compared to only assuming that they are
uncorrelated, but this is required for our theoretical analysis. The interesting fact about model
(9)
is that it is a linear
model. However, since there are infinite covariates in this AR(
∞
) representation, training neural networks directly
is impossible. The common solution is to truncate the covariates and only consider the first few, i.e. approximate
model
(9)
by an AR(d) model for some
d
. This approximation can successfully estimate second order structures of the
original model (i.e. spectral density or auto-correlation function) if
d
is selected carefully and under certain assumptions
on the AR coefficients
φi
’s (see e.g. [
25
]). Therefore, we follow this path and fit a neural network to the
d
-dimensional
input variables
(Xt−1, . . . , Xt−d)
with a proper choice of
d
while keeping in mind that the true regression function is
in fact
f0(Xt) = P∞
i=1 φiXt−i
. To establish the prediction error consistency of neural networks on truncated input
variables, we need two additional assumptions.
Assumption 4. There exists α > 0, M > 0such that P∞
i=1(1 + i)α|φi| ≤ M < ∞.
Assumption 5. For some constant K > 0,|Xt| ≤ Kfor all t≥0.
Assumption 4 controls the decay rate of the
AR(∞)
coefficients in the true model and
α
can be treated as a decreasing
rate of
φi
’s. This assumption is needed to compensate for approximating a general time series of form
(9)
with a finite
lag AR process. Further, it plays an important role in restricting the first derivative of
f0
, which corresponds to the
βi
-smoothness assumption on
gij
in (4). Note that Assumption 4 is satisfied if the spectral density function is strictly
positive and continuous, and the auto-covariance function of
Xt
has some bounded property [
26
]. Moreover, since the
model is linear (i.e. the regression function is unbounded), Assumption 5 becomes necessary to make
f0(Xt)
bounded,
a property needed for Theorem 1 as well.
Theorem 2.
Consider model (9) with
f0(Xt) = P∞
i=1 φiXt−i
. Let
b
fn
be an estimator taking values in the network class
F(L, (pi)i=0,···,L+1, s, F )
satisfying (i)
F≥KM
; (ii)
L≥4
; (iii)
sLd
, and (iv)
d.mini=1,···,Lpi
. Assume that
dn1
α+1
. Under the Assumptions 1-5, there exist positive constants
C, C0
such that if
∆n(b
f, f0)≤Cn−α
α+1 Llog5n
,
then
R(b
fn, f0)≤C0n−α
α+1 Llog5n, (10)
and if ∆n(b
f, f0)> Cn−α
α+1 Llog5n, then
1
C0∆n(b
f, f0)≤R(b
fn, f0)≤C0∆n(b
f, f0).(11)
5