THEORETICAL ANALYSIS OF DEEP NEURAL NETWORKS FOR TEMPORALLY DEPENDENT OBSERVATIONS Mingliang Ma

2025-05-06 1 0 671.31KB 23 页 10玖币
侵权投诉
THEORETICAL ANALYSIS OF DEEP NEURAL NETWORKS FOR
TEMPORALLY DEPENDENT OBSERVATIONS
Mingliang Ma
Department of Statistics
University of Florida
Gainesville, FL 32611
maminglian@ufl.edu
Abolfazl Safikhani
Department of Statistics
George Mason University
Fairfax, VA 22030
asafikha@gmu.edu
ABSTRACT
Deep neural networks are powerful tools to model observations over time with non-linear patterns.
Despite the widespread use of neural networks in such settings, most theoretical developments of
deep neural networks are under the assumption of independent observations, and theoretical results
for temporally dependent observations are scarce. To bridge this gap, we study theoretical properties
of deep neural networks on modeling non-linear time series data. Specifically, non-asymptotic
bounds for prediction error of (sparse) feed-forward neural network with ReLU activation function
is established under mixing-type assumptions. These assumptions are mild such that they include
a wide range of time series models including auto-regressive models. Compared to independent
observations, established convergence rates have additional logarithmic factors to compensate for
additional complexity due to dependence among data points. The theoretical results are supported via
various numerical simulation settings as well as an application to a macroeconomic data set.
1 Introduction
Neural networks have the ability to model highly complex relationship among data. If input data are observed data in
past with future observations as response, neural networks can be utilized to perform time series forecasting. Examples
of application of neural networks in forecasting include biotechnology [
1
], finance [
2
], health sciences [
3
], and business
[
4
], just to name a selected few. Compared to more traditional time series forecasting methods such as ARIMA models
[
5
], neural networks have the ability to detect highly non-linear trend and seasonality. In this work, we analyse the
prediction error consistency of (deep) feed-forward neural networks to fit stationary (non-linear) time series models.
The property of the single hidden layer neural network is well studied in the past few decades. For example, [
6
] use
single hidden layer neural network with a transformed cosine activation function to show that a sufficiently complex
single hidden layer feed-forward network can approximate any member of a specific class of functions to any desired
degree of accuracy. Such an approximation property for neural networks with sigmoidal activation function was also
analyzed in [
7
,
8
]. Further, [
9
] use monotonic homogeneous activation function (general version of ReLU activation
function) and show that both the input dimension and number of hidden units have an effect on the convergence rate
when using single layer neural networks.
There are many recent works which shed some light on the reasoning behind the good performance of multi-layer (or
deep) neural networks. The performance is evaluated via computing mean squared predictive error which is also called
the statistical risk. For example, [
10
] show that the statistical risk of a multi-layer neural network depends on the number
of layers and the input dimension of each layer. The problem of applying deep neural network in high-dimensional
settings is that the high-dimensional input vector in nonparametric regression leads to a slow convergence rate [
11
] while
the complexity often scales exponentially with the depth or number of units per layer [
12
,
13
,
14
]. Further, convergence
rate for prediction error is also related to the regression function and fast rate can be obtained in special classes of
regression functions such as additive and/or composition functions [
11
,
15
,
16
]. To avoid the curse of dimensionality
and achieving faster rates, [
17
] work under hierarchical composition assumption with a sparse neural network. It is
shown that under the independence assumption over input vectors, the estimator utilizing a sparse network achieves
arXiv:2210.11530v1 [stat.ML] 20 Oct 2022
nearly optimal convergence rates for prediction error. Finally, [
18
] use neural network as a classifier for temporally
dependent observations which are based on Markov processes. We refer to [
19
] for an overview of deep learning
methods.
Most of theoretical developments related to prediction error consistency of neural networks are under the assumption
that either the input variables are independent or they are independent with the error (noise) term, or both. However,
these assumptions are restrictive and may not hold in time series models. To bridge this gap, the main goal of this paper
is to establish consistency rates for prediction error of deep feed-forward neural networks for temporally dependent
observations. To that end, we focus on multivariate nonparametric regression model with bounded and composite
regression functions and apply sparse neural networks with ReLU activation functions for estimation (see more details
in Section 2). The modeling framework is similar to [
17
] while the independence assumption is relaxed. Specifically, we
show that given temporally dependent observations, under certain mixing condition, the statistical risk coincides with
the result of independent observations setting with an additional
log4(n)
factor where
n
is the sample size (Theorem 1).
Moreover, utilizing the Wold decomposition, this result is extended to a general family of stationary time series models
in which it is shown that the decay rate of
AR()
representation coefficients plays an important role on the consistency
rate for prediction error of neural networks (Theorem 2). These results give some insights on the effect of temporal
dependence on the performance of neural networks by specifically quantifying the prediction error in such settings.
Finally, the prediction performance of neural networks in time series settings is investigated empirically via several
simulation scenarios and a real data application (Sections 4 and 5).
Notation. For two random variables
X
and
Y
,
XD
=Y
implies that
X
and
Y
have the same distribution. For a matrix
W
,
kWk:= maxi,j |Wij |
,
kWk0
is the number of nonzero entries of
W
, and
kWk1:= Pi,j |Wij |
. For a vector
v
,
|v|0
,
|v|1
and
|v|
are defined by the same way. We write
bxc
for the smallest integer
x
. For two sequences
(at)t1
and
(bt)t1
, we write
at.bt
if there exists a constant
c1
such that
atcbt
for all
t
. If both
at.bt
and
bt&at
, we write
atbt
. Also,
at=o(bt)
implies
at/bt0
as
t→ ∞
. For a multi-dimension random
variable
X
,
XN(µ, Σ)
implies that
X
has multivariate Gaussian distribution with mean
µ
and covariance matrix
Σ
. For two functions
f, g
, we use
fg(x)
to denote
f(g(x))
. Also,
kfk:= supx|f(x)|
is the sup-norm of
f
, and
(a)+= max(a, 0) for aR.
2 Setup
In this section, a brief presentation of feed-forward neural networks is provided in Section 2.1 followed by discussing
the modeling framework under consideration in Section 2.2.
2.1 Background about multi-layer neural networks
Multi-layer neural network is composed of three parts: input layer, hidden layers and output layer. We denote the depth
of a multi-layer neural network by
L
, which implies that there are
L+ 1
layers in total consisting of
L1
hidden
layers, one input layer and one output layer. We refer to the input layer as the
0
-th layer and the output layer as the
L
-th
layer. The multi-layer neural network can be written as
f(x) : RdR=WLσvL(WL1σvL1(···W1σv0(W0(x))))),(1)
where
Wi
is the matrix of weights between
(i1)
-th and
i
-th layers of the network (
i= 1, . . . , L
) and
σv
is a modified
ReLU activation function in each layer. Specifically, for the shift parameter
v= (v1,··· , vr)Rr
, the activation
function σv:RrRis defined as
σv
x1
.
.
.
xr
=
(x1v1)+
.
.
.
(xrvr)+
.
Let
pi
denote the number of units in the
i
-th layer (note that
p0=d
and
pL= 1
). For a fully connected multi-layer
neural network, the total number of parameters is
PL1
i=0 pipi+1
that is defined as the size of a multi-layer neural network
[
20
]. Similar to [
17
], the entries of all weight matrices
{Wi}i=1,···,L
and shift parameters
{vi}i=1,···,L
are assumed
to be uniformly bounded. The sparsity level
s
is defined as the number of all non-zero parameters in
{Wi}i=1,···,L
and
{vi}i=1,···,L
. We also assume that the value of
|f|
is bounded by some constant
F
. In summary, we focus on the
collection of
s
-sparse multi-layer neural networks with bounded parameters which is denoted by
F(L, p, s, F )
and
2
defined as
F(L, p, s, F ) := {f∈ F(L, p) :
L
X
i=0 kWik0+|vi|0s, kfkF},
where F(L, p) := {fof form (1) : max
i=0,1,···,L kWik∨ |vi|1}.
This restriction of neural networks to the ones with sparse connections and bounded parameters is common on deep
learning (see e.g. [
17
] and references therein) since neural networks are typically trained using certain penalization
methods and dropouts.
2.2 Model
The modeling framework considered is similar to [
17
] while allowing for temporal dependence among observations.
Let
(t)t1
be a sequence of independent random variables with
E[t]=0
. Let
{Xt}t1
be a p-dimensional stationary
process with Xt:= (Xt,1,··· , Xt,p). We assume Ytis generated as
Yt=f0(Xt) + t,(2)
with a measurable function
f0:RpR
. We assume that the regression function
f0
is a composition of several
functions, specifically
f0=gqgq1◦ ··· ◦ g1g0,(3)
with
gi: [ai, bi]di[ai+1, bi+1]di+1
, where
d0=p, dq+1 = 1
. Each
gi
has a
di+1
-dimensional vector output
gi= (gi,1,··· , gi,di+1 )
. We assume that the multivariate function
gi,j
depends on at most
ti
variables while
ti
is far
less than
di
, i.e.
tidi
. As mentioned in [
17
], for a
β
-smooth function
f0
, the minimax estimation rate for the
prediction error is
n2β/(2β+d0)
. Since the dimensionality
d0
can be large in applications, the rate can be slow. To
mitigate this issue, the sparse structure avoids the effect of input dimension on the convergence rate and improves the
rate.
Let
T
be a region in
Rr
. Let
β
and
L
be two positive numbers. The Hölder class
Σ(β, L)
is defined as the set of
α=bβctimes differentiable functions f:TRrwhose derivative αf(x)satisfies
|αf(x)αf(y)|
|xy|β−bβc
L,
where we used the notation
α=α1···αr
with
α= (α1,··· , αr)
and
|α|:= |α|1
. Further, we define the ball of
β-Hölder functions with radius Kas
Cβ
r(D, K) = {f:DRrR:
X
α:|α|kαfk+X
α:|α|=bβc
sup
x,yD
|αf(x)αf(y)|
|xy|β−bβc
K}.
We assume that all functions
gij
for
i= 0,··· , q
and
j= 1,··· , di+1
belong to
βi
-Hölder class
Cβi
ti(Dij , K)
. From
the model (3), we know that Dij = [ai, bi]ti. Hence, the class of f0we focus belongs to
G(q, d,t,β, K) := {f0=gq◦ ··· ◦ g0:gi= (gij )j: [ai, bi]di[ai+1, bi+1]di+1 ,
gij ∈ Cβi
ti([ai, bi]ti, K),for some |ai|,|bi| ≤ K},(4)
with d:= (d0,··· , dq+1),t:= (t0,··· , tq)and β:= (β0,··· , βq).
3 Main result
In this section, we present consistency results for prediction error of deep neural networks applied to model
(2)
followed
by providing time series model examples satisfying the assumptions in Section 3.1. First, we need to introduce some
notations and state the main assumptions under which the theoretical developments are established. For any estimator
3
b
fn
in the class
F(L, p, s, F )
, we define (similar to [
17
])
n(b
fn, f0)
to measure the difference between the expected
empirical risk of b
fnand the global minimum over all networks in the class F(L, p, s, F )as
n(b
fn, f0)
:= Ef0"1
n
n
X
i=1
(Yib
fn(Xi))2inf
f∈F(L,p,s,F )
1
n
n
X
i=1
(Yif(Xi))2#.
The quantity
n(b
fn, f0)
plays a pivotal role in consistency properties of neural networks. The performance of
b
fn
is
evaluated by the prediction error defined as
R(b
fn, f0) := Ef0h(b
fn(X)f0(X))2i,(5)
with
XD
=Xt
and
X
is independent with
{Xt}t0
. Recall from Section 2.2, the regression function
f0
is in the
class
G(q, d,t,β, K)
. To simplify notations, we define
β
i:= βiQq
`=i+1(β`1), φn:= maxi=0,··· ,q n2β
i
2β
i+ti
. The
following assumptions are needed to present the first theorem.
Assumption 1.
For all
i= 1,2, . . .
,
E[i]=0
,
E[2
i] = σ2
, and there exists some positive constant
c
such that
E[|i|m]σ2m!cm2, m = 3,4,···.
Assumption 2. {Xt}
is a strictly stationary and exponentially
α
-mixing process. Recall that the
α
-mixing coefficient
of a stationary process {Xt}is define as
α(s) = sup{|P(AB)P(A)P(B)|:−∞ <t<, A σ(X
t), B σ(X+
t+s)},
where
X
t
consists of the entire past of the process including
Xt
, and
X+
t
consists of its entire future. The process
{Xt}is said to be exponentially α-mixing if there exists some constant ˜c > 0such that log(α(t)) ≤ −˜ct, t 1.
Assumption 3. The error term tis independent with {Xs, s t}.
Assumption 1 is known as the Bernstein condition and implies that
t
is a sub-exponential variable. This assumption is
often used when we cannot assume
t
is bounded. Assumption 2 is to control the dependence among input variables
and holds for a wide range of time series models, see e.g. auto-regressive models in [
21
]. Assumption 3 controls
dependence between input variables and error terms. A more stringent condition is to assume the whole error process
{t}t0
is independent with
{Xs}s0
. However, this assumption is restrictive since it excludes auto-regressive model
which is an important family of time series models. To avoid this, we only assume the current error term is independent
of current and past input variables. Further, this assumption ensures that
Pttf0(Xt)
is a martingale which helps in
verifying certain concentration inequalities needed in the proof of main results. All three assumptions are common in
non-linear time series analysis, see e.g. [22]. Now, we are ready to state the main result.
Theorem 1.
Consider the d-variate nonparametric regression model
(2)
for a composite regression function
(3)
in
the class
G(q, d,t,β, K)
. Suppose Assumptions 1-3 hold. Let
b
fn
be an estimator taking values in the network
class
F(L, (pi)i=0,···,L+1, s, F )
satisfying (i)
Fmax(K, 1)
; (ii)
Pq
i=0 log2(4ti4βi)log2nL.n
; (iii)
n.mini=1,···,Lpi
; and (iv)
snlogn
. Then, there exist positive constants
C, C0
depending only on
q, d,t,β, F
,
such that if n(b
f, f0)CφnLlog6n, then
R(b
fn, f0)C0φnLlog6n, (6)
and if n(b
f, f0)CφnLlog6n, then
1
C0n(b
f, f0)R(b
fn, f0)C0n(b
fn, f0).(7)
Based on Theorem 1, the prediction error defined in
(5)
is controlled by
φnLlog6n
. From condition (ii),
L
is at
least of the order of
log2n
. Thus, the rate in Theorem 1 becomes
φnlog7n
. This rate for the case of independent
observations is of order
φnlog3n
based on Theorem 1 in [
17
]. Compared to latter, our rate has an extra
log4n
factor,
which compensates for additional complexity in verifying the prediction error consistency in the presence of temporal
dependence among input variables. Further, note that for a fully connected neural network, the number of parameters is
PL1
i=1 pipi+1 &n2φ2
nL
. We can see that this number is greater than the sparsity level
s
which is of order
nlogn
based on condition (iv) in Theorem 1. Thus, it can be seen that condition (iv) restricts the neural network class to the
ones with sparse architecture. In other words, at least
PL1
i=1 pipi+1 s
units of the neural network is completely
inactive.
4
3.1 Time series model examples
In this section, we introduce some (well-known) examples of time series models that satisfy the assumptions of
Theorem 1. The first example is to let
{t}tZ
and
{Xt}tZ
be two independent processes. This independence
assumption implies that the input variables
Xt
are exogenous, thus Assumption 3 is automatically satisfied. Further,
assume
t
satisfy the moment conditions in Assumption 1 (for example, they have normal distribution) and
Xt
is
a stationary and geometrically
α
-mixing process. There are many examples of such processes including certain
finite-order auto-regressive processes, see e.g. [
23
,
21
]. The second example is to consider non-linear auto-regressive
models, i.e. assume
Xt=g(Xt1,··· , Xtd) + t.(8)
This is a special case of model
(2)
by setting
Xt= (Xtd,··· , Xt1)
and
Yt=Xt
. Assuming
t
s are i.i.d. random
variables with positive density in the real line and the boundedness of the function
g
, it can be shown that there exists
a stationary solution to equation
(8)
while the solution is exponentially
α
-mixing as well [
24
,
21
]. In both examples,
assumptions of Theorem 1 are satisfied, thus the results of this Theorem are applicable.
Now, we consider a more general time series model. Recall that by the well-known Wold representation, every purely
nondeterministic stationary and zero-mean stochastic process
Xt
can be expressed as
Xt=P
i=0 aiti
where
t
is
a mean-zero white noise. Further, if
Xt
has a non-vanishing spectral density and absolute summable auto-regressive
coefficients, i.e.
P
i=1 |φi|<
, it has the
AR()
representation
Xt=P
i=1 φiXti+t
(see e.g. [
25
]). Motivated
by this discussion, we consider a general family of times series models satisfying
Xt=
X
i=1
φiXti+t,(9)
where
t
s are i.i.d. errors. Independence among
t
s is a strong assumption compared to only assuming that they are
uncorrelated, but this is required for our theoretical analysis. The interesting fact about model
(9)
is that it is a linear
model. However, since there are infinite covariates in this AR(
) representation, training neural networks directly
is impossible. The common solution is to truncate the covariates and only consider the first few, i.e. approximate
model
(9)
by an AR(d) model for some
d
. This approximation can successfully estimate second order structures of the
original model (i.e. spectral density or auto-correlation function) if
d
is selected carefully and under certain assumptions
on the AR coefficients
φi
s (see e.g. [
25
]). Therefore, we follow this path and fit a neural network to the
d
-dimensional
input variables
(Xt1, . . . , Xtd)
with a proper choice of
d
while keeping in mind that the true regression function is
in fact
f0(Xt) = P
i=1 φiXti
. To establish the prediction error consistency of neural networks on truncated input
variables, we need two additional assumptions.
Assumption 4. There exists α > 0, M > 0such that P
i=1(1 + i)α|φi| ≤ M < .
Assumption 5. For some constant K > 0,|Xt| ≤ Kfor all t0.
Assumption 4 controls the decay rate of the
AR()
coefficients in the true model and
α
can be treated as a decreasing
rate of
φi
s. This assumption is needed to compensate for approximating a general time series of form
(9)
with a finite
lag AR process. Further, it plays an important role in restricting the first derivative of
f0
, which corresponds to the
βi
-smoothness assumption on
gij
in (4). Note that Assumption 4 is satisfied if the spectral density function is strictly
positive and continuous, and the auto-covariance function of
Xt
has some bounded property [
26
]. Moreover, since the
model is linear (i.e. the regression function is unbounded), Assumption 5 becomes necessary to make
f0(Xt)
bounded,
a property needed for Theorem 1 as well.
Theorem 2.
Consider model (9) with
f0(Xt) = P
i=1 φiXti
. Let
b
fn
be an estimator taking values in the network class
F(L, (pi)i=0,···,L+1, s, F )
satisfying (i)
FKM
; (ii)
L4
; (iii)
sLd
, and (iv)
d.mini=1,···,Lpi
. Assume that
dn1
α+1
. Under the Assumptions 1-5, there exist positive constants
C, C0
such that if
n(b
f, f0)Cnα
α+1 Llog5n
,
then
R(b
fn, f0)C0nα
α+1 Llog5n, (10)
and if n(b
f, f0)> Cnα
α+1 Llog5n, then
1
C0n(b
f, f0)R(b
fn, f0)C0n(b
f, f0).(11)
5
摘要:

THEORETICALANALYSISOFDEEPNEURALNETWORKSFORTEMPORALLYDEPENDENTOBSERVATIONSMingliangMaDepartmentofStatisticsUniversityofFloridaGainesville,FL32611maminglian@ufl.eduAbolfazlSakhaniDepartmentofStatisticsGeorgeMasonUniversityFairfax,VA22030asafikha@gmu.eduABSTRACTDeepneuralnetworksarepowerfultoolstomode...

收起<<
THEORETICAL ANALYSIS OF DEEP NEURAL NETWORKS FOR TEMPORALLY DEPENDENT OBSERVATIONS Mingliang Ma.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:671.31KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注