THEORETICAL ANALYSIS OF DEEP NEURAL NETWORKS FOR TEMPORALLY DEPENDENT OBSERVATIONS Mingliang Ma

2025-05-06 3 0 671.31KB 23 页 10玖币

侵权投诉

THEORETICAL ANALYSIS OF DEEP NEURAL NETWORKS FOR

TEMPORALLY DEPENDENT OBSERVATIONS

Mingliang Ma

Department of Statistics

University of Florida

Gainesville, FL 32611

maminglian@ufl.edu

Abolfazl Saﬁkhani

Department of Statistics

George Mason University

Fairfax, VA 22030

asafikha@gmu.edu

ABSTRACT

Deep neural networks are powerful tools to model observations over time with non-linear patterns.

Despite the widespread use of neural networks in such settings, most theoretical developments of

deep neural networks are under the assumption of independent observations, and theoretical results

for temporally dependent observations are scarce. To bridge this gap, we study theoretical properties

of deep neural networks on modeling non-linear time series data. Speciﬁcally, non-asymptotic

bounds for prediction error of (sparse) feed-forward neural network with ReLU activation function

is established under mixing-type assumptions. These assumptions are mild such that they include

a wide range of time series models including auto-regressive models. Compared to independent

observations, established convergence rates have additional logarithmic factors to compensate for

additional complexity due to dependence among data points. The theoretical results are supported via

various numerical simulation settings as well as an application to a macroeconomic data set.

1 Introduction

Neural networks have the ability to model highly complex relationship among data. If input data are observed data in

past with future observations as response, neural networks can be utilized to perform time series forecasting. Examples

of application of neural networks in forecasting include biotechnology [

], ﬁnance [

], health sciences [

], and business

[

], just to name a selected few. Compared to more traditional time series forecasting methods such as ARIMA models

[

], neural networks have the ability to detect highly non-linear trend and seasonality. In this work, we analyse the

prediction error consistency of (deep) feed-forward neural networks to ﬁt stationary (non-linear) time series models.

The property of the single hidden layer neural network is well studied in the past few decades. For example, [

] use

single hidden layer neural network with a transformed cosine activation function to show that a sufﬁciently complex

single hidden layer feed-forward network can approximate any member of a speciﬁc class of functions to any desired

degree of accuracy. Such an approximation property for neural networks with sigmoidal activation function was also

analyzed in [

]. Further, [

] use monotonic homogeneous activation function (general version of ReLU activation

function) and show that both the input dimension and number of hidden units have an effect on the convergence rate

when using single layer neural networks.

There are many recent works which shed some light on the reasoning behind the good performance of multi-layer (or

deep) neural networks. The performance is evaluated via computing mean squared predictive error which is also called

the statistical risk. For example, [

] show that the statistical risk of a multi-layer neural network depends on the number

of layers and the input dimension of each layer. The problem of applying deep neural network in high-dimensional

settings is that the high-dimensional input vector in nonparametric regression leads to a slow convergence rate [

] while

the complexity often scales exponentially with the depth or number of units per layer [

]. Further, convergence

rate for prediction error is also related to the regression function and fast rate can be obtained in special classes of

regression functions such as additive and/or composition functions [

]. To avoid the curse of dimensionality

and achieving faster rates, [

] work under hierarchical composition assumption with a sparse neural network. It is

shown that under the independence assumption over input vectors, the estimator utilizing a sparse network achieves

arXiv:2210.11530v1 [stat.ML] 20 Oct 2022

nearly optimal convergence rates for prediction error. Finally, [

] use neural network as a classiﬁer for temporally

dependent observations which are based on Markov processes. We refer to [

] for an overview of deep learning

methods.

Most of theoretical developments related to prediction error consistency of neural networks are under the assumption

that either the input variables are independent or they are independent with the error (noise) term, or both. However,

these assumptions are restrictive and may not hold in time series models. To bridge this gap, the main goal of this paper

is to establish consistency rates for prediction error of deep feed-forward neural networks for temporally dependent

observations. To that end, we focus on multivariate nonparametric regression model with bounded and composite

regression functions and apply sparse neural networks with ReLU activation functions for estimation (see more details

in Section 2). The modeling framework is similar to [

] while the independence assumption is relaxed. Speciﬁcally, we

show that given temporally dependent observations, under certain mixing condition, the statistical risk coincides with

the result of independent observations setting with an additional

log4(n)

factor where

is the sample size (Theorem 1).

Moreover, utilizing the Wold decomposition, this result is extended to a general family of stationary time series models

in which it is shown that the decay rate of

AR(∞)

representation coefﬁcients plays an important role on the consistency

rate for prediction error of neural networks (Theorem 2). These results give some insights on the effect of temporal

dependence on the performance of neural networks by speciﬁcally quantifying the prediction error in such settings.

Finally, the prediction performance of neural networks in time series settings is investigated empirically via several

simulation scenarios and a real data application (Sections 4 and 5).

Notation. For two random variables

and

implies that

and

have the same distribution. For a matrix

kWk∞:= maxi,j |Wij |

kWk0

is the number of nonzero entries of

, and

kWk1:= Pi,j |Wij |

. For a vector

|v|0

|v|1

and

|v|∞

are deﬁned by the same way. We write

bxc

for the smallest integer

≥x

. For two sequences

(at)t≥1

and

(bt)t≥1

, we write

at.bt

if there exists a constant

c≥1

such that

at≤cbt

for all

. If both

at.bt

and

bt&at

, we write

atbt

. Also,

at=o(bt)

implies

at/bt→0

t→ ∞

. For a multi-dimension random

variable

X∼N(µ, Σ)

implies that

has multivariate Gaussian distribution with mean

and covariance matrix

. For two functions

f, g

, we use

f◦g(x)

to denote

f(g(x))

. Also,

kfk∞:= supx|f(x)|

is the sup-norm of

, and

(a)+= max(a, 0) for a∈R.

2 Setup

In this section, a brief presentation of feed-forward neural networks is provided in Section 2.1 followed by discussing

the modeling framework under consideration in Section 2.2.

2.1 Background about multi-layer neural networks

Multi-layer neural network is composed of three parts: input layer, hidden layers and output layer. We denote the depth

of a multi-layer neural network by

, which implies that there are

L+ 1

layers in total consisting of

L−1

hidden

layers, one input layer and one output layer. We refer to the input layer as the

-th layer and the output layer as the

-th

layer. The multi-layer neural network can be written as

f(x) : Rd−→ R=WLσvL(WL−1σvL−1(···W1σv0(W0(x))))),(1)

where

is the matrix of weights between

(i−1)

-th and

-th layers of the network (

i= 1, . . . , L

) and

σv

is a modiﬁed

ReLU activation function in each layer. Speciﬁcally, for the shift parameter

v= (v1,··· , vr)∈Rr

, the activation

function σv:Rr−→ Ris deﬁned as

σv









=





(x1−v1)+

(xr−vr)+





.

Let

denote the number of units in the

-th layer (note that

p0=d

and

pL= 1

). For a fully connected multi-layer

neural network, the total number of parameters is

PL−1

i=0 pipi+1

that is deﬁned as the size of a multi-layer neural network

[

]. Similar to [

], the entries of all weight matrices

{Wi}i=1,···,L

and shift parameters

{vi}i=1,···,L

are assumed

to be uniformly bounded. The sparsity level

is deﬁned as the number of all non-zero parameters in

{Wi}i=1,···,L

and

{vi}i=1,···,L

. We also assume that the value of

|f|

is bounded by some constant

. In summary, we focus on the

collection of

-sparse multi-layer neural networks with bounded parameters which is denoted by

F(L, p, s, F )

and

deﬁned as

F(L, p, s, F ) := {f∈ F(L, p) :

i=0 kWik0+|vi|0≤s, kfk∞≤F},

where F(L, p) := {fof form (1) : max

i=0,1,···,L kWik∞∨ |vi|∞≤1}.

This restriction of neural networks to the ones with sparse connections and bounded parameters is common on deep

learning (see e.g. [

] and references therein) since neural networks are typically trained using certain penalization

methods and dropouts.

2.2 Model

The modeling framework considered is similar to [

] while allowing for temporal dependence among observations.

Let

(t)t≥1

be a sequence of independent random variables with

E[t]=0

. Let

{Xt}t≥1

be a p-dimensional stationary

process with Xt:= (Xt,1,··· , Xt,p). We assume Ytis generated as

Yt=f0(Xt) + t,(2)

with a measurable function

f0:Rp−→ R

. We assume that the regression function

is a composition of several

functions, speciﬁcally

f0=gq◦gq−1◦ ··· ◦ g1◦g0,(3)

with

gi: [ai, bi]di−→ [ai+1, bi+1]di+1

, where

d0=p, dq+1 = 1

. Each

has a

di+1

-dimensional vector output

gi= (gi,1,··· , gi,di+1 )

. We assume that the multivariate function

gi,j

depends on at most

variables while

is far

less than

, i.e.

tidi

. As mentioned in [

], for a

-smooth function

, the minimax estimation rate for the

prediction error is

n−2β/(2β+d0)

. Since the dimensionality

can be large in applications, the rate can be slow. To

mitigate this issue, the sparse structure avoids the effect of input dimension on the convergence rate and improves the

rate.

Let

be a region in

. Let

and

be two positive numbers. The Hölder class

Σ(β, L)

is deﬁned as the set of

α=bβctimes differentiable functions f:T−→ Rrwhose derivative ∂αf(x)satisﬁes

|∂αf(x)−∂αf(y)|

|x−y|β−bβc

∞≤L,

where we used the notation

∂α=∂α1···∂αr

with

α= (α1,··· , αr)

and

|α|:= |α|1

. Further, we deﬁne the ball of

β-Hölder functions with radius Kas

Cβ

r(D, K) = {f:D⊂Rr−→ R:

α:|α|<β k∂αfk∞+X

α:|α|=bβc

sup

x,y∈D

|∂αf(x)−∂αf(y)|

|x−y|β−bβc

∞≤K}.

We assume that all functions

gij

for

i= 0,··· , q

and

j= 1,··· , di+1

belong to

βi

-Hölder class

Cβi

ti(Dij , K)

. From

the model (3), we know that Dij = [ai, bi]ti. Hence, the class of f0we focus belongs to

G(q, d,t,β, K) := {f0=gq◦ ··· ◦ g0:gi= (gij )j: [ai, bi]di−→ [ai+1, bi+1]di+1 ,

gij ∈ Cβi

ti([ai, bi]ti, K),for some |ai|,|bi| ≤ K},(4)

with d:= (d0,··· , dq+1),t:= (t0,··· , tq)and β:= (β0,··· , βq).

3 Main result

In this section, we present consistency results for prediction error of deep neural networks applied to model

(2)

followed

by providing time series model examples satisfying the assumptions in Section 3.1. First, we need to introduce some

notations and state the main assumptions under which the theoretical developments are established. For any estimator

in the class

F(L, p, s, F )

, we deﬁne (similar to [

])

∆n(b

fn, f0)

to measure the difference between the expected

empirical risk of b

fnand the global minimum over all networks in the class F(L, p, s, F )as

∆n(b

fn, f0)

:= Ef0"1

i=1

(Yi−b

fn(Xi))2−inf

f∈F(L,p,s,F )

i=1

(Yi−f(Xi))2#.

The quantity

∆n(b

fn, f0)

plays a pivotal role in consistency properties of neural networks. The performance of

evaluated by the prediction error deﬁned as

R(b

fn, f0) := Ef0h(b

fn(X)−f0(X))2i,(5)

with

=Xt

and

is independent with

{Xt}t≥0

. Recall from Section 2.2, the regression function

is in the

class

G(q, d,t,β, K)

. To simplify notations, we deﬁne

β∗

i:= βiQq

`=i+1(β`∧1), φn:= maxi=0,··· ,q n−2β∗

2β∗

i+ti

. The

following assumptions are needed to present the ﬁrst theorem.

Assumption 1.

For all

i= 1,2, . . .

E[i]=0

E[2

i] = σ2

, and there exists some positive constant

such that

E[|i|m]≤σ2m!cm−2, m = 3,4,···.

Assumption 2. {Xt}

is a strictly stationary and exponentially

-mixing process. Recall that the

-mixing coefﬁcient

of a stationary process {Xt}is deﬁne as

α(s) = sup{|P(A∩B)−P(A)P(B)|:−∞ <t<∞, A ∈σ(X−

t), B ∈σ(X+

t+s)},

where

X−

consists of the entire past of the process including

, and

X+

consists of its entire future. The process

{Xt}is said to be exponentially α-mixing if there exists some constant ˜c > 0such that log(α(t)) ≤ −˜ct, t ≥1.

Assumption 3. The error term tis independent with {Xs, s ≤t}.

Assumption 1 is known as the Bernstein condition and implies that

t

is a sub-exponential variable. This assumption is

often used when we cannot assume

t

is bounded. Assumption 2 is to control the dependence among input variables

and holds for a wide range of time series models, see e.g. auto-regressive models in [

]. Assumption 3 controls

dependence between input variables and error terms. A more stringent condition is to assume the whole error process

{t}t≥0

is independent with

{Xs}s≥0

. However, this assumption is restrictive since it excludes auto-regressive model

which is an important family of time series models. To avoid this, we only assume the current error term is independent

of current and past input variables. Further, this assumption ensures that

Pttf0(Xt)

is a martingale which helps in

verifying certain concentration inequalities needed in the proof of main results. All three assumptions are common in

non-linear time series analysis, see e.g. [22]. Now, we are ready to state the main result.

Theorem 1.

Consider the d-variate nonparametric regression model

(2)

for a composite regression function

(3)

the class

G(q, d,t,β, K)

. Suppose Assumptions 1-3 hold. Let

be an estimator taking values in the network

class

F(L, (pi)i=0,···,L+1, s, F )

satisfying (i)

F≥max(K, 1)

; (ii)

i=0 log2(4ti∨4βi)log2n≤L.nφn

; (iii)

nφn.mini=1,···,Lpi

; and (iv)

snφnlogn

. Then, there exist positive constants

C, C0

depending only on

q, d,t,β, F

such that if ∆n(b

f, f0)≤CφnLlog6n, then

R(b

fn, f0)≤C0φnLlog6n, (6)

and if ∆n(b

f, f0)≥CφnLlog6n, then

C0∆n(b

f, f0)≤R(b

fn, f0)≤C0∆n(b

fn, f0).(7)

Based on Theorem 1, the prediction error deﬁned in

(5)

is controlled by

φnLlog6n

. From condition (ii),

is at

least of the order of

log2n

. Thus, the rate in Theorem 1 becomes

φnlog7n

. This rate for the case of independent

observations is of order

φnlog3n

based on Theorem 1 in [

]. Compared to latter, our rate has an extra

log4n

factor,

which compensates for additional complexity in verifying the prediction error consistency in the presence of temporal

dependence among input variables. Further, note that for a fully connected neural network, the number of parameters is

PL−1

i=1 pipi+1 &n2φ2

. We can see that this number is greater than the sparsity level

which is of order

nφnlogn

based on condition (iv) in Theorem 1. Thus, it can be seen that condition (iv) restricts the neural network class to the

ones with sparse architecture. In other words, at least

PL−1

i=1 pipi+1 −s

units of the neural network is completely

inactive.

3.1 Time series model examples

In this section, we introduce some (well-known) examples of time series models that satisfy the assumptions of

Theorem 1. The ﬁrst example is to let

{t}t∈Z

and

{Xt}t∈Z

be two independent processes. This independence

assumption implies that the input variables

are exogenous, thus Assumption 3 is automatically satisﬁed. Further,

assume

t

satisfy the moment conditions in Assumption 1 (for example, they have normal distribution) and

a stationary and geometrically

-mixing process. There are many examples of such processes including certain

ﬁnite-order auto-regressive processes, see e.g. [

]. The second example is to consider non-linear auto-regressive

models, i.e. assume

Xt=g(Xt−1,··· , Xt−d) + t.(8)

This is a special case of model

(2)

by setting

Xt= (Xt−d,··· , Xt−1)

and

Yt=Xt

. Assuming

t

’s are i.i.d. random

variables with positive density in the real line and the boundedness of the function

, it can be shown that there exists

a stationary solution to equation

(8)

while the solution is exponentially

-mixing as well [

]. In both examples,

assumptions of Theorem 1 are satisﬁed, thus the results of this Theorem are applicable.

Now, we consider a more general time series model. Recall that by the well-known Wold representation, every purely

nondeterministic stationary and zero-mean stochastic process

can be expressed as

Xt=P∞

i=0 ait−i

where

t

a mean-zero white noise. Further, if

has a non-vanishing spectral density and absolute summable auto-regressive

coefﬁcients, i.e.

P∞

i=1 |φi|<∞

, it has the

AR(∞)

representation

Xt=P∞

i=1 φiXt−i+t

(see e.g. [

]). Motivated

by this discussion, we consider a general family of times series models satisfying

Xt=∞

i=1

φiXt−i+t,(9)

where

t

’s are i.i.d. errors. Independence among

t

’s is a strong assumption compared to only assuming that they are

uncorrelated, but this is required for our theoretical analysis. The interesting fact about model

(9)

is that it is a linear

model. However, since there are inﬁnite covariates in this AR(

∞

) representation, training neural networks directly

is impossible. The common solution is to truncate the covariates and only consider the ﬁrst few, i.e. approximate

model

(9)

by an AR(d) model for some

. This approximation can successfully estimate second order structures of the

original model (i.e. spectral density or auto-correlation function) if

is selected carefully and under certain assumptions

on the AR coefﬁcients

φi

’s (see e.g. [

]). Therefore, we follow this path and ﬁt a neural network to the

-dimensional

input variables

(Xt−1, . . . , Xt−d)

with a proper choice of

while keeping in mind that the true regression function is

in fact

f0(Xt) = P∞

i=1 φiXt−i

. To establish the prediction error consistency of neural networks on truncated input

variables, we need two additional assumptions.

Assumption 4. There exists α > 0, M > 0such that P∞

i=1(1 + i)α|φi| ≤ M < ∞.

Assumption 5. For some constant K > 0,|Xt| ≤ Kfor all t≥0.

Assumption 4 controls the decay rate of the

AR(∞)

coefﬁcients in the true model and

can be treated as a decreasing

rate of

φi

’s. This assumption is needed to compensate for approximating a general time series of form

(9)

with a ﬁnite

lag AR process. Further, it plays an important role in restricting the ﬁrst derivative of

, which corresponds to the

βi

-smoothness assumption on

gij

in (4). Note that Assumption 4 is satisﬁed if the spectral density function is strictly

positive and continuous, and the auto-covariance function of

has some bounded property [

]. Moreover, since the

model is linear (i.e. the regression function is unbounded), Assumption 5 becomes necessary to make

f0(Xt)

bounded,

a property needed for Theorem 1 as well.

Theorem 2.

Consider model (9) with

f0(Xt) = P∞

i=1 φiXt−i

. Let

be an estimator taking values in the network class

F(L, (pi)i=0,···,L+1, s, F )

satisfying (i)

F≥KM

; (ii)

L≥4

; (iii)

sLd

, and (iv)

d.mini=1,···,Lpi

. Assume that

dn1

α+1

. Under the Assumptions 1-5, there exist positive constants

C, C0

such that if

∆n(b

f, f0)≤Cn−α

α+1 Llog5n

then

R(b

fn, f0)≤C0n−α

α+1 Llog5n, (10)

and if ∆n(b

f, f0)> Cn−α

α+1 Llog5n, then

C0∆n(b

f, f0)≤R(b

fn, f0)≤C0∆n(b

f, f0).(11)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

THEORETICALANALYSISOFDEEPNEURALNETWORKSFORTEMPORALLYDEPENDENTOBSERVATIONSMingliangMaDepartmentofStatisticsUniversityofFloridaGainesville,FL32611maminglian@ufl.eduAbolfazlSakhaniDepartmentofStatisticsGeorgeMasonUniversityFairfax,VA22030asafikha@gmu.eduABSTRACTDeepneuralnetworksarepowerfultoolstomode...

展开>> 收起<<

THEORETICAL ANALYSIS OF DEEP NEURAL NETWORKS FOR TEMPORALLY DEPENDENT OBSERVATIONS Mingliang Ma.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

THEORETICAL ANALYSIS OF DEEP NEURAL NETWORKS FOR TEMPORALLY DEPENDENT OBSERVATIONS Mingliang Ma

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: