Statistical learning for -weakly dependent processes October 4 2022 Mamadou Lamine DIOP1and William KENGNE2

2025-05-03 0 0 484.63KB 20 页 10玖币

侵权投诉

Statistical learning for ψ-weakly dependent processes

October 4, 2022

Mamadou Lamine DIOP 1and William KENGNE 2

THEMA, CY Cergy Paris Universit´e, 33 Boulevard du Port, 95011 Cergy-Pontoise Cedex, France

E-mail: mamadou-lamine.diop@cyu.fr ; william.kengne@cyu.fr

Abstract: We consider statistical learning question for ψ-weakly dependent processes, that uniﬁes a large

class of weak dependence conditions such as mixing, association,··· The consistency of the empirical risk

minimization algorithm is established. We derive the generalization bounds and provide the learning rate,

which, on some H¨older class of hypothesis, is close to the usual O(n−1/2) obtained in the i.i.d. case. Application

to time series prediction is carried out with an example of causal models with exogenous covariates.

Keywords: Statistical learning, ψ-weak dependence, ERM principle, generalization bound, consistency.

1 Introduction

Statistical learning has received a considerable attention in the literature in the recent decades. This interest

is still increasing nowadays, mainly due to the signiﬁcant successes of the applications of machine learning

algorithms and the theoretical properties of these algorithms, which are now well studied in many cases. For

instance, see [31], [9], [7], [27] for some results when the training samples are independent and identically

distributed (i.i.d.). But, the i.i.d. assumption fails in many real life applications: market prediction, GDP

(gross domestic product) prediction, signal processing, meteorological observations,··· There is a vast literature

on learning with dependent observations, we refer to [29], [36], [28], [32], [23] and the references therein for an

overview of this issue.

We consider the supervised learning setting and let Dn={Z1= (X1, Y1),··· , Zn= (Xn, Yn)}(the

training sample) be a trajectory of a stationary and ergodic process {Zt= (Xt, Yt), t ∈Z}, which takes values

in Z=X × Y, where Xis the input space and Ythe output space. Denote by H={h:X → Y} the set of

hypothesis (a family of predictors) and consider a loss function `:Y ×Y → [0,∞). In this context of learning

from dependent data, the generalization error (risk) can be deﬁned in various ways, see for instance [23]. We

deal with the widely-used averaged risk (see for instance, [36], [28], [1], [17], [6]), given for any hypothesis

h∈ H by

R(h) = E`h(X0), Y0.

1Supported by the MME-DII center of excellence (ANR-11-LABEX-0023-01)

2Developed within the ANR BREAKRISK: ANR-17-CE26-0001-01 and the CY Initiative of Excellence (grant ”Investissements

d’Avenir” ANR-16-IDEX-0008), Project ”EcoDep” PSI-AAP2020-0000000013

arXiv:2210.00088v1 [math.ST] 30 Sep 2022

2Statistical learning for ψ-weakly dependent processes

The goal is to construct a learner h∈ H such that, for any t∈Z,h(Xt) is average ”close” to Yt; that is, a

learner which achieves the small averaged risk. The empirical risk (with respect to the training sample) of a

hypothesis his given by

Rn(h) = 1

i=1

`h(Xi), Yi.

In the sequel, we set `(h, z) = `h(x), yfor all z= (x, y)∈ X × Y and h:X → Y. The setting considered

here covers many commonly used situations: regression estimation, classiﬁcation (pattern recognition when Y

is ﬁnite), autoregressive models prediction (we can take Xt= (Yt−1,··· , Yt−k) and X=Ykfor some k∈N),

autoregressive models with exogenous covariates.

Consider a target (with respect to H) function hH(assumed to exist), given by,

hH= argmin

h∈H

R(h);

and the empirical target

hn= argmin

h∈H b

Rn(h).(1.1)

We focus on the empirical risk minimization (ERM) principle and aim to study the relevance of estimation

of hHby b

hn. The capacity of b

hnto approximate hHis now as the generalization capability of the ERM

algorithm. This generalization capability is accessed by studying how R(b

hn) is close to R(hH). The deviation

between R(b

hn) and R(hH) is the generalization error the algorithm. When R(b

hn)−R(hH) = oP(1), the ERM

algorithm is said to be consistent within the hypothesis class H.

The study of a learning algorithm includes the calibration of a bound of the generalization error for any

ﬁxed n(non asymptotic property) and the investigation of consistency (asymptotic property). There exist

several important contributions in the literature devoted to statistical learning for dependent observations,

with various types of dependence structure. See among others papers, [25], [26], [36], [28], [35], [18], [23]

for some developments under mixing conditions and [39], [37], [38], [33] for some results for Markov chains.

[1] considered a prediction of time series under θ-weakly dependent condition. They established convergence

rates using the PAC-Bayesian approach. See also [6], [22], [24] for some Bernstein-type inequality for τ-mixing

process and some advances on time series forecasting using statistical learning paradigm. However, most of

the above works are developed within a mixing condition or for time series prediction or do not consider a

general setting that includes pattern recognition, regression estimation, time series prediction,···

In this new contribution, we consider a general learning framework where the observations Dn={Z1=

(X1, Y1),··· , Zn= (Xn, Yn)}is a trajectory of a ψ-weakly dependent process {Zt= (Xt, Yt), t ∈Z}with

values in a Banach space Z=X × Y. The following issues are addressed.

(i) Consistency of the ERM algorithm. We establish the consistency of the ERM algorithm within any

space Hof Lipschitz predictors. In comparison with the existing works, let us stress that, the ψ-weakly

dependent structure considered here is a more general concept and it is well known that many weak

dependent processes do not fulﬁll the mixing conditions, see for instance [10].

(ii) Generalization bounds and convergence rates. When X ⊂ Rd(with d∈N), Y ⊂ Rand His a

subset of a H¨older space Csfor some s > 0, generalization bounds are derived and the learning rate is

provided. This rate is close to the usual O(n−1/2) when sd.

Diop and Kengne 3

(iii) Application to time series prediction. Application to the prediction of aﬃne causal models with

exogenous covariates is carried out. We show that, these models fulﬁll the conditions that ensure the

consistency of the ERM algorithm and enjoy the generalization bounds established.

The rest of the paper is structured as follows. In Section 2, we set some notations and assumptions. Section

3 provides the main results on consistency, generalization bounds and convergence rates. Application to the

prediction of aﬃne causal models with exogenous covariates is carried out in Section 4, whereas Section 5 is

devoted to the proofs of the main results.

2 Notations and assumptions

In the sequel, we assume that Xand Yare subsets of separable Banach spaces equipped with norms k · kX

and k·kYrespectively and consider the covering number as the complexity measure of H. The complexity of

the hypothesis set Hplay a key role in such study. Recall that, for any  > 0, the -covering number N(H, )

of Hwith the k·k∞norm is the minimal number of balls of radius needed to cover H; that is,

N(H, ) = inf nm≥1 : ∃h1,··· , hm∈ H such that H ⊂

[

i=1

B(hi, )o

where B(hi, ) = h:X → Y;kh−hik∞= supx∈X kh(x)−hi(x)kY≤. For two Banach spaces E1and E2

equipped with a norm k·kEii= 1,2, and any function h:E1→E2set,

khk∞= sup

x∈E1kh(x)kE2,Lipα(h):= sup

x1,x2∈E1, x16=x2

kh(x1)−h(x2)kE2

kx1−x2kα

for any α∈[0,1],

and for any K>0, Λα,K(E1, E2) (simply Λα,K(E1) when E2=R) denotes the set of functions h:Eu

1→E2

for some u∈N, such that khk∞<∞and Lipα(h)≤ K. When α= 1, we set Lip1(h) = Lip(h). In the

whole sequel, Λ1(E1) denotes Λ1,1(E1,R) and if E1,··· , Ekare separable Banach spaces equipped with norms

k·kE1,··· ,k·kEkrespectively, then we set kxk=kx1kE1+···+kxkkEkfor any x= (x1,··· , xk)∈E1×···×Ek.

We set the following assumptions for the sequel.

(A1): There exists KH>0 such that His a subset of Λ1,KH(X,Y) and suph∈H khk∞<∞.

(A2): There exists K`>0 such that, the loss function `belongs to Λ1,K`(Y2) and M= suph∈H supz∈Z |`(h, z)|<

∞.

Under (A1) and (A2), we have

L:= sup

h1,h2∈H,h16=h2

sup

z∈Z

|`(h1, z)−`(h2, z)|

kh1−h2k∞

<∞.(2.1)

Under the pre-compact condition in (A1), for any  > 0, the -covering number N(H, ) is ﬁnite. If Yis

bounded, then examples of loss functions

`(y, y0) = ky−y0kY, `(y, y0) = ky−y0k2

Y,(2.2)

fulﬁll the conditions in assumption (A2) with K`= 1 and K`= 2 supy∈Y kykYrespectively. In both cases, one

can easily see that suph∈H supz∈Z |`(h, z)|<∞.

Let us deﬁne the weakly dependent process, see [13] and [10]. Let Ebe a separable Banach space.

4Statistical learning for ψ-weakly dependent processes

Deﬁnition 2.1 An E-valued process (Zt)t∈Zis said to be (Λ1(E), ψ, )-weakly dependent if there exists a

function ψ: [0,∞)2×N2→[0,∞)and a sequence = ((r))r∈Ndecreasing to zero at inﬁnity such that, for

any g1, g2∈Λ1(E)with g1:Eu→R,g2:Ev→R(u, v ∈N) and for any u-tuple (s1,··· , su)and any v-tuple

(t1,··· , tv)with s1≤ ··· ≤ su≤su+r≤t1≤ ··· ≤ tv, the following inequality is fulﬁlled:

|Cov (g1(Zs1,··· , Zsu), g2(Zt1,··· , Ztv))| ≤ ψ(Lip(g1),Lip(g2), u, v)(r).

For example, we have the following choices of ψ(see also [10]).

•ψ(Lip(g1),Lip(g2), u, v) = vLip(g2): the θ-weak dependence, then denote (r) = θ(r);

•ψ(Lip(g1),Lip(g2), u, v) = uLip(g1) + vLip(g2): the η-weak dependence, then denote (r) = η(r);

•ψ(Lip(g1),Lip(g2), u, v) = uvLip(g1)·Lip(g2): the κ-weak dependence, then denote (r) = κ(r);

•ψ(Lip(g1),Lip(g2), u, v) = uLip(g1)+vLip(g2)+uvLip(g1)·Lip(g2): the λ-weak dependence, then denote

(r) = λ(r).

One can easily see that η(r)≤θ(r) for all r≥0. In the sequel, for each of the four choices of ψabove, we set

respectively, Ψ(u, v) = 2v, Ψ(u, v) = u+v, Ψ(u, v) = uv and Ψ(u, v)=(u+v+uv)/2.

Now, we set the weak-dependence assumption.

(A3): Let ψ: [0,∞)2×N2→[0,∞) be one of the four choices above. The process {Zt= (Xt, Yt), t ∈Z}is

stationary ergodic and (Λ1(Z), ψ, )-weakly dependent such that, there exist L1, L2, µ ≥0 satisfying

j≥0

(j+ 1)kj≤L1Lk

2(k!)µfor all k≥0.(2.3)

3 Main results

3.1 Generalization bound and consistency

The following proposition provides an inequality of the deviation of the empirical risk around the risk, for any

ﬁxed predictor.

Proposition 3.1 Assume that the conditions (A1)-(A3) hold. Let h∈ H. For all ε > 0,n∈N, we have

PnR(h)−b

Rn(h)> εo≤exp −n2ε2/2

An+B1/(µ+2)

n(nε)(2µ+3)/(µ+2) ,

for any real numbers Anand Bnsatisfying:

An≥Eh n

i=1 `h(Xi), Yi−E`h(X0), Y02iand Bn= 2ML2max 24+µnM2L1

,1.

The next theorem provides a uniform concentration inequality between risk and its empirical version.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Statisticallearningfor-weaklydependentprocessesOctober4,2022MamadouLamineDIOP1andWilliamKENGNE2THEMA,CYCergyParisUniversite,33BoulevardduPort,95011Cergy-PontoiseCedex,FranceE-mail:mamadou-lamine.diop@cyu.fr;william.kengne@cyu.frAbstract:Weconsiderstatisticallearningquestionfor-weaklydependentproces...

展开>> 收起<<

Statistical learning for -weakly dependent processes October 4 2022 Mamadou Lamine DIOP1and William KENGNE2.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Statistical learning for -weakly dependent processes October 4 2022 Mamadou Lamine DIOP1and William KENGNE2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: