Statistical learning for -weakly dependent processes October 4 2022 Mamadou Lamine DIOP1and William KENGNE2

2025-05-03 0 0 484.63KB 20 页 10玖币
侵权投诉
Statistical learning for ψ-weakly dependent processes
October 4, 2022
Mamadou Lamine DIOP 1and William KENGNE 2
THEMA, CY Cergy Paris Universit´e, 33 Boulevard du Port, 95011 Cergy-Pontoise Cedex, France
E-mail: mamadou-lamine.diop@cyu.fr ; william.kengne@cyu.fr
Abstract: We consider statistical learning question for ψ-weakly dependent processes, that unifies a large
class of weak dependence conditions such as mixing, association,··· The consistency of the empirical risk
minimization algorithm is established. We derive the generalization bounds and provide the learning rate,
which, on some H¨older class of hypothesis, is close to the usual O(n1/2) obtained in the i.i.d. case. Application
to time series prediction is carried out with an example of causal models with exogenous covariates.
Keywords: Statistical learning, ψ-weak dependence, ERM principle, generalization bound, consistency.
1 Introduction
Statistical learning has received a considerable attention in the literature in the recent decades. This interest
is still increasing nowadays, mainly due to the significant successes of the applications of machine learning
algorithms and the theoretical properties of these algorithms, which are now well studied in many cases. For
instance, see [31], [9], [7], [27] for some results when the training samples are independent and identically
distributed (i.i.d.). But, the i.i.d. assumption fails in many real life applications: market prediction, GDP
(gross domestic product) prediction, signal processing, meteorological observations,··· There is a vast literature
on learning with dependent observations, we refer to [29], [36], [28], [32], [23] and the references therein for an
overview of this issue.
We consider the supervised learning setting and let Dn={Z1= (X1, Y1),··· , Zn= (Xn, Yn)}(the
training sample) be a trajectory of a stationary and ergodic process {Zt= (Xt, Yt), t Z}, which takes values
in Z=X × Y, where Xis the input space and Ythe output space. Denote by H={h:X → Y} the set of
hypothesis (a family of predictors) and consider a loss function `:Y ×Y [0,). In this context of learning
from dependent data, the generalization error (risk) can be defined in various ways, see for instance [23]. We
deal with the widely-used averaged risk (see for instance, [36], [28], [1], [17], [6]), given for any hypothesis
h∈ H by
R(h) = E`h(X0), Y0.
1Supported by the MME-DII center of excellence (ANR-11-LABEX-0023-01)
2Developed within the ANR BREAKRISK: ANR-17-CE26-0001-01 and the CY Initiative of Excellence (grant ”Investissements
d’Avenir” ANR-16-IDEX-0008), Project ”EcoDep” PSI-AAP2020-0000000013
1
arXiv:2210.00088v1 [math.ST] 30 Sep 2022
2Statistical learning for ψ-weakly dependent processes
The goal is to construct a learner h∈ H such that, for any tZ,h(Xt) is average ”close” to Yt; that is, a
learner which achieves the small averaged risk. The empirical risk (with respect to the training sample) of a
hypothesis his given by
b
Rn(h) = 1
n
n
X
i=1
`h(Xi), Yi.
In the sequel, we set `(h, z) = `h(x), yfor all z= (x, y) X × Y and h:X → Y. The setting considered
here covers many commonly used situations: regression estimation, classification (pattern recognition when Y
is finite), autoregressive models prediction (we can take Xt= (Yt1,··· , Ytk) and X=Ykfor some kN),
autoregressive models with exogenous covariates.
Consider a target (with respect to H) function hH(assumed to exist), given by,
hH= argmin
h∈H
R(h);
and the empirical target
b
hn= argmin
h∈H b
Rn(h).(1.1)
We focus on the empirical risk minimization (ERM) principle and aim to study the relevance of estimation
of hHby b
hn. The capacity of b
hnto approximate hHis now as the generalization capability of the ERM
algorithm. This generalization capability is accessed by studying how R(b
hn) is close to R(hH). The deviation
between R(b
hn) and R(hH) is the generalization error the algorithm. When R(b
hn)R(hH) = oP(1), the ERM
algorithm is said to be consistent within the hypothesis class H.
The study of a learning algorithm includes the calibration of a bound of the generalization error for any
fixed n(non asymptotic property) and the investigation of consistency (asymptotic property). There exist
several important contributions in the literature devoted to statistical learning for dependent observations,
with various types of dependence structure. See among others papers, [25], [26], [36], [28], [35], [18], [23]
for some developments under mixing conditions and [39], [37], [38], [33] for some results for Markov chains.
[1] considered a prediction of time series under θ-weakly dependent condition. They established convergence
rates using the PAC-Bayesian approach. See also [6], [22], [24] for some Bernstein-type inequality for τ-mixing
process and some advances on time series forecasting using statistical learning paradigm. However, most of
the above works are developed within a mixing condition or for time series prediction or do not consider a
general setting that includes pattern recognition, regression estimation, time series prediction,···
In this new contribution, we consider a general learning framework where the observations Dn={Z1=
(X1, Y1),··· , Zn= (Xn, Yn)}is a trajectory of a ψ-weakly dependent process {Zt= (Xt, Yt), t Z}with
values in a Banach space Z=X × Y. The following issues are addressed.
(i) Consistency of the ERM algorithm. We establish the consistency of the ERM algorithm within any
space Hof Lipschitz predictors. In comparison with the existing works, let us stress that, the ψ-weakly
dependent structure considered here is a more general concept and it is well known that many weak
dependent processes do not fulfill the mixing conditions, see for instance [10].
(ii) Generalization bounds and convergence rates. When X Rd(with dN), Y Rand His a
subset of a H¨older space Csfor some s > 0, generalization bounds are derived and the learning rate is
provided. This rate is close to the usual O(n1/2) when sd.
Diop and Kengne 3
(iii) Application to time series prediction. Application to the prediction of affine causal models with
exogenous covariates is carried out. We show that, these models fulfill the conditions that ensure the
consistency of the ERM algorithm and enjoy the generalization bounds established.
The rest of the paper is structured as follows. In Section 2, we set some notations and assumptions. Section
3 provides the main results on consistency, generalization bounds and convergence rates. Application to the
prediction of affine causal models with exogenous covariates is carried out in Section 4, whereas Section 5 is
devoted to the proofs of the main results.
2 Notations and assumptions
In the sequel, we assume that Xand Yare subsets of separable Banach spaces equipped with norms k · kX
and k·kYrespectively and consider the covering number as the complexity measure of H. The complexity of
the hypothesis set Hplay a key role in such study. Recall that, for any  > 0, the -covering number N(H, )
of Hwith the k·knorm is the minimal number of balls of radius needed to cover H; that is,
N(H, ) = inf nm1 : h1,··· , hm∈ H such that H ⊂
m
[
i=1
B(hi, )o
where B(hi, ) = h:X → Y;khhik= supx∈X kh(x)hi(x)kY. For two Banach spaces E1and E2
equipped with a norm k·kEii= 1,2, and any function h:E1E2set,
khk= sup
xE1kh(x)kE2,Lipα(h):= sup
x1,x2E1, x16=x2
kh(x1)h(x2)kE2
kx1x2kα
E1
for any α[0,1],
and for any K>0, Λα,K(E1, E2) (simply Λα,K(E1) when E2=R) denotes the set of functions h:Eu
1E2
for some uN, such that khk<and Lipα(h)≤ K. When α= 1, we set Lip1(h) = Lip(h). In the
whole sequel, Λ1(E1) denotes Λ1,1(E1,R) and if E1,··· , Ekare separable Banach spaces equipped with norms
k·kE1,··· ,k·kEkrespectively, then we set kxk=kx1kE1+···+kxkkEkfor any x= (x1,··· , xk)E1×···×Ek.
We set the following assumptions for the sequel.
(A1): There exists KH>0 such that His a subset of Λ1,KH(X,Y) and suph∈H khk<.
(A2): There exists K`>0 such that, the loss function `belongs to Λ1,K`(Y2) and M= suph∈H supz∈Z |`(h, z)|<
.
Under (A1) and (A2), we have
L:= sup
h1,h2∈H,h16=h2
sup
z∈Z
|`(h1, z)`(h2, z)|
kh1h2k
<.(2.1)
Under the pre-compact condition in (A1), for any  > 0, the -covering number N(H, ) is finite. If Yis
bounded, then examples of loss functions
`(y, y0) = kyy0kY, `(y, y0) = kyy0k2
Y,(2.2)
fulfill the conditions in assumption (A2) with K`= 1 and K`= 2 supy∈Y kykYrespectively. In both cases, one
can easily see that suph∈H supz∈Z |`(h, z)|<.
Let us define the weakly dependent process, see [13] and [10]. Let Ebe a separable Banach space.
4Statistical learning for ψ-weakly dependent processes
Definition 2.1 An E-valued process (Zt)tZis said to be 1(E), ψ, )-weakly dependent if there exists a
function ψ: [0,)2×N2[0,)and a sequence = ((r))rNdecreasing to zero at infinity such that, for
any g1, g2Λ1(E)with g1:EuR,g2:EvR(u, v N) and for any u-tuple (s1,··· , su)and any v-tuple
(t1,··· , tv)with s1≤ ··· ≤ susu+rt1≤ ··· ≤ tv, the following inequality is fulfilled:
|Cov (g1(Zs1,··· , Zsu), g2(Zt1,··· , Ztv))| ≤ ψ(Lip(g1),Lip(g2), u, v)(r).
For example, we have the following choices of ψ(see also [10]).
ψ(Lip(g1),Lip(g2), u, v) = vLip(g2): the θ-weak dependence, then denote (r) = θ(r);
ψ(Lip(g1),Lip(g2), u, v) = uLip(g1) + vLip(g2): the η-weak dependence, then denote (r) = η(r);
ψ(Lip(g1),Lip(g2), u, v) = uvLip(g1)·Lip(g2): the κ-weak dependence, then denote (r) = κ(r);
ψ(Lip(g1),Lip(g2), u, v) = uLip(g1)+vLip(g2)+uvLip(g1)·Lip(g2): the λ-weak dependence, then denote
(r) = λ(r).
One can easily see that η(r)θ(r) for all r0. In the sequel, for each of the four choices of ψabove, we set
respectively, Ψ(u, v) = 2v, Ψ(u, v) = u+v, Ψ(u, v) = uv and Ψ(u, v)=(u+v+uv)/2.
Now, we set the weak-dependence assumption.
(A3): Let ψ: [0,)2×N2[0,) be one of the four choices above. The process {Zt= (Xt, Yt), t Z}is
stationary ergodic and (Λ1(Z), ψ, )-weakly dependent such that, there exist L1, L2, µ 0 satisfying
X
j0
(j+ 1)kjL1Lk
2(k!)µfor all k0.(2.3)
3 Main results
3.1 Generalization bound and consistency
The following proposition provides an inequality of the deviation of the empirical risk around the risk, for any
fixed predictor.
Proposition 3.1 Assume that the conditions (A1)-(A3) hold. Let h∈ H. For all ε > 0,nN, we have
PnR(h)b
Rn(h)> εoexp n2ε2/2
An+B1/(µ+2)
n()(2µ+3)/(µ+2) ,
for any real numbers Anand Bnsatisfying:
AnEh n
X
i=1 `h(Xi), YiE`h(X0), Y02iand Bn= 2ML2max 24+µnM2L1
An
,1.
The next theorem provides a uniform concentration inequality between risk and its empirical version.
摘要:

Statisticallearningfor-weaklydependentprocessesOctober4,2022MamadouLamineDIOP1andWilliamKENGNE2THEMA,CYCergyParisUniversite,33BoulevardduPort,95011Cergy-PontoiseCedex,FranceE-mail:mamadou-lamine.diop@cyu.fr;william.kengne@cyu.frAbstract:Weconsiderstatisticallearningquestionfor-weaklydependentproces...

展开>> 收起<<
Statistical learning for -weakly dependent processes October 4 2022 Mamadou Lamine DIOP1and William KENGNE2.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:484.63KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注