Renewable Learning for Multiplicative Regression with Streaming Datasets Tianzhen Wang1 Haixiang Zhang1and Liuquan Sun2

2025-04-29 0 0 388.2KB 32 页 10玖币
侵权投诉
Renewable Learning for Multiplicative Regression
with Streaming Datasets
Tianzhen Wang1, Haixiang Zhang1and Liuquan Sun2
1Center for Applied Mathematics, Tianjin University, Tianjin 300072, China
2Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
Abstract
When large amounts of data continuously arrive in streams, online updating is an
effective way to reduce storage and computational burden. The key idea of online up-
dating is that the previous estimators are sequentially updated only using the current
data and some summary statistics of historical raw data. In this article, we develop a
renewable learning method for a multiplicative regression model with streaming data,
where the parameter estimator based on a least product relative error criterion is re-
newed without revisiting any historical raw data. Under some regularity conditions, we
establish the consistency and asymptotic normality of the renewable estimator. More-
over, the theoretical results confirm that the proposed renewable estimator achieves
the same asymptotic distribution as the least product relative error estimator with the
entire dataset. Numerical studies and two real data examples are provided to evaluate
the performance of our proposed method.
Keywords: Multiplicative regression; Positive responses; Renewable learning; Stream-
ing data.
1 Introduction
With the rapid development of data collecting and storage technologies, the sizes of available
datasets have grown rapidly during recent years. In the era of big data, it is common that
datasets continuously arrive in streams or large chunks. Faced with this kind of large-
scale streaming dataset, many conventional statistical methods are challenging mainly due
Corresponding author: haixiang.zhang@tju.edu.cn (Haixiang Zhang)
1
arXiv:2210.05149v1 [stat.ME] 11 Oct 2022
to (i) the entire dataset is too large to be held in a general computer’s memory; (ii) the
historical data may no longer be accessible due to the storage burden or privacy limit. The
online updating method is effective to address the two challenges, because it only needs
the current block data and some summary statistics of previous data instead of historical
raw data. To be more specific, the primary advantage of online updating method is that
it does not require to access historical data, while it is able to provide real-time inference
for making decisions. In the literature, many efforts have been devoted to develop online
updating methods towards streaming datasets. For example, Schifano et al. (2016) proposed
a cumulative estimating equation (CEE) estimator and a cumulatively updated estimating
equation (CUEE) estimator with streaming datasets. Lee et al. (2020) studied an online
updating method to correct the bias due to covariate measurement error in the framework of
linear models. Luo and Song (2020) developed an incremental updating algorithm to analyze
streaming data for generalized linear model. Lin et al. (2020) established a unified framework
of renewable weighted sums for various online updating estimations with streaming datasets.
Xue et al. (2020) proposed an online updating-based test to evaluate the proportional hazards
assumption with streaming survival data. Wu et al. (2021) proposed an online updating
method of survival analysis under the Cox proportional hazards model. Luo and Song
(2021) studied a multivariate online regression analysis with heterogeneous streaming data.
Lin et al. (2021) studied a homogenization strategy for heterogeneous streaming data. Hector
et al. (2021) proposed a new big data learning method by seamlessly integrating parallel data
processing and online streaming paradigm. Luo et al. (2021) proposed an online debiased
lasso method for high-dimensional generalized linear models with streaming data. Shi and
Luo (2021) studied a novel framework for online causal learning. Luo et al. (2022) proposed
an incremental learning algorithm to analyze streaming data with correlated outcomes based
on quadratic inference function. Wang et al. (2022) proposed a novel online renewable
strategy for quantile regression, among others.
In practice, we often meet with positive data in economic or biomedical studies. The
multiplicative regression plays an important role in modeling this kind of positive data,
such as stock prices or life times. In many applications, the relative error (e.g. stock price
data), rather than error itself, is the major concern. The multiplicative regression is able
2
to capture the size of relative error. There have been several papers on the statistical
analysis with multiplicative regression in the literature. e.g., Chen et al. (2010) proposed a
least absolute relative error estimation criterion for multiplicative regression model. Li et al.
(2014) considered an empirical likelihood approach towards constructing confidence intervals
of the regression parameters in multiplicative regression model. Chen et al. (2016) proposed a
least product relative error (LPRE) estimation criterion for multiplicative regression model.
Xia et al. (2016) studied the variable selection for multiplicative regression model. Faced with
large-scale streaming data with positive responses, we propose a renewable learning method
for multiplicative regression model. The main features of our approach are as follows: First,
the renewable estimator and its variance are sequentially updated only using the current
data batch and some summary statistics of historical data, instead of the historical raw
data. Therefore, the proposed method can deal with the computation and storage burden
due to massive blocks of data. Second, the renewable estimator is statistically equivalent
to the traditional LPRE estimator that based on the entire dataset, which implies that it
achieves the same asymptotic distribution as the traditional LPRE estimator. Third, the
computational speed of the proposed renewable learning method is much faster than the full
data method.
The remainder of this article is organized as follows. In Section 2, we briefly review
some notations for the multiplicative regression model with streaming data. In Section 3, we
present a renewable estimation method and review two sequential updating methods. Section
4 investigates the theoretical properties of the proposed renewable estimator. In Section 5,
we conduct some numerical simulations to evaluate the performance of our method. Section
6 presents two illustrative real data examples. In Section 7, we give some conclusions and
future research topics. All proofs are given in the Appendix.
2 Model and Notations
We consider the following multiplicative regression model (Chen et al., 2010),
Yi= exp(βTXi)i,(2.1)
3
where Yiis a positive response variable, XiRpis a vector of covariates with the first
component being 1 (intercept), β= (β1,...,βp)Tis a vector of regression parameters, and
i>0 is an error term, i= 1 . . . N. To estimate the parameters in model (2.1), Chen et al.
(2016) proposed a LPRE criterion
`(Y;X,β) =
N
X
i=1 Yiexp(βTXi) + Y1
iexp(βTXi)2,
which is an infinitely differentiable and strictly convex function, where Y= (Y1, . . . , YN)T
and X= (X1,...,XN)T. Accordingly, the score function is given by S(Y;X,β) = β`(Y;X,β),
where βstands for the derivative of `(Y;X,β) with respect to β. Specifically, the score
function has the following explicit expression:
S(Y;X,β) =
N
X
i=1 Y1
iexp(βTXi)Yiexp(βTXi)Xi.
Denote the minimizer of `(Y;X,β) as ˆ
βN, satisfying S(Y;X,ˆ
βN) = 0. Due to the convexity
of `(Y;X,β), the Newton-Raphson method is usually adopted to obtain the traditional
LPRE estimator.
Note that streaming data with positive response is very common in many fields such
as bioinformatics (Wei, 1992; Jin et al., 2003) and economic analysis (Teekens and Koerts,
1972). This brings new research opportunities, but also comes with challenges of storing
and analyzing such streaming data. To be more specific, the storage burden is heavy due
to large blocks data. Moreover, it is often computationally infeasible to perform statistical
analysis due to the relatively limited computing resources at hand. Meanwhile, the previous
data may be not accessible due to privacy concern. Therefore, it is desirable to develop
a renewable learning method for the multiplicative regression model that does not require
storing any historical individual-level data in the streaming data environment. Assume
that D1,...,Db, . . . are independent and identically distributed streaming datasets, where
Db={(Xib, Yib)}nb
i=1 is the bth dataset. Let D
b={D1,...,Db}denotes the cumulative data
up to batch bwith Nb=Pb
k=1 nk. As mentioned by Luo and Song (2020), the key idea of
renewable estimation method is that a previous estimator is sequentially updated only using
the current data batch Dband some summary statistics of historical data batches. To deal
4
with large-scale streaming data with positive response, we will propose a renewable learning
method for the multiplicative regression model in next section.
3 Methods
3.1 Renewable Estimation
Let ˆ
βband ˆ
β
bbe the traditional LPRE estimators obtained from a single batch Dband the
entire cumulative dataset D
b, respectively. Denote ˜
βbas a renewable estimator obtained
from the current data batch Dband some summary statistics of historical data batches D
b1,
where an initial estimator with the first data batch is ˜
β1=ˆ
β1=ˆ
β
1. For b= 2,3, . . ., a
previous estimator ˜
βb1is sequentially updated to ˜
βbusing the current data batch Dband
a summary statistic of previous data batches D
b1. To illustrate the proposed method, we
denote the score function on data batch Dbas follows:
Sb(Db,β) = X
i∈DbY1
iexp(βTXi)Yiexp(βTXi)Xi,
and its negative gradient matrix is
Qb(Db,β) = X
i∈DbYiexp(βTXi) + Y1
iexp(βTXi)XiXT
i.
For simplicity, we first consider two data batches D1and D2. For the first data batch D1,
a LPRE ˆ
β1is obtained by solving S1(D1,ˆ
β1) = 0. When the second data batch D2arrives,
the traditional LPRE estimator ˆ
β
2satisfies the following aggregated score equation,
S1(D1,ˆ
β
2) + S2(D2,ˆ
β
2) = 0.(3.1)
However, solving equation (3.1) requires revisiting the previous data batch D1. To derive a
renewable estimator that does not need to revisit D1, we take the first-order Taylor expansion
of S1(D1,ˆ
β
2) at the estimator ˜
β1,
S1(D1,˜
β1) + Q1(D1,˜
β1)( ˜
β1ˆ
β
2) + Opkˆ
β
2˜
β1k2+S2(D2,ˆ
β
2)=0.
If min{n1, n2}is large enough, both ˆ
β
2and ˜
β1are consistent estimators of the true value
βt(Chen et al., 2016). After ignoring the error term Opkˆ
β
2˜
β1k2, we can derive a
renewable estimator ˜
β2satisfying
5
摘要:

RenewableLearningforMultiplicativeRegressionwithStreamingDatasetsTianzhenWang1,HaixiangZhang1andLiuquanSun21CenterforAppliedMathematics,TianjinUniversity,Tianjin300072,China2AcademyofMathematicsandSystemsScience,ChineseAcademyofSciences,Beijing100190,ChinaAbstractWhenlargeamountsofdatacontinuouslya...

展开>> 收起<<
Renewable Learning for Multiplicative Regression with Streaming Datasets Tianzhen Wang1 Haixiang Zhang1and Liuquan Sun2.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:388.2KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注