Renewable Learning for Multiplicative Regression with Streaming Datasets Tianzhen Wang1 Haixiang Zhang1and Liuquan Sun2

2025-04-29 0 0 388.2KB 32 页 10玖币

侵权投诉

Renewable Learning for Multiplicative Regression

with Streaming Datasets

Tianzhen Wang1, Haixiang Zhang1∗and Liuquan Sun2

1Center for Applied Mathematics, Tianjin University, Tianjin 300072, China

2Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China

Abstract

When large amounts of data continuously arrive in streams, online updating is an

eﬀective way to reduce storage and computational burden. The key idea of online up-

dating is that the previous estimators are sequentially updated only using the current

data and some summary statistics of historical raw data. In this article, we develop a

renewable learning method for a multiplicative regression model with streaming data,

where the parameter estimator based on a least product relative error criterion is re-

newed without revisiting any historical raw data. Under some regularity conditions, we

establish the consistency and asymptotic normality of the renewable estimator. More-

over, the theoretical results conﬁrm that the proposed renewable estimator achieves

the same asymptotic distribution as the least product relative error estimator with the

entire dataset. Numerical studies and two real data examples are provided to evaluate

the performance of our proposed method.

Keywords: Multiplicative regression; Positive responses; Renewable learning; Stream-

ing data.

1 Introduction

With the rapid development of data collecting and storage technologies, the sizes of available

datasets have grown rapidly during recent years. In the era of big data, it is common that

datasets continuously arrive in streams or large chunks. Faced with this kind of large-

scale streaming dataset, many conventional statistical methods are challenging mainly due

∗Corresponding author: haixiang.zhang@tju.edu.cn (Haixiang Zhang)

arXiv:2210.05149v1 [stat.ME] 11 Oct 2022

to (i) the entire dataset is too large to be held in a general computer’s memory; (ii) the

historical data may no longer be accessible due to the storage burden or privacy limit. The

online updating method is eﬀective to address the two challenges, because it only needs

the current block data and some summary statistics of previous data instead of historical

raw data. To be more speciﬁc, the primary advantage of online updating method is that

it does not require to access historical data, while it is able to provide real-time inference

for making decisions. In the literature, many eﬀorts have been devoted to develop online

updating methods towards streaming datasets. For example, Schifano et al. (2016) proposed

a cumulative estimating equation (CEE) estimator and a cumulatively updated estimating

equation (CUEE) estimator with streaming datasets. Lee et al. (2020) studied an online

updating method to correct the bias due to covariate measurement error in the framework of

linear models. Luo and Song (2020) developed an incremental updating algorithm to analyze

streaming data for generalized linear model. Lin et al. (2020) established a uniﬁed framework

of renewable weighted sums for various online updating estimations with streaming datasets.

Xue et al. (2020) proposed an online updating-based test to evaluate the proportional hazards

assumption with streaming survival data. Wu et al. (2021) proposed an online updating

method of survival analysis under the Cox proportional hazards model. Luo and Song

(2021) studied a multivariate online regression analysis with heterogeneous streaming data.

Lin et al. (2021) studied a homogenization strategy for heterogeneous streaming data. Hector

et al. (2021) proposed a new big data learning method by seamlessly integrating parallel data

processing and online streaming paradigm. Luo et al. (2021) proposed an online debiased

lasso method for high-dimensional generalized linear models with streaming data. Shi and

Luo (2021) studied a novel framework for online causal learning. Luo et al. (2022) proposed

an incremental learning algorithm to analyze streaming data with correlated outcomes based

on quadratic inference function. Wang et al. (2022) proposed a novel online renewable

strategy for quantile regression, among others.

In practice, we often meet with positive data in economic or biomedical studies. The

multiplicative regression plays an important role in modeling this kind of positive data,

such as stock prices or life times. In many applications, the relative error (e.g. stock price

data), rather than error itself, is the major concern. The multiplicative regression is able

to capture the size of relative error. There have been several papers on the statistical

analysis with multiplicative regression in the literature. e.g., Chen et al. (2010) proposed a

least absolute relative error estimation criterion for multiplicative regression model. Li et al.

(2014) considered an empirical likelihood approach towards constructing conﬁdence intervals

of the regression parameters in multiplicative regression model. Chen et al. (2016) proposed a

least product relative error (LPRE) estimation criterion for multiplicative regression model.

Xia et al. (2016) studied the variable selection for multiplicative regression model. Faced with

large-scale streaming data with positive responses, we propose a renewable learning method

for multiplicative regression model. The main features of our approach are as follows: First,

the renewable estimator and its variance are sequentially updated only using the current

data batch and some summary statistics of historical data, instead of the historical raw

data. Therefore, the proposed method can deal with the computation and storage burden

due to massive blocks of data. Second, the renewable estimator is statistically equivalent

to the traditional LPRE estimator that based on the entire dataset, which implies that it

achieves the same asymptotic distribution as the traditional LPRE estimator. Third, the

computational speed of the proposed renewable learning method is much faster than the full

data method.

The remainder of this article is organized as follows. In Section 2, we brieﬂy review

some notations for the multiplicative regression model with streaming data. In Section 3, we

present a renewable estimation method and review two sequential updating methods. Section

4 investigates the theoretical properties of the proposed renewable estimator. In Section 5,

we conduct some numerical simulations to evaluate the performance of our method. Section

6 presents two illustrative real data examples. In Section 7, we give some conclusions and

future research topics. All proofs are given in the Appendix.

2 Model and Notations

We consider the following multiplicative regression model (Chen et al., 2010),

Yi= exp(βTXi)i,(2.1)

where Yiis a positive response variable, Xi∈Rpis a vector of covariates with the ﬁrst

component being 1 (intercept), β= (β1,...,βp)Tis a vector of regression parameters, and

i>0 is an error term, i= 1 . . . N. To estimate the parameters in model (2.1), Chen et al.

(2016) proposed a LPRE criterion

`(Y;X,β) =

i=1 Yiexp(−βTXi) + Y−1

iexp(βTXi)−2,

which is an inﬁnitely diﬀerentiable and strictly convex function, where Y= (Y1, . . . , YN)T

and X= (X1,...,XN)T. Accordingly, the score function is given by S(Y;X,β) = ∇β`(Y;X,β),

where ∇βstands for the derivative of `(Y;X,β) with respect to β. Speciﬁcally, the score

function has the following explicit expression:

S(Y;X,β) =

i=1 Y−1

iexp(βTXi)−Yiexp(−βTXi)Xi.

Denote the minimizer of `(Y;X,β) as ˆ

βN, satisfying S(Y;X,ˆ

βN) = 0. Due to the convexity

of `(Y;X,β), the Newton-Raphson method is usually adopted to obtain the traditional

LPRE estimator.

Note that streaming data with positive response is very common in many ﬁelds such

as bioinformatics (Wei, 1992; Jin et al., 2003) and economic analysis (Teekens and Koerts,

1972). This brings new research opportunities, but also comes with challenges of storing

and analyzing such streaming data. To be more speciﬁc, the storage burden is heavy due

to large blocks data. Moreover, it is often computationally infeasible to perform statistical

analysis due to the relatively limited computing resources at hand. Meanwhile, the previous

data may be not accessible due to privacy concern. Therefore, it is desirable to develop

a renewable learning method for the multiplicative regression model that does not require

storing any historical individual-level data in the streaming data environment. Assume

that D1,...,Db, . . . are independent and identically distributed streaming datasets, where

Db={(Xib, Yib)}nb

i=1 is the bth dataset. Let D∗

b={D1,...,Db}denotes the cumulative data

up to batch bwith Nb=Pb

k=1 nk. As mentioned by Luo and Song (2020), the key idea of

renewable estimation method is that a previous estimator is sequentially updated only using

the current data batch Dband some summary statistics of historical data batches. To deal

with large-scale streaming data with positive response, we will propose a renewable learning

method for the multiplicative regression model in next section.

3 Methods

3.1 Renewable Estimation

Let ˆ

βband ˆ

β∗

bbe the traditional LPRE estimators obtained from a single batch Dband the

entire cumulative dataset D∗

b, respectively. Denote ˜

βbas a renewable estimator obtained

from the current data batch Dband some summary statistics of historical data batches D∗

b−1,

where an initial estimator with the ﬁrst data batch is ˜

β1=ˆ

β∗

1. For b= 2,3, . . ., a

previous estimator ˜

βb−1is sequentially updated to ˜

βbusing the current data batch Dband

a summary statistic of previous data batches D∗

b−1. To illustrate the proposed method, we

denote the score function on data batch Dbas follows:

Sb(Db,β) = X

i∈DbY−1

iexp(βTXi)−Yiexp(−βTXi)Xi,

and its negative gradient matrix is

Qb(Db,β) = −X

i∈DbYiexp(−βTXi) + Y−1

iexp(βTXi)XiXT

For simplicity, we ﬁrst consider two data batches D1and D2. For the ﬁrst data batch D1,

a LPRE ˆ

β1is obtained by solving S1(D1,ˆ

β1) = 0. When the second data batch D2arrives,

the traditional LPRE estimator ˆ

β∗

2satisﬁes the following aggregated score equation,

S1(D1,ˆ

β∗

2) + S2(D2,ˆ

β∗

2) = 0.(3.1)

However, solving equation (3.1) requires revisiting the previous data batch D1. To derive a

renewable estimator that does not need to revisit D1, we take the ﬁrst-order Taylor expansion

of S1(D1,ˆ

β∗

2) at the estimator ˜

β1,

S1(D1,˜

β1) + Q1(D1,˜

β1)( ˜

β1−ˆ

β∗

2) + Opkˆ

β∗

2−˜

β1k2+S2(D2,ˆ

β∗

2)=0.

If min{n1, n2}is large enough, both ˆ

β∗

2and ˜

β1are consistent estimators of the true value

βt(Chen et al., 2016). After ignoring the error term Opkˆ

β∗

2−˜

β1k2, we can derive a

renewable estimator ˜

β2satisfying

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RenewableLearningforMultiplicativeRegressionwithStreamingDatasetsTianzhenWang1,HaixiangZhang1andLiuquanSun21CenterforAppliedMathematics,TianjinUniversity,Tianjin300072,China2AcademyofMathematicsandSystemsScience,ChineseAcademyofSciences,Beijing100190,ChinaAbstractWhenlargeamountsofdatacontinuouslya...

展开>> 收起<<

Renewable Learning for Multiplicative Regression with Streaming Datasets Tianzhen Wang1 Haixiang Zhang1and Liuquan Sun2.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Renewable Learning for Multiplicative Regression with Streaming Datasets Tianzhen Wang1 Haixiang Zhang1and Liuquan Sun2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: