Unweighted estimation based on optimal sample under measurement constraints Jing Wang123 HaiYing Wang3 Shifeng Xiong1

2025-05-06 0 0 1.12MB 48 页 10玖币

侵权投诉

Unweighted estimation based on optimal sample under

measurement constraints

Jing Wang 1,2,3, HaiYing Wang 3∗

, Shifeng Xiong 1

October 11, 2022

NCMIS, KLSC, Academy of Mathematics and Systems Science, CAS, Beijing 100190, China 1

School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China 2

Department of Statistics, University of Connecticut, Storrs, CT 06269, U.S.A. 3

Abstract

To tackle massive data, subsampling is a practical approach to select the more in-

formative data points. However, when responses are expensive to measure, developing

eﬃcient subsampling schemes is challenging, and an optimal sampling approach under

measurement constraints was developed to meet this challenge. This method uses the

inverses of optimal sampling probabilities to reweight the objective function, which as-

signs smaller weights to the more important data points. Thus the estimation eﬃciency

of the resulting estimator can be improved. In this paper, we propose an unweighted

estimating procedure based on optimal subsamples to obtain a more eﬃcient estimator.

We obtain the unconditional asymptotic distribution of the estimator via martingale

techniques without conditioning on the pilot estimate, which has been less investigated

in the existing subsampling literature. Both asymptotic results and numerical results

show that the unweighted estimator is more eﬃcient in parameter estimation.

keywords: Generalized Linear Models; Massive Data; Martingale Central Limit The-

orem

MSC2020: Primary 62D05; secondary 62J12

∗Author to whom correspondence may be addressed. Email:haiying.wang@uconn.edu

arXiv:2210.04079v1 [stat.CO] 8 Oct 2022

1 INTRODUCTION

Data acquisition is becoming easier nowadays, and massive data bring new challenges to

data storage and processing. Conventional statistical models may not be applicable due to

limited computational resources. Facing such problems, subsampling has become a popular

approach to reduce computational burdens. The key idea of subsampling is to collect more

informative data points from the full data and perform calculations on a smaller data set,

see Drineas et al. (2006); Drineas et al. (2011); Mahoney (2011). In some circumstances,

covariates {Xi}are available for all the data points, but responses {Yi}can be obtained for

only a small portion because they are expensive to measure. For example, the extremely

large size of modern galaxy datasets has made visual classiﬁcation of galaxies impractical.

Most subsampling probabilities developed recently for generalized linear models (GLMs) rely

on complete responses in the full data set, see Wang et al. (2018); Wang (2019), Ai et al.

(2021). In order to handle the diﬃculty when responses are hard to measure, Zhang et al.

(2021) proposed a response-free optimal sampling scheme under measurement constraints

(OSUMC) for GLMs. However, their method uses the reweighted estimator which is not the

most eﬃcient one, since it assigns smaller weights to the more informative data points in the

objective function. The robust sampling probabilities proposed in Nie et al. (2018) do not

depend on the responses either, but their investigation focused on linear regression models.

In this paper, we focus on a subsampling method under measurement constraints and

propose a more eﬃcient estimator based on the same subsamples taken according to OSUMC

for GLMs. We use martingale techniques to derive the unconditional asymptotic distribution

of the unweighted estimator and show that its asymptotic covariance matrix is smaller, in

the Loewner ordering, than that of the weighted estimator. Before showing the structure of

the paper, we ﬁrst give a short overview of the emerging ﬁeld of subsampling methods.

Various subsampling methods have been studied in recent years. For linear regression,

Drineas et al. (2006) developed a subsampling method based on statistical leveraging scores.

Drineas et al. (2011) developed an algorithm using randomized Hardamard transform. Ma

et al. (2015) investigated the statistical perspective of leverage sampling. Wang et al. (2019)

developed an information-based procedure to select optimal subdata for linear regression

deteministically. Zhang & Wang (2021) proposed a distributed sampling-based approach

for linear models. Ma et al. (2020) studied the statistical properties of sampling estimators

and proposed several estimators based on asymptotic results which are related to leverag-

ing scores. Beyond linear models, Fithian & Hastie (2014) proposed a local case-control

subsampling method to handle imbalanced data sets for logistic regression. Wang et al.

(2018) developed an optimal sampling method under A-optimality criterion (OSMAC) for

logistic regression. Their estimator can be improved because inverse probability reweighting

is applied on the objective function, and Wang (2019) developed a more eﬃcient estimator

for logistic regression based on optimal subsample. They proposed an unweighted estimator

with bias correction using an idea similar to Fithian & Hastie (2014). They also introduced a

Poisson sampling algorithm to reduce RAM usage when calculating optimal sampling prob-

abilities. Ai et al. (2021) generalized OSMAC to GLMs and obtained optimal subsampling

probabilities under A- and L-optimality criteria for GLMs. These optimal sampling methods

require all the responses in order to construct optimal probabilities, which is not possible

under measurement constraints. Zhang et al. (2021) developed an optimal sampling method

under measurement constraints. Their estimator is also based on the weighted objective

function and thus the performance can be improved. Recently, Cheng et al. (2020) extended

an information-based data selection approach for linear models to logistic regression. Yu

et al. (2022) derived optimal Poisson subsampling probabilies under the A- and L-optimality

criteria for quasi-likelihood estimation, and developed a distributed subsampling framwork

to deal with data stored in diﬀerent machines. Wang & Ma (2020) developed an optimal

sampling method for quantile regression. Pronzato & Wang (2021) proposed a sequential

online subsampling procedure based on optimal bounded design measures.

We focus on GLMs in this paper, which include commonly used models such as linear, lo-

gistic and Poisson regression. The rest of the paper is organized as follows. Section 2 presents

the model setup and brieﬂy reviews the OSUMC method. The more eﬃcient estimator and

its asymptotic properties are presented in Section 3. Section 4 provides numerical simula-

tions. We summarize our paper in Section 5. Proofs and technical details are presented in

the Supplementary Material.

2 BACKGROUND AND MODEL SETUP

We start by reviewing GLMs. Consider independent and identically distributed (i.i.d)

data (X1, Y1),(X2, Y2),..., (Xn, Yn)from the distribution of (X, Y ), where X∈Rpis the

covariate vector and Yis the response variable. Assume that the conditional density of Y

given Xsatisﬁes that

f(y|x, β0, σ)∝exp yxTβ0−b(xTβ0)

c(σ),

where β0is the unknown parameter we need to estimate from data, b(·)and c(·)are known

functions, and σis the dispersion parameter. In this paper, we are only interested in esti-

mating β0. Thus, we take c(σ)=1without loss of generality. We also include an intercept

in the model, as is almost always the case in practice. We obtain the maximum likelihood

estimator (MLE) of β0through maximizing the loglikelihood function, namely,

βMLE := arg max

i=1 YiXT

iβ−b(XT

iβ),(1)

which is the same as solving the following score equation:

Ψn(β) := 1

i=1 {b0(XT

iβ)−Yi}Xi= 0,

where b0(·)is the derivative of b(·). There is no general closed-form solution to ˆ

βMLE, and

iterative algorithms such as Newton’s method are often used. Therefore, when the data are

massive, the computational burden of estimating β0is very heavy. To handle this problem,

Ai et al. (2021) proposed a subsampling-based approach, which constructs sampling proba-

bilities {πi}n

i=1 that depend on both the covariates {Xi}and the responses {Yi}. However,

it is infeasible to obtain all the responses under measurement constraints. For example, it

costs considerable money and time to synthesize superconductors. When we use data-driven

methods to predict the critical temperature with the chemical composition of superconduc-

tors, it may be more pratical to measure a small number of materials to build a data-driven

model. To tackle this type of “many X, few Y” scenario, Zhang et al. (2021) developed

OSUMC subsampling probabilities.

Assume we obtain a subsample of size rby sampling with replacement according to the

probabilities π={πi}n

i=1. A reweighted estimator is often used in subsample literature,

deﬁned as the minimizer of the reweighted target function, namely

βw:= arg max

i=1

Y∗

iX∗T

iβ−b(X∗T

iβ)

nπ∗

,(2)

where (X∗

i, Y ∗

i)is the data sampled in the ith step, and π∗

idenotes the corresponding sam-

pling probability. Equivalently, we can solve the reweighted score function

Ψ∗

w(β) := 1

i=1

b0(X∗T

iβ)−Y∗

nπ∗

X∗

i= 0,

to obtain the reweighted estimator. Zhang et al. (2021) proposed a scheme to derive the

optimal subsampling probabilities for GLMs under measurement constraints. They ﬁrst

proved that ˆ

βwis asymptotically normal:

V{Ψ∗

w(β0)}−1

2Φ( ˆ

βw−β0)d

−→ N(0, I),

where the notation “ d

−→” denotes convergence in distribution,

V{Ψ∗

w(β0)}:= E[V{Ψ∗

w(β0)|Xn

1}] = E(1

i=1

b00 (XT

iβ0)XiXT

i1

rπi−1

r+ 1),

1:= (X1, X2, ..., Xn),b00 (·)is the second derivative of b(·), and

Φ := E(1

i=1

b00 (XT

iβ0)XiXT

i).(3)

Since the matrix Φ−1V{Ψ∗

w(β0)|Xn

1}Φ−1converges to the asymptotic variance of ˆ

βw, Zhang

et al. (2021) minimized its trace, tr(Φ−1V{Ψ∗

w(β0)|Xn

1}Φ−1), to obtain the optimal sampling

probabilities which depend only on covariate vectors X1,..., Xn:

πA−OS

i(β0,Φ) = pb00 (XT

iβ0)kΦ−1Xik

j=1 qb00 (XT

jβ0)kΦ−1Xjk

.(4)

To avoid the matrix multiplication in kΦ−1Xikin (4), we can consider a variant of (4) which

omits the inverse matrix Φ−1:

πL−OS

i(β0) = pb00 (XT

iβ0)kXik

j=1 qb00 (XT

jβ0)kXjk

.(5)

Here, {πL−OS

i}n

i=1 are other widely used optimal probabilities, derived by minimizing the

quantity tr(LΦ−1V{Ψ∗

w(β0)|Xn

1}Φ−1LT)with L= Φ. This is a special case of using the

L-optimality criterion to obtain optimal subsampling probabilities (see Wang et al., 2018;

Ai et al., 2021). The probabilities in (4) and (5) are useful when the responses are not

available, as we discussed before. However, as pointed out in Wang (2019), under the logistic

model framework, the weighting scheme adopted in (2) does not bring us the most eﬃcient

estimator. Intuitively, if a data point (Xi, Yi)has a larger sampling probability, it contains

more information about β0. However, data points with higher sampling probabilities have

smaller weights in (2). This will reduce the eﬃciency of the estimator. We propose a more

eﬃcient estimator based on the unweighted target function.

3 UNWEIGHTED ESTIMATION AND ASYMPTOTIC

THEORY

In this section, we present an algorithm with an unweighted estimator and derive its

asymptotic property. As we discussed before, the reweighted estimator reduces the impor-

tance of more informative data points. To overcome this problem, Wang (2019) developed

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnweightedestimationbasedonoptimalsampleundermeasurementconstraintsJingWang1;2;3,HaiYingWang3*,ShifengXiong1October11,2022NCMIS,KLSC,AcademyofMathematicsandSystemsScience,CAS,Beijing100190,China1SchoolofMathematicalSciences,UniversityofChineseAcademyofSciences,Beijing100049,China2DepartmentofStatist...

展开>> 收起<<

Unweighted estimation based on optimal sample under measurement constraints Jing Wang123 HaiYing Wang3 Shifeng Xiong1.pdf

共48页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unweighted estimation based on optimal sample under measurement constraints Jing Wang123 HaiYing Wang3 Shifeng Xiong1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: