Unweighted estimation based on optimal sample under measurement constraints Jing Wang123 HaiYing Wang3 Shifeng Xiong1

2025-05-06 0 0 1.12MB 48 页 10玖币
侵权投诉
Unweighted estimation based on optimal sample under
measurement constraints
Jing Wang 1,2,3, HaiYing Wang 3
, Shifeng Xiong 1
October 11, 2022
NCMIS, KLSC, Academy of Mathematics and Systems Science, CAS, Beijing 100190, China 1
School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China 2
Department of Statistics, University of Connecticut, Storrs, CT 06269, U.S.A. 3
Abstract
To tackle massive data, subsampling is a practical approach to select the more in-
formative data points. However, when responses are expensive to measure, developing
efficient subsampling schemes is challenging, and an optimal sampling approach under
measurement constraints was developed to meet this challenge. This method uses the
inverses of optimal sampling probabilities to reweight the objective function, which as-
signs smaller weights to the more important data points. Thus the estimation efficiency
of the resulting estimator can be improved. In this paper, we propose an unweighted
estimating procedure based on optimal subsamples to obtain a more efficient estimator.
We obtain the unconditional asymptotic distribution of the estimator via martingale
techniques without conditioning on the pilot estimate, which has been less investigated
in the existing subsampling literature. Both asymptotic results and numerical results
show that the unweighted estimator is more efficient in parameter estimation.
keywords: Generalized Linear Models; Massive Data; Martingale Central Limit The-
orem
MSC2020: Primary 62D05; secondary 62J12
Author to whom correspondence may be addressed. Email:haiying.wang@uconn.edu
1
arXiv:2210.04079v1 [stat.CO] 8 Oct 2022
1 INTRODUCTION
Data acquisition is becoming easier nowadays, and massive data bring new challenges to
data storage and processing. Conventional statistical models may not be applicable due to
limited computational resources. Facing such problems, subsampling has become a popular
approach to reduce computational burdens. The key idea of subsampling is to collect more
informative data points from the full data and perform calculations on a smaller data set,
see Drineas et al. (2006); Drineas et al. (2011); Mahoney (2011). In some circumstances,
covariates {Xi}are available for all the data points, but responses {Yi}can be obtained for
only a small portion because they are expensive to measure. For example, the extremely
large size of modern galaxy datasets has made visual classification of galaxies impractical.
Most subsampling probabilities developed recently for generalized linear models (GLMs) rely
on complete responses in the full data set, see Wang et al. (2018); Wang (2019), Ai et al.
(2021). In order to handle the difficulty when responses are hard to measure, Zhang et al.
(2021) proposed a response-free optimal sampling scheme under measurement constraints
(OSUMC) for GLMs. However, their method uses the reweighted estimator which is not the
most efficient one, since it assigns smaller weights to the more informative data points in the
objective function. The robust sampling probabilities proposed in Nie et al. (2018) do not
depend on the responses either, but their investigation focused on linear regression models.
In this paper, we focus on a subsampling method under measurement constraints and
propose a more efficient estimator based on the same subsamples taken according to OSUMC
for GLMs. We use martingale techniques to derive the unconditional asymptotic distribution
of the unweighted estimator and show that its asymptotic covariance matrix is smaller, in
the Loewner ordering, than that of the weighted estimator. Before showing the structure of
the paper, we first give a short overview of the emerging field of subsampling methods.
Various subsampling methods have been studied in recent years. For linear regression,
Drineas et al. (2006) developed a subsampling method based on statistical leveraging scores.
Drineas et al. (2011) developed an algorithm using randomized Hardamard transform. Ma
et al. (2015) investigated the statistical perspective of leverage sampling. Wang et al. (2019)
developed an information-based procedure to select optimal subdata for linear regression
deteministically. Zhang & Wang (2021) proposed a distributed sampling-based approach
for linear models. Ma et al. (2020) studied the statistical properties of sampling estimators
and proposed several estimators based on asymptotic results which are related to leverag-
ing scores. Beyond linear models, Fithian & Hastie (2014) proposed a local case-control
subsampling method to handle imbalanced data sets for logistic regression. Wang et al.
(2018) developed an optimal sampling method under A-optimality criterion (OSMAC) for
2
logistic regression. Their estimator can be improved because inverse probability reweighting
is applied on the objective function, and Wang (2019) developed a more efficient estimator
for logistic regression based on optimal subsample. They proposed an unweighted estimator
with bias correction using an idea similar to Fithian & Hastie (2014). They also introduced a
Poisson sampling algorithm to reduce RAM usage when calculating optimal sampling prob-
abilities. Ai et al. (2021) generalized OSMAC to GLMs and obtained optimal subsampling
probabilities under A- and L-optimality criteria for GLMs. These optimal sampling methods
require all the responses in order to construct optimal probabilities, which is not possible
under measurement constraints. Zhang et al. (2021) developed an optimal sampling method
under measurement constraints. Their estimator is also based on the weighted objective
function and thus the performance can be improved. Recently, Cheng et al. (2020) extended
an information-based data selection approach for linear models to logistic regression. Yu
et al. (2022) derived optimal Poisson subsampling probabilies under the A- and L-optimality
criteria for quasi-likelihood estimation, and developed a distributed subsampling framwork
to deal with data stored in different machines. Wang & Ma (2020) developed an optimal
sampling method for quantile regression. Pronzato & Wang (2021) proposed a sequential
online subsampling procedure based on optimal bounded design measures.
We focus on GLMs in this paper, which include commonly used models such as linear, lo-
gistic and Poisson regression. The rest of the paper is organized as follows. Section 2 presents
the model setup and briefly reviews the OSUMC method. The more efficient estimator and
its asymptotic properties are presented in Section 3. Section 4 provides numerical simula-
tions. We summarize our paper in Section 5. Proofs and technical details are presented in
the Supplementary Material.
2 BACKGROUND AND MODEL SETUP
We start by reviewing GLMs. Consider independent and identically distributed (i.i.d)
data (X1, Y1),(X2, Y2),..., (Xn, Yn)from the distribution of (X, Y ), where XRpis the
covariate vector and Yis the response variable. Assume that the conditional density of Y
given Xsatisfies that
f(y|x, β0, σ)exp yxTβ0b(xTβ0)
c(σ),
where β0is the unknown parameter we need to estimate from data, b(·)and c(·)are known
functions, and σis the dispersion parameter. In this paper, we are only interested in esti-
mating β0. Thus, we take c(σ)=1without loss of generality. We also include an intercept
3
in the model, as is almost always the case in practice. We obtain the maximum likelihood
estimator (MLE) of β0through maximizing the loglikelihood function, namely,
ˆ
βMLE := arg max
β
1
n
n
X
i=1 YiXT
iβb(XT
iβ),(1)
which is the same as solving the following score equation:
Ψn(β) := 1
n
n
X
i=1 {b0(XT
iβ)Yi}Xi= 0,
where b0(·)is the derivative of b(·). There is no general closed-form solution to ˆ
βMLE, and
iterative algorithms such as Newton’s method are often used. Therefore, when the data are
massive, the computational burden of estimating β0is very heavy. To handle this problem,
Ai et al. (2021) proposed a subsampling-based approach, which constructs sampling proba-
bilities {πi}n
i=1 that depend on both the covariates {Xi}and the responses {Yi}. However,
it is infeasible to obtain all the responses under measurement constraints. For example, it
costs considerable money and time to synthesize superconductors. When we use data-driven
methods to predict the critical temperature with the chemical composition of superconduc-
tors, it may be more pratical to measure a small number of materials to build a data-driven
model. To tackle this type of “many X, few Y scenario, Zhang et al. (2021) developed
OSUMC subsampling probabilities.
Assume we obtain a subsample of size rby sampling with replacement according to the
probabilities π={πi}n
i=1. A reweighted estimator is often used in subsample literature,
defined as the minimizer of the reweighted target function, namely
ˆ
βw:= arg max
β
1
r
r
X
i=1
Y
iXT
iβb(XT
iβ)
i
,(2)
where (X
i, Y
i)is the data sampled in the ith step, and π
idenotes the corresponding sam-
pling probability. Equivalently, we can solve the reweighted score function
Ψ
w(β) := 1
r
r
X
i=1
b0(XT
iβ)Y
i
i
X
i= 0,
to obtain the reweighted estimator. Zhang et al. (2021) proposed a scheme to derive the
optimal subsampling probabilities for GLMs under measurement constraints. They first
proved that ˆ
βwis asymptotically normal:
V{Ψ
w(β0)}1
2Φ( ˆ
βwβ0)d
N(0, I),
4
where the notation “ d
denotes convergence in distribution,
V{Ψ
w(β0)}:= E[V{Ψ
w(β0)|Xn
1}] = E(1
n2
n
X
i=1
b00 (XT
iβ0)XiXT
i1
rπi1
r+ 1),
Xn
1:= (X1, X2, ..., Xn),b00 (·)is the second derivative of b(·), and
Φ := E(1
n
n
X
i=1
b00 (XT
iβ0)XiXT
i).(3)
Since the matrix Φ1V{Ψ
w(β0)|Xn
1}Φ1converges to the asymptotic variance of ˆ
βw, Zhang
et al. (2021) minimized its trace, tr1V{Ψ
w(β0)|Xn
1}Φ1), to obtain the optimal sampling
probabilities which depend only on covariate vectors X1,..., Xn:
πAOS
i(β0,Φ) = pb00 (XT
iβ0)kΦ1Xik
Pn
j=1 qb00 (XT
jβ0)kΦ1Xjk
.(4)
To avoid the matrix multiplication in kΦ1Xikin (4), we can consider a variant of (4) which
omits the inverse matrix Φ1:
πLOS
i(β0) = pb00 (XT
iβ0)kXik
Pn
j=1 qb00 (XT
jβ0)kXjk
.(5)
Here, {πLOS
i}n
i=1 are other widely used optimal probabilities, derived by minimizing the
quantity tr(LΦ1V{Ψ
w(β0)|Xn
1}Φ1LT)with L= Φ. This is a special case of using the
L-optimality criterion to obtain optimal subsampling probabilities (see Wang et al., 2018;
Ai et al., 2021). The probabilities in (4) and (5) are useful when the responses are not
available, as we discussed before. However, as pointed out in Wang (2019), under the logistic
model framework, the weighting scheme adopted in (2) does not bring us the most efficient
estimator. Intuitively, if a data point (Xi, Yi)has a larger sampling probability, it contains
more information about β0. However, data points with higher sampling probabilities have
smaller weights in (2). This will reduce the efficiency of the estimator. We propose a more
efficient estimator based on the unweighted target function.
3 UNWEIGHTED ESTIMATION AND ASYMPTOTIC
THEORY
In this section, we present an algorithm with an unweighted estimator and derive its
asymptotic property. As we discussed before, the reweighted estimator reduces the impor-
tance of more informative data points. To overcome this problem, Wang (2019) developed
5
摘要:

UnweightedestimationbasedonoptimalsampleundermeasurementconstraintsJingWang1;2;3,HaiYingWang3*,ShifengXiong1October11,2022NCMIS,KLSC,AcademyofMathematicsandSystemsScience,CAS,Beijing100190,China1SchoolofMathematicalSciences,UniversityofChineseAcademyofSciences,Beijing100049,China2DepartmentofStatist...

展开>> 收起<<
Unweighted estimation based on optimal sample under measurement constraints Jing Wang123 HaiYing Wang3 Shifeng Xiong1.pdf

共48页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:48 页 大小:1.12MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 48
客服
关注