Probability-Weighted Clustered Coefficient Regression Models in Complex Survey Sampling Mingjun Gang1 Xin Wang2 Zhonglei Wang3 and Wei Zhong3

2025-05-02 0 0 647.31KB 45 页 10玖币
侵权投诉
Probability-Weighted Clustered Coefficient Regression
Models in Complex Survey Sampling
Mingjun Gang1, Xin Wang2, Zhonglei Wang3, and Wei Zhong3
1Department of Statistics, National University of Singapore
2Department of Mathematics and Statistics, San Diego State University
3MOE Key Laboratory of Econometrics, Wang Yanan Institute for Studies in
Economics and School of Economics, Xiamen University
Abstract
Regression analysis is commonly conducted in survey sampling. However, exist-
ing methods fail when regression models vary across different clusters of domains.
In this paper, we propose a unified framework to study the cluster-wise covariate
effect under complex survey sampling based on pairwise penalties, and the associ-
ated objective function is solved by the alternating direction method of multipliers.
Theoretical properties of the proposed method are investigated under regularity
conditions. Numerical experiments demonstrate that the proposed method outper-
forms its alternatives in terms of identifying the cluster structure and estimation
efficiency for both linear regression and logistic regression models. American Com-
munity Survey is used as an example to illustrate the advantages of the proposed
approach.
Key Words: Clustered regression coefficients; Linear regression models; Logis-
tic regression models; Oracle property; Pairwise penalties; Survey sampling.
1 Introduction
Regression models are commonly used in survey sampling to analyze the relationship
among different variables(Fuller, 2011; Lohr and Liu, 1994; Rao and Molina, 2015).
Correspond to: xwang14@sdsu.edu
1
arXiv:2210.09339v3 [stat.ME] 23 Sep 2024
Traditional regression models assume common regression coefficients across all domains,
but this assumption fails and leads to biased estimators if regression coefficients vary
across different clusters of domains. In this paper, we propose a probability-weighted
clustered coefficients (PCC) regression model to solve this problem under complex survey
sampling.
A concave fusion approach was proposed to identify a cluster structure and estimate
model parameters with common regression coefficients but cluster-specific intercepts Ma
and Huang (2017). Extension to models with clustered regression coefficient heterogene-
ity was investigated by Ma et al. (2020). Estimation accuracy of Ma and Huang (2017)
can be further improved by incorporating spatial information for areal data with re-
peated measures Wang et al. (2023b). Various models with cluster-specific regression
coefficients were explored for different types of data, such as binary data (Zhu et al.,
2021), count data (Chen et al., 2019), survival data (Hu et al., 2021), functional data
(Wang, 2024; Zhang et al., 2022), Poisson process (Wang et al., 2023a); also see (Liu
et al., 2022; Zhu and Qu, 2018). These works have shown the advantages of models with
cluster-specific regression coefficients when estimating population parameters. However,
none considers identifying a cluster structure of regression coefficients under complex
survey sampling. Under complex survey sampling, an empirical-likelihood based method
was proposed for variable selection under high dimensional setups Zhao et al. (2022),
but it cannot be used to identify cluster structures; also see Dumitrescu et al. (2021);
Wang et al. (2014).
Hierarchical models are widely used in survey sampling and small area estimation
Rao and Molina (2015). A multi-level Bayesian model was proposed by Kim et al.
(2018) for small area estimation. A robust data-driven transformation technique was
proposed by Rojas-Perilla et al. (2019), and a two-level linear regression model with
random intercepts was used to obtain a pseudopopulation Molina and Rao (2010). Wang
and Zhu (2019) proposed new model-based estimators using a linear mixed model with
cluster-specific regression coefficients, where random effects were considered in the model
compared to Ma et al. (2020) without random effects. Lahiri and Salvati (2023) proposed
a model based on linear mixed models with heterogeneity in regression coefficients and
variance components. Also see Azka Ubaidillah and Mangku (2019); Datta and Lahiri
(2000); Esteban et al. (2022); Innocent Ngaruye and Singull (2017); Jiang and Lahiri
2
(2006); Sun et al. (2022). Generalized linear models with random effects were also
proposed for categorical responses Berg and Fuller (2014); Jiang and Lahiri (2001); Sun
et al. (2024); Zhang and Chambers (2004). However, existing works did not incorporate
sampling weights. A two-level model with random cluster effects was proposed for
population mean estimation Pfeffermann and Sverchkov (2007), and sampling weights
are incorporated; also see Marhuenda et al. (2017). Under complex survey sampling, a
generalized linear regression model with random effect was considered by Esteban et al.
(2020). The efficiency of existing works can be further improved if we can successfully
identify cluster structures, but none pursued in this direction.
Under complex survey sampling, sampling weights are essential to guarantee un-
biased inference (Pfeffermann, 1993), and statistical efficiency can be improved by in-
corporating cluster structures. In this work, we consider a PCC regression model with
the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) to identify
the cluster structure by the estimated regression coefficients. We use the alternating
direction method of multipliers (ADMM) (Boyd et al., 2011) to implement the pro-
posed method. In theory, we show that the oracle estimator, when the cluster structure
is available, is a local minimizer of the objective function with probability approach-
ing one under regularity conditions. The asymptotic distribution of our estimator is
established accordingly. Theoretical properties are investigated under a general model
setup, so they apply to a wide range of situations, including linear and logistic regression
models.
The article is organized as follows. In Section 2, we propose the PCC regression
model incorporating sampling weights and introduce an ADMM-based algorithm to solve
the objective function. In Section 3, asymptotic properties of the proposed estimators
are investigated under regularity conditions. Two simulation studies are conducted to
illustrate the advantage of the proposed method for both linear and logistic regression
models in Section 4. The proposed method is applied to a real survey dataset in Section
5. Finally, a summary and conclusion are provided in Section 6.
3
2 Methodology and Algorithm
2.1 Basic Setup
In this paper, we consider a finite population F=F1∪ ··· ∪ Fmof size N, where
{F1,...,Fm}are mmutually exclusive domains, Fi={(yih,xih,zih) : h= 1, . . . , Ni}
for i= 1, . . . , m,yih is the response of interest for hth element in the ith domain,
xih is a p-dimensional covariate vector associated with the domain-specific part in the
regression model, zih is a q-dimensional covariate vector with respect to the population-
level part, Niis the size of Fi, and N=Pm
i=1 Ni. In this paper, we consider the following
generalized linear regression model,
g{E(yih |xih,zih)}=xT
ihβ0
i+zT
ihη0(i= 1, ..., m;h= 1, ..., Ni),(1)
where g(·) is a known link function, {β0
i:i= 1, ..., m}are domain-specific regression
coefficients, and η0is common for different domains. For example, g(x) = xcorresponds
to a linear regression model, and g(x) = log{x/(1 x)}to a logistic regression model.
In (1), {β0
i:i= 1, ..., m}may not be distinct across different domains, and clus-
tering elements with the same domain-specific regression coefficient can effectively im-
prove estimation efficiency. Among {β0
i:i= 1, ..., m}, assume there are Kmutu-
ally different cluster-specific regression coefficients {αk:k= 1, . . . , K}, and denote
Gk={i:βi=αk,1im}to the domain index set associated with αk. In practice,
we neither know Gknor Kin advance, and we are interested in identifying the partition ˆ
G
and the number of clusters ˆ
K, where ˆ
G={ˆ
G1,..., ˆ
Gˆ
K},ˆ
Gk={i:ˆ
βi=ˆ
αk,1im},
and ˆ
βiand ˆ
αkare estimated regression coefficients; see Section 2.2 for details.
If the finite population Fwere available, the loss function would be
LN(η,β) =
m
X
i=1
WiLi(η,β),
where β=βT
1,...,βT
mT,Wi=Ni/N ,Li(η,β) = N1
iPNi
h=1 Lih(η,β), and Lih(η,β)
is a loss function corresponding to the hth element in the ith domain. For exam-
ple, Lih(η,β) = 1
2(yih zT
ihηxT
ihβi)2corresponds to a linear regression model, and
Lih(η,β) = yih zT
ihη+xT
ihβi+ log exp(zT
ihη+xT
ihβi)+1to a logistic regression
model.
4
However, Fis never fully observable in practice due to time and budget constraints.
Instead, we can only observe a probability sample. Let n0=Pm
i=1 nibe the sample size,
where niis the size with respect to the ith domain. Denote δih as a sampling indicator
with δih = 1 if the hth element in the ith domain is observed and δih = 0 otherwise,
and let πih be the associated inclusion probability. Then, a probability-weighted loss
function is Lω(η,β) = Pm
i=1 Wiˆ
Li(η,β), where ˆ
Li(η,β) = N1
iPNi
h=1 δihπ1
ih Lih(η,β).
It can be shown that Lω(η,β) is design-unbiased for LN(η,β) (Horvitz and Thompson,
1952). To estimate the domain-specific regression coefficients {βi:i= 1, . . . , m}and
identify the cluster structure G={G1,...,GK}, we consider the following objective
function,
Qω(η,β) = mLω(η,β) + X
1i<jm
pγ(βiβj, λ),(2)
where pγ(·, λ) is a penalty function imposed on all distinct pairs of βiand βjwith i̸=j,
γis a fixed constant, and λ0 is a tuning parameter. In this paper, we use the SCAD
penalty with the following form:
pγ(t, λ) = λZ|t|
0
min {1,(γx/λ)+/(γ1)}dx, (3)
and the tuning parameter is determined by a BIC criterion; see Section 2.2 for details.
Following a practical rule (Ma and Huang, 2017; Ma et al., 2020; Wang et al., 2023b), we
fix γto be 3. There are also other options for the penalty function, such as the minimax
concave penalty (MCP)(Zhang, 2010). Ma and Huang (2017) compared different penalty
functions and concluded that SCAD and MCP perform similarly, and they are better
than an L1penalty.
2.2 Algorithm
In this section, we use an ADMM-based algorithm to minimize the objective func-
tion (2). First, we fix the tuning parameter λand show the algorithm to minimize (2).
Details regarding the selection of λare relegated to the end of this section.
Let ζij =βiβjbe the slack parameter associated with βiand βj. Then, the
5
摘要:

Probability-WeightedClusteredCoefficientRegressionModelsinComplexSurveySamplingMingjunGang1,XinWang∗2,ZhongleiWang3,andWeiZhong31DepartmentofStatistics,NationalUniversityofSingapore2DepartmentofMathematicsandStatistics,SanDiegoStateUniversity3MOEKeyLaboratoryofEconometrics,WangYananInstituteforStudi...

展开>> 收起<<
Probability-Weighted Clustered Coefficient Regression Models in Complex Survey Sampling Mingjun Gang1 Xin Wang2 Zhonglei Wang3 and Wei Zhong3.pdf

共45页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:45 页 大小:647.31KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 45
客服
关注