Probability-Weighted Clustered Coefficient Regression Models in Complex Survey Sampling Mingjun Gang1 Xin Wang2 Zhonglei Wang3 and Wei Zhong3

2025-05-02 0 0 647.31KB 45 页 10玖币

侵权投诉

Probability-Weighted Clustered Coeﬃcient Regression

Models in Complex Survey Sampling

Mingjun Gang1, Xin Wang∗2, Zhonglei Wang3, and Wei Zhong3

1Department of Statistics, National University of Singapore

2Department of Mathematics and Statistics, San Diego State University

3MOE Key Laboratory of Econometrics, Wang Yanan Institute for Studies in

Economics and School of Economics, Xiamen University

Abstract

Regression analysis is commonly conducted in survey sampling. However, exist-

ing methods fail when regression models vary across diﬀerent clusters of domains.

In this paper, we propose a uniﬁed framework to study the cluster-wise covariate

eﬀect under complex survey sampling based on pairwise penalties, and the associ-

ated objective function is solved by the alternating direction method of multipliers.

Theoretical properties of the proposed method are investigated under regularity

conditions. Numerical experiments demonstrate that the proposed method outper-

forms its alternatives in terms of identifying the cluster structure and estimation

eﬃciency for both linear regression and logistic regression models. American Com-

munity Survey is used as an example to illustrate the advantages of the proposed

approach.

Key Words: Clustered regression coeﬃcients; Linear regression models; Logis-

tic regression models; Oracle property; Pairwise penalties; Survey sampling.

1 Introduction

Regression models are commonly used in survey sampling to analyze the relationship

among diﬀerent variables(Fuller, 2011; Lohr and Liu, 1994; Rao and Molina, 2015).

∗Correspond to: xwang14@sdsu.edu

arXiv:2210.09339v3 [stat.ME] 23 Sep 2024

Traditional regression models assume common regression coeﬃcients across all domains,

but this assumption fails and leads to biased estimators if regression coeﬃcients vary

across diﬀerent clusters of domains. In this paper, we propose a probability-weighted

clustered coeﬃcients (PCC) regression model to solve this problem under complex survey

sampling.

A concave fusion approach was proposed to identify a cluster structure and estimate

model parameters with common regression coeﬃcients but cluster-speciﬁc intercepts Ma

and Huang (2017). Extension to models with clustered regression coeﬃcient heterogene-

ity was investigated by Ma et al. (2020). Estimation accuracy of Ma and Huang (2017)

can be further improved by incorporating spatial information for areal data with re-

peated measures Wang et al. (2023b). Various models with cluster-speciﬁc regression

coeﬃcients were explored for diﬀerent types of data, such as binary data (Zhu et al.,

2021), count data (Chen et al., 2019), survival data (Hu et al., 2021), functional data

(Wang, 2024; Zhang et al., 2022), Poisson process (Wang et al., 2023a); also see (Liu

et al., 2022; Zhu and Qu, 2018). These works have shown the advantages of models with

cluster-speciﬁc regression coeﬃcients when estimating population parameters. However,

none considers identifying a cluster structure of regression coeﬃcients under complex

survey sampling. Under complex survey sampling, an empirical-likelihood based method

was proposed for variable selection under high dimensional setups Zhao et al. (2022),

but it cannot be used to identify cluster structures; also see Dumitrescu et al. (2021);

Wang et al. (2014).

Hierarchical models are widely used in survey sampling and small area estimation

Rao and Molina (2015). A multi-level Bayesian model was proposed by Kim et al.

(2018) for small area estimation. A robust data-driven transformation technique was

proposed by Rojas-Perilla et al. (2019), and a two-level linear regression model with

random intercepts was used to obtain a pseudopopulation Molina and Rao (2010). Wang

and Zhu (2019) proposed new model-based estimators using a linear mixed model with

cluster-speciﬁc regression coeﬃcients, where random eﬀects were considered in the model

compared to Ma et al. (2020) without random eﬀects. Lahiri and Salvati (2023) proposed

a model based on linear mixed models with heterogeneity in regression coeﬃcients and

variance components. Also see Azka Ubaidillah and Mangku (2019); Datta and Lahiri

(2000); Esteban et al. (2022); Innocent Ngaruye and Singull (2017); Jiang and Lahiri

(2006); Sun et al. (2022). Generalized linear models with random eﬀects were also

proposed for categorical responses Berg and Fuller (2014); Jiang and Lahiri (2001); Sun

et al. (2024); Zhang and Chambers (2004). However, existing works did not incorporate

sampling weights. A two-level model with random cluster eﬀects was proposed for

population mean estimation Pfeﬀermann and Sverchkov (2007), and sampling weights

are incorporated; also see Marhuenda et al. (2017). Under complex survey sampling, a

generalized linear regression model with random eﬀect was considered by Esteban et al.

(2020). The eﬃciency of existing works can be further improved if we can successfully

identify cluster structures, but none pursued in this direction.

Under complex survey sampling, sampling weights are essential to guarantee un-

biased inference (Pfeﬀermann, 1993), and statistical eﬃciency can be improved by in-

corporating cluster structures. In this work, we consider a PCC regression model with

the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) to identify

the cluster structure by the estimated regression coeﬃcients. We use the alternating

direction method of multipliers (ADMM) (Boyd et al., 2011) to implement the pro-

posed method. In theory, we show that the oracle estimator, when the cluster structure

is available, is a local minimizer of the objective function with probability approach-

ing one under regularity conditions. The asymptotic distribution of our estimator is

established accordingly. Theoretical properties are investigated under a general model

setup, so they apply to a wide range of situations, including linear and logistic regression

models.

The article is organized as follows. In Section 2, we propose the PCC regression

model incorporating sampling weights and introduce an ADMM-based algorithm to solve

the objective function. In Section 3, asymptotic properties of the proposed estimators

are investigated under regularity conditions. Two simulation studies are conducted to

illustrate the advantage of the proposed method for both linear and logistic regression

models in Section 4. The proposed method is applied to a real survey dataset in Section

5. Finally, a summary and conclusion are provided in Section 6.

2 Methodology and Algorithm

2.1 Basic Setup

In this paper, we consider a ﬁnite population F=F1∪ ··· ∪ Fmof size N, where

{F1,...,Fm}are mmutually exclusive domains, Fi={(yih,xih,zih) : h= 1, . . . , Ni}

for i= 1, . . . , m,yih is the response of interest for hth element in the ith domain,

xih is a p-dimensional covariate vector associated with the domain-speciﬁc part in the

regression model, zih is a q-dimensional covariate vector with respect to the population-

level part, Niis the size of Fi, and N=Pm

i=1 Ni. In this paper, we consider the following

generalized linear regression model,

g{E(yih |xih,zih)}=xT

ihβ0

i+zT

ihη0(i= 1, ..., m;h= 1, ..., Ni),(1)

where g(·) is a known link function, {β0

i:i= 1, ..., m}are domain-speciﬁc regression

coeﬃcients, and η0is common for diﬀerent domains. For example, g(x) = xcorresponds

to a linear regression model, and g(x) = log{x/(1 −x)}to a logistic regression model.

In (1), {β0

i:i= 1, ..., m}may not be distinct across diﬀerent domains, and clus-

tering elements with the same domain-speciﬁc regression coeﬃcient can eﬀectively im-

prove estimation eﬃciency. Among {β0

i:i= 1, ..., m}, assume there are Kmutu-

ally diﬀerent cluster-speciﬁc regression coeﬃcients {αk:k= 1, . . . , K}, and denote

Gk={i:βi=αk,1≤i≤m}to the domain index set associated with αk. In practice,

we neither know Gknor Kin advance, and we are interested in identifying the partition ˆ

and the number of clusters ˆ

K, where ˆ

G={ˆ

G1,..., ˆ

Gˆ

K},ˆ

Gk={i:ˆ

βi=ˆ

αk,1≤i≤m},

and ˆ

βiand ˆ

αkare estimated regression coeﬃcients; see Section 2.2 for details.

If the ﬁnite population Fwere available, the loss function would be

LN(η,β) =

i=1

WiLi(η,β),

where β=βT

1,...,βT

mT,Wi=Ni/N ,Li(η,β) = N−1

iPNi

h=1 Lih(η,β), and Lih(η,β)

is a loss function corresponding to the hth element in the ith domain. For exam-

ple, Lih(η,β) = 1

2(yih −zT

ihη−xT

ihβi)2corresponds to a linear regression model, and

Lih(η,β) = −yih zT

ihη+xT

ihβi+ log exp(zT

ihη+xT

ihβi)+1to a logistic regression

model.

However, Fis never fully observable in practice due to time and budget constraints.

Instead, we can only observe a probability sample. Let n0=Pm

i=1 nibe the sample size,

where niis the size with respect to the ith domain. Denote δih as a sampling indicator

with δih = 1 if the hth element in the ith domain is observed and δih = 0 otherwise,

and let πih be the associated inclusion probability. Then, a probability-weighted loss

function is Lω(η,β) = Pm

i=1 Wiˆ

Li(η,β), where ˆ

Li(η,β) = N−1

iPNi

h=1 δihπ−1

ih Lih(η,β).

It can be shown that Lω(η,β) is design-unbiased for LN(η,β) (Horvitz and Thompson,

1952). To estimate the domain-speciﬁc regression coeﬃcients {βi:i= 1, . . . , m}and

identify the cluster structure G={G1,...,GK}, we consider the following objective

function,

Qω(η,β) = mLω(η,β) + X

1≤i<j≤m

pγ(∥βi−βj∥, λ),(2)

where pγ(·, λ) is a penalty function imposed on all distinct pairs of βiand βjwith i̸=j,

γis a ﬁxed constant, and λ≥0 is a tuning parameter. In this paper, we use the SCAD

penalty with the following form:

pγ(t, λ) = λZ|t|

min {1,(γ−x/λ)+/(γ−1)}dx, (3)

and the tuning parameter is determined by a BIC criterion; see Section 2.2 for details.

Following a practical rule (Ma and Huang, 2017; Ma et al., 2020; Wang et al., 2023b), we

ﬁx γto be 3. There are also other options for the penalty function, such as the minimax

concave penalty (MCP)(Zhang, 2010). Ma and Huang (2017) compared diﬀerent penalty

functions and concluded that SCAD and MCP perform similarly, and they are better

than an L1penalty.

2.2 Algorithm

In this section, we use an ADMM-based algorithm to minimize the objective func-

tion (2). First, we ﬁx the tuning parameter λand show the algorithm to minimize (2).

Details regarding the selection of λare relegated to the end of this section.

Let ζij =βi−βjbe the slack parameter associated with βiand βj. Then, the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Probability-WeightedClusteredCoefficientRegressionModelsinComplexSurveySamplingMingjunGang1,XinWang∗2,ZhongleiWang3,andWeiZhong31DepartmentofStatistics,NationalUniversityofSingapore2DepartmentofMathematicsandStatistics,SanDiegoStateUniversity3MOEKeyLaboratoryofEconometrics,WangYananInstituteforStudi...

展开>> 收起<<

Probability-Weighted Clustered Coefficient Regression Models in Complex Survey Sampling Mingjun Gang1 Xin Wang2 Zhonglei Wang3 and Wei Zhong3.pdf

共45页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Probability-Weighted Clustered Coefficient Regression Models in Complex Survey Sampling Mingjun Gang1 Xin Wang2 Zhonglei Wang3 and Wei Zhong3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: