Nonparametric Copula Models for Multivariate Mixed and Missing Data Joseph Feldmanand Daniel R. Kowal

2025-05-02 0 0 3.27MB 65 页 10玖币
侵权投诉
Nonparametric Copula Models for
Multivariate, Mixed, and Missing Data
Joseph Feldmanand Daniel R. Kowal
Department of Statistics, Rice University
Abstract
Modern datasets commonly feature both substantial missingness and many vari-
ables of mixed data types, which present significant challenges for estimation and in-
ference. Complete case analysis, which proceeds using only the observations with fully-
observed variables, is often severely biased, while model-based imputation of missing
values is limited by the ability of the model to capture complex dependencies among
(possibly many) variables of mixed data types. To address these challenges, we develop
a novel Bayesian mixture copula for joint and nonparametric modeling of multivariate
count, continuous, ordinal, and unordered categorical variables, and deploy this model
for inference, prediction, and imputation of missing data. Most uniquely, we introduce
a new and computationally efficient strategy for marginal distribution estimation that
eliminates the need to specify any marginal models yet delivers posterior consistency for
each marginal distribution and the copula parameters under missingness-at-random.
Extensive simulation studies demonstrate exceptional modeling and imputation ca-
pabilities relative to competing methods, especially with mixed data types, complex
missingness mechanisms, and nonlinear dependencies. We conclude with a data anal-
ysis that highlights how improper treatment of missing data can distort a statistical
analysis, and how the proposed approach offers a resolution.
Keywords: Bayesian inference, Factor models, Imputation, Mixture models
Corresponding author email: jrf11@rice.edu. An Rpackage implementing the proposed approach is
available on the author’s github page, found at https://github.com/jfeldman396/GMCImpute
Research was sponsored by the Army Research Office (W911NF-20-1-0184), the National Institute of
Environmental Health Sciences of the National Institutes of Health (R01ES028819), and the National Science
Foundation (SES-2214726). The content, views, and conclusions contained in this document are those of the
authors and should not be interpreted as representing the official policies, either expressed or implied, of the
Army Research Office, the National Institutes of Health, or the U.S. Government. The U.S. Government
is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright
notation herein.
1
arXiv:2210.14988v2 [stat.ME] 7 Apr 2023
1 Introduction
Missing data are ever-present in modern statistics and data analysis. The sources of miss-
ingness are vast and varied: participant non-response in surveys (Rubin,1976), participant
attrition in longitudinal studies (Gustavson et al.,2012), linking multiple data sources (Re-
iter,2012), or errors in the data collection process all contribute to missingness. Any statistic
meant to be computed on a fully-observed sample of data—including frequentist estimators
and Bayesian posterior distributions—must be modified carefully in the presence of missing
data. At the broadest level, the goal remains to infer an unknown population quantity Q,
and specifically to provide accurate point estimates and precise uncertainty quantification
for Q; here, we focus on the additional challenges and implications of abundant missingness
among many variables of mixed data types.
When confronted with missing data, there are two options for analysis. The first is to
proceed using only observations for which all variables are observed. However, this com-
plete case (CC) analysis, while common in practice, is highly problematic in many settings.
CC analysis often substantially decreases the sample size, leading to imprecise and under-
powered analysis. More critically, CC analysis can introduce various and significant forms
of bias. Consider a sample of correlated bivariate data {(Yi1, Yi2)}n
i=1, and suppose that the
missingness in Y1is determined by the value of Y2, which is fully observed (missingness-at-
random; see below). Figure 1shows the potential impacts of a CC analysis: the empirical
cumulative distribution function (ECDF) of Y1is severely biased, which implicates inference
on Q(Y1) as well as popular Bayesian semiparametric copula models discussed subsequently
(Hoff,2007;Murray et al.,2013;Cui et al.,2019;Feldman and Kowal,2022).
The second option, which we pursue here, is imputation of missing values. Informally, a
statistical model is fit to the observed data and then used to repeatedly simulate the missing
values, thus forming many completed datasets. Then, estimates ˆ
Qare computed on each
2
−3 −2 −1 0 1 2 3
−4 −2 0 2 4
Paired data with missingness−at−random
Y1
Y2
Observed Y1
Missing Y1
−3 −2 −1 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0
Empirical Cumulative Distribution Functions
y1
F
^(y1)
ECDF: all data
ECDF: complete cases
Figure 1: Bivariate data {(Yi1, Yi2)}n
i=1 with missing-at-random missingness (left) and the corre-
sponding true and empirical cumulative distribution function (ECDF) for Y1(right). The missing
data severely biases the ECDF, which impacts functionals of this term—including traditional statis-
tics as well as Bayesian semiparametric copula models.
completed dataset, and combined to produce point estimates and uncertainty quantification
for Q. If the model adequately captures the features of the data, we can expect the inference
based on an imputation procedure to correct the shortcomings of a CC analysis.
The specification of an imputation model is made precise by considering a joint model
for all data Y= (Yij ) and binary missingness variables R= (Rij ), where Rij = 1 indicates
that Yij is missing, and Rij = 0 means that Yij is observed. Let Yobs = (Yij :Rij = 0)
denote the observed data and Ymis = (Yij :Rij = 1) the missing values. We assume that
this model is indexed by distinct parameters θfor Yand φfor R, with joint likelihood
p(R,Yobs |θ,φ) = Zp(Yobs,Ymis |θ)p(R|Yobs,Ymis,φ)dYmis (1)
We focus on missingness-at-random (MAR), which allows the missingness mechanism
to depend on the observed (but not missing) data: p(R|Yobs,Ymis,φ) = p(R|Yobs,φ)
(Rubin,1976). In this case the missingness is ignorable, and the model specified on the ob-
served data p(Yobs |θ) = Rp(Yobs,Ymis |θ)dYmis may be used for imputation. A stronger
assumption is missing-completely-at-random (MCAR), p(R|Yobs,Ymis,φ) = p(R|φ),
which is a special case of MAR.
3
There are several important considerations for MAR. First, CC analysis is strongly in-
advisable (see Figure 1), and thus imputation is needed in general. Second, MAR is most
likely satisfied when Yobs contains many potentially informative variables (Little,2021).
Thus, MAR demands a model capable of accommodating multiple variables, possibly of
mixed types. Finally, the suitability of MAR in practice depends on the adequacy of the
assumed model. In aggregate, MAR necessitates a model for multivariate and mixed data
that can adapt to complex marginal and joint distributional features.
Our motivating example comes from a collection of variables (see Table 1) in the Na-
tional Health and Nutrition Examination Survey (NHANES). These variables include count,
continuous, ordinal, and unordered categorical variables, with missingness as high as 43%
for some variables and missing values for each data type. Notably, these variables include
self-reported mental health—which displays complex and discrete marginal distributional
features (Figure 2)—along with demographic and socioeconomic variables, alcohol and drug
use variables, and health-related variables with intricate multivariate relationships. Most im-
portantly, CC analysis is unsatisfactory or misleading for these data (see Section 7). Thus,
an imputation model is required—and in particular one capable of accommodating many
variables of mixed types with intricate distributional features.
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Marginal Distribution of DMHNG
DMHNG
Probability
0 2 4 6 8 10 13 16 19 22 25 28
Figure 2: The marginal distribution of days of self-reported poor mental health (DMHNG) from the
NHANES data, which is the response variable of interest in our real data analysis. Discreteness,
boundedness, heaping, and zero-inflation combine to make modeling difficult.
The literature on imputation models is extensive, yet limited in its ability to address
4
Variable Values % Missing
Response variable:
DaysMentHlthNotGood (DMHNG) {0,1,...,30}14%
Demographic and socioeconomic variables:
Gender Male, Female 0
Age (years) {18,...,80}0
RaceWhite, Black,
Hispanic, Other 0
Education Level<HS, = HS, >HS 5%
Family Income(FI) Low, Middle, High 4%
UninsuredYes, No 0.2%
Alcohol and drug use variables:
HeavyDrinker Yes, No 29%
UseNicotine Yes, No 15%
UsedMarijuana Yes, No 43%
UsedHardDrug Yes, No 30%
Health-related variables:
Body Mass Index (BMI, kg/m2)[13.4,81.2] 6%
HasHighBP (BPQ020 at link) Yes, No 0.1%
HasHighChol (BPQ080 at link) Yes, No 6%
HasDiabetesYes, No 0.08%
Table 1: Variables in the analysis dataset with hyperlinks to the online NHANES descriptions.
Annotated variables () include minor modifications (e.g., collapsed categories) from the original
NHANES variables.
these critical challenges; see Murray (2018) for a thorough review. Broadly, there are two
main frameworks for imputation. The first, fully conditional specification (FCS), imputes
missing values by (i) specifying a univariate regression model for each variable in the dataset
conditional on all other variables and (ii) using each regression model to impute (separately)
the missing values for each variable (Van Buuren and Oudshoorn,1999;Raghunathan et al.,
2001). This approach offers several advantages: it is amenable to mixed data types, allows
customization of each univariate model to increase flexibility (Burgette and Reiter,2010;
Tang and Ishwaran,2017), and is implemented is freely available software (Van Buuren and
Groothuis-Oudshoorn,2011). However, FCS does not guarantee a valid joint distribution
for the data, which is especially problematic for Bayesian inference, and is difficult to tune
in high dimensions, since it requires a separate model fit for each variable. Perhaps most
5
摘要:

NonparametricCopulaModelsforMultivariate,Mixed,andMissingDataJosephFeldman*andDanielR.Kowal„DepartmentofStatistics,RiceUniversityAbstractModerndatasetscommonlyfeaturebothsubstantialmissingnessandmanyvari-ablesofmixeddatatypes,whichpresentsigni cantchallengesforestimationandin-ference.Completecaseana...

展开>> 收起<<
Nonparametric Copula Models for Multivariate Mixed and Missing Data Joseph Feldmanand Daniel R. Kowal.pdf

共65页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:65 页 大小:3.27MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 65
客服
关注