Nonparametric Copula Models for Multivariate Mixed and Missing Data Joseph Feldmanand Daniel R. Kowal

2025-05-02 0 0 3.27MB 65 页 10玖币

侵权投诉

Nonparametric Copula Models for

Multivariate, Mixed, and Missing Data

Joseph Feldman∗and Daniel R. Kowal†

Department of Statistics, Rice University

Abstract

Modern datasets commonly feature both substantial missingness and many vari-

ables of mixed data types, which present signiﬁcant challenges for estimation and in-

ference. Complete case analysis, which proceeds using only the observations with fully-

observed variables, is often severely biased, while model-based imputation of missing

values is limited by the ability of the model to capture complex dependencies among

(possibly many) variables of mixed data types. To address these challenges, we develop

a novel Bayesian mixture copula for joint and nonparametric modeling of multivariate

count, continuous, ordinal, and unordered categorical variables, and deploy this model

for inference, prediction, and imputation of missing data. Most uniquely, we introduce

a new and computationally eﬃcient strategy for marginal distribution estimation that

eliminates the need to specify any marginal models yet delivers posterior consistency for

each marginal distribution and the copula parameters under missingness-at-random.

Extensive simulation studies demonstrate exceptional modeling and imputation ca-

pabilities relative to competing methods, especially with mixed data types, complex

missingness mechanisms, and nonlinear dependencies. We conclude with a data anal-

ysis that highlights how improper treatment of missing data can distort a statistical

analysis, and how the proposed approach oﬀers a resolution.

Keywords: Bayesian inference, Factor models, Imputation, Mixture models

∗Corresponding author email: jrf11@rice.edu. An Rpackage implementing the proposed approach is

available on the author’s github page, found at https://github.com/jfeldman396/GMCImpute

†Research was sponsored by the Army Research Oﬃce (W911NF-20-1-0184), the National Institute of

Environmental Health Sciences of the National Institutes of Health (R01ES028819), and the National Science

Foundation (SES-2214726). The content, views, and conclusions contained in this document are those of the

authors and should not be interpreted as representing the oﬃcial policies, either expressed or implied, of the

Army Research Oﬃce, the National Institutes of Health, or the U.S. Government. The U.S. Government

is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright

notation herein.

arXiv:2210.14988v2 [stat.ME] 7 Apr 2023

1 Introduction

Missing data are ever-present in modern statistics and data analysis. The sources of miss-

ingness are vast and varied: participant non-response in surveys (Rubin,1976), participant

attrition in longitudinal studies (Gustavson et al.,2012), linking multiple data sources (Re-

iter,2012), or errors in the data collection process all contribute to missingness. Any statistic

meant to be computed on a fully-observed sample of data—including frequentist estimators

and Bayesian posterior distributions—must be modiﬁed carefully in the presence of missing

data. At the broadest level, the goal remains to infer an unknown population quantity Q,

and speciﬁcally to provide accurate point estimates and precise uncertainty quantiﬁcation

for Q; here, we focus on the additional challenges and implications of abundant missingness

among many variables of mixed data types.

When confronted with missing data, there are two options for analysis. The ﬁrst is to

proceed using only observations for which all variables are observed. However, this com-

plete case (CC) analysis, while common in practice, is highly problematic in many settings.

CC analysis often substantially decreases the sample size, leading to imprecise and under-

powered analysis. More critically, CC analysis can introduce various and signiﬁcant forms

of bias. Consider a sample of correlated bivariate data {(Yi1, Yi2)}n

i=1, and suppose that the

missingness in Y1is determined by the value of Y2, which is fully observed (missingness-at-

random; see below). Figure 1shows the potential impacts of a CC analysis: the empirical

cumulative distribution function (ECDF) of Y1is severely biased, which implicates inference

on Q(Y1) as well as popular Bayesian semiparametric copula models discussed subsequently

(Hoﬀ,2007;Murray et al.,2013;Cui et al.,2019;Feldman and Kowal,2022).

The second option, which we pursue here, is imputation of missing values. Informally, a

statistical model is ﬁt to the observed data and then used to repeatedly simulate the missing

values, thus forming many completed datasets. Then, estimates ˆ

Qare computed on each

−3 −2 −1 0 1 2 3

−4 −2 0 2 4

Paired data with missingness−at−random

Observed Y1

Missing Y1

−3 −2 −1 0 1 2 3

0.0 0.2 0.4 0.6 0.8 1.0

Empirical Cumulative Distribution Functions

^(y1)

ECDF: all data

ECDF: complete cases

Figure 1: Bivariate data {(Yi1, Yi2)}n

i=1 with missing-at-random missingness (left) and the corre-

sponding true and empirical cumulative distribution function (ECDF) for Y1(right). The missing

data severely biases the ECDF, which impacts functionals of this term—including traditional statis-

tics as well as Bayesian semiparametric copula models.

completed dataset, and combined to produce point estimates and uncertainty quantiﬁcation

for Q. If the model adequately captures the features of the data, we can expect the inference

based on an imputation procedure to correct the shortcomings of a CC analysis.

The speciﬁcation of an imputation model is made precise by considering a joint model

for all data Y= (Yij ) and binary missingness variables R= (Rij ), where Rij = 1 indicates

that Yij is missing, and Rij = 0 means that Yij is observed. Let Yobs = (Yij :Rij = 0)

denote the observed data and Ymis = (Yij :Rij = 1) the missing values. We assume that

this model is indexed by distinct parameters θfor Yand φfor R, with joint likelihood

p(R,Yobs |θ,φ) = Zp(Yobs,Ymis |θ)p(R|Yobs,Ymis,φ)dYmis (1)

We focus on missingness-at-random (MAR), which allows the missingness mechanism

to depend on the observed (but not missing) data: p(R|Yobs,Ymis,φ) = p(R|Yobs,φ)

(Rubin,1976). In this case the missingness is ignorable, and the model speciﬁed on the ob-

served data p(Yobs |θ) = Rp(Yobs,Ymis |θ)dYmis may be used for imputation. A stronger

assumption is missing-completely-at-random (MCAR), p(R|Yobs,Ymis,φ) = p(R|φ),

which is a special case of MAR.

There are several important considerations for MAR. First, CC analysis is strongly in-

advisable (see Figure 1), and thus imputation is needed in general. Second, MAR is most

likely satisﬁed when Yobs contains many potentially informative variables (Little,2021).

Thus, MAR demands a model capable of accommodating multiple variables, possibly of

mixed types. Finally, the suitability of MAR in practice depends on the adequacy of the

assumed model. In aggregate, MAR necessitates a model for multivariate and mixed data

that can adapt to complex marginal and joint distributional features.

Our motivating example comes from a collection of variables (see Table 1) in the Na-

tional Health and Nutrition Examination Survey (NHANES). These variables include count,

continuous, ordinal, and unordered categorical variables, with missingness as high as 43%

for some variables and missing values for each data type. Notably, these variables include

self-reported mental health—which displays complex and discrete marginal distributional

features (Figure 2)—along with demographic and socioeconomic variables, alcohol and drug

use variables, and health-related variables with intricate multivariate relationships. Most im-

portantly, CC analysis is unsatisfactory or misleading for these data (see Section 7). Thus,

an imputation model is required—and in particular one capable of accommodating many

variables of mixed types with intricate distributional features.

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Marginal Distribution of DMHNG

DMHNG

Probability

0 2 4 6 8 10 13 16 19 22 25 28

Figure 2: The marginal distribution of days of self-reported poor mental health (DMHNG) from the

NHANES data, which is the response variable of interest in our real data analysis. Discreteness,

boundedness, heaping, and zero-inﬂation combine to make modeling diﬃcult.

The literature on imputation models is extensive, yet limited in its ability to address

Variable Values % Missing

Response variable:

DaysMentHlthNotGood (DMHNG) {0,1,...,30}14%

Demographic and socioeconomic variables:

Gender Male, Female 0

Age (years) {18,...,80}0

Race∗White, Black,

Hispanic, Other 0

Education Level∗<HS, = HS, >HS 5%

Family Income∗(FI) Low, Middle, High 4%

Uninsured∗Yes, No 0.2%

Alcohol and drug use variables:

HeavyDrinker Yes, No 29%

UseNicotine Yes, No 15%

UsedMarijuana Yes, No 43%

UsedHardDrug Yes, No 30%

Health-related variables:

Body Mass Index (BMI, kg/m2)[13.4,81.2] 6%

HasHighBP (BPQ020 at link) Yes, No 0.1%

HasHighChol (BPQ080 at link) Yes, No 6%

HasDiabetes∗Yes, No 0.08%

Table 1: Variables in the analysis dataset with hyperlinks to the online NHANES descriptions.

Annotated variables (∗) include minor modiﬁcations (e.g., collapsed categories) from the original

NHANES variables.

these critical challenges; see Murray (2018) for a thorough review. Broadly, there are two

main frameworks for imputation. The ﬁrst, fully conditional speciﬁcation (FCS), imputes

missing values by (i) specifying a univariate regression model for each variable in the dataset

conditional on all other variables and (ii) using each regression model to impute (separately)

the missing values for each variable (Van Buuren and Oudshoorn,1999;Raghunathan et al.,

2001). This approach oﬀers several advantages: it is amenable to mixed data types, allows

customization of each univariate model to increase ﬂexibility (Burgette and Reiter,2010;

Tang and Ishwaran,2017), and is implemented is freely available software (Van Buuren and

Groothuis-Oudshoorn,2011). However, FCS does not guarantee a valid joint distribution

for the data, which is especially problematic for Bayesian inference, and is diﬃcult to tune

in high dimensions, since it requires a separate model ﬁt for each variable. Perhaps most

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NonparametricCopulaModelsforMultivariate,Mixed,andMissingDataJosephFeldman*andDanielR.KowalDepartmentofStatistics,RiceUniversityAbstractModerndatasetscommonlyfeaturebothsubstantialmissingnessandmanyvari-ablesofmixeddatatypes,whichpresentsignicantchallengesforestimationandin-ference.Completecaseana...

展开>> 收起<<

Nonparametric Copula Models for Multivariate Mixed and Missing Data Joseph Feldmanand Daniel R. Kowal.pdf

共65页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Nonparametric Copula Models for Multivariate Mixed and Missing Data Joseph Feldmanand Daniel R. Kowal

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: