Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data Yuxuan Zhao

2025-04-26 0 0 769.16KB 20 页 10玖币

侵权投诉

Probabilistic Missing Value Imputation

for Mixed Categorical and Ordered Data

Yuxuan Zhao

Cornell University

yz2295@cornell.edu

Alex Townsend

Cornell University

townsend@cornell.edu

Madeleine Udell

Stanford University

udell@stanford.edu

Abstract

Many real-world datasets contain missing entries and mixed data types including

categorical and ordered (e.g. continuous and ordinal) variables. Imputing the

missing entries is necessary, since many data analysis pipelines require complete

data, but this is challenging especially for mixed data. This paper proposes a proba-

bilistic imputation method using an extended Gaussian copula model that supports

both single and multiple imputation. This method models mixed categorical and

ordered data using a latent Gaussian distribution. The unordered characteristics of

categorical variables is explicitly modeled using the argmax operator. This model

makes no assumptions on the data marginals nor does it require any hyperparame-

ter. Experimental results on synthetic and real datasets show that imputation with

the extended Gaussian copula outperforms the current state-of-the-art for both

categorical and ordered variables in mixed data.

1 Introduction

Modern datasets from healthcare, the sciences, and the social sciences often contain missing entries

and mixed data types such as continuous, ordinal, and categorical. Social survey datasets, for

example, are typically mixed because they include variables like age (continuous), demographic

group (categorical), and Likert scales (ordinal) measuring how strongly a respondent agrees with

certain stated opinions. Continuous variables are encoded as real numbers and sometimes called

numeric. We refer to variables that admit a total order (e.g. continuous and ordinal) as ordered

variables. In contrast, a categorical variable, also called nominal, can take one of a ﬁxed number of

unordered values such as “A", “B", “AB", or "O" for blood type.

Most data analysis techniques require a complete dataset, so missing data imputation is an essential

preprocessing step. It is also often of interest to propagate imputation uncertainty into subsequent

analyses through multiple imputation, which generates several potentially different imputed datasets

[

]. An imputation method should ideally use all the collected data—regardless of data type—

to impute any missing data entries. However, most imputation approaches, whether explicitly or

implicitly, assume that each variable admits a total order and, as a result, cannot impute categorical

variables without proper preprocessing.

There is no satisfying, successful, and widely adopted method for imputing categorical variables,

especially in mixed datasets. It is tempting to reduce categorical imputation to ordinal imputation

using an integer encoding in which each category is assigned a number; however, this encoding

requires choosing an arbitrary ordering for the categories that may affect the results of downstream

(e.g., predictive) models. (This problem is most severe for linear and neural network models, whereas

tree-based models are less sensitive [

].) Instead, it is more common to use one-hot encoding to

represent a categorical variable using a binary vector with one entry for each category. With this

encoding, imputation methods developed for continuous or binary data, such as the popular low rank

matrix completion methods (LRMC) [

], can be jerry-rigged for categorical imputation.

Preprint. Under review.

arXiv:2210.06673v1 [stat.ME] 13 Oct 2022

However, LRMC is only proper when the encoded matrix (both ordered and encoded categorical

variables) is approximately low rank. It has been observed that LRMC performs poorly on long

skinny (many samples nand few features p) datasets because the low rank assumption fails [9, 35].

Iterative imputation methods including MICE [

] and missForest [

] can directly operate on the

unordered categories. MICE learns the conditional distribution of each variable (whether ordered

or unordered) using all other variables via linear and logistic regression. However, the learned

conditional models may be incompatible in the sense that the joint distribution cannot exist [

MissForest uses random forest to predict each variable and yields more accurate single imputations

[

], but cannot provide multiple imputation. Both methods converge slowly (or sometimes diverge)

for large datasets because they train many models on (possibly) ill-conditioned data.

This paper expands on the idea of modeling categorical data and multivariate data interaction using

a latent continuous space [

]. This approach can explicitly model the categorical distribution,

which is critical for providing multiple imputation. This paper models the latent space as a Gaussian

distribution and models each categorical variable as the argmax of a latent Gaussian vector. This

choice is inspired by the Gaussian copula model [

], which yields high quality imputations

for ordered data [

]. See [

] for a concise review and [

] for comprehensive methodology about

Gaussian copula imputation. However, it cannot be used when the data contains categorical variables.

This paper proposes the extended Gaussian copula to overcome this limitation by explicitly modeling

each data type. In contrast, other imputation methods typically require bespoke preprocessing for the

input data to perform well. For mixed categorical and ordered data, the extended Gaussian copula

model generates each categorical variable using a latent Gaussian vector and each ordered variable

using a latent Gaussian scalar. The latent Gaussian correlations capture the dependence structure.

Contributions. This paper makes three main contributions to the literature:

A probabilistic model, the extended Gaussian copula, for mixed data including continuous, ordinal,

and categorical variables. The model is free of hyperparameters and makes no assumptions on the

marginal distribution of each data type.

Asingle imputation method that empirically provides state-of-the-art accuracy for mixed data and

amultiple imputation method that can quantify imputation uncertainty by measures such as the

category probability for categorical variables and conﬁdence intervals for continuous variables.

A robust and efﬁcient parameter ﬁtting algorithm for the extended Gaussian copula that supports

further acceleration through parallelization, mini-batch training, and low rank structure.

Related work.

Modeling a categorical variable as the argmax of a latent continuous vector is a

classical technique used in the multinomial probit model [

] in the context of supervised learning.

Instead, our model is more similar to [

], which proposed one way to accommodate categorical

variables in the Gaussian copula. However, the model in [

] has many redundant parameters and thus

is unidentiﬁable, while the extended Gaussian copula is identiﬁable and can match any categorical

marginal distribution.

2 Methodology

Notation.

For

f:RK→R

, we deﬁne the (possibly set-valued) pre-image of

f−1(y) := {x:

f(x) = y}

. We deﬁne

argmax(z)

of a vector

z= (z1, . . . , zp)

as the index of the maximum entry

so that

zargmax(z)= maxi=1,...,p zi

. We use

N(µ,Σ)

to denote the multivariate normal distribution

with mean

and covariance

. We use

to denote the all-zero vector and

for the all-one vector,

where the context determines the length of the vector.

We introduce our model in Section 2.1, then derive the model-based imputation methods in Section 2.2

and ﬁnally present model estimation algorithms in Section 2.3. Throughout the paper, we assume

the missing complete at random (MCAR) mechanism: missing values are uniform and independent

of any data. Nevertheless, we show our method performs reasonably well under missing at random

(MAR) and missing not at random (MNAR) assumptions through experiments. We also discuss how

violating the MCAR assumption may affect the assumptions of the proposed model in Section 4.

2.1 Extended Gaussian copula with categorical variables

We ﬁrst show how to model a categorical variable by transforming a latent Gaussian vector. Then we

extend this model to generate categorical vectors and mixed categorical and ordered vectors. To ease

the notation, we assume that all categorical variables have

categories encoded as

{“1”,...,“K”}

It is straightforward to allow categorical variables with different numbers of categories.

2.1.1 Univariate categorical variable

We model a univariate categorical variable

with

categories as the argmax of a

-dim latent

Gaussian z= (z1, . . . , zK)with some mean µ= (µ1, . . . , µK)and identity covariance. That is,

x:= argmax(z+µ),z∼ N(0,IK),(1)

We call the distribution in Eq. (1) the Gaussian-Max distribution. Without loss of generality, we

assume

µ1= 0

, as the argmax is invariant under translations, i.e.,

µ←µ+α1

for

α∈R

. While a

dense covariance matrix is sometimes used for

[

], we prefer the model of Eq. (1) as it is identiﬁable.

Theorem 1 states that any categorical distribution corresponds to a unique choice of

under Eq. (1).

All proofs are in the supplement. In Section 2.3.1 we describe an algorithm to estimate

for a given

categorical distribution.

Theorem 1

(Existence and Uniqueness)

For any categorical distribution

P[x= “k”] = pk>0

for

k= 1, . . . , K such that PK

k=1 pk= 1, there is a unique µ∈RKwith µ1= 0 such that

Pz∼N (0,IK)[argmax(z+µ) = k] = pk, k = 1, . . . , K. (2)

2.1.2 Multivariate categorical vector

We model a categorical vector

x= (x1, . . . , xp)

by supposing each of its entries

follows the

Gaussian-Max distribution. That is,

xj= argmax(z(j)+µ(j))

for some

µ(j)

and isotropic Gaussian

z(j)

. Additionally, the latent Gaussian variables

z(j)

corresponding to different categorical variables

may be correlated. We call this model the categorical latent Gaussian (CLG) model (see Deﬁnition 1).

For

z:= (z(1),...,z(p))

, we use

[j]

to denote the indices of

z(j)

, so

z[j]=z(j)

for

j= 1, . . . , p

Deﬁnition 1

(Categorical latent Gaussian)

For a categorical vector

x= (x1, . . . , xp)

, we say

follows the categorical latent Gaussian

x∼CLG(Σ,µ)

, if there exists a correlation matrix

and

such that (1) for z∼ N(0,Σ),xj= argmax(z[j]+µ[j]); (2) Σ[j],[j]=IK, for every j= 1, . . . , p.

In the CLG model, the value of

µ[j]

sufﬁces to determine the marginal distribution of each categorical

, as the marginal distribution of

z[j]

N(0,IK)

and independent of

. The correlation matrix

introduces dependencies between different categorical variables in x.

Consider two categorical variables

xj1

and

xj2

whose joint distribution is described by

P∈RK×K

where

pkl =P[xj1=k, xj2=l]

. The CLG model captures the dependence between

xj1

and

xj2

the correlation submatrix

Σ[j1],[j2]

. Note that

has only

(K−1)2

free parameters as the rows and

columns must sum to the associated marginal probabilities. This means that

Σ[j1],[j2]

in the CLG

model is not identiﬁable. We develop an identiﬁable variant of the CLG model in the supplement that

uses only

(K−1)2

parameters in

Σ[j1],[j2]

. Both models share similar imputation performance, but

the identiﬁable model is not invariant under permutation of the categorical labels while the model in

Deﬁnition 1 is invariant. For the rest of this paper, we use the permutation-invariant CLG model.

2.1.3 Mixed categorical and ordered vector

We model mixed data that contains both categorical and ordered variables by combining the CLG

model for categorical variables with the Gaussian copula model for ordered variables [10, 35].

To model

with ordered variables, the Gaussian copula assumes

x∼GC(Σ,f)

is generated as

an elementwise transformed Gaussian, i.e.,

x=f(z) = (f1(z1), . . . , fp(zp))

, where each

is a

monotonic increasing function and

z∼ N(0,Σ)

. Denoting the cumulative distribution function

(CDF) of

, the transformation

is uniquely determined as

fj=F−1

j◦Φ

, where

denotes

the standard Gaussian CDF and

F−1

j(y) := inf{x∈R:Fj(x)≥y}

. Speciﬁcally, ordinals result

from thresholding the latent Gaussian variable. For an ordinal

, the transformation

fj(z)

has the

form

Ps∈S1(z > s)

where

1(z > s)

z > s

and

otherwise. Here, the set

is determined by

Fj. The CLG model has the same form as x=f(z)by writing

xj=fj(z[j];µ[j]) := argmax(z[j]+µ[j]).(3)

Hence, we propose the extended Gaussian copula (EGC) model.

Deﬁnition 2

(Extended Gaussian copula)

Write a mixed data vector as

x= (xcat,xord)

where

xcat

collects all categorical variables and

xord

collects all ordered variables. We say

follows

the extended Gaussian copula

x∼EGC(Σ,ford,µ)

if there exists a correlation matrix

, an

elementwise monotone

ford

, and a

such that

xcat ∼CLG(Σcat, cat,µ)

and

xord ∼GC(Σord,ord,ford)

where Σ = Σcat, cat Σcat, ord

Σord, cat Σord,ord .

We can also write

x=f(z)=(fcat(zcat;µ),ford(zord))

for

z= (zcat,zord)∼ N(0,Σ)

where

deﬁned by all transformation parameters, i.e.,

ford

and

. Unlike the original Gaussian copula, the

dimension of the latent

does not match that of the data

; instead, it is larger as subvectors of

correspond to categorical entries of

. The correlation matrix

must also obey constraints on the

submatrix Σcat, cat according to Deﬁnition 1.

2.2 Missing data imputation

Now we show how to impute missing values given an EGC model with known parameters. For model

parameter estimation, see Section 2.3.

Note each categorical

corresponds to

consecutive entries in

: the subvector

z[j]

that generates

. We use

[I]

to denote the set of indices in

that corresponds to a set of indices

from

. For

example,

[j]

follows this notation with

I={j}

. Two sets of indices are particularly important: the

observed entries

and missing entries

for a given mixed data vector

. Thus

[O]

and

[M]

are

the latent dimensions in zthat generate xOand xM, respectively.

Our imputation strategy follows [35]. First, the algorithm identiﬁes a region of the latent space z[O]

constrained by the observed entries. Then, it imputes the latent dimensions by exploiting the Gaussian

distribution of

z[M]

conditional on

z[O]

. Finally, the algorithm uses the marginal transformation

produce imputations in the data space xM. Pictorially, we can visualize this process as

f−1

−−→ z[O]

−→ z[M]

−−→ xM.

Multiple imputation.

Multiple imputation creates several imputed datasets by sampling from the

distribution of missing entries conditional on the observations. For the EGC model, Algorithm 1

samples from the distribution of

using two facts: (1) given

, the variable

z[O]

follows a

truncated Gaussian distribution truncated to

f−1

O(xO) = Qj∈O f−1

j(xj)

; and (2) the random variable

z[M]|z[O]

is Gaussian distributed. Section 2.3.3 shows how to sample from the truncated Gaussian

random variable given by

z[O]|xO

. Using the empirical distribution of drawn samples, we can

estimate the category probability for a missing categorical variable and build conﬁdence intervals for

a missing continuous variable.

Algorithm 1 Multiple imputation via the extended Gaussian Copula

1: Input: # of imputations m, data vector xobserved at O, model parameters fand Σ.

2: for s= 1,2, . . . , m do

3: Sample ˆ

z(s)

[O]∼z[O]|xO:N(0,ˆ

Σ[O],[O])truncated to f−1

O(xO)

4: Sample ˆ

z(s)

[M]∼z[M]|z[O]:NΣ[M],[O]ˆ

z(s)

[O],Σ[M],[M]−Σ[M],[O]Σ−1

[O],[O]Σ[O],[M]

5: Compute ˆ

x(s)

M=fM(ˆ

z(s)

[M])

6: end for

7: Output: {ˆ

x(s)

M|s= 1, .., m}.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProbabilisticMissingValueImputationforMixedCategoricalandOrderedDataYuxuanZhaoCornellUniversityyz2295@cornell.eduAlexTownsendCornellUniversitytownsend@cornell.eduMadeleineUdellStanfordUniversityudell@stanford.eduAbstractManyreal-worlddatasetscontainmissingentriesandmixeddatatypesincludingcategorical...

展开>> 收起<<

Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data Yuxuan Zhao.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data Yuxuan Zhao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: