Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data Yuxuan Zhao

2025-04-26 0 0 769.16KB 20 页 10玖币
侵权投诉
Probabilistic Missing Value Imputation
for Mixed Categorical and Ordered Data
Yuxuan Zhao
Cornell University
yz2295@cornell.edu
Alex Townsend
Cornell University
townsend@cornell.edu
Madeleine Udell
Stanford University
udell@stanford.edu
Abstract
Many real-world datasets contain missing entries and mixed data types including
categorical and ordered (e.g. continuous and ordinal) variables. Imputing the
missing entries is necessary, since many data analysis pipelines require complete
data, but this is challenging especially for mixed data. This paper proposes a proba-
bilistic imputation method using an extended Gaussian copula model that supports
both single and multiple imputation. This method models mixed categorical and
ordered data using a latent Gaussian distribution. The unordered characteristics of
categorical variables is explicitly modeled using the argmax operator. This model
makes no assumptions on the data marginals nor does it require any hyperparame-
ter. Experimental results on synthetic and real datasets show that imputation with
the extended Gaussian copula outperforms the current state-of-the-art for both
categorical and ordered variables in mixed data.
1 Introduction
Modern datasets from healthcare, the sciences, and the social sciences often contain missing entries
and mixed data types such as continuous, ordinal, and categorical. Social survey datasets, for
example, are typically mixed because they include variables like age (continuous), demographic
group (categorical), and Likert scales (ordinal) measuring how strongly a respondent agrees with
certain stated opinions. Continuous variables are encoded as real numbers and sometimes called
numeric. We refer to variables that admit a total order (e.g. continuous and ordinal) as ordered
variables. In contrast, a categorical variable, also called nominal, can take one of a fixed number of
unordered values such as “A", “B", “AB", or "O" for blood type.
Most data analysis techniques require a complete dataset, so missing data imputation is an essential
preprocessing step. It is also often of interest to propagate imputation uncertainty into subsequent
analyses through multiple imputation, which generates several potentially different imputed datasets
[
25
]. An imputation method should ideally use all the collected data—regardless of data type—
to impute any missing data entries. However, most imputation approaches, whether explicitly or
implicitly, assume that each variable admits a total order and, as a result, cannot impute categorical
variables without proper preprocessing.
There is no satisfying, successful, and widely adopted method for imputing categorical variables,
especially in mixed datasets. It is tempting to reduce categorical imputation to ordinal imputation
using an integer encoding in which each category is assigned a number; however, this encoding
requires choosing an arbitrary ordering for the categories that may affect the results of downstream
(e.g., predictive) models. (This problem is most severe for linear and neural network models, whereas
tree-based models are less sensitive [
31
].) Instead, it is more common to use one-hot encoding to
represent a categorical variable using a binary vector with one entry for each category. With this
encoding, imputation methods developed for continuous or binary data, such as the popular low rank
matrix completion methods (LRMC) [
24
,
17
,
29
], can be jerry-rigged for categorical imputation.
Preprint. Under review.
arXiv:2210.06673v1 [stat.ME] 13 Oct 2022
However, LRMC is only proper when the encoded matrix (both ordered and encoded categorical
variables) is approximately low rank. It has been observed that LRMC performs poorly on long
skinny (many samples nand few features p) datasets because the low rank assumption fails [9, 35].
Iterative imputation methods including MICE [
4
] and missForest [
27
] can directly operate on the
unordered categories. MICE learns the conditional distribution of each variable (whether ordered
or unordered) using all other variables via linear and logistic regression. However, the learned
conditional models may be incompatible in the sense that the joint distribution cannot exist [
4
].
MissForest uses random forest to predict each variable and yields more accurate single imputations
[
27
], but cannot provide multiple imputation. Both methods converge slowly (or sometimes diverge)
for large datasets because they train many models on (possibly) ill-conditioned data.
This paper expands on the idea of modeling categorical data and multivariate data interaction using
a latent continuous space [
11
,
6
]. This approach can explicitly model the categorical distribution,
which is critical for providing multiple imputation. This paper models the latent space as a Gaussian
distribution and models each categorical variable as the argmax of a latent Gaussian vector. This
choice is inspired by the Gaussian copula model [
14
,
8
,
10
], which yields high quality imputations
for ordered data [
35
]. See [
36
] for a concise review and [
32
] for comprehensive methodology about
Gaussian copula imputation. However, it cannot be used when the data contains categorical variables.
This paper proposes the extended Gaussian copula to overcome this limitation by explicitly modeling
each data type. In contrast, other imputation methods typically require bespoke preprocessing for the
input data to perform well. For mixed categorical and ordered data, the extended Gaussian copula
model generates each categorical variable using a latent Gaussian vector and each ordered variable
using a latent Gaussian scalar. The latent Gaussian correlations capture the dependence structure.
Contributions. This paper makes three main contributions to the literature:
1.
A probabilistic model, the extended Gaussian copula, for mixed data including continuous, ordinal,
and categorical variables. The model is free of hyperparameters and makes no assumptions on the
marginal distribution of each data type.
2.
Asingle imputation method that empirically provides state-of-the-art accuracy for mixed data and
amultiple imputation method that can quantify imputation uncertainty by measures such as the
category probability for categorical variables and confidence intervals for continuous variables.
3.
A robust and efficient parameter fitting algorithm for the extended Gaussian copula that supports
further acceleration through parallelization, mini-batch training, and low rank structure.
Related work.
Modeling a categorical variable as the argmax of a latent continuous vector is a
classical technique used in the multinomial probit model [
3
,
22
] in the context of supervised learning.
Instead, our model is more similar to [
6
], which proposed one way to accommodate categorical
variables in the Gaussian copula. However, the model in [
6
] has many redundant parameters and thus
is unidentifiable, while the extended Gaussian copula is identifiable and can match any categorical
marginal distribution.
2 Methodology
Notation.
For
f:RKR
, we define the (possibly set-valued) pre-image of
f
as
f1(y) := {x:
f(x) = y}
. We define
argmax(z)
of a vector
z= (z1, . . . , zp)
as the index of the maximum entry
so that
zargmax(z)= maxi=1,...,p zi
. We use
N(µ,Σ)
to denote the multivariate normal distribution
with mean
µ
and covariance
Σ
. We use
0
to denote the all-zero vector and
1
for the all-one vector,
where the context determines the length of the vector.
We introduce our model in Section 2.1, then derive the model-based imputation methods in Section 2.2
and finally present model estimation algorithms in Section 2.3. Throughout the paper, we assume
the missing complete at random (MCAR) mechanism: missing values are uniform and independent
of any data. Nevertheless, we show our method performs reasonably well under missing at random
(MAR) and missing not at random (MNAR) assumptions through experiments. We also discuss how
violating the MCAR assumption may affect the assumptions of the proposed model in Section 4.
2
2.1 Extended Gaussian copula with categorical variables
We first show how to model a categorical variable by transforming a latent Gaussian vector. Then we
extend this model to generate categorical vectors and mixed categorical and ordered vectors. To ease
the notation, we assume that all categorical variables have
K
categories encoded as
{“1”,...,K}
.
It is straightforward to allow categorical variables with different numbers of categories.
2.1.1 Univariate categorical variable
We model a univariate categorical variable
x
with
K
categories as the argmax of a
K
-dim latent
Gaussian z= (z1, . . . , zK)with some mean µ= (µ1, . . . , µK)and identity covariance. That is,
x:= argmax(z+µ),z∼ N(0,IK),(1)
We call the distribution in Eq. (1) the Gaussian-Max distribution. Without loss of generality, we
assume
µ1= 0
, as the argmax is invariant under translations, i.e.,
µµ+α1
for
αR
. While a
dense covariance matrix is sometimes used for
z
[
6
], we prefer the model of Eq. (1) as it is identifiable.
Theorem 1 states that any categorical distribution corresponds to a unique choice of
µ
under Eq. (1).
All proofs are in the supplement. In Section 2.3.1 we describe an algorithm to estimate
µ
for a given
categorical distribution.
Theorem 1
(Existence and Uniqueness)
.
For any categorical distribution
P[x= “k”] = pk>0
for
k= 1, . . . , K such that PK
k=1 pk= 1, there is a unique µRKwith µ1= 0 such that
Pz∼N (0,IK)[argmax(z+µ) = k] = pk, k = 1, . . . , K. (2)
2.1.2 Multivariate categorical vector
We model a categorical vector
x= (x1, . . . , xp)
by supposing each of its entries
xj
follows the
Gaussian-Max distribution. That is,
xj= argmax(z(j)+µ(j))
for some
µ(j)
and isotropic Gaussian
z(j)
. Additionally, the latent Gaussian variables
z(j)
corresponding to different categorical variables
may be correlated. We call this model the categorical latent Gaussian (CLG) model (see Definition 1).
For
z:= (z(1),...,z(p))
, we use
[j]
to denote the indices of
z(j)
in
z
, so
z[j]=z(j)
for
j= 1, . . . , p
.
Definition 1
(Categorical latent Gaussian)
.
For a categorical vector
x= (x1, . . . , xp)
, we say
x
follows the categorical latent Gaussian
xCLG,µ)
, if there exists a correlation matrix
Σ
and
µ
such that (1) for z∼ N(0,Σ),xj= argmax(z[j]+µ[j]); (2) Σ[j],[j]=IK, for every j= 1, . . . , p.
In the CLG model, the value of
µ[j]
suffices to determine the marginal distribution of each categorical
xj
, as the marginal distribution of
z[j]
is
N(0,IK)
and independent of
Σ
. The correlation matrix
Σ
introduces dependencies between different categorical variables in x.
Consider two categorical variables
xj1
and
xj2
whose joint distribution is described by
PRK×K
,
where
pkl =P[xj1=k, xj2=l]
. The CLG model captures the dependence between
xj1
and
xj2
by
the correlation submatrix
Σ[j1],[j2]
. Note that
P
has only
(K1)2
free parameters as the rows and
columns must sum to the associated marginal probabilities. This means that
Σ[j1],[j2]
in the CLG
model is not identifiable. We develop an identifiable variant of the CLG model in the supplement that
uses only
(K1)2
parameters in
Σ[j1],[j2]
. Both models share similar imputation performance, but
the identifiable model is not invariant under permutation of the categorical labels while the model in
Definition 1 is invariant. For the rest of this paper, we use the permutation-invariant CLG model.
2.1.3 Mixed categorical and ordered vector
We model mixed data that contains both categorical and ordered variables by combining the CLG
model for categorical variables with the Gaussian copula model for ordered variables [10, 35].
To model
x
with ordered variables, the Gaussian copula assumes
xGC,f)
is generated as
an elementwise transformed Gaussian, i.e.,
x=f(z) = (f1(z1), . . . , fp(zp))
, where each
fj
is a
monotonic increasing function and
z∼ N(0,Σ)
. Denoting the cumulative distribution function
(CDF) of
xj
as
Fj
, the transformation
fj
is uniquely determined as
fj=F1
jΦ
, where
Φ
denotes
the standard Gaussian CDF and
F1
j(y) := inf{xR:Fj(x)y}
. Specifically, ordinals result
from thresholding the latent Gaussian variable. For an ordinal
xj
, the transformation
fj(z)
has the
3
form
PsS1(z > s)
where
1(z > s)
is
1
if
z > s
and
0
otherwise. Here, the set
S
is determined by
Fj. The CLG model has the same form as x=f(z)by writing
xj=fj(z[j];µ[j]) := argmax(z[j]+µ[j]).(3)
Hence, we propose the extended Gaussian copula (EGC) model.
Definition 2
(Extended Gaussian copula)
.
Write a mixed data vector as
x= (xcat,xord)
where
xcat
collects all categorical variables and
xord
collects all ordered variables. We say
x
follows
the extended Gaussian copula
xEGC,ford,µ)
if there exists a correlation matrix
Σ
, an
elementwise monotone
ford
, and a
µ
such that
xcat CLGcat, cat,µ)
and
xord GCord,ord,ford)
,
where Σ = Σcat, cat Σcat, ord
Σord, cat Σord,ord .
We can also write
x=f(z)=(fcat(zcat;µ),ford(zord))
for
z= (zcat,zord)∼ N(0,Σ)
where
f
is
defined by all transformation parameters, i.e.,
ford
and
µ
. Unlike the original Gaussian copula, the
dimension of the latent
z
does not match that of the data
x
; instead, it is larger as subvectors of
z
correspond to categorical entries of
x
. The correlation matrix
Σ
must also obey constraints on the
submatrix Σcat, cat according to Definition 1.
2.2 Missing data imputation
Now we show how to impute missing values given an EGC model with known parameters. For model
parameter estimation, see Section 2.3.
Note each categorical
xj
corresponds to
K
consecutive entries in
z
: the subvector
z[j]
that generates
xj
. We use
[I]
to denote the set of indices in
z
that corresponds to a set of indices
I
from
x
. For
example,
[j]
follows this notation with
I={j}
. Two sets of indices are particularly important: the
observed entries
O
and missing entries
M
for a given mixed data vector
x
. Thus
[O]
and
[M]
are
the latent dimensions in zthat generate xOand xM, respectively.
Our imputation strategy follows [35]. First, the algorithm identifies a region of the latent space z[O]
constrained by the observed entries. Then, it imputes the latent dimensions by exploiting the Gaussian
distribution of
z[M]
conditional on
z[O]
. Finally, the algorithm uses the marginal transformation
f
to
produce imputations in the data space xM. Pictorially, we can visualize this process as
xO
f1
O
z[O]
Σ
z[M]
fM
xM.
Multiple imputation.
Multiple imputation creates several imputed datasets by sampling from the
distribution of missing entries conditional on the observations. For the EGC model, Algorithm 1
samples from the distribution of
xM
using two facts: (1) given
xO
, the variable
z[O]
follows a
truncated Gaussian distribution truncated to
f1
O(xO) = Qj∈O f1
j(xj)
; and (2) the random variable
z[M]|z[O]
is Gaussian distributed. Section 2.3.3 shows how to sample from the truncated Gaussian
random variable given by
z[O]|xO
. Using the empirical distribution of drawn samples, we can
estimate the category probability for a missing categorical variable and build confidence intervals for
a missing continuous variable.
Algorithm 1 Multiple imputation via the extended Gaussian Copula
1: Input: # of imputations m, data vector xobserved at O, model parameters fand Σ.
2: for s= 1,2, . . . , m do
3: Sample ˆ
z(s)
[O]z[O]|xO:N(0,ˆ
Σ[O],[O])truncated to f1
O(xO)
4: Sample ˆ
z(s)
[M]z[M]|z[O]:NΣ[M],[O]ˆ
z(s)
[O],Σ[M],[M]Σ[M],[O]Σ1
[O],[O]Σ[O],[M]
5: Compute ˆ
x(s)
M=fM(ˆ
z(s)
[M])
6: end for
7: Output: {ˆ
x(s)
M|s= 1, .., m}.
4
摘要:

ProbabilisticMissingValueImputationforMixedCategoricalandOrderedDataYuxuanZhaoCornellUniversityyz2295@cornell.eduAlexTownsendCornellUniversitytownsend@cornell.eduMadeleineUdellStanfordUniversityudell@stanford.eduAbstractManyreal-worlddatasetscontainmissingentriesandmixeddatatypesincludingcategorical...

展开>> 收起<<
Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data Yuxuan Zhao.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:769.16KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注