2.1 Extended Gaussian copula with categorical variables
We first show how to model a categorical variable by transforming a latent Gaussian vector. Then we
extend this model to generate categorical vectors and mixed categorical and ordered vectors. To ease
the notation, we assume that all categorical variables have
K
categories encoded as
{“1”,...,“K”}
.
It is straightforward to allow categorical variables with different numbers of categories.
2.1.1 Univariate categorical variable
We model a univariate categorical variable
x
with
K
categories as the argmax of a
K
-dim latent
Gaussian z= (z1, . . . , zK)with some mean µ= (µ1, . . . , µK)and identity covariance. That is,
x:= argmax(z+µ),z∼ N(0,IK),(1)
We call the distribution in Eq. (1) the Gaussian-Max distribution. Without loss of generality, we
assume
µ1= 0
, as the argmax is invariant under translations, i.e.,
µ←µ+α1
for
α∈R
. While a
dense covariance matrix is sometimes used for
z
[
6
], we prefer the model of Eq. (1) as it is identifiable.
Theorem 1 states that any categorical distribution corresponds to a unique choice of
µ
under Eq. (1).
All proofs are in the supplement. In Section 2.3.1 we describe an algorithm to estimate
µ
for a given
categorical distribution.
Theorem 1
(Existence and Uniqueness)
.
For any categorical distribution
P[x= “k”] = pk>0
for
k= 1, . . . , K such that PK
k=1 pk= 1, there is a unique µ∈RKwith µ1= 0 such that
Pz∼N (0,IK)[argmax(z+µ) = k] = pk, k = 1, . . . , K. (2)
2.1.2 Multivariate categorical vector
We model a categorical vector
x= (x1, . . . , xp)
by supposing each of its entries
xj
follows the
Gaussian-Max distribution. That is,
xj= argmax(z(j)+µ(j))
for some
µ(j)
and isotropic Gaussian
z(j)
. Additionally, the latent Gaussian variables
z(j)
corresponding to different categorical variables
may be correlated. We call this model the categorical latent Gaussian (CLG) model (see Definition 1).
For
z:= (z(1),...,z(p))
, we use
[j]
to denote the indices of
z(j)
in
z
, so
z[j]=z(j)
for
j= 1, . . . , p
.
Definition 1
(Categorical latent Gaussian)
.
For a categorical vector
x= (x1, . . . , xp)
, we say
x
follows the categorical latent Gaussian
x∼CLG(Σ,µ)
, if there exists a correlation matrix
Σ
and
µ
such that (1) for z∼ N(0,Σ),xj= argmax(z[j]+µ[j]); (2) Σ[j],[j]=IK, for every j= 1, . . . , p.
In the CLG model, the value of
µ[j]
suffices to determine the marginal distribution of each categorical
xj
, as the marginal distribution of
z[j]
is
N(0,IK)
and independent of
Σ
. The correlation matrix
Σ
introduces dependencies between different categorical variables in x.
Consider two categorical variables
xj1
and
xj2
whose joint distribution is described by
P∈RK×K
,
where
pkl =P[xj1=k, xj2=l]
. The CLG model captures the dependence between
xj1
and
xj2
by
the correlation submatrix
Σ[j1],[j2]
. Note that
P
has only
(K−1)2
free parameters as the rows and
columns must sum to the associated marginal probabilities. This means that
Σ[j1],[j2]
in the CLG
model is not identifiable. We develop an identifiable variant of the CLG model in the supplement that
uses only
(K−1)2
parameters in
Σ[j1],[j2]
. Both models share similar imputation performance, but
the identifiable model is not invariant under permutation of the categorical labels while the model in
Definition 1 is invariant. For the rest of this paper, we use the permutation-invariant CLG model.
2.1.3 Mixed categorical and ordered vector
We model mixed data that contains both categorical and ordered variables by combining the CLG
model for categorical variables with the Gaussian copula model for ordered variables [10, 35].
To model
x
with ordered variables, the Gaussian copula assumes
x∼GC(Σ,f)
is generated as
an elementwise transformed Gaussian, i.e.,
x=f(z) = (f1(z1), . . . , fp(zp))
, where each
fj
is a
monotonic increasing function and
z∼ N(0,Σ)
. Denoting the cumulative distribution function
(CDF) of
xj
as
Fj
, the transformation
fj
is uniquely determined as
fj=F−1
j◦Φ
, where
Φ
denotes
the standard Gaussian CDF and
F−1
j(y) := inf{x∈R:Fj(x)≥y}
. Specifically, ordinals result
from thresholding the latent Gaussian variable. For an ordinal
xj
, the transformation
fj(z)
has the
3