Copula Graphical Models for Heterogeneous Mixed Data Sjoerd Hermes12 Joost van Heerwaarden12and Pariya Behrouzi1 1Mathematical and Statistical Methods Wageningen University

2025-05-06 0 0 1.62MB 25 页 10玖币
侵权投诉
Copula Graphical Models for Heterogeneous Mixed Data
Sjoerd Hermes1,2, Joost van Heerwaarden1,2and Pariya Behrouzi1
1Mathematical and Statistical Methods, Wageningen University
2Plant Production Systems, Wageningen University
Abstract
This article proposes a graphical model that handles mixed-type, multi-group data. The motivation for
such a model originates from real-world observational data, which often contain groups of samples obtained
under heterogeneous conditions in space and time, potentially resulting in differences in network structure
among groups. Therefore, the i.i.d. assumption is unrealistic, and fitting a single graphical model on all
data results in a network that does not accurately represent the between group differences. In addition,
real-world observational data is typically of mixed discrete-and-continuous type, violating the Gaussian
assumption that is typical of graphical models, which leads to the model being unable to adequately recover
the underlying graph structure. The proposed model takes into account these properties of data, by treating
observed data as transformed latent Gaussian data, by means of the Gaussian copula, and thereby allowing
for the attractive properties of the Gaussian distribution such as estimating the optimal number of model
parameter using the inverse covariance matrix. The multi-group setting is addressed by jointly fitting a
graphical model for each group, and applying the fused group penalty to fuse similar graphs together. In an
extensive simulation study, the proposed model is evaluated against alternative models, where the proposed
model is better able to recover the true underlying graph structure for different groups. Finally, the proposed
model is applied on real production-ecological data pertaining to on-farm maize yield in order to showcase
the added value of the proposed method in generating new hypotheses for production ecologists.
1 Introduction
Gaussian graphical models are statistical learning techniques used to make inference on conditional independence
relationships within a set of variables arising from a multivariate normal distribution Lauritzen (1996). These
techniques have been successfully applied in a variety of fields, such as finance (Giudici and Spelta, 2016), biology
(Krumsiek et al., 2011), healthcare (Gunathilake et al., 2020) and others. Despite their wide applicability, the
assumption of multivariate normality is often untenable. Therefore, a variety of alternative models have been
proposed, in, for example, the case of Poisson or exponential data (Yang et al., 2015), ordinal data (Guo et
al., 2015) and the Ising model in the case of binary data. More general, despite the availability of approaches
that do not impose specific distributions on the data, they are limited by their inability to allow for non-
binary discrete data (Liu et al., 2012; Fan et al., 2017) or contain a substantial number of parameters (Lee &
Hastie, 2015). Dobra and Lenkoski (2011) developed a type of Gaussian graphical model that allows for mixed-
type data, by combining the theory of copulas, Gaussian graphical models and the rank likelihood method
(Hoff, 2007). Whereas this model consisted of a Bayesian framework, Abegaz and Wit (2015) proposed a
frequentist alternative, reasoning that the choice of priors for the inverse covariance matrix is nontrivial. Both
the Bayesian and frequentist approaches have seen further development and application to real problems in the
medical (Mohammadi et al., 2017) and biomedical sciences (Behrouzi & Wit, 2019).
Notwithstanding distributional assumptions, all aforementioned methods assume that the data is i.i.d. (in-
dependent and identically distributed). However, real-world observational data often contain groups of samples
obtained under heterogeneous conditions in space and time, potentially resulting in differences in network struc-
ture among groups. Therefore, the i.i.d. assumption is unrealistic, and fitting a single graphical model on all
data results in a network that does not accurately represent the between group differences. Conversely, fitting
each graph separately for each group fails to take advantage of underlying similarities that may exist between
the groups, thereby possibly resulting in highly variable parameter estimates, especially if the sample size for
each group is small (Guo et al., 2011). For these reasons, during the last decade, several researchers have
developed graphical models for so-called heterogeneous data, that is, data consisting of various groups (Guo et
al., 2011; Danaher et al., 2014; Xie et al., 2016). Akin to graphical models for homogeneous data, research on
heterogeneous graphical models has mainly pertained to the Gaussian setting, despite mixed-type heterogeneous
data occurring in a wide variety of situations, such as multi-nation survey data, meteorological data measured
Corresponding author.
1
arXiv:2210.13140v5 [stat.ME] 29 Dec 2022
at different locations, or medical data of different diseases. Consequently, the aim of this article is to fill the
methodological gap that is graphical models for heterogeneous mixed data.
Even though Jia and Liang (2020) aimed to close this methodological gap using their joint mixed learning
model, the effectiveness of said model has only been shown in the case where the data follow Gaussian or
binomial distributions. This is not always the case in real-world applications. In addition, the model is unable
to handle missing data, which tend to be the norm, rather than the exception in real-world data (Nakagawa
& Freckleton, 2008). Despite Jia and Liang also including an R package with their method, it is currently
depreciated and not usable for graph estimation. Motivated by an application of networks on disease status,
Park and Won (2022) recently proposed the fused mixed graphical model: a method to infer graph structures of
mixed-type (numerical and categorical) data for multiple groups. This approach is based on the mixed graphical
model by Lee and Hastie (2013), but extended to the multi-group setting. The proposed model assumes that the
categorical variables given all other variables follow a multinomial distribution and all numeric variables follow
a Gaussian distribution given all other variables, which is not realistic in the case of Poisson, or non-Gaussian
continuous variables. Moreover, the imposed penalty function consists of 6 different penalty parameters to be
estimated for 2 groups, which only grows further as the number of groups increases, resulting in the FMGM being
prohibitively computationally expensive. Furthermore, no comparative analysis is done with existing methods,
but only to a separate network estimation, giving no indication of comparative performance on different types
of data. Finally, the FMGM is not accompanied by an R package that allows for such comparative analyses.
There is a need for a method that can handle more general mixed-type data consisting of any combination of
continuous and ordered discrete variables in a heterogeneous setting, which to the best of our knowledge does
not exist at present. Borrowing from recent developments in copula graphical models, the proposed method can
handle Gaussian, non-Gaussian continuous, Poisson, ordinal and binomial variables, thereby letting researchers
model a wider variety of problems. All code used in this article can be found at https://github.com/sjoher/
cgmhmd-analysis, whilst the R package can be found at https://github.com/sjoher/cgmhmd.
1.1 Application to production ecological data
Interest in relationships between multiple variables based on samples obtained over different locations and
time-points is particularly common in production-ecology, a science that aims to understand and predict the
productivity of agricultural systems (e.g. yield) as a function of their genetic biological components (G), the
production environment (E) and human management (M). Production-ecological data typically consist of obser-
vations from different crops, seasons, environments, or management conditions and research is likely to benefit
from the use of graphical models. Moreover, production ecological data tends to be of mixed-type, consisting of
(commonly) Gaussian, non-Gaussian continuous and Poisson environmental data, but also ordinal and binomial
management data.
A typical challenge for production-ecological research lies in explaining variability in observed yields as a
function of a wide set of potential enabling and constraining variables. This is typically done by employing
linear models or basic machine learning methods such as random forest that model yield as a function of a set
of covariates (Ronner et al., 2016; Bielders & G´erard, 2015; Palmas & Chamberlin, 2020). However, advanced
statistical models such as graphical models have not yet been introduced to this field. As graphical models are
used to represent the conditional dependencies underlying a set of variables, we expect that these models can
greatly aid researchers’ understanding of G×E×M interactions by way of exposing new, fundamental relation-
ships that affect plant production, which have not been captured by methods that are commonly used in the
field of production ecology. Therefore, we use this field as a way to illustrate our proposed method and thereby
introduce graphical models in general to production ecologists.
This article extends the Gaussian copula graphical model to allow for heterogeneous, mixed data, where we
showcase the effectiveness of the novel approach on production-ecological data. To this end, in Section 2, the
proposed methodology behind the Gaussian copula graphical model for heterogeneous data is presented. Section
3 presents an elaborate simulation study, where the performance of the newly proposed method compared to
other types of graphical models is evaluated. An application of the new method on production-ecological data
consisting of multiple seasons is given in Section 4. Finally, the conclusion can be found in Section 5.
2 Methodology
A Gaussian graphical model corresponds to a graph G= (V, E) that represents the full conditional depen-
dence structure between variables represented by a set of vertices V={1,2, . . . , p}through the use of a
set of undirected edges EV×V, and depends on a n×pdata matrix X= (X1, X2, . . . , Xp), Xj=
(X1j, X2j, . . . , Xnj )T, j = 1, . . . , p, where XNp(0,Σ), with Σ = Θ1. Θ is known as the precision matrix
containing the scaled partial correlations: ρij =Θij
ΘiiΘjj
. Thus, the partial correlation ρij represents the
2
independence between Xiand Xjconditional on XV\ij . Therefore, (i, j)6∈ EΘij = 0.
2.1 Copula graphical models for heterogeneous data
Let X(k)=X(k)
1, X(k)
2, . . . , X(k)
p, where k= 1,2, . . . , K represents the group index, indicating differential geno-
typic, environmental or management situations, and X(k)
jis a column of length nk, where nkis not necessarily
equal to nk0for k6=k0and the data are of mixed-type, i.e. non-Gaussian, counts, ordinal or binomial data,
as obtained from measurements on different genotypic, environmental, management and production variables.
Moreover, the data across the different groups are not i.i.d. For group k, a general form of the joint cumulative
density function is given by
F(x(k)
1, . . . , x(k)
p) = P(X(k)
1x(k)
1, . . . , X(k)
px(k)
p).(1)
As the Gaussian assumption is violated for the X(k), maximum likelihood estimation of Θ(k)based on a Gaussian
model will not suffice. For joint densities consisting of different marginals, as in (1), copulas can be applied
to model the joint dependency structure between the variables (Nelsen 2007). In the copula graphical model
literature, each observed variable Xjis assumed to have arisen by some perturbation of a latent variable Zj,
where ZNp(0,Σ), with correlation matrix Σ. The choice for a Gaussian latent variable is motivated by
the familiar closed-form of the density and the fact that the Gaussian copula correlation matrix enforces the
same conditional (in)dependence relations as the precision matrix of graphical models (Dobra & Lenkoski, 2011;
Behrouzi & Wit, 2019; Abegaz & Wit, 2015). This article also assumes a Gaussian distribution for the latent
variables such that
Z(k)Np(0,Σ(k)),
where Σ(k)Rp×prepresents the correlation matrix for group k. The latent variables are linked to the observed
data as
X(k)
j=F(k)1
j(Φ(Z(k)
j)),
where the F(k)
j() are non-decreasing marginal distribution functions, the F(k)1
j() are quantile functions, Φ()
the standard normal cdf and the X(k)
j, j = 1,2, . . . , p are observed continuous and ordered discrete variables
taking values (in the discrete case) in {0,1, . . . , d(k)
j1}, d(k)
j2, with d(k)
jbeing the number of categories of
variable jin group k. A visualization of the relationship between the latent and observed variables is given in
Figure 1.
0.0
0.1
0.2
0.3
0.4
−2.5 0.0 2.5 5.0
z
φ(z)
0.0
0.1
0.2
0.3
123456
x
P(x)
Figure 1: Relationship between the latent and observed values for ordinal variable X(k)
j.
The copula function joining the marginal distributions is denoted as
P(X(k)
1x(k)
1, . . . , X(k)
px(k)
p) = C(F(k)
1(x(k)
1), . . . , F (k)
p(x(k)
p)),
where F(k)
j(x(k)
j) is standard uniform (Casella & Berger, 2001) and, due to the Gaussian assumption of the Z(k)
j,
can be written as
ΦΣ(k)1(F(k)
1(x(k)
1)),...,Φ1(F(k)
p(x(k)
p))) = ΦΣ(k)1(u(k)
1),...,Φ1(u(k)
p)),
where ΦΣ(k)() is a cdf of a multivariate normal distribution with correlation matrix Σ(k)Rp×p. As the Φ() is
always nondecreasing and the F(k)1
j(t) are nondecreasing due to the ordered nature of the data, we have that
x(k)
ij < x(k)
i0jimplies z(k)
ij < z(k)
i0jand z(k)
ij < z(k)
i0jimplies x(k)
ij x(k)
i0j,1i6=i0nk, see Hoff (2007). Thus, we have
that z(k)
jD(x(k)
j) = {z(k)
jRnk:Lij (x(k)
ij )< z(k)
ij < Uij (x(k)
ij )}, where Lij (x(k)
ij ) = max{z(k)
i0j:x(k)
i0j< x(k)
ij }
3
and Uij (x(k)
ij ) = min{z(k)
i0j:x(k)
ij < x(k)
i0j}. From here on out, we refer to the set of intervals containing the latent
data D(x) = {zRPKnk×p:z(k)
jD(x(k)
j)}as D.
In order to facilitate the joint estimation of the different Θ(k), the probability density function over all Kgroups
is given as
f(x(1)
1, . . . , x(1)
p,......,x(K)
1, . . . , x(K)
p)
=
K
Y
k=1
cF(k)
1(x(k)
1), . . . , F (k)
p(x(k)
p)p
Y
j=1
f(k)
j(x(k)
j)
,(2)
where c(F(k)
1(x(k)
1), . . . , F (k)
p(x(k)
p)) is the copula density function and f(k)
jis the marginal density function for
the j-th variable and the k-th group. This copula density is obtained by taking the derivative of the cdf with
respect to the marginals. As the Gaussian copula is used, the copula density function can be rewritten as:
c(F(k)
1(x(k)
1), . . . , F (k)
p(x(k)
p))
=pC
F (k)
1, . . . , ∂F (k)
p
=ΦΣ(k)1(u(k)
1),...,Φ1(u(k)
p))
Qp
i=1 Φ(Φ1(u(k)
i))
= (2π)p
2det(Σ(k))1
2exp(1
21(u(k)
1),...,Φ1(u(k)
p))TΣ(k)11(u(k)
1),...,Φ1(u(k)
p)))
Qp
i=1(2π)1
2exp(1
2Φ1(u(k)
i1(u(k)
i))
= (2π)p
2det(Σ(k))1
2exp(1
2Z(k)TΣ(k)1Z(k))
(2π)p
2exp(1
2Z(k)TZ(k))
= det(Θ(k))1
2exp(1
2Z(k)T(k)I)Z(k)),
where Z(k)is used to shorten the nk×platent matrix (Φ1(u(k)
1),...,Φ1(u(k)
p)) and Iis a p×pidentity
matrix. The full log-likelihood over Kgroups is then given by
`({Θ(k)}K
k=1|X) = log
K
Y
k=1
nk
Y
i=1
c(F(k)
1(x(k)
i1), . . . , F (k)
p(x(k)
ip ))
p
Y
j=1
f(k)
j(x(k)
ij )
=
K
X
k=1
nk
X
i=1
log(c(F(k)
1(x(k)
i1), . . . , F (k)
p(x(k)
ip ))) +
K
X
k=1
nk
X
i=1
p
X
j=1
log(f(k)
j(x(k)
ij ))
=
K
X
k=1
nk
X
i=1
log det(Θ(k))1
2exp(1
2Z(k)T
i(k)I)Z(k)
i)
+
K
X
k=1
nk
X
i=1
p
X
j=1
log
1
σj2πexp
1
2
x(k)2
ij
σ2
j
=1
2
K
X
k=1
nklog(det(Θ(k))) 1
2
K
X
k=1
nk
X
i=1
Z(k)T
i(k)I)Z(k)
i
1
2
K
X
k=1
nkplog(2π)1
2
K
X
k=1
nk
X
i=1
p
X
j=1
x(k)2
ij
1
2
K
X
k=1
nklog(det(Θ(k))) 1
2
K
X
k=1
nk
X
i=1
Z(k)T
i(k)I)Z(k)
i(3)
where X= (X(1), . . . , X(K))T. We denote {Θ(k)}K
k=1 as Θfor the purpose of simplicity. The two rightmost
terms in the penultimate line of (3) were omitted, as they are constant with respect to Θ(k)because of the
standard normal marginals.
2.2 Model estimation
When estimating the marginals, a nonparametric approach is adhered to, as is common in the copula literature.
This is due to the computational costs involved their estimating and because of the fact that we only care
4
about the dependencies encoded in the Θ(k). They are estimated as ˆ
F(k)
j(x) = 1
nk+1 Pnk
i=1 I(X(k)
ij x). Whilst
(3) allows for the joint estimation of the graphical models pertaining to the different groups, these models
are not sparse and cannot enforce relations to be the same. Sparsity is a common assumption in biological
networks and production ecology is not an exception. Consider for example the solubilization of fertiliser which
is independent of root activity (de Wit, 1953), the independence between nitrogen and yield for certain crops
(Raun et al. 2011), or more general the independence between weather and various management techniques.
Moreover, if certain groups are highly similar, for example different locations with similar climates, enforcing
relations between those groups to be the same is both realistic and parsimonious. Therefore, a fused-type
penalty is imposed upon the precision matrix, such that the penalised log-likelihood function has the following
form
`(Θ|X) = 1
2
K
X
k=1
nklog(det(Θ(k))) 1
2
K
X
k=1
nk
X
i=1
Z(k)T
i(k)I)Z(k)
i
λ1
K
X
k=1 X
j6=j0|θ(k)
jj0| − λ2X
k<k0X
j,j0|θ(k)
jj0θ(k0)
jj0|
(4)
for 1 k6=k0Kand 1 j6=j0p. Here, λ1controls the sparsity of the Kdifferent graphs and λ2controls
the edge-similarity between the Kdifferent graphs. Higher values for λ1and λ2correspond to respectively more
sparse and more similar graphs, where similarity is not only limited to similar sparsity patterns in the different
Θ(k), but also in terms of attaining the exact same coefficients across different Θ(k). The fused-type penalty
for heterogeneous data graphical models was originally proposed by Danaher et al. (2014). Whenever groups
pertaining to seasons or environments share similar characteristics, production ecological research has hinted
at similar edge values between groups (Hajjarpoor et al., 2021; Zhang et al., 1999; Richards & Townley-Smith,
1987). Consider the case where groups represent different locations. If two groups have very similar environ-
ments, both weather patterns and soil properties, many conditional independence relations are expected to be
similar between the groups, as the underlying production ecological relations are assumed to be invariant across
(near) identical situations (Connor et al., 2011). Conversely, if the amount of shared characteristics is limited
between the groups, the edge values between groups are expected to be different, resulting from the low value
for λ2, as obtained from a penalty parameter selection method. Moreover, this fused-type penalty has been
shown to outperform other types of penalties (Danaher et al., 2014), and, if the data contains only 2 groups,
this type of penalty has a very light computational burden, due to the existence of a closed-form solution for
(4) once the conditional expectations of the latent variables have been computed.
As direct maximization `(Θ|X) is not feasible due to the nonexistence of an analytic expression of (4), an
iterative method is needed to estimate the value of Θλ12. A common algorithm used in the presence of latent
variables is the EM-algorithm (McLachlan & Krishnan, 2007). A benefit of this algorithm is that it can handle
missing data, which is not uncommon in production ecology as plants can die mid-season due to external stresses
such as droughts or pests. The EM algorithm alternates between an E-step and an M-step, where during the
E-step the expectation of the (unpenalised) complete-data (both Xand Z) log-likelihood conditional on the
event Dand the estimate ˆ
Θ(m)obtained during the previous M-step is computed
5
摘要:

CopulaGraphicalModelsforHeterogeneousMixedDataSjoerdHermes1;2,JoostvanHeerwaarden1;2andPariyaBehrouzi*11MathematicalandStatisticalMethods,WageningenUniversity2PlantProductionSystems,WageningenUniversityAbstractThisarticleproposesagraphicalmodelthathandlesmixed-type,multi-groupdata.Themotivationforsu...

展开>> 收起<<
Copula Graphical Models for Heterogeneous Mixed Data Sjoerd Hermes12 Joost van Heerwaarden12and Pariya Behrouzi1 1Mathematical and Statistical Methods Wageningen University.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:1.62MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注