Copula Graphical Models for Heterogeneous Mixed Data Sjoerd Hermes12 Joost van Heerwaarden12and Pariya Behrouzi1 1Mathematical and Statistical Methods Wageningen University

2025-05-06 0 0 1.62MB 25 页 10玖币

侵权投诉

Copula Graphical Models for Heterogeneous Mixed Data

Sjoerd Hermes1,2, Joost van Heerwaarden1,2and Pariya Behrouzi∗1

1Mathematical and Statistical Methods, Wageningen University

2Plant Production Systems, Wageningen University

Abstract

This article proposes a graphical model that handles mixed-type, multi-group data. The motivation for

such a model originates from real-world observational data, which often contain groups of samples obtained

under heterogeneous conditions in space and time, potentially resulting in diﬀerences in network structure

among groups. Therefore, the i.i.d. assumption is unrealistic, and ﬁtting a single graphical model on all

data results in a network that does not accurately represent the between group diﬀerences. In addition,

real-world observational data is typically of mixed discrete-and-continuous type, violating the Gaussian

assumption that is typical of graphical models, which leads to the model being unable to adequately recover

the underlying graph structure. The proposed model takes into account these properties of data, by treating

observed data as transformed latent Gaussian data, by means of the Gaussian copula, and thereby allowing

for the attractive properties of the Gaussian distribution such as estimating the optimal number of model

parameter using the inverse covariance matrix. The multi-group setting is addressed by jointly ﬁtting a

graphical model for each group, and applying the fused group penalty to fuse similar graphs together. In an

extensive simulation study, the proposed model is evaluated against alternative models, where the proposed

model is better able to recover the true underlying graph structure for diﬀerent groups. Finally, the proposed

model is applied on real production-ecological data pertaining to on-farm maize yield in order to showcase

the added value of the proposed method in generating new hypotheses for production ecologists.

1 Introduction

Gaussian graphical models are statistical learning techniques used to make inference on conditional independence

relationships within a set of variables arising from a multivariate normal distribution Lauritzen (1996). These

techniques have been successfully applied in a variety of ﬁelds, such as ﬁnance (Giudici and Spelta, 2016), biology

(Krumsiek et al., 2011), healthcare (Gunathilake et al., 2020) and others. Despite their wide applicability, the

assumption of multivariate normality is often untenable. Therefore, a variety of alternative models have been

proposed, in, for example, the case of Poisson or exponential data (Yang et al., 2015), ordinal data (Guo et

al., 2015) and the Ising model in the case of binary data. More general, despite the availability of approaches

that do not impose speciﬁc distributions on the data, they are limited by their inability to allow for non-

binary discrete data (Liu et al., 2012; Fan et al., 2017) or contain a substantial number of parameters (Lee &

Hastie, 2015). Dobra and Lenkoski (2011) developed a type of Gaussian graphical model that allows for mixed-

type data, by combining the theory of copulas, Gaussian graphical models and the rank likelihood method

(Hoﬀ, 2007). Whereas this model consisted of a Bayesian framework, Abegaz and Wit (2015) proposed a

frequentist alternative, reasoning that the choice of priors for the inverse covariance matrix is nontrivial. Both

the Bayesian and frequentist approaches have seen further development and application to real problems in the

medical (Mohammadi et al., 2017) and biomedical sciences (Behrouzi & Wit, 2019).

Notwithstanding distributional assumptions, all aforementioned methods assume that the data is i.i.d. (in-

dependent and identically distributed). However, real-world observational data often contain groups of samples

obtained under heterogeneous conditions in space and time, potentially resulting in diﬀerences in network struc-

ture among groups. Therefore, the i.i.d. assumption is unrealistic, and ﬁtting a single graphical model on all

data results in a network that does not accurately represent the between group diﬀerences. Conversely, ﬁtting

each graph separately for each group fails to take advantage of underlying similarities that may exist between

the groups, thereby possibly resulting in highly variable parameter estimates, especially if the sample size for

each group is small (Guo et al., 2011). For these reasons, during the last decade, several researchers have

developed graphical models for so-called heterogeneous data, that is, data consisting of various groups (Guo et

al., 2011; Danaher et al., 2014; Xie et al., 2016). Akin to graphical models for homogeneous data, research on

heterogeneous graphical models has mainly pertained to the Gaussian setting, despite mixed-type heterogeneous

data occurring in a wide variety of situations, such as multi-nation survey data, meteorological data measured

∗Corresponding author.

arXiv:2210.13140v5 [stat.ME] 29 Dec 2022

at diﬀerent locations, or medical data of diﬀerent diseases. Consequently, the aim of this article is to ﬁll the

methodological gap that is graphical models for heterogeneous mixed data.

Even though Jia and Liang (2020) aimed to close this methodological gap using their joint mixed learning

model, the eﬀectiveness of said model has only been shown in the case where the data follow Gaussian or

binomial distributions. This is not always the case in real-world applications. In addition, the model is unable

to handle missing data, which tend to be the norm, rather than the exception in real-world data (Nakagawa

& Freckleton, 2008). Despite Jia and Liang also including an R package with their method, it is currently

depreciated and not usable for graph estimation. Motivated by an application of networks on disease status,

Park and Won (2022) recently proposed the fused mixed graphical model: a method to infer graph structures of

mixed-type (numerical and categorical) data for multiple groups. This approach is based on the mixed graphical

model by Lee and Hastie (2013), but extended to the multi-group setting. The proposed model assumes that the

categorical variables given all other variables follow a multinomial distribution and all numeric variables follow

a Gaussian distribution given all other variables, which is not realistic in the case of Poisson, or non-Gaussian

continuous variables. Moreover, the imposed penalty function consists of 6 diﬀerent penalty parameters to be

estimated for 2 groups, which only grows further as the number of groups increases, resulting in the FMGM being

prohibitively computationally expensive. Furthermore, no comparative analysis is done with existing methods,

but only to a separate network estimation, giving no indication of comparative performance on diﬀerent types

of data. Finally, the FMGM is not accompanied by an R package that allows for such comparative analyses.

There is a need for a method that can handle more general mixed-type data consisting of any combination of

continuous and ordered discrete variables in a heterogeneous setting, which to the best of our knowledge does

not exist at present. Borrowing from recent developments in copula graphical models, the proposed method can

handle Gaussian, non-Gaussian continuous, Poisson, ordinal and binomial variables, thereby letting researchers

model a wider variety of problems. All code used in this article can be found at https://github.com/sjoher/

cgmhmd-analysis, whilst the R package can be found at https://github.com/sjoher/cgmhmd.

1.1 Application to production ecological data

Interest in relationships between multiple variables based on samples obtained over diﬀerent locations and

time-points is particularly common in production-ecology, a science that aims to understand and predict the

productivity of agricultural systems (e.g. yield) as a function of their genetic biological components (G), the

production environment (E) and human management (M). Production-ecological data typically consist of obser-

vations from diﬀerent crops, seasons, environments, or management conditions and research is likely to beneﬁt

from the use of graphical models. Moreover, production ecological data tends to be of mixed-type, consisting of

(commonly) Gaussian, non-Gaussian continuous and Poisson environmental data, but also ordinal and binomial

management data.

A typical challenge for production-ecological research lies in explaining variability in observed yields as a

function of a wide set of potential enabling and constraining variables. This is typically done by employing

linear models or basic machine learning methods such as random forest that model yield as a function of a set

of covariates (Ronner et al., 2016; Bielders & G´erard, 2015; Palmas & Chamberlin, 2020). However, advanced

statistical models such as graphical models have not yet been introduced to this ﬁeld. As graphical models are

used to represent the conditional dependencies underlying a set of variables, we expect that these models can

greatly aid researchers’ understanding of G×E×M interactions by way of exposing new, fundamental relation-

ships that aﬀect plant production, which have not been captured by methods that are commonly used in the

ﬁeld of production ecology. Therefore, we use this ﬁeld as a way to illustrate our proposed method and thereby

introduce graphical models in general to production ecologists.

This article extends the Gaussian copula graphical model to allow for heterogeneous, mixed data, where we

showcase the eﬀectiveness of the novel approach on production-ecological data. To this end, in Section 2, the

proposed methodology behind the Gaussian copula graphical model for heterogeneous data is presented. Section

3 presents an elaborate simulation study, where the performance of the newly proposed method compared to

other types of graphical models is evaluated. An application of the new method on production-ecological data

consisting of multiple seasons is given in Section 4. Finally, the conclusion can be found in Section 5.

2 Methodology

A Gaussian graphical model corresponds to a graph G= (V, E) that represents the full conditional depen-

dence structure between variables represented by a set of vertices V={1,2, . . . , p}through the use of a

set of undirected edges E⊂V×V, and depends on a n×pdata matrix X= (X1, X2, . . . , Xp), Xj=

(X1j, X2j, . . . , Xnj )T, j = 1, . . . , p, where X∼Np(0,Σ), with Σ = Θ−1. Θ is known as the precision matrix

containing the scaled partial correlations: ρij =−Θij

√ΘiiΘjj

. Thus, the partial correlation ρij represents the

independence between Xiand Xjconditional on XV\ij . Therefore, (i, j)6∈ E⇔Θij = 0.

2.1 Copula graphical models for heterogeneous data

Let X(k)=X(k)

1, X(k)

2, . . . , X(k)

p, where k= 1,2, . . . , K represents the group index, indicating diﬀerential geno-

typic, environmental or management situations, and X(k)

jis a column of length nk, where nkis not necessarily

equal to nk0for k6=k0and the data are of mixed-type, i.e. non-Gaussian, counts, ordinal or binomial data,

as obtained from measurements on diﬀerent genotypic, environmental, management and production variables.

Moreover, the data across the diﬀerent groups are not i.i.d. For group k, a general form of the joint cumulative

density function is given by

F(x(k)

1, . . . , x(k)

p) = P(X(k)

1≤x(k)

1, . . . , X(k)

p≤x(k)

p).(1)

As the Gaussian assumption is violated for the X(k), maximum likelihood estimation of Θ(k)based on a Gaussian

model will not suﬃce. For joint densities consisting of diﬀerent marginals, as in (1), copulas can be applied

to model the joint dependency structure between the variables (Nelsen 2007). In the copula graphical model

literature, each observed variable Xjis assumed to have arisen by some perturbation of a latent variable Zj,

where Z∼Np(0,Σ), with correlation matrix Σ. The choice for a Gaussian latent variable is motivated by

the familiar closed-form of the density and the fact that the Gaussian copula correlation matrix enforces the

same conditional (in)dependence relations as the precision matrix of graphical models (Dobra & Lenkoski, 2011;

Behrouzi & Wit, 2019; Abegaz & Wit, 2015). This article also assumes a Gaussian distribution for the latent

variables such that

Z(k)∼Np(0,Σ(k)),

where Σ(k)∈Rp×prepresents the correlation matrix for group k. The latent variables are linked to the observed

data as

X(k)

j=F(k)−1

j(Φ(Z(k)

j)),

where the F(k)

j() are non-decreasing marginal distribution functions, the F(k)−1

j() are quantile functions, Φ()

the standard normal cdf and the X(k)

j, j = 1,2, . . . , p are observed continuous and ordered discrete variables

taking values (in the discrete case) in {0,1, . . . , d(k)

j−1}, d(k)

j≥2, with d(k)

jbeing the number of categories of

variable jin group k. A visualization of the relationship between the latent and observed variables is given in

Figure 1.

0.0

0.1

0.2

0.3

0.4

−2.5 0.0 2.5 5.0

φ(z)

0.0

0.1

0.2

0.3

123456

P(x)

Figure 1: Relationship between the latent and observed values for ordinal variable X(k)

The copula function joining the marginal distributions is denoted as

P(X(k)

1≤x(k)

1, . . . , X(k)

p≤x(k)

p) = C(F(k)

1(x(k)

1), . . . , F (k)

p(x(k)

p)),

where F(k)

j(x(k)

j) is standard uniform (Casella & Berger, 2001) and, due to the Gaussian assumption of the Z(k)

can be written as

ΦΣ(k)(Φ−1(F(k)

1(x(k)

1)),...,Φ−1(F(k)

p(x(k)

p))) = ΦΣ(k)(Φ−1(u(k)

1),...,Φ−1(u(k)

p)),

where ΦΣ(k)() is a cdf of a multivariate normal distribution with correlation matrix Σ(k)∈Rp×p. As the Φ() is

always nondecreasing and the F(k)−1

j(t) are nondecreasing due to the ordered nature of the data, we have that

x(k)

ij < x(k)

i0jimplies z(k)

ij < z(k)

i0jand z(k)

ij < z(k)

i0jimplies x(k)

ij ≤x(k)

i0j,1≤i6=i0≤nk, see Hoﬀ (2007). Thus, we have

that z(k)

j∈D(x(k)

j) = {z(k)

j∈Rnk:Lij (x(k)

ij )< z(k)

ij < Uij (x(k)

ij )}, where Lij (x(k)

ij ) = max{z(k)

i0j:x(k)

i0j< x(k)

ij }

and Uij (x(k)

ij ) = min{z(k)

i0j:x(k)

ij < x(k)

i0j}. From here on out, we refer to the set of intervals containing the latent

data D(x) = {z∈RPKnk×p:z(k)

j∈D(x(k)

j)}as D.

In order to facilitate the joint estimation of the diﬀerent Θ(k), the probability density function over all Kgroups

is given as

f(x(1)

1, . . . , x(1)

p,......,x(K)

1, . . . , x(K)

k=1 

cF(k)

1(x(k)

1), . . . , F (k)

p(x(k)

p)p

j=1

f(k)

j(x(k)

j)

,(2)

where c(F(k)

1(x(k)

1), . . . , F (k)

p(x(k)

p)) is the copula density function and f(k)

jis the marginal density function for

the j-th variable and the k-th group. This copula density is obtained by taking the derivative of the cdf with

respect to the marginals. As the Gaussian copula is used, the copula density function can be rewritten as:

c(F(k)

1(x(k)

1), . . . , F (k)

p(x(k)

p))

=∂pC

∂F (k)

1, . . . , ∂F (k)

=ΦΣ(k)(Φ−1(u(k)

1),...,Φ−1(u(k)

p))

i=1 Φ(Φ−1(u(k)

i))

= (2π)−p

2det(Σ(k))−1

2exp(−1

2(Φ−1(u(k)

1),...,Φ−1(u(k)

p))TΣ(k)−1(Φ−1(u(k)

1),...,Φ−1(u(k)

p)))

i=1(2π)−1

2exp(−1

2Φ−1(u(k)

i)Φ−1(u(k)

i))

= (2π)−p

2det(Σ(k))−1

2exp(−1

2Z(k)TΣ(k)−1Z(k))

(2π)−p

2exp(−1

2Z(k)TZ(k))

= det(Θ(k))1

2exp(−1

2Z(k)T(Θ(k)−I)Z(k)),

where Z(k)is used to shorten the nk×platent matrix (Φ−1(u(k)

1),...,Φ−1(u(k)

p)) and Iis a p×pidentity

matrix. The full log-likelihood over Kgroups is then given by

`({Θ(k)}K

k=1|X) = log 



k=1

i=1 

c(F(k)

1(x(k)

i1), . . . , F (k)

p(x(k)

ip ))

j=1

f(k)

j(x(k)

ij )





k=1

i=1

log(c(F(k)

1(x(k)

i1), . . . , F (k)

p(x(k)

ip ))) +

k=1

i=1

j=1

log(f(k)

j(x(k)

ij ))

k=1

i=1

log det(Θ(k))1

2exp(−1

2Z(k)T

i(Θ(k)−I)Z(k)

i)

k=1

i=1

j=1

log 



σj√2πexp 

−1

x(k)2

σ2

j





k=1

nklog(det(Θ(k))) −1

k=1

i=1

Z(k)T

i(Θ(k)−I)Z(k)

−1

k=1

nkplog(2π)−1

k=1

i=1

j=1

x(k)2

∝1

k=1

nklog(det(Θ(k))) −1

k=1

i=1

Z(k)T

i(Θ(k)−I)Z(k)

i(3)

where X= (X(1), . . . , X(K))T. We denote {Θ(k)}K

k=1 as Θfor the purpose of simplicity. The two rightmost

terms in the penultimate line of (3) were omitted, as they are constant with respect to Θ(k)because of the

standard normal marginals.

2.2 Model estimation

When estimating the marginals, a nonparametric approach is adhered to, as is common in the copula literature.

This is due to the computational costs involved their estimating and because of the fact that we only care

about the dependencies encoded in the Θ(k). They are estimated as ˆ

F(k)

j(x) = 1

nk+1 Pnk

i=1 I(X(k)

ij ≤x). Whilst

(3) allows for the joint estimation of the graphical models pertaining to the diﬀerent groups, these models

are not sparse and cannot enforce relations to be the same. Sparsity is a common assumption in biological

networks and production ecology is not an exception. Consider for example the solubilization of fertiliser which

is independent of root activity (de Wit, 1953), the independence between nitrogen and yield for certain crops

(Raun et al. 2011), or more general the independence between weather and various management techniques.

Moreover, if certain groups are highly similar, for example diﬀerent locations with similar climates, enforcing

relations between those groups to be the same is both realistic and parsimonious. Therefore, a fused-type

penalty is imposed upon the precision matrix, such that the penalised log-likelihood function has the following

form

`(Θ|X) = 1

k=1

nklog(det(Θ(k))) −1

k=1

i=1

Z(k)T

i(Θ(k)−I)Z(k)

−λ1

k=1 X

j6=j0|θ(k)

jj0| − λ2X

k<k0X

j,j0|θ(k)

jj0−θ(k0)

jj0|

(4)

for 1 ≤k6=k0≤Kand 1 ≤j6=j0≤p. Here, λ1controls the sparsity of the Kdiﬀerent graphs and λ2controls

the edge-similarity between the Kdiﬀerent graphs. Higher values for λ1and λ2correspond to respectively more

sparse and more similar graphs, where similarity is not only limited to similar sparsity patterns in the diﬀerent

Θ(k), but also in terms of attaining the exact same coeﬃcients across diﬀerent Θ(k). The fused-type penalty

for heterogeneous data graphical models was originally proposed by Danaher et al. (2014). Whenever groups

pertaining to seasons or environments share similar characteristics, production ecological research has hinted

at similar edge values between groups (Hajjarpoor et al., 2021; Zhang et al., 1999; Richards & Townley-Smith,

1987). Consider the case where groups represent diﬀerent locations. If two groups have very similar environ-

ments, both weather patterns and soil properties, many conditional independence relations are expected to be

similar between the groups, as the underlying production ecological relations are assumed to be invariant across

(near) identical situations (Connor et al., 2011). Conversely, if the amount of shared characteristics is limited

between the groups, the edge values between groups are expected to be diﬀerent, resulting from the low value

for λ2, as obtained from a penalty parameter selection method. Moreover, this fused-type penalty has been

shown to outperform other types of penalties (Danaher et al., 2014), and, if the data contains only 2 groups,

this type of penalty has a very light computational burden, due to the existence of a closed-form solution for

(4) once the conditional expectations of the latent variables have been computed.

As direct maximization `(Θ|X) is not feasible due to the nonexistence of an analytic expression of (4), an

iterative method is needed to estimate the value of Θλ1,λ2. A common algorithm used in the presence of latent

variables is the EM-algorithm (McLachlan & Krishnan, 2007). A beneﬁt of this algorithm is that it can handle

missing data, which is not uncommon in production ecology as plants can die mid-season due to external stresses

such as droughts or pests. The EM algorithm alternates between an E-step and an M-step, where during the

E-step the expectation of the (unpenalised) complete-data (both Xand Z) log-likelihood conditional on the

event Dand the estimate ˆ

Θ(m)obtained during the previous M-step is computed

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CopulaGraphicalModelsforHeterogeneousMixedDataSjoerdHermes1;2,JoostvanHeerwaarden1;2andPariyaBehrouzi*11MathematicalandStatisticalMethods,WageningenUniversity2PlantProductionSystems,WageningenUniversityAbstractThisarticleproposesagraphicalmodelthathandlesmixed-type,multi-groupdata.Themotivationforsu...

展开>> 收起<<

Copula Graphical Models for Heterogeneous Mixed Data Sjoerd Hermes12 Joost van Heerwaarden12and Pariya Behrouzi1 1Mathematical and Statistical Methods Wageningen University.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Copula Graphical Models for Heterogeneous Mixed Data Sjoerd Hermes12 Joost van Heerwaarden12and Pariya Behrouzi1 1Mathematical and Statistical Methods Wageningen University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: