
at different locations, or medical data of different diseases. Consequently, the aim of this article is to fill the
methodological gap that is graphical models for heterogeneous mixed data.
Even though Jia and Liang (2020) aimed to close this methodological gap using their joint mixed learning
model, the effectiveness of said model has only been shown in the case where the data follow Gaussian or
binomial distributions. This is not always the case in real-world applications. In addition, the model is unable
to handle missing data, which tend to be the norm, rather than the exception in real-world data (Nakagawa
& Freckleton, 2008). Despite Jia and Liang also including an R package with their method, it is currently
depreciated and not usable for graph estimation. Motivated by an application of networks on disease status,
Park and Won (2022) recently proposed the fused mixed graphical model: a method to infer graph structures of
mixed-type (numerical and categorical) data for multiple groups. This approach is based on the mixed graphical
model by Lee and Hastie (2013), but extended to the multi-group setting. The proposed model assumes that the
categorical variables given all other variables follow a multinomial distribution and all numeric variables follow
a Gaussian distribution given all other variables, which is not realistic in the case of Poisson, or non-Gaussian
continuous variables. Moreover, the imposed penalty function consists of 6 different penalty parameters to be
estimated for 2 groups, which only grows further as the number of groups increases, resulting in the FMGM being
prohibitively computationally expensive. Furthermore, no comparative analysis is done with existing methods,
but only to a separate network estimation, giving no indication of comparative performance on different types
of data. Finally, the FMGM is not accompanied by an R package that allows for such comparative analyses.
There is a need for a method that can handle more general mixed-type data consisting of any combination of
continuous and ordered discrete variables in a heterogeneous setting, which to the best of our knowledge does
not exist at present. Borrowing from recent developments in copula graphical models, the proposed method can
handle Gaussian, non-Gaussian continuous, Poisson, ordinal and binomial variables, thereby letting researchers
model a wider variety of problems. All code used in this article can be found at https://github.com/sjoher/
cgmhmd-analysis, whilst the R package can be found at https://github.com/sjoher/cgmhmd.
1.1 Application to production ecological data
Interest in relationships between multiple variables based on samples obtained over different locations and
time-points is particularly common in production-ecology, a science that aims to understand and predict the
productivity of agricultural systems (e.g. yield) as a function of their genetic biological components (G), the
production environment (E) and human management (M). Production-ecological data typically consist of obser-
vations from different crops, seasons, environments, or management conditions and research is likely to benefit
from the use of graphical models. Moreover, production ecological data tends to be of mixed-type, consisting of
(commonly) Gaussian, non-Gaussian continuous and Poisson environmental data, but also ordinal and binomial
management data.
A typical challenge for production-ecological research lies in explaining variability in observed yields as a
function of a wide set of potential enabling and constraining variables. This is typically done by employing
linear models or basic machine learning methods such as random forest that model yield as a function of a set
of covariates (Ronner et al., 2016; Bielders & G´erard, 2015; Palmas & Chamberlin, 2020). However, advanced
statistical models such as graphical models have not yet been introduced to this field. As graphical models are
used to represent the conditional dependencies underlying a set of variables, we expect that these models can
greatly aid researchers’ understanding of G×E×M interactions by way of exposing new, fundamental relation-
ships that affect plant production, which have not been captured by methods that are commonly used in the
field of production ecology. Therefore, we use this field as a way to illustrate our proposed method and thereby
introduce graphical models in general to production ecologists.
This article extends the Gaussian copula graphical model to allow for heterogeneous, mixed data, where we
showcase the effectiveness of the novel approach on production-ecological data. To this end, in Section 2, the
proposed methodology behind the Gaussian copula graphical model for heterogeneous data is presented. Section
3 presents an elaborate simulation study, where the performance of the newly proposed method compared to
other types of graphical models is evaluated. An application of the new method on production-ecological data
consisting of multiple seasons is given in Section 4. Finally, the conclusion can be found in Section 5.
2 Methodology
A Gaussian graphical model corresponds to a graph G= (V, E) that represents the full conditional depen-
dence structure between variables represented by a set of vertices V={1,2, . . . , p}through the use of a
set of undirected edges E⊂V×V, and depends on a n×pdata matrix X= (X1, X2, . . . , Xp), Xj=
(X1j, X2j, . . . , Xnj )T, j = 1, . . . , p, where X∼Np(0,Σ), with Σ = Θ−1. Θ is known as the precision matrix
containing the scaled partial correlations: ρij =−Θij
√ΘiiΘjj
. Thus, the partial correlation ρij represents the
2