Comparison of REML methods for the study of phenome-wide genetic variation Damian Pavlyshyn12 Iain M. Johnstone1 and Jacqueline L. Sztepanacz3

2025-04-29 0 0 501.26KB 23 页 10玖币
侵权投诉
Comparison of REML methods for the study of phenome-wide
genetic variation
Damian Pavlyshyn1,2, Iain M. Johnstone1, and Jacqueline L. Sztepanacz3
1Department of Statistics, Stanford University
2Disease Elimination Program, Burnet Institute
3Department of Ecology and Evolutionary Biology, University of Toronto
October 2022
Abstract
It is now well documented that genetic covariance between functionally related traits
leads to an uneven distribution of genetic variation across multivariate trait combina-
tions, and possibly a large part of phenotype-space that is inaccessible to evolution.
How the size of this nearly-null genetic space translates to the broader phenome level is
unknown. High dimensional phenotype data to address these questions are now within
reach, however, incorporating these data into genetic analyses remains a challenge.
Multi-trait genetic analyses, of more than a handful of traits, are slow and often fail
to converge when fit with REML. This makes it challenging to estimate the genetic co-
variance (G) underlying thousands of traits, let alone study its properties. We present
a previously proposed REML algorithm that is feasible for high dimensional genetic
studies in the specific setting of a balanced nested half-sib design, common of quanti-
tative genetics. We show that it substantially outperforms other common approaches
when the number of traits is large, and we use it to investigate the bias in estimated
eigenvalues of Gand the size of the nearly-null genetic subspace. We show that the
high-dimensional biases observed are qualitatively similar to those substantiated by
asymptotic approximation in a simpler setting of a sample covariance matrix based on
i.i.d. vector observation, and that interpreting the estimated size of the nearly-null
genetic subspace requires considerable caution in high-dimensional studies of genetic
variation. Our results provide the foundation for future research characterizing the
asymptotic approximation of estimated genetic eigenvalues, and a statistical null dis-
tribution for phenome-wide studies of genetic variation.
1 Introduction
To answer the most compelling questions in evolutionary biology we must uncover the
causal connection between genotypes, phenotypes, and the environment (selection). Un-
like the finite genomes that underlie them, organisms are comprised of essentially infinite
phenotypes that may genetically vary and covary. While not all of those phenotypes will
be meaningful for fitness, there are at least many thousands to consider, if we are to begin
to understand the genotype-phenotype map. Efforts to increase the scope and through-
put of phenotyping have been cited as an urgent priority in evolutionary biology since at
least 2010 [Houle, 2010, Houle et al., 2010]. High-throughput phenomic technologies such
as such as RNA sequencing [Schrag et al., 2018], drone based plant imaging [Furbank and
1
arXiv:2210.11709v1 [stat.AP] 21 Oct 2022
Tester, 2011], wearable sensors [Haleem et al., 2021, Sharma et al., 2021, Neethirajan, 2017],
metabolomics for physiological measurements [Jin et al., 2020, Freimer and Sabatti, 2003],
and computer vision for morphometrics [L¨urig et al., 2021] are now within reach. How-
ever, incorporating these high-dimensional phenotype data into genetic analyses remains a
challenge.
Multivariate studies of genetic variation focusing on relatively small sets of traits p20
(small p) have transformed our understanding of how genetic variation is distributed across
phenotypes, and how this affects evolutionary outcomes. While almost all individual traits
studied have been shown to have genetic variation [Lynch et al., 1998], and responses to
artificial selection are often rapid and of large magnitude [Hill and Kirkpatrick, 2010], mul-
tivariate studies of the genetic variance-covariance matrix (G) show that this genetic varia-
tion is distributed unevenly across multivariate phenotypes [Kirkpatrick, 2009, Sztepanacz
and Houle, 2019]. The concentration of genetic variance onto fewer multivariate trait com-
binations than the number of phenotypes measured is biologically caused by the pleiotropic
effects of alleles on multiple traits which leads to their genetic covariance [Lande, 1980].
Multivariate trait combinations with high genetic variation form a genetic subspace where
traits are predicted to have high evolvability. Evolution is predicted to occur more quickly
along these multivariate trait combinations than in any individual trait [Agrawal and Stinch-
combe, 2009], contributing to divergence among populations (eg. [McGlothlin et al., 2022,
Schluter, 1996]), species (eg. [Innocenti and Chenoweth, 2013, B´egin and Roff, 2004]), and
sexes (eg. Gosden and Chenoweth [2014]). The set of orthogonal multivariate trait com-
binations with low genetic variation, form a complementary subspace which is termed the
nearly-null genetic subspace [Gomulkiewicz and Houle, 2009, Gaydos et al., 2013]. This
subspace putatively represents important evolutionary constraints in natural populations,
where phenotypic evolution is expected to be constrained [Gomulkiewicz and Houle, 2009],
occur slowly [Kirkpatrick, 2009] or stochastically [Hine et al., 2014].
Estimating the size of the nearly-null genetic subspace is an avenue to quantify both
genetic constraints that may lead to evolutionary limits, and the extent of pleiotropy under-
lying organisms, which determines their genetic dimensionality. Past studies have typically
found that genetic variance is restricted to less than half of the phenotype space with
most multivariate trait combinations having no detectable genetic variation [Blows and
McGuigan, 2015] . One notable exception to this pattern is Drosophila wing shape, which
as been shown to have genetic variation in all multivariate wing shape traits (ie. a full rank
G) [Mezey and Houle, 2005, Sztepanacz and Blows, 2015, Sztepanacz and Houle, 2019,
Houle and Meyer, 2015]. These multivariate studies have typically dealt with small sets
of functionally related traits such as wing [Mezey and Houle, 2005], skeletal [Garcia et al.,
2014] or phyto [Walsh and Lynch, 2018] morphology, or traits that have a strong physio-
logical [Caruso et al., 2005], or biochemical relationships [Sztepanacz and Rundle, 2012].
Therefore, genetic covariance among these traits may be relatively high compared to the
range of traits that are possible to study with phenomics. Whether inferences for the size
of the nearly null subspaces found in these small pstudies can be extrapolated to phenomes
is an open biological question.
Confounded with the biological phenomena that nearly-null genetic subspaces represent,
are the statistical phenomena that arise from their estimation. Even in low pstudies, esti-
mating Gand its eigenvalues is a challenge. The standard approach is to fit a multivariate
mixed model in REML to estimate Gas an unstructured covariance matrix. These mod-
els commonly fail to converge with only a small number of traits, and run-times are long
because of the quadratic growth p2/2 in the number of parameters to estimate with the
2
number of traits. As shown in the one-way MANOVA context, small pmodels that do con-
verge, often produce estimated matrices that are not positive definite [Hill and Thompson,
1978] and that have overdispersed eigenvalues [Hayes and Hill, 1981], properties which are
exacerbated in the two-way hierarchical model, such as the nested half-sib design common
in quantitative genetic studies. In particular, Hayes and Hill [1981] show that the magni-
tude of overdisperion of the genetic eigenvalues is determined by p/n with smaller sample
sizes or more traits leading to larger dispersion. The overdispersion of estimated eigenval-
ues is a pattern remarkably similar to that we interpret biologically as genetic subspaces of
high evolvability and those that are nearly null. In the phenomic context, where p/n 1 or
possibly even greater than 1, it could even obscure any biological signal. Disentangling how
much of the dispersion of eigenvalues of Gis due to biological covariance versus statistical
estimation is therefore important, if we want to predict evolutionary constraints.
In simpler settings, recent work has provided a quantitative description of the bias and
error related to the dispersion of estimated eigenvalues when pis of order comparable to n,
summarized for example in [Johnstone and Paul, 2018]. The bulk eigenvalue distribution
of sample (i.i.d) covariance matrices, such as those describing among-individual covari-
ance structure such as phenotypic covariance matrices P, or genomic relatedness matrices
(GRMs), is known to follow the Marchenko Pastur distribution [Marˇcenko and Pastur, 1967,
Yao et al., 2015]. For a large class of mixed models, including full-sib and half-sib designs,
and for the particular case of MANOVA estimators, overdisperson of sample eigenvalues in
the bulk eigenvalue distribution is described by a generalization of the Marchenko-Pastur
distribution [Fan and Johnstone, 2019].
The individual eigenvalues of sample covariance matrices also follow particular distribu-
tions. The leading eigenvalue conforms to the Tracy-Widom distribution, both for sample
(i.i.d.) covariance matrices [Johnstone, 2001] and for MANOVA estimates from mixed mod-
els[Fan and Johnstone, 2022]. The trailing (smallest) eigenvalue converges to a reflected
Tracy-Widom distribution [Fan and Johnstone, 2022]. Again for MANOVA estimates in
mixed models, the leading estimated eigenvalues and eigenvectors can be influenced by
an alignment between directions of sufficiently elevated variance in the different covariance
components [Fan et al., 2018]. For example, an alignment between the major axis of genetic
variance gmax and environmental variation could lead to an estimated gmax that is biased
toward the major axis of environmental variation and with an eigenvalue that is much larger
than it should be. These quantitative descriptions provide appropriate null distributions
for MANOVA eigenvalues, and enable bias-correction for estimated eigenvalues in these
contexts.
Simple among line (one-way) genetic simulation studies estimating Gfor p=5 using
REML, combined with an empirical centering and scaling approach [Saccenti et al., 2011],
suggest that REML eigenvalues may have similar properties. The bulk distribution of
eigenvalues appears to follow the Marchenko Pastur distribution [Blows and McGuigan,
2015], and the leading eigenvalues of estimated Gare consistent with the Tracy Widom
distribution [Sztepanacz and Blows, 2017]. Notably, however, the leading eigenvalues of G
estimated using factor analysis or MCMCglmm do not follow the Tracy Widom distribution
[Sztepanacz and Blows, 2017], showing that the method used to estimate Ginfluences the
sampling distribution of its eigenvalues. These studies laid the groundwork for determining
the properties of REML eigenvalues in the one-way design, however, they are limited to
small pand np. A major hurdle in extrapolating these studies to the phenomic-scale,
is the difficulty in employing REML for high p.
Alternative approaches to REML have been developed in recent years and employed to
3
estimate Gand its eigenvalues for high pphenomic studies. The Bayesian sparse factor
approach of Runcie and Mukherjee [2013] enables the estimation of genetic covariance for
thousands of traits by placing a prior on the latent factors underlying Gand Ethat few
traits contribute to them (ie. they are sparse). Studies using this method have qualita-
tively found the same uneven distribution of standing genetic variation [Garcia et al., 2014],
mutational variation [Hine et al., 2018], and a large nearly-null genetic subspace, as seen
in small pstudies. While biologically motivated, the assumption that sparse factors un-
derlie both Gand Emay upwardly bias estimates of the nearly null subspace. MegaLMM
[Runcie et al., 2021] uses strong Bayesian priors on the sparsity and number of important
latent factors underlying Gand Eto fit linear mixed models for >20,000 traits and rel-
atively small n. The primary goal of this approach, however, is for genomic prediction,
not estimation. Consequently, the properties that make the method feasible for including
phenome-level data, and which are beneficial for genomic prediction, may have undesirable
properties with respect to estimation. Blows et al. [2015] circumvented convergence prob-
lems in REML by constructing large Gfrom a set of overlapping principal sub-matrices,
estimated for a small numbers of traits using REML. Since the constructed matrix is not
guaranteed to be positive definite, a subsequent bending [Hayes and Hill, 1981, Meyer,
2019] of Gwas required. This approach enabled a coarse description of the distribution of
genetic variation across G, finding that the vast majority of genetic variation was found in
few dimensions. However, a quantitative description of the size of the nearly null subspace
was not possible. While each of these approaches have value in certain circumstances, they
all make some undesirable assumptions that lead to challenges in interpreting the size of
the nearly null space and performing statistical hypothesis testing.
In this paper we demonstrate that REML is feasible for high pphenomics in the spe-
cific setting of a balanced nested half-sib design, without having to make any additional
assumptions about the structure of Gor perform any bending, and we begin to study the
properties of estimated eigenvalues from these high pmodels. In particular, we perform
simulation study investigating the spectral characteristics of the REML estimate of Gfor
a family of balanced half-sib breeding designs in the moderate-p(p50) setting.
As algorithms that are usually used to fit mixed models often fail to converge even for p
as small as 5, we use an algorithm designed specifically for fitting balanced nested designs
introduced in Calvin and Dykstra [1991] that will allow us to find REML estimates for
larger values of p.
In section 3, we compare this Calvin-Dijkstra algorithm with various other procedures
used to compute or approximate the REML estimate, showing that it does at least as well as
its common alternatives. We then demonstrate that the algorithm is a practical method for
REML estimation for the balanced designs even for p50, in which regard it substantially
outperforms the alternatives presented.
Sections 4 and 5 investigate the illustrate the biases inherent in using functions of eigen-
values of the REML estimate ˆ
Gto estimate the corresponding functions of the eigenvalues
of G. We show that, as pgets large, the large eigenvalues of ˆ
Ghave a substantial upward
bias when used to estimate the large eigenvalues of G. We also investigate the bias in esti-
mators of the nearly-null dimension of Gbased on counting small eigenvalues of ˆ
G. These
biases are more difficult to describe, so we demonstrate the behavior of such estimators
on a family of qualitatively different between-sire and between-dam covariance matrices.
Section 6 has some summary conclusions and discussion.
4
2 Description of methods compared
Our simulations use a balanced half-sib (or nested two-way) design
Yijk =µ+αi+βij +εijk (1)
Here αi, βij and εijk are the random effects due to sire, dam and individual respectively,
and µis a fixed intercept.
Each Yijk is a vector of ptraits; there are I, J and Kpossible values of the indices i, j
and k. Each random effect vector is assumed to follow independent normal distributions
αi N (0,ΣA), βij N (0,ΣB), ijk N (0,ΣE).(2)
We are particularly interested in estimation of the genetic covariance matrix G= ΣA
and its eigenvalues, but will also look at the concomitant estimates of ΣBand ΣE. The
estimates we will compare are
MANOVA, which are easy to compute but do not necessarily produce a positive-
definite estimate for ΣA,
REML-lme. fitted using a generic mixed-effects model solver (we use Rfunction
lme()), which should converge to the true REML estimates but are prohibitively
slow to compute for high (or even moderate) numbers of traits,
pseudo-REML, as described by Amemiya [1985]. These are easy to compute, and in
the full-sub (one-way) design are exactly the REML estimates. In the half-sib case,
they are only approximations.
stepwise REML, as described by Calvin and Dykstra [1991] This is an iterative algo-
rithm that converges to the true REML estimates. Although we know of no results
decribing the rate of convergence, in practice it is very fast, allowing for easy REML
estimation even for 100 ×100 matrices.
pairwise REML, in which the estimates of the individual entries Σlm are computed
using a bivariate analysis with traits land m. This estimate is sometimes proposed
as a computational fall-back when the dimensionality of Σ is too high for REML to
converge; it need not be non-negative definite, as will be seen.
3 Initial examples
3.1 A small pexample with zero eigenvalues
As a warm-up, we first illustrate the methods in a setting with a small number of traits,
p= 4.1Specifically, we take
p= 4, I = 100, J = 3, K = 5,(3)
and diagonal structures for the variance component matrices:
ΣA=σ2
Adiag(1,1,0,0), σ2
A= 25
ΣB=σ2
Bdiag(1,0,0,1), σ2
B= 9,(4)
ΣE=σ2
EId4, σ2
E= 1.
1indeed, this was the largest number of traits for which lme() runs relatively quickly.
5
摘要:

ComparisonofREMLmethodsforthestudyofphenome-widegeneticvariationDamianPavlyshyn1,2,IainM.Johnstone1,andJacquelineL.Sztepanacz31DepartmentofStatistics,StanfordUniversity2DiseaseEliminationProgram,BurnetInstitute3DepartmentofEcologyandEvolutionaryBiology,UniversityofTorontoOctober2022AbstractItisnowwe...

展开>> 收起<<
Comparison of REML methods for the study of phenome-wide genetic variation Damian Pavlyshyn12 Iain M. Johnstone1 and Jacqueline L. Sztepanacz3.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:501.26KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注