Comparison of REML methods for the study of phenome-wide genetic variation Damian Pavlyshyn12 Iain M. Johnstone1 and Jacqueline L. Sztepanacz3

2025-04-29 0 0 501.26KB 23 页 10玖币

侵权投诉

Comparison of REML methods for the study of phenome-wide

genetic variation

Damian Pavlyshyn1,2, Iain M. Johnstone1, and Jacqueline L. Sztepanacz3

1Department of Statistics, Stanford University

2Disease Elimination Program, Burnet Institute

3Department of Ecology and Evolutionary Biology, University of Toronto

October 2022

Abstract

It is now well documented that genetic covariance between functionally related traits

leads to an uneven distribution of genetic variation across multivariate trait combina-

tions, and possibly a large part of phenotype-space that is inaccessible to evolution.

How the size of this nearly-null genetic space translates to the broader phenome level is

unknown. High dimensional phenotype data to address these questions are now within

reach, however, incorporating these data into genetic analyses remains a challenge.

Multi-trait genetic analyses, of more than a handful of traits, are slow and often fail

to converge when ﬁt with REML. This makes it challenging to estimate the genetic co-

variance (G) underlying thousands of traits, let alone study its properties. We present

a previously proposed REML algorithm that is feasible for high dimensional genetic

studies in the speciﬁc setting of a balanced nested half-sib design, common of quanti-

tative genetics. We show that it substantially outperforms other common approaches

when the number of traits is large, and we use it to investigate the bias in estimated

eigenvalues of Gand the size of the nearly-null genetic subspace. We show that the

high-dimensional biases observed are qualitatively similar to those substantiated by

asymptotic approximation in a simpler setting of a sample covariance matrix based on

i.i.d. vector observation, and that interpreting the estimated size of the nearly-null

genetic subspace requires considerable caution in high-dimensional studies of genetic

variation. Our results provide the foundation for future research characterizing the

asymptotic approximation of estimated genetic eigenvalues, and a statistical null dis-

tribution for phenome-wide studies of genetic variation.

1 Introduction

To answer the most compelling questions in evolutionary biology we must uncover the

causal connection between genotypes, phenotypes, and the environment (selection). Un-

like the ﬁnite genomes that underlie them, organisms are comprised of essentially inﬁnite

phenotypes that may genetically vary and covary. While not all of those phenotypes will

be meaningful for ﬁtness, there are at least many thousands to consider, if we are to begin

to understand the genotype-phenotype map. Eﬀorts to increase the scope and through-

put of phenotyping have been cited as an urgent priority in evolutionary biology since at

least 2010 [Houle, 2010, Houle et al., 2010]. High-throughput phenomic technologies such

as such as RNA sequencing [Schrag et al., 2018], drone based plant imaging [Furbank and

arXiv:2210.11709v1 [stat.AP] 21 Oct 2022

Tester, 2011], wearable sensors [Haleem et al., 2021, Sharma et al., 2021, Neethirajan, 2017],

metabolomics for physiological measurements [Jin et al., 2020, Freimer and Sabatti, 2003],

and computer vision for morphometrics [L¨urig et al., 2021] are now within reach. How-

ever, incorporating these high-dimensional phenotype data into genetic analyses remains a

challenge.

Multivariate studies of genetic variation focusing on relatively small sets of traits p20

(small p) have transformed our understanding of how genetic variation is distributed across

phenotypes, and how this aﬀects evolutionary outcomes. While almost all individual traits

studied have been shown to have genetic variation [Lynch et al., 1998], and responses to

artiﬁcial selection are often rapid and of large magnitude [Hill and Kirkpatrick, 2010], mul-

tivariate studies of the genetic variance-covariance matrix (G) show that this genetic varia-

tion is distributed unevenly across multivariate phenotypes [Kirkpatrick, 2009, Sztepanacz

and Houle, 2019]. The concentration of genetic variance onto fewer multivariate trait com-

binations than the number of phenotypes measured is biologically caused by the pleiotropic

eﬀects of alleles on multiple traits which leads to their genetic covariance [Lande, 1980].

Multivariate trait combinations with high genetic variation form a genetic subspace where

traits are predicted to have high evolvability. Evolution is predicted to occur more quickly

along these multivariate trait combinations than in any individual trait [Agrawal and Stinch-

combe, 2009], contributing to divergence among populations (eg. [McGlothlin et al., 2022,

Schluter, 1996]), species (eg. [Innocenti and Chenoweth, 2013, B´egin and Roﬀ, 2004]), and

sexes (eg. Gosden and Chenoweth [2014]). The set of orthogonal multivariate trait com-

binations with low genetic variation, form a complementary subspace which is termed the

nearly-null genetic subspace [Gomulkiewicz and Houle, 2009, Gaydos et al., 2013]. This

subspace putatively represents important evolutionary constraints in natural populations,

where phenotypic evolution is expected to be constrained [Gomulkiewicz and Houle, 2009],

occur slowly [Kirkpatrick, 2009] or stochastically [Hine et al., 2014].

Estimating the size of the nearly-null genetic subspace is an avenue to quantify both

genetic constraints that may lead to evolutionary limits, and the extent of pleiotropy under-

lying organisms, which determines their genetic dimensionality. Past studies have typically

found that genetic variance is restricted to less than half of the phenotype space with

most multivariate trait combinations having no detectable genetic variation [Blows and

McGuigan, 2015] . One notable exception to this pattern is Drosophila wing shape, which

as been shown to have genetic variation in all multivariate wing shape traits (ie. a full rank

G) [Mezey and Houle, 2005, Sztepanacz and Blows, 2015, Sztepanacz and Houle, 2019,

Houle and Meyer, 2015]. These multivariate studies have typically dealt with small sets

of functionally related traits such as wing [Mezey and Houle, 2005], skeletal [Garcia et al.,

2014] or phyto [Walsh and Lynch, 2018] morphology, or traits that have a strong physio-

logical [Caruso et al., 2005], or biochemical relationships [Sztepanacz and Rundle, 2012].

Therefore, genetic covariance among these traits may be relatively high compared to the

range of traits that are possible to study with phenomics. Whether inferences for the size

of the nearly null subspaces found in these small pstudies can be extrapolated to phenomes

is an open biological question.

Confounded with the biological phenomena that nearly-null genetic subspaces represent,

are the statistical phenomena that arise from their estimation. Even in low pstudies, esti-

mating Gand its eigenvalues is a challenge. The standard approach is to ﬁt a multivariate

mixed model in REML to estimate Gas an unstructured covariance matrix. These mod-

els commonly fail to converge with only a small number of traits, and run-times are long

because of the quadratic growth ∼p2/2 in the number of parameters to estimate with the

number of traits. As shown in the one-way MANOVA context, small pmodels that do con-

verge, often produce estimated matrices that are not positive deﬁnite [Hill and Thompson,

1978] and that have overdispersed eigenvalues [Hayes and Hill, 1981], properties which are

exacerbated in the two-way hierarchical model, such as the nested half-sib design common

in quantitative genetic studies. In particular, Hayes and Hill [1981] show that the magni-

tude of overdisperion of the genetic eigenvalues is determined by p/n with smaller sample

sizes or more traits leading to larger dispersion. The overdispersion of estimated eigenval-

ues is a pattern remarkably similar to that we interpret biologically as genetic subspaces of

high evolvability and those that are nearly null. In the phenomic context, where p/n ∼1 or

possibly even greater than 1, it could even obscure any biological signal. Disentangling how

much of the dispersion of eigenvalues of Gis due to biological covariance versus statistical

estimation is therefore important, if we want to predict evolutionary constraints.

In simpler settings, recent work has provided a quantitative description of the bias and

error related to the dispersion of estimated eigenvalues when pis of order comparable to n,

summarized for example in [Johnstone and Paul, 2018]. The bulk eigenvalue distribution

of sample (i.i.d) covariance matrices, such as those describing among-individual covari-

ance structure such as phenotypic covariance matrices P, or genomic relatedness matrices

(GRMs), is known to follow the Marchenko Pastur distribution [Marˇcenko and Pastur, 1967,

Yao et al., 2015]. For a large class of mixed models, including full-sib and half-sib designs,

and for the particular case of MANOVA estimators, overdisperson of sample eigenvalues in

the bulk eigenvalue distribution is described by a generalization of the Marchenko-Pastur

distribution [Fan and Johnstone, 2019].

The individual eigenvalues of sample covariance matrices also follow particular distribu-

tions. The leading eigenvalue conforms to the Tracy-Widom distribution, both for sample

(i.i.d.) covariance matrices [Johnstone, 2001] and for MANOVA estimates from mixed mod-

els[Fan and Johnstone, 2022]. The trailing (smallest) eigenvalue converges to a reﬂected

Tracy-Widom distribution [Fan and Johnstone, 2022]. Again for MANOVA estimates in

mixed models, the leading estimated eigenvalues and eigenvectors can be inﬂuenced by

an alignment between directions of suﬃciently elevated variance in the diﬀerent covariance

components [Fan et al., 2018]. For example, an alignment between the major axis of genetic

variance gmax and environmental variation could lead to an estimated gmax that is biased

toward the major axis of environmental variation and with an eigenvalue that is much larger

than it should be. These quantitative descriptions provide appropriate null distributions

for MANOVA eigenvalues, and enable bias-correction for estimated eigenvalues in these

contexts.

Simple among line (one-way) genetic simulation studies estimating Gfor p=5 using

REML, combined with an empirical centering and scaling approach [Saccenti et al., 2011],

suggest that REML eigenvalues may have similar properties. The bulk distribution of

eigenvalues appears to follow the Marchenko Pastur distribution [Blows and McGuigan,

2015], and the leading eigenvalues of estimated Gare consistent with the Tracy Widom

distribution [Sztepanacz and Blows, 2017]. Notably, however, the leading eigenvalues of G

estimated using factor analysis or MCMCglmm do not follow the Tracy Widom distribution

[Sztepanacz and Blows, 2017], showing that the method used to estimate Ginﬂuences the

sampling distribution of its eigenvalues. These studies laid the groundwork for determining

the properties of REML eigenvalues in the one-way design, however, they are limited to

small pand np. A major hurdle in extrapolating these studies to the phenomic-scale,

is the diﬃculty in employing REML for high p.

Alternative approaches to REML have been developed in recent years and employed to

estimate Gand its eigenvalues for high pphenomic studies. The Bayesian sparse factor

approach of Runcie and Mukherjee [2013] enables the estimation of genetic covariance for

thousands of traits by placing a prior on the latent factors underlying Gand Ethat few

traits contribute to them (ie. they are sparse). Studies using this method have qualita-

tively found the same uneven distribution of standing genetic variation [Garcia et al., 2014],

mutational variation [Hine et al., 2018], and a large nearly-null genetic subspace, as seen

in small pstudies. While biologically motivated, the assumption that sparse factors un-

derlie both Gand Emay upwardly bias estimates of the nearly null subspace. MegaLMM

[Runcie et al., 2021] uses strong Bayesian priors on the sparsity and number of important

latent factors underlying Gand Eto ﬁt linear mixed models for >20,000 traits and rel-

atively small n. The primary goal of this approach, however, is for genomic prediction,

not estimation. Consequently, the properties that make the method feasible for including

phenome-level data, and which are beneﬁcial for genomic prediction, may have undesirable

properties with respect to estimation. Blows et al. [2015] circumvented convergence prob-

lems in REML by constructing large Gfrom a set of overlapping principal sub-matrices,

estimated for a small numbers of traits using REML. Since the constructed matrix is not

guaranteed to be positive deﬁnite, a subsequent bending [Hayes and Hill, 1981, Meyer,

2019] of Gwas required. This approach enabled a coarse description of the distribution of

genetic variation across G, ﬁnding that the vast majority of genetic variation was found in

few dimensions. However, a quantitative description of the size of the nearly null subspace

was not possible. While each of these approaches have value in certain circumstances, they

all make some undesirable assumptions that lead to challenges in interpreting the size of

the nearly null space and performing statistical hypothesis testing.

In this paper we demonstrate that REML is feasible for high pphenomics in the spe-

ciﬁc setting of a balanced nested half-sib design, without having to make any additional

assumptions about the structure of Gor perform any bending, and we begin to study the

properties of estimated eigenvalues from these high pmodels. In particular, we perform

simulation study investigating the spectral characteristics of the REML estimate of Gfor

a family of balanced half-sib breeding designs in the moderate-p(p≈50) setting.

As algorithms that are usually used to ﬁt mixed models often fail to converge even for p

as small as 5, we use an algorithm designed speciﬁcally for ﬁtting balanced nested designs

introduced in Calvin and Dykstra [1991] that will allow us to ﬁnd REML estimates for

larger values of p.

In section 3, we compare this Calvin-Dijkstra algorithm with various other procedures

used to compute or approximate the REML estimate, showing that it does at least as well as

its common alternatives. We then demonstrate that the algorithm is a practical method for

REML estimation for the balanced designs even for p≈50, in which regard it substantially

outperforms the alternatives presented.

Sections 4 and 5 investigate the illustrate the biases inherent in using functions of eigen-

values of the REML estimate ˆ

Gto estimate the corresponding functions of the eigenvalues

of G. We show that, as pgets large, the large eigenvalues of ˆ

Ghave a substantial upward

bias when used to estimate the large eigenvalues of G. We also investigate the bias in esti-

mators of the nearly-null dimension of Gbased on counting small eigenvalues of ˆ

G. These

biases are more diﬃcult to describe, so we demonstrate the behavior of such estimators

on a family of qualitatively diﬀerent between-sire and between-dam covariance matrices.

Section 6 has some summary conclusions and discussion.

2 Description of methods compared

Our simulations use a balanced half-sib (or nested two-way) design

Yijk =µ+αi+βij +εijk (1)

Here αi, βij and εijk are the random eﬀects due to sire, dam and individual respectively,

and µis a ﬁxed intercept.

Each Yijk is a vector of ptraits; there are I, J and Kpossible values of the indices i, j

and k. Each random eﬀect vector is assumed to follow independent normal distributions

αi∼ N (0,ΣA), βij ∼ N (0,ΣB), ijk ∼ N (0,ΣE).(2)

We are particularly interested in estimation of the genetic covariance matrix G= ΣA

and its eigenvalues, but will also look at the concomitant estimates of ΣBand ΣE. The

estimates we will compare are

•MANOVA, which are easy to compute but do not necessarily produce a positive-

deﬁnite estimate for ΣA,

•REML-lme. ﬁtted using a generic mixed-eﬀects model solver (we use Rfunction

lme()), which should converge to the true REML estimates but are prohibitively

slow to compute for high (or even moderate) numbers of traits,

•pseudo-REML, as described by Amemiya [1985]. These are easy to compute, and in

the full-sub (one-way) design are exactly the REML estimates. In the half-sib case,

they are only approximations.

•stepwise REML, as described by Calvin and Dykstra [1991] This is an iterative algo-

rithm that converges to the true REML estimates. Although we know of no results

decribing the rate of convergence, in practice it is very fast, allowing for easy REML

estimation even for 100 ×100 matrices.

•pairwise REML, in which the estimates of the individual entries Σlm are computed

using a bivariate analysis with traits land m. This estimate is sometimes proposed

as a computational fall-back when the dimensionality of Σ is too high for REML to

converge; it need not be non-negative deﬁnite, as will be seen.

3 Initial examples

3.1 A small pexample with zero eigenvalues

As a warm-up, we ﬁrst illustrate the methods in a setting with a small number of traits,

p= 4.1Speciﬁcally, we take

p= 4, I = 100, J = 3, K = 5,(3)

and diagonal structures for the variance component matrices:

ΣA=σ2

Adiag(1,1,0,0), σ2

A= 25

ΣB=σ2

Bdiag(1,0,0,1), σ2

B= 9,(4)

ΣE=σ2

EId4, σ2

E= 1.

1indeed, this was the largest number of traits for which lme() runs relatively quickly.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ComparisonofREMLmethodsforthestudyofphenome-widegeneticvariationDamianPavlyshyn1,2,IainM.Johnstone1,andJacquelineL.Sztepanacz31DepartmentofStatistics,StanfordUniversity2DiseaseEliminationProgram,BurnetInstitute3DepartmentofEcologyandEvolutionaryBiology,UniversityofTorontoOctober2022AbstractItisnowwe...

展开>> 收起<<

Comparison of REML methods for the study of phenome-wide genetic variation Damian Pavlyshyn12 Iain M. Johnstone1 and Jacqueline L. Sztepanacz3.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Comparison of REML methods for the study of phenome-wide genetic variation Damian Pavlyshyn12 Iain M. Johnstone1 and Jacqueline L. Sztepanacz3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: