Visualizing Multispecies Coalescent Trees Drawing Gene Trees Inside Species Trees Jonathan Klawitter12

2025-04-24 0 0 1.11MB 28 页 10玖币
侵权投诉
Visualizing Multispecies Coalescent Trees:
Drawing Gene Trees Inside Species Trees
Jonathan Klawitter1,2, Felix Klesen1, Moritz Niederer3, and
Alexander Wolff1
1Universität Würzburg, Germany
2University of Auckland, New Zealand
3HTW Saar, Germany
Abstract. We consider the problem of drawing multiple gene trees in-
side a single species tree in order to visualize multispecies coalescent
trees. Specifically, the drawing of the species tree fills a rectangle in which
each of its edges is represented by a smaller rectangle, and the gene trees
are drawn as rectangular cladograms (that is, orthogonally and down-
ward, with one bend per edge) inside the drawing of the species tree. As
an alternative, we also consider a style where the widths of the edges of
the species tree are proportional to given effective population sizes.
In order to obtain readable visualizations, our aim is to minimize the
number of crossings between edges of the gene trees in such drawings.
We show that planar instances can be recognized in linear time and that
the general problem is NP-hard. Therefore, we introduce two heuristics
and give an integer linear programming (ILP) formulation that provides
us with exact solutions in exponential time. We use the ILP to measure
the quality of the heuristics on real-world instances. The heuristics yield
surprisingly good solutions, and the ILP runs surprisingly fast.
1 Introduction
Visualizations of trees to present information have been used for centuries [24]
and the study of producing readable, compact representations of trees has a long
tradition [31,33]. Trees and their drawings are also an ubiquitous and fundamen-
tal tool in the field of phylogenetics. In particular, a phylogenetic tree is used
1
2
3
4
5
6
7
Fig. 1: A rectangular cladogram
drawing of a rooted binary phylo-
genetic tree on seven taxa.
to model the evolutionary history and rela-
tionships of a set Xof taxa such as species,
genes, or languages [34]. There exist many
different models, but most commonly a phy-
logenetic tree on Xis a tree Twhose leaves
are bijectively labeled with X; see Fig. 1. In a
rooted phylogenetic tree, each internal vertex
represents a branching event (such as species
divergence); time (or genetic distance) is rep-
resented by the edge lengths from the root
towards the leaves. In most applications, the
arXiv:2210.06744v2 [cs.DM] 29 Oct 2022
2 J. Klawitter et al.
tree is binary, that is, each internal vertex has indegree one, outdegree two, and
thus represents a bifurcation event. An unrooted phylogenetic tree, on the other
hand, models only the relatedness of the taxa. A phylogenetic tree where the
taxa are species is called a species tree. If the taxa are biological sequences, such
as particular genes or protein sequences, the tree is called a gene tree.
Multispecies coalescent models. One of the main tasks in phylogenetics is the
inference of a phylogenetic tree for some given data and model. When inferring
a species tree based on sequencing data, one might be inclined to set the species
tree as that of an inferred gene tree. However, gene trees can differ from the
species tree in the presence of so-called incomplete lineage sorting4or when di-
vergence times are small5, which can lead to inaccurate edge lengths or even to
an incorrectly inferred species tree [1,25, 28, 32]. To address these issues, multi-
species coalescent (MSC) models have been developed. An MSC model provides
a framework for inferring species trees while accounting for conflicts between
gene trees and species trees [15, 18, 30]. Roughly speaking, by using multiple
samples (genes) per species, the model coestimates multiple gene trees that are
constrained within their shared species tree. In doing so, the model can infer
not only divergence times for inner vertices but also the effective population size
for each edge (branch) in the species tree. There exist several models for pop-
ulation sizes [38], two of which we define here. In the continuous linear model,
for each branch, the population size between the top and the bottom is lin-
early interpolated, and for a branch not incident to a leaf, the population size
at the bottom equals the sum of the population sizes at the top of its two child
branches; see Fig. 2a. In the piecewise constant model, the population size of
each branch is constant from the top to the bottom of the branch and there are
no restrictions between adjacent branches [12]; see Fig. 2b.
For a phylogenetic tree T, let V(T)be the vertex set of T, let E(T)be
the edge set of T, and let L(T)be the leaf set of T. We define an MSC tree
as a triple hS, T, ϕiconsisting of a species tree S, a gene tree T, and a map-
ping ϕ:L(T)L(S)with the following properties. Both Sand Tare rooted
binary phylogenetic trees where all vertices have an associated height hthat are
strictly decreasing from root to leaf. We consider only the case where h(`)is
zero for each leaf `in L(S)and L(T). For gene trees, we use the terms leaf,ver-
tex, and edge; whereas we use the terms species,node, and branch if we want to
stress that we talk about species trees. Each branch in E(S)is associated with
an upper and a lower population size. The mapping ϕdescribes which leaves
of Tbelong to which species in S. Next, consider two leaves `and `0of Twith
ϕ(`)6=ϕ(`0). Let vbe the lowest common ancestor of `and `0. In the MSC
model we have that a divergence event of voccurred before the divergence event
4We speak of incomplete lineage sorting if (i) in a population of an ancestral species
two (or more) variants of a gene were present, say red and blue, and (ii) when the
species diverged, this did not result in one child species having the red variant and the
other having the blue variant, but, e.g., one child species having both variants [32].
5A small divergence time corresponds to a short edge in the phylogenetic tree, which
can be hard to infer correctly.
Visualizing Multispecies Coalescent Trees 3
A
B
C
D
effective population size
(a) Continuous linear model.
A
B
C
D
effective population size
(b) Piecewise constant model.
Fig. 2: Multispecies coalescent of a species tree on four species A, B, C, D and a gene
tree on eleven taxa under two different models.
at a node sof Sthat ultimately split ϕ(`)and ϕ(`0). Hence, h(v)> h(s)and we
can extrapolate ϕto a mapping of each inner vertex vof Tto a branch of S.
Lastly, we assume that the input consists of a single gene tree; otherwise we
merge multiple given gene trees by connecting all their roots to a super root.
Visualizing MSC trees. Visualizations of an MSC tree usually show the species
and gene tree together. This allows the user to detect any discordance between
them such as whether they have different topologies and where incomplete lin-
eage sorting occurs. It is also interesting to see where these events occur with
respect to the inferred population sizes. Furthermore, these drawings are used to
diagnose whether the parameters of the model are set up well. E.g., if all inner
vertices of the gene tree occur directly above nodes of the species tree or if all
occur near the root of the species tree, parameters may have been chosen poorly.
Wilson et al. [38] suggested a tree-in-tree style for an MSC tree hS, T, ϕiunder
continuous models similar to the one shown in Fig. 2a. There, the species tree S
is drawn in a space-filling fashion such that the branch widths of Sreflect the
associated population sizes and the gene tree Tis then drawn into Sas a classic
node-link diagram. Without these constraints on the branch widths, Tcould
be drawn as a classic rectangular cladogram as in Fig. 1. As noted above, the
MSC model ensures that Tcan be drawn inside Swithout edges of Tcrossing
edges of Ssince, for each edge uv of T, we have that, if uand vlie inside the
branches euand evof S, respectively, then either euprecedes evor eu=ev.
Douglas [12] developed the tool UglyTrees that generates tree-in-tree draw-
ings for MSC trees under the piecewise constant model; Fig. 2b resembles such a
drawing. He points out that the results are in many cases visually unpleasing
(as reflected in his choice for the tool’s name), in particular if the difference in
width between parent and child is large. This is amplified in practice by the
inverse relationship between the number of gene tree vertices and population
sizes, which results in clusters of vertices in the narrowest branches [12].
4 J. Klawitter et al.
Fig. 3: Representation of co-phylogenetic trees with the host tree as background shape
and the parasite tree drawn with a node-link diagram (after Calamoneri et al. [9]).
Related work. There exist several applications where multiple phylogenetic trees
are displayed together. In a tanglegram, two phylogenetic trees on the same set of
taxa are drawn opposite each other and the corresponding leaves are connected
by line segments for easy comparison [8, 14]. The tool DensiTree [7] allows the
user to compare many trees simultaneously by drawing them on top of each
other. A co-phylogenetic tree consists of two rooted phylogenetic trees, namely,
ahost tree Hand a parasite tree P, together with a mapping (reconciliation)
of the vertices of Pto vertices of H. Other than in an MSC tree, the vertices
of Pcommonly do not have heights but are mapped to nodes of H, the host
branches do not have associated population sizes, and the edges of Pcan go from
one subtree of Hto another, representing so-called host switches. Several tools
visualize the reconciliation of co-phylogenetic trees [10, 11, 26, 35]. Commonly,
the branches of Hare drawn with thick lines such that Pcan be embedded
into H; see Fig. 3. Recently, Calamoneri et al. [9] suggested a tree-in-tree style
for reconciliation similar to the one for MSC trees above. They draw Hin a
space-filling way and embed Pinto Has an orthogonal node-link diagram.
More generally, visualizations have been studied for various models in phy-
logenetics, such as rooted phylogenetic trees [2,6,31], in conjunction with a geo-
graphic map [27,29], unrooted phylogenetic trees, and split networks [13,23,36].
In recent years, research has been extended from drawings of trees to drawings
of phylogenetic networks [19–22, 37], which are more general.
All these applications share the main combinatorial objective of finding draw-
ings where the number of crossings between edges is minimized. To this end, good
embeddings of the trees (or networks) have to be found, which are mostly fully
defined by the order of the leaves. For example, Calamoneri et al. [9] investigated
the problem of minimizing the number of crossings of the parasite tree in their
drawings. They showed that this problem is in general NP-hard, though planar
instances can be identified efficiently, and they suggested two heuristics.
Contribution. Motivated by the drawing styles of Wilson et al. [38] and by the
recently proposed space-filling drawing style for reconciliation [9] and phyloge-
netic networks [37], we formally define tree-in-tree drawing styles for MSC trees
(Section 2). In our base, rectangular style, we draw the species tree such that it
completely fills a rectangle; the branch widths are based on the number of leaves
Visualizing Multispecies Coalescent Trees 5
in the respective gene subtree. Additionally, population sizes can be represented,
e.g., by a background color gradient. This avoids visual overload and can be used
for any population size model. Nonetheless, based on this, we also define a style
where the branch widths are proportional to the associated population sizes.
We then study the problem of minimizing the number of crossings between
edges of the gene tree both for the case when the embedding of the species tree is
already fixed and when it is left variable. We show that the crossing minimization
problem is NP-hard in both cases (Section 3). On the positive side, we show
that crossing-free instances can be identified in linear time (Section 4) and we
introduce two heuristics and an integer linear program (ILP) formulation for the
non-planar cases (Section 5). We measure the performance of the heuristics on
real-world instances by comparing them to optimal solutions obtained via the
ILP, which we have tuned to solve medium-size instances in reasonable time.
Complete proofs to some of our claims and detailed descriptions of our algo-
rithms can be found in Section 5. Implementations of our algorithm are shared
upon request.
2 Drawing Style
In this section, we define styles for tree-in-tree drawings of an MSC tree hS, T, ϕi.
A drawing is defined for particular leaf orders π(S)and π(T)of Sand T, re-
spectively, and we assume that they satisfy the following requirements. (i) At
least one leaf is mapped to each species. (ii) The leaf order π(T)is consistent
with ϕand π(S), that is, the sets of leaves of Tmapped by ϕto a species sare
consecutive in π(T)and succeed all leaves mapped to the species that precede
sin π(S). (iii) If all the leaves of a subtree T0of Tare mapped to the same
species s, then these leaves must be consecutive in π(T), and T0must admit
a plane drawing above the leaves. We first describe the rectangular style where
branch widths are proportional to the number of leaves in subtrees, and then the
proportional style where branch widths are proportional to the population sizes.
Finally, we define the crossing minimization problem for tree-in-tree drawings.
Rectangular style. Our drawing area is an axis-aligned rectangle R. The width
of Ris twice the number of leaves of T. We assume that the roots of Sand T
have out-degree 1. We scale hsuch that the heights of the roots equal the height
of R. The given heights of vertices and nodes thus correspond to heights in R.
The species tree Sis drawn as follows; see Fig. 4. For a species sL(S),
we define n(s) = |ϕ1(s)|, that is, the number of leaves of Tmapped to sby ϕ.
The branches of Sare represented by internally disjoint rectangles whose union
covers R. Of each such rectangle we only draw the left and the right border –
the delimiters. Their y-coordinates are defined by the heights of their start and
target nodes. The x-coordinates are defined recursively: A branch incident to a
species shas width 2n(s)and an internal branch has width equal to the width
of its two child branches; see Fig. 4b. Note that the branch incident to the root
has a width equal to the width of R.
摘要:

VisualizingMultispeciesCoalescentTrees:DrawingGeneTreesInsideSpeciesTreesJonathanKlawitter1;2,FelixKlesen1,MoritzNiederer3,andAlexanderWol11UniversitätWürzburg,Germany2UniversityofAuckland,NewZealand3HTWSaar,GermanyAbstract.Weconsidertheproblemofdrawingmultiplegenetreesin-sideasinglespeciestreeinor...

展开>> 收起<<
Visualizing Multispecies Coalescent Trees Drawing Gene Trees Inside Species Trees Jonathan Klawitter12.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:1.11MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注