Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction Hansheng Xue1Vaibhav Rajan2and Yu Lin1

2025-05-06 0 0 4.74MB 16 页 10玖币
侵权投诉
Graph Coloring via Neural Networks for Haplotype
Assembly and Viral Quasispecies Reconstruction
Hansheng Xue,1Vaibhav Rajan,2and Yu Lin1
1School of Computing, Australian National University, Canberra, Australia
2School of Computing, National University of Singapore, Singapore
{hansheng.xue,yu.lin}@anu.edu.au,vaibhav.rajan@nus.edu.sg
Abstract
Understanding genetic variation, e.g., through mutations, in organisms is crucial to
unravel their effects on the environment and human health. A fundamental char-
acterization can be obtained by solving the haplotype assembly problem, which
yields the variation across multiple copies of chromosomes. Variations among fast
evolving viruses that lead to different strains (called quasispecies) are also deci-
phered with similar approaches. In both these cases, high-throughput sequencing
technologies that provide oversampled mixtures of large noisy fragments (reads) of
genomes, are used to infer constituent components (haplotypes or quasispecies).
The problem is harder for polyploid species where there are more than two copies
of chromosomes. State-of-the-art neural approaches to solve this NP-hard problem
do not adequately model relations among the reads that are important for decon-
volving the input signal. We address this problem by developing a new method,
called
NeurHap
, that combines graph representation learning with combinatorial
optimization. Our experiments demonstrate substantially better performance of
NeurHap in real and synthetic datasets compared to competing approaches.
1 Introduction
Our genetic material is organized as sequences of DNA or RNA molecules (nucleotides) which form
three-dimensional structures (chromosomes) within our cells. Most organisms have multiple highly
similar copies of chromosomes in their cells (e.g., humans have 2). Variations in genetic sequences
lead to the emergence of new species during evolution and are also known to be associated with many
diseases (e.g., cancer). There are many possible ways in which such variations can occur; the simplest
among them is a mutation or a change in the nucleotide at a specific location in the DNA or RNA
sequence. A Single Nucleotide Polymorphism (SNP) refers to a mutation in at least one of the copies
which renders the copies nonidentical at that point. An ordered list of SNPs on a single chromosome
is called a haplotype [Schwartz, 2010]. Haplotypes provide a signature of genetic variability and thus
inform us about disease susceptibilities and evolutionary patterns (e.g., of viruses). These studies in
turn pave the way for personalized medicine and effective drug development against viruses.
The problem of inferring haplotypes from high-throughput sequencing data is called haplotype
assembly or phasing, and is done in multiple stages (see Figure 1). Sequencing data yields multiple
copies of short fragments of the entire genomic sequence (called reads, Figure 1b). These reads are
noisy due to sequencing errors and their short lengths may span across limited number of SNPs.
This makes the problem of haplotype phasing challenging. The reads are first aligned to a reference
genome. This step indicates positions that are different across reads and thus infers the potential
locations of SNPs. All other positions are discarded to obtain the SNP matrix (Figure 1c). This
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12158v1 [q-bio.GN] 21 Oct 2022
Figure 1: The pipeline of reference-based polyploid haplotypes reconstruction and
NeuralHap
.
Haplotype phasing is formulated as a graph coloring problem by constructing the Read-overlap graph.
NeurHap
consists of
NeurHap
-search, a graph neural network to learn vertex representations and
color assignments, and NeurHap-refine, a local refinement strategy to further adjust colors.
matrix may be viewed as an oversampled mixture of noisy reads (restricted to SNPs). Each mixture
component represents a single haplotype and thus should have SNPs at the same locations.
In diploid species, containing two copies of chromosomes, there are two haplotypes to be inferred.
This problem has been studied extensively [Browning and Browning, 2011]. In polyploid species,
containing more than two copies of chromosomes (and thus more than two haplotypes), the problem
is more challenging due to dramatic increase in search space [Van de Peer et al., 2017, Abou Saada
et al., 2022, Jablonski and Beerenwinkel, 2021]. In reconstruction of virus strains, called viral
quasispecies, from viral populations, similar challenges arise. Moreover, unknown population sizes
and imbalanced abundances pose additional difficulties [Jablonski and Beerenwinkel, 2021].
Existing approaches for haplotype phasing of polyploid species and viral quasispecies often group
reads in the SNP matrix into clusters that correspond to different haplotypes, respectively. In an ideal
case, all reads from the same cluster should be consistent with respect to SNPs as they all belong to
the same haplotype. In reality inconsistencies occur due to sequencing errors in reads. Therefore, a
minimum error correction (MEC) score [Lippert et al., 2002b] is used to measure the discrepancy
between the consensus haplotypes and their associated reads within each cluster (see Figure 1e). It is
NP-hard to optimize the MEC score [Zhang et al., 2006], and a number of combinatorial optimization
heuristics have been proposed to approximate the optimal MEC score [Zhang et al., 2020].
More recently, the first neural network-based learning framework, named GAEseq [Ke and Vikalo,
2020b], was proposed to phase haplotypes for polyploid species and viral quasispecies. CAECseq
was later developed using a convolutional auto-encoder which captures spatial relationships between
SNPs and enables clustering reads obtained from highly similar genomic regions [Ke and Vikalo,
2020a]. Both GAEseq and CAECseq showed improved results compared to previous approaches. A
major limitation of both CAECseq and GAEseq is that they cannot capture implicit relations among
different reads. These methods have two independent steps (embedding and clustering) which makes
the haplotype phasing results unstable. Besides, sparsity of the SNP matrix makes haplotype phasing
for polyploids more challenging for these methods as well.
In this paper, we propose an approach based on graph representation learning for haplotype phasing
of polyploid species and viral quasispecies. We formulate the haplotype phasing problem as a graph
coloring problem, where the colors indicate haplotypes. The graph is constructed from the SNP
matrix where vertices are reads and two edge types are defined based on pairwise consistency and
conflicts with respect to SNPs in the reads. Message passing-based neural networks are trained to
minimize a loss designed to obtain a color assignment that maximizes consistent edges and minimizes
conflicting edges. The network learns vertex representations and through them, an initial color
assignment. A local refinement strategy is then applied to adjust node colors in order to minimize
MEC scores. Thus, in contrast to previous neural approaches that first learn representations and then
cluster, our approach models the problem requirements in all steps. As a result, our model achieves
better MEC scores, is more stable and also performs well on the challenging cases of polyploids and
viral quasispecies. In summary, our contributions are:
2
We provide a unique formulation of the haplotype phasing problem as a graph coloring problem,
and develop an algorithm based on graph representation learning and combinatorial optimization.
Our approach consists of
NeurHap
-search, a graph neural network to learn vertex representations
and color assignments followed by
NeurHap
-refine, a local refinement strategy to adjust colors and
optimize MEC scores.
Extensive experiments on synthetic and real datasets demonstrate that our new method
NeuralHap
significantly outperforms state-of-the-art phasing methods for both polyploid species and viral
quasispecies.
2 Related Work
Haplotype Phasing.
The aim of haplotype phasing of polyploid species and viral quasispecies is to
group reads into homogeneous clusters that corresponds to different haplotypes, respectively. The
minimum error correction (MEC) score [Lippert et al., 2002b] is introduced to measure the total
discrepancy of reads in all clusters but is NP-hard to be optimised [Zhang et al., 2006]. Haplotype
phasing for diploid species (i.e., reconstructing two haplotypes) has been extensively studied in
the last two decades and a number of combinatorial optimization heuristics have been proposed
to approximate the optimal MEC score, such as BNB [Wang et al., 2005], HapCUT [Bansal and
Bafna, 2008], HASH [Bansal et al., 2008], RefHap [Duitama et al., 2012] ProbHap [Kuleshov, 2014],
HapCUT2 [Edge et al., 2017] and others, and refer to [Zhang et al., 2020] for a recent review on
phasing diploid species.
Haplotype phasing for polyploid species (i.e., reconstructing more than two haplotypes) becomes
more computationally challenging as it requires a much larger search space compared to phasing
two haplotypes for diploid species. A limited number of phasing methods work for polyploid
species, e.g., HapCompass [Aguiar and Istrail, 2012], SDhaP [Das and Vikalo, 2015], H-PoP [Xie
et al., 2016], AltHap [Hashemi et al., 2018], refer to [Abou Saada et al., 2022] for a recent review.
Haplotype phasing for viral quasispecies is very similar to the problem of phasing polyploid species.
While haplotypes in polyploid species typically have uniform abundances, the different haplotypes
(strains) in viral quasispecies may have varying abundances. Quite a few tools have also been
proposed for haplotype phasing of viral quasispecies, such as ViSpA [Astrovskaya et al., 2011],
ShoRAH [Zagordi et al., 2011], QuRe [Prosperi and Salemi, 2012], QuasiRecomb[Töpfer et al.,
2013], PredictHaplo [Prabhakaran et al., 2014], aBayesQR [Ahn and Vikalo, 2018], TenSQR [Ahn
et al., 2018], refer to [Jablonski and Beerenwinkel, 2021] for a recent review.
More recently, deep learning models have been introduced into haplotype phasing for polyploid
species and viral quasispecies. GAEseq [Ke and Vikalo, 2020b] uses a graph auto-encoder model
on the constructed reads-SNPs bipartite network to model the relations between reads and SNPs.
CAECseq [Ke and Vikalo, 2020a] uses a convolutional auto-encoder model to represent reads as
low-dimensional features and then employs a clustering algorithm to group these reads. Note that
GAEseq and CAECseq can be directly used to phase haplotypes for both polyploid species and viral
quasispecies. Experimental results on both simulated and real datasets showed the superior results of
GAEseq and CAECseq (in terms of MEC scores) compared to previous approaches for haplotype
phasing for both polyploid species and viral quasispecies [Ke and Vikalo, 2020b,a]. However, implicit
relations among different reads have not been fully captured by GAEseq and CAECseq, especially
when embedding and clustering are modelled separately and the SNP matrix is sparse.
Neural Networks on Graphs.
Most existing graph neural networks can be explained as a message-
passing based graph learning model which recursively combines learned features/messages from their
neighbors [Cui et al., 2019, Cai et al., 2018, Gilmer et al., 2017]. Popular methods include GCN [Kipf
and Welling, 2017], GraphSAGE [Hamilton et al., 2017], GAT [Veliˇ
ckovi´
c et al., 2018] and GIN [Xu
et al., 2019]. All these methods make the homophily assumption that similar nodes in the graph should
be embedded close together. However, graph coloring aims to assign pairwise nodes with distinct
colors for each edge of the graph, which is opposite to the homophily assumption. An intuitive way to
integrate GNN models into the graph coloring challenge is to adjust the loss function, such as GNN-
GCP [Lemos et al., 2019], RUN-CSP [Toenshoff et al., 2019], and PI-GNN [Schuetz et al., 2022].
However, existing GNN-based graph coloring models cannot be implemented to the read-overlap
graph directly because they cannot handle conflicting and consistent edges simultaneously.
3
3 Methodology
Overview.
In this paper, we propose the model of neural networks for graph coloring optimization to
solve the haplotype reconstruction problem, called
NeurHap
.
NeurHap
mainly contains three steps,
i) constructing the read-overlap graph; ii) global coloring searching via iterative neural networks
model, NeurHap-search; iii) local refinement to fine tune final coloring, NeurHap-refine.
Figure 2: A toy example of constructing
read-overlap graph with conflict edges
(in grey) and consistent edge (in blue).
Notations.
Let
k
be the number of haplotypes in a cell of
polyploid species (aka. ploidy) or the number of strains in
viral quasispecies. For example, human and other mam-
mals contain two haplotypes (diploid, k=2); plants have
more than 2 haplotypes (e.g., California redwood has six
copies of each chromosome, hexaploid, k=6). Viral qua-
sispecies mixing 5 distinct HIV strains will have k= 5.
Single nucleotide polymorphisms (SNPs) refer to positions
where not all haplotypes have the same alleles. Given
the alignment of reads to a reference genome, the SNP
columns can be identified by removing the columns with
identical alleles. The remaining alignment is referred to
as a
m×n
SNP matrix
R
where
m
denotes the number
of reads and
n
is the number of SNPs. The haplotype
reconstruction aims to group
m
reads into
k
clusters ,
{C1, C2, . . . , Ck}
, that correspond to
k
haplotypes,
{H1,H2,...,Hk}
, respectively. Once reads
are grouped into clusters, the haplotype
Hi
can be reconstructed from reads in
Ci
using a simple
consensus voting. In the ideal case, all the reads from the cluster
Ci
will be all consistent with
Hi
. In
reality, this is not the case and thus a minimum error correction (MEC) score [Lippert et al., 2002b]
is introduced to measure the discrepancy between the reads in the cluster
Ci
and the consensus
haplotypes
Hi
in all clusters. Given the grouping of reads into
k
clusters
{C1, C2, . . . , Ck}
, the
corresponding MEC score can be computed as
MEC(C1, C2, . . . , Ck) =
k
X
i=1 X
RjCi
HD(Hi, Rj)(1)
where
HD(·)
is the Hamming distance function. Note that
HD(Hi, Rj)
for
RjCi
can only
be computed when we know all the reads in the cluster
Ci
and use them to derive the consensus
haplotype
Hi
of
Ci
. The main challenge in haplotype phasing is to find the grouping of reads into
{C1, C2, . . . , Ck}such that the MEC score is minimized.
Two reads are called overlapping if they span over common SNP positions otherwise non-overlapping.
Given any two reads, the relationship between them belongs to one of three cases, consistent,conflict,
or ambiguous. While the relationship between two non-overlapping reads is always ambiguous, we
further introduce two parameters
p
and
q
to define the relationship between two overlapping reads to
account for sequencing errors and alignment ambiguity. Two overlapping reads are consistent if they
overlap at least
p
positions and have the same alleles over all overlapping positions; are in conflict if
they differ on at least
q
overlapping positions; and are ambiguous otherwise. The term ‘ambiguous
means that there is not enough evidence to support that these two reads should belong to the same
haplotype (‘consistent’) or should belong to the different haplotypes (‘conflict’). For example, in
Figure 2,
R4
and
R7
are consistent,
R1
and
R2
are in conflict, and
R1
and
R4
are ambiguous. In
an ideal case, all the overlapping reads in the same cluster must be consistent, i.e., if two reads are
in conflict, they must belong to different clusters. This observation naturally motivates us to build
a read-overlap graph to model all reads as vertices and the important pairwise relationships (i.e.,
consistent and conflict) between overlapping reads as edges. Moreover, if we use
k
colors to represent
the
k
clusters of reads, the problem of haplotype phasing is reduced to a graph coloring problem on
the read-overlap graph. For example, in Figure 2, the minimum MEC is achieved by grouping nine
reads into three clusters,
C1={R1, R4, R7}
,
C2 = {R2, R5, R8}
and
C3={R3, R6, R9}
, which
correspond to three distinct colors on corresponding vertices in the read-overlap graph, respectively.
4
摘要:

GraphColoringviaNeuralNetworksforHaplotypeAssemblyandViralQuasispeciesReconstructionHanshengXue,1VaibhavRajan,2andYuLin11SchoolofComputing,AustralianNationalUniversity,Canberra,Australia2SchoolofComputing,NationalUniversityofSingapore,Singapore{hansheng.xue,yu.lin}@anu.edu.au,vaibhav.rajan@nus.edu....

展开>> 收起<<
Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction Hansheng Xue1Vaibhav Rajan2and Yu Lin1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:4.74MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注