Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction Hansheng Xue1Vaibhav Rajan2and Yu Lin1

2025-05-06 0 0 4.74MB 16 页 10玖币

侵权投诉

Graph Coloring via Neural Networks for Haplotype

Assembly and Viral Quasispecies Reconstruction

Hansheng Xue,1Vaibhav Rajan,2and Yu Lin1∗

1School of Computing, Australian National University, Canberra, Australia

2School of Computing, National University of Singapore, Singapore

{hansheng.xue,yu.lin}@anu.edu.au,vaibhav.rajan@nus.edu.sg

Abstract

Understanding genetic variation, e.g., through mutations, in organisms is crucial to

unravel their effects on the environment and human health. A fundamental char-

acterization can be obtained by solving the haplotype assembly problem, which

yields the variation across multiple copies of chromosomes. Variations among fast

evolving viruses that lead to different strains (called quasispecies) are also deci-

phered with similar approaches. In both these cases, high-throughput sequencing

technologies that provide oversampled mixtures of large noisy fragments (reads) of

genomes, are used to infer constituent components (haplotypes or quasispecies).

The problem is harder for polyploid species where there are more than two copies

of chromosomes. State-of-the-art neural approaches to solve this NP-hard problem

do not adequately model relations among the reads that are important for decon-

volving the input signal. We address this problem by developing a new method,

called

NeurHap

, that combines graph representation learning with combinatorial

optimization. Our experiments demonstrate substantially better performance of

NeurHap in real and synthetic datasets compared to competing approaches.

1 Introduction

Our genetic material is organized as sequences of DNA or RNA molecules (nucleotides) which form

three-dimensional structures (chromosomes) within our cells. Most organisms have multiple highly

similar copies of chromosomes in their cells (e.g., humans have 2). Variations in genetic sequences

lead to the emergence of new species during evolution and are also known to be associated with many

diseases (e.g., cancer). There are many possible ways in which such variations can occur; the simplest

among them is a mutation or a change in the nucleotide at a speciﬁc location in the DNA or RNA

sequence. A Single Nucleotide Polymorphism (SNP) refers to a mutation in at least one of the copies

which renders the copies nonidentical at that point. An ordered list of SNPs on a single chromosome

is called a haplotype [Schwartz, 2010]. Haplotypes provide a signature of genetic variability and thus

inform us about disease susceptibilities and evolutionary patterns (e.g., of viruses). These studies in

turn pave the way for personalized medicine and effective drug development against viruses.

The problem of inferring haplotypes from high-throughput sequencing data is called haplotype

assembly or phasing, and is done in multiple stages (see Figure 1). Sequencing data yields multiple

copies of short fragments of the entire genomic sequence (called reads, Figure 1b). These reads are

noisy due to sequencing errors and their short lengths may span across limited number of SNPs.

This makes the problem of haplotype phasing challenging. The reads are ﬁrst aligned to a reference

genome. This step indicates positions that are different across reads and thus infers the potential

locations of SNPs. All other positions are discarded to obtain the SNP matrix (Figure 1c). This

∗Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12158v1 [q-bio.GN] 21 Oct 2022

Figure 1: The pipeline of reference-based polyploid haplotypes reconstruction and

NeuralHap

Haplotype phasing is formulated as a graph coloring problem by constructing the Read-overlap graph.

NeurHap

consists of

NeurHap

-search, a graph neural network to learn vertex representations and

color assignments, and NeurHap-reﬁne, a local reﬁnement strategy to further adjust colors.

matrix may be viewed as an oversampled mixture of noisy reads (restricted to SNPs). Each mixture

component represents a single haplotype and thus should have SNPs at the same locations.

In diploid species, containing two copies of chromosomes, there are two haplotypes to be inferred.

This problem has been studied extensively [Browning and Browning, 2011]. In polyploid species,

containing more than two copies of chromosomes (and thus more than two haplotypes), the problem

is more challenging due to dramatic increase in search space [Van de Peer et al., 2017, Abou Saada

et al., 2022, Jablonski and Beerenwinkel, 2021]. In reconstruction of virus strains, called viral

quasispecies, from viral populations, similar challenges arise. Moreover, unknown population sizes

and imbalanced abundances pose additional difﬁculties [Jablonski and Beerenwinkel, 2021].

Existing approaches for haplotype phasing of polyploid species and viral quasispecies often group

reads in the SNP matrix into clusters that correspond to different haplotypes, respectively. In an ideal

case, all reads from the same cluster should be consistent with respect to SNPs as they all belong to

the same haplotype. In reality inconsistencies occur due to sequencing errors in reads. Therefore, a

minimum error correction (MEC) score [Lippert et al., 2002b] is used to measure the discrepancy

between the consensus haplotypes and their associated reads within each cluster (see Figure 1e). It is

NP-hard to optimize the MEC score [Zhang et al., 2006], and a number of combinatorial optimization

heuristics have been proposed to approximate the optimal MEC score [Zhang et al., 2020].

More recently, the ﬁrst neural network-based learning framework, named GAEseq [Ke and Vikalo,

2020b], was proposed to phase haplotypes for polyploid species and viral quasispecies. CAECseq

was later developed using a convolutional auto-encoder which captures spatial relationships between

SNPs and enables clustering reads obtained from highly similar genomic regions [Ke and Vikalo,

2020a]. Both GAEseq and CAECseq showed improved results compared to previous approaches. A

major limitation of both CAECseq and GAEseq is that they cannot capture implicit relations among

different reads. These methods have two independent steps (embedding and clustering) which makes

the haplotype phasing results unstable. Besides, sparsity of the SNP matrix makes haplotype phasing

for polyploids more challenging for these methods as well.

In this paper, we propose an approach based on graph representation learning for haplotype phasing

of polyploid species and viral quasispecies. We formulate the haplotype phasing problem as a graph

coloring problem, where the colors indicate haplotypes. The graph is constructed from the SNP

matrix where vertices are reads and two edge types are deﬁned based on pairwise consistency and

conﬂicts with respect to SNPs in the reads. Message passing-based neural networks are trained to

minimize a loss designed to obtain a color assignment that maximizes consistent edges and minimizes

conﬂicting edges. The network learns vertex representations and through them, an initial color

assignment. A local reﬁnement strategy is then applied to adjust node colors in order to minimize

MEC scores. Thus, in contrast to previous neural approaches that ﬁrst learn representations and then

cluster, our approach models the problem requirements in all steps. As a result, our model achieves

better MEC scores, is more stable and also performs well on the challenging cases of polyploids and

viral quasispecies. In summary, our contributions are:

•

We provide a unique formulation of the haplotype phasing problem as a graph coloring problem,

and develop an algorithm based on graph representation learning and combinatorial optimization.

•

Our approach consists of

NeurHap

-search, a graph neural network to learn vertex representations

and color assignments followed by

NeurHap

-reﬁne, a local reﬁnement strategy to adjust colors and

optimize MEC scores.

•

Extensive experiments on synthetic and real datasets demonstrate that our new method

NeuralHap

signiﬁcantly outperforms state-of-the-art phasing methods for both polyploid species and viral

quasispecies.

2 Related Work

Haplotype Phasing.

The aim of haplotype phasing of polyploid species and viral quasispecies is to

group reads into homogeneous clusters that corresponds to different haplotypes, respectively. The

minimum error correction (MEC) score [Lippert et al., 2002b] is introduced to measure the total

discrepancy of reads in all clusters but is NP-hard to be optimised [Zhang et al., 2006]. Haplotype

phasing for diploid species (i.e., reconstructing two haplotypes) has been extensively studied in

the last two decades and a number of combinatorial optimization heuristics have been proposed

to approximate the optimal MEC score, such as BNB [Wang et al., 2005], HapCUT [Bansal and

Bafna, 2008], HASH [Bansal et al., 2008], RefHap [Duitama et al., 2012] ProbHap [Kuleshov, 2014],

HapCUT2 [Edge et al., 2017] and others, and refer to [Zhang et al., 2020] for a recent review on

phasing diploid species.

Haplotype phasing for polyploid species (i.e., reconstructing more than two haplotypes) becomes

more computationally challenging as it requires a much larger search space compared to phasing

two haplotypes for diploid species. A limited number of phasing methods work for polyploid

species, e.g., HapCompass [Aguiar and Istrail, 2012], SDhaP [Das and Vikalo, 2015], H-PoP [Xie

et al., 2016], AltHap [Hashemi et al., 2018], refer to [Abou Saada et al., 2022] for a recent review.

Haplotype phasing for viral quasispecies is very similar to the problem of phasing polyploid species.

While haplotypes in polyploid species typically have uniform abundances, the different haplotypes

(strains) in viral quasispecies may have varying abundances. Quite a few tools have also been

proposed for haplotype phasing of viral quasispecies, such as ViSpA [Astrovskaya et al., 2011],

ShoRAH [Zagordi et al., 2011], QuRe [Prosperi and Salemi, 2012], QuasiRecomb[Töpfer et al.,

2013], PredictHaplo [Prabhakaran et al., 2014], aBayesQR [Ahn and Vikalo, 2018], TenSQR [Ahn

et al., 2018], refer to [Jablonski and Beerenwinkel, 2021] for a recent review.

More recently, deep learning models have been introduced into haplotype phasing for polyploid

species and viral quasispecies. GAEseq [Ke and Vikalo, 2020b] uses a graph auto-encoder model

on the constructed reads-SNPs bipartite network to model the relations between reads and SNPs.

CAECseq [Ke and Vikalo, 2020a] uses a convolutional auto-encoder model to represent reads as

low-dimensional features and then employs a clustering algorithm to group these reads. Note that

GAEseq and CAECseq can be directly used to phase haplotypes for both polyploid species and viral

quasispecies. Experimental results on both simulated and real datasets showed the superior results of

GAEseq and CAECseq (in terms of MEC scores) compared to previous approaches for haplotype

phasing for both polyploid species and viral quasispecies [Ke and Vikalo, 2020b,a]. However, implicit

relations among different reads have not been fully captured by GAEseq and CAECseq, especially

when embedding and clustering are modelled separately and the SNP matrix is sparse.

Neural Networks on Graphs.

Most existing graph neural networks can be explained as a message-

passing based graph learning model which recursively combines learned features/messages from their

neighbors [Cui et al., 2019, Cai et al., 2018, Gilmer et al., 2017]. Popular methods include GCN [Kipf

and Welling, 2017], GraphSAGE [Hamilton et al., 2017], GAT [Veliˇ

ckovi´

c et al., 2018] and GIN [Xu

et al., 2019]. All these methods make the homophily assumption that similar nodes in the graph should

be embedded close together. However, graph coloring aims to assign pairwise nodes with distinct

colors for each edge of the graph, which is opposite to the homophily assumption. An intuitive way to

integrate GNN models into the graph coloring challenge is to adjust the loss function, such as GNN-

GCP [Lemos et al., 2019], RUN-CSP [Toenshoff et al., 2019], and PI-GNN [Schuetz et al., 2022].

However, existing GNN-based graph coloring models cannot be implemented to the read-overlap

graph directly because they cannot handle conﬂicting and consistent edges simultaneously.

3 Methodology

Overview.

In this paper, we propose the model of neural networks for graph coloring optimization to

solve the haplotype reconstruction problem, called

NeurHap

mainly contains three steps,

i) constructing the read-overlap graph; ii) global coloring searching via iterative neural networks

model, NeurHap-search; iii) local reﬁnement to ﬁne tune ﬁnal coloring, NeurHap-reﬁne.

Figure 2: A toy example of constructing

read-overlap graph with conﬂict edges

(in grey) and consistent edge (in blue).

Notations.

Let

be the number of haplotypes in a cell of

polyploid species (aka. ploidy) or the number of strains in

viral quasispecies. For example, human and other mam-

mals contain two haplotypes (diploid, k=2); plants have

more than 2 haplotypes (e.g., California redwood has six

copies of each chromosome, hexaploid, k=6). Viral qua-

sispecies mixing 5 distinct HIV strains will have k= 5.

Single nucleotide polymorphisms (SNPs) refer to positions

where not all haplotypes have the same alleles. Given

the alignment of reads to a reference genome, the SNP

columns can be identiﬁed by removing the columns with

identical alleles. The remaining alignment is referred to

as a

m×n

SNP matrix

where

denotes the number

of reads and

is the number of SNPs. The haplotype

reconstruction aims to group

reads into

clusters ,

{C1, C2, . . . , Ck}

, that correspond to

haplotypes,

{H1,H2,...,Hk}

, respectively. Once reads

are grouped into clusters, the haplotype

can be reconstructed from reads in

using a simple

consensus voting. In the ideal case, all the reads from the cluster

will be all consistent with

. In

reality, this is not the case and thus a minimum error correction (MEC) score [Lippert et al., 2002b]

is introduced to measure the discrepancy between the reads in the cluster

and the consensus

haplotypes

in all clusters. Given the grouping of reads into

clusters

{C1, C2, . . . , Ck}

, the

corresponding MEC score can be computed as

MEC(C1, C2, . . . , Ck) =

i=1 X

Rj∈Ci

HD(Hi, Rj)(1)

where

HD(·)

is the Hamming distance function. Note that

HD(Hi, Rj)

for

Rj∈Ci

can only

be computed when we know all the reads in the cluster

and use them to derive the consensus

haplotype

. The main challenge in haplotype phasing is to ﬁnd the grouping of reads into

{C1, C2, . . . , Ck}such that the MEC score is minimized.

Two reads are called overlapping if they span over common SNP positions otherwise non-overlapping.

Given any two reads, the relationship between them belongs to one of three cases, consistent,conﬂict,

or ambiguous. While the relationship between two non-overlapping reads is always ambiguous, we

further introduce two parameters

and

to deﬁne the relationship between two overlapping reads to

account for sequencing errors and alignment ambiguity. Two overlapping reads are consistent if they

overlap at least

positions and have the same alleles over all overlapping positions; are in conﬂict if

they differ on at least

overlapping positions; and are ambiguous otherwise. The term ‘ambiguous’

means that there is not enough evidence to support that these two reads should belong to the same

haplotype (‘consistent’) or should belong to the different haplotypes (‘conﬂict’). For example, in

Figure 2,

and

are consistent,

and

are in conﬂict, and

and

are ambiguous. In

an ideal case, all the overlapping reads in the same cluster must be consistent, i.e., if two reads are

in conﬂict, they must belong to different clusters. This observation naturally motivates us to build

a read-overlap graph to model all reads as vertices and the important pairwise relationships (i.e.,

consistent and conﬂict) between overlapping reads as edges. Moreover, if we use

colors to represent

the

clusters of reads, the problem of haplotype phasing is reduced to a graph coloring problem on

the read-overlap graph. For example, in Figure 2, the minimum MEC is achieved by grouping nine

reads into three clusters,

C1={R1, R4, R7}

C2 = {R2, R5, R8}

and

C3={R3, R6, R9}

, which

correspond to three distinct colors on corresponding vertices in the read-overlap graph, respectively.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GraphColoringviaNeuralNetworksforHaplotypeAssemblyandViralQuasispeciesReconstructionHanshengXue,1VaibhavRajan,2andYuLin11SchoolofComputing,AustralianNationalUniversity,Canberra,Australia2SchoolofComputing,NationalUniversityofSingapore,Singapore{hansheng.xue,yu.lin}@anu.edu.au,vaibhav.rajan@nus.edu....

展开>> 收起<<

Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction Hansheng Xue1Vaibhav Rajan2and Yu Lin1.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction Hansheng Xue1Vaibhav Rajan2and Yu Lin1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: