•
We provide a unique formulation of the haplotype phasing problem as a graph coloring problem,
and develop an algorithm based on graph representation learning and combinatorial optimization.
•
Our approach consists of
NeurHap
-search, a graph neural network to learn vertex representations
and color assignments followed by
NeurHap
-refine, a local refinement strategy to adjust colors and
optimize MEC scores.
•
Extensive experiments on synthetic and real datasets demonstrate that our new method
NeuralHap
significantly outperforms state-of-the-art phasing methods for both polyploid species and viral
quasispecies.
2 Related Work
Haplotype Phasing.
The aim of haplotype phasing of polyploid species and viral quasispecies is to
group reads into homogeneous clusters that corresponds to different haplotypes, respectively. The
minimum error correction (MEC) score [Lippert et al., 2002b] is introduced to measure the total
discrepancy of reads in all clusters but is NP-hard to be optimised [Zhang et al., 2006]. Haplotype
phasing for diploid species (i.e., reconstructing two haplotypes) has been extensively studied in
the last two decades and a number of combinatorial optimization heuristics have been proposed
to approximate the optimal MEC score, such as BNB [Wang et al., 2005], HapCUT [Bansal and
Bafna, 2008], HASH [Bansal et al., 2008], RefHap [Duitama et al., 2012] ProbHap [Kuleshov, 2014],
HapCUT2 [Edge et al., 2017] and others, and refer to [Zhang et al., 2020] for a recent review on
phasing diploid species.
Haplotype phasing for polyploid species (i.e., reconstructing more than two haplotypes) becomes
more computationally challenging as it requires a much larger search space compared to phasing
two haplotypes for diploid species. A limited number of phasing methods work for polyploid
species, e.g., HapCompass [Aguiar and Istrail, 2012], SDhaP [Das and Vikalo, 2015], H-PoP [Xie
et al., 2016], AltHap [Hashemi et al., 2018], refer to [Abou Saada et al., 2022] for a recent review.
Haplotype phasing for viral quasispecies is very similar to the problem of phasing polyploid species.
While haplotypes in polyploid species typically have uniform abundances, the different haplotypes
(strains) in viral quasispecies may have varying abundances. Quite a few tools have also been
proposed for haplotype phasing of viral quasispecies, such as ViSpA [Astrovskaya et al., 2011],
ShoRAH [Zagordi et al., 2011], QuRe [Prosperi and Salemi, 2012], QuasiRecomb[Töpfer et al.,
2013], PredictHaplo [Prabhakaran et al., 2014], aBayesQR [Ahn and Vikalo, 2018], TenSQR [Ahn
et al., 2018], refer to [Jablonski and Beerenwinkel, 2021] for a recent review.
More recently, deep learning models have been introduced into haplotype phasing for polyploid
species and viral quasispecies. GAEseq [Ke and Vikalo, 2020b] uses a graph auto-encoder model
on the constructed reads-SNPs bipartite network to model the relations between reads and SNPs.
CAECseq [Ke and Vikalo, 2020a] uses a convolutional auto-encoder model to represent reads as
low-dimensional features and then employs a clustering algorithm to group these reads. Note that
GAEseq and CAECseq can be directly used to phase haplotypes for both polyploid species and viral
quasispecies. Experimental results on both simulated and real datasets showed the superior results of
GAEseq and CAECseq (in terms of MEC scores) compared to previous approaches for haplotype
phasing for both polyploid species and viral quasispecies [Ke and Vikalo, 2020b,a]. However, implicit
relations among different reads have not been fully captured by GAEseq and CAECseq, especially
when embedding and clustering are modelled separately and the SNP matrix is sparse.
Neural Networks on Graphs.
Most existing graph neural networks can be explained as a message-
passing based graph learning model which recursively combines learned features/messages from their
neighbors [Cui et al., 2019, Cai et al., 2018, Gilmer et al., 2017]. Popular methods include GCN [Kipf
and Welling, 2017], GraphSAGE [Hamilton et al., 2017], GAT [Veliˇ
ckovi´
c et al., 2018] and GIN [Xu
et al., 2019]. All these methods make the homophily assumption that similar nodes in the graph should
be embedded close together. However, graph coloring aims to assign pairwise nodes with distinct
colors for each edge of the graph, which is opposite to the homophily assumption. An intuitive way to
integrate GNN models into the graph coloring challenge is to adjust the loss function, such as GNN-
GCP [Lemos et al., 2019], RUN-CSP [Toenshoff et al., 2019], and PI-GNN [Schuetz et al., 2022].
However, existing GNN-based graph coloring models cannot be implemented to the read-overlap
graph directly because they cannot handle conflicting and consistent edges simultaneously.
3