Phase transition in the computational complexity of the shortest common superstring and genome assembly L. A. Fernandez1 2V. Martin-Mayor1 2and D. Yllanes3 2

2025-05-02 0 0 824.54KB 9 页 10玖币
侵权投诉
Phase transition in the computational complexity of the
shortest common superstring and genome assembly
L. A. Fernandez,1, 2 V. Martin-Mayor,1, 2 and D. Yllanes3, 2
1Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain
2Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain
3Chan Zuckerberg Biohub, 499 Illinois Street, San Francisco, CA 94158, USA
Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging
short fragments, or reads, is known to be NP-hard, either as a version of the shortest common
superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed
to grow exponentially with the the problem size in the worst case. Despite this fact, high-throughput
technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions
of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating
the existence of a phase transition in the computational complexity of the problem and showing
that practical instances always fall in the ‘easy’ phase (solvable by polynomial-time algorithms). In
addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic
algorithms in the hard regime.
I. INTRODUCTION
Sequence assembly is one of the fundamental prob-
lems in bioinformatics. Since an organism’s whole genetic
material cannot be read in one go, current technologies
build on strategies where the genome (or a portion of it,
such as a chromosome) is randomly fragmented in shorter
reads, which then have to be ordered and merged to re-
construct the original sequence. In a naive formulation,
one would look for the shortest sequence that contains all
of the individual reads, or Shortest Common Superstring
(SCS). This is, in principle, a formidable task, since the
SCS belongs to the so-called NP-complete class of prob-
lems [1, 2], for which no efficient algorithms are known
(or believed) to exist. More precisely, NP denotes a large
family of problems that are verifiable in polynomial time,
meaning that potential solutions can be checked in a time
that grows at most as a power of the size of the input. A
problem is then termed NP-complete if it is in NP and
at least as hard as any problem in NP. This is a relevant
notion because it is believed, but not proven, that not
all NP-complete tasks can be solved in polynomial time.
See, e.g., [3–5] for more precise definitions and examples.
As hinted above, the formulation of genome assem-
bly as an SCS problem is not quite correct. This is be-
cause our assumption of parsimony is not true: most
genomes contain repeats, multiple identical stretches of
DNA, which the SCS would collapse. A formulation of
the assembly problem that takes this issue into account
can be made using de Bruijn graphs [6], but the task can
still be proven to be NP-hard by reduction from SCS [7].
Alternative approaches to assembly have been proposed,
for instance the string-graph representation [8], but this
model has also been shown to be NP-hard, by reduction
from the Hamiltonian-cycle problem [7].
In short, no polynomial-time algorithms are known (or
even believed) to exist to solve the sequence-assembly
problem in its general formulation. Despite this fact,
with current high-throughput sequencing techniques and
assembly algorithms, datasets of billions of reads are reg-
ularly assembled (at least at the contig level) [9–14]. This
achievement is in stark contrast with the state of the art
for the travelling salesman, a closely related NP-complete
problem, for which managing as few as 104‘cities’ is ex-
ceedingly difficult and the largest instance solved to date
featured 120 000 locations [15, 16].
The way out of this apparent contradiction is the gen-
eral notion that, while in the worst case an NP-complete
problem takes exponential time to solve, typical instances
might be much easier. This observation could explain
the, at least apparent, success of heuristic methods but
it needs to be formalised: what is a ‘typical’ instance and
how likely is the worst-case scenario? As a path towards
answering those questions, it has been observed that in
several problems in computational biology, small ranges
of parameters cover all the interesting cases. The ques-
tion is, then, whether we can identify the right variables
and whether the problem can be solved in polynomial
time for fixed values of these parameters [17]. This para-
metric complexity paradigm has been applied to genome
assembly, either using statistical analyses or analytical
methods, suggesting that, in some relevant limits, the
problem can indeed be solved correctly with polynomial
algorithms [18–20].
In this work we present an alternative approach to this
question, based on the methods of statistical mechan-
ics. The applicability of the models of statistical physics
to the issue of NP-completeness has been conjectured
since the 1980s [21, 22] and is now well understood [23].
More to the point, phase diagrams for complexity can
be defined for some problems. For instance, in a semi-
nal work [24] all the constraints in the problem can be
satisfied only when a certain parameter is smaller than
its critical value, and the computationally hard problems
arise only in the neighbourhood of the phase boundary.
We show that a similar result can be obtained for the
SCS. Yet, we depart from the paradigm of Ref. [24] in
the sense that computationally hard problems become
arXiv:2210.09986v2 [cond-mat.stat-mech] 11 Mar 2024
2
the rule, rather than the exception, in one of our two
phases, not just at the critical point.
It is worth emphasizing that, unlike assembly, the SCS
problem always has a well-defined solution, which might
not be unique (unfortunately, in some regions of parame-
ter space this solution is very hard to find). Instead, the
assembly problem is well posed only if the whole genome
is covered. It is a classic result [25] that, if Lis the
length of our (portion of) genome and frag the length
of the reads, in order to ensure that the whole genome
is covered with probability 1ϵ, the number Nfrag of
reads must satisfy Nfrag = (L/ℓfrag) log(Nfrag). There-
fore, in practical applications one is interested in over-
sampling the genome (the oversampling ratio is termed
coverage). Our main result in this respect is that the
regime of full coverage corresponds precisely to the easily
solvable phase of the SCS problem. Therefore, whenever
the assembly problem is well posed, the corresponding
solution can be found in polynomial time.
A final note of warning is in order: we shall always
use assembly language (genomes and reads, rather than
superstrings and strings) in order to unify the nomencla-
ture.
The remaining part of this paper is organized as fol-
lows. In Sect. II we define the two problems considered
here, the SCS and the assembly. In Sect. III we discuss
our data collection and computational approaches. The
analysis of the phase transition is presented in Sect. IV.
An alternative algorithm for the hard phase of the SCS
is presented in Sect. V. Our conclusions are given in
Sect. VI. We provide additional details on the employed
algorithms in the appendices.
II. THE SCS AND SEQUENCE ASSEMBLY
As explained in the introduction, there are two differ-
ent, yet related problems:
Shortest Common Superstring (SCS). Given Nfrag
sequences of frag letters taken from a common al-
phabet, find the shortest sequence of letters that
contains every one of the Nfrag fragments.
Ex novo genome reconstruction. Read Nfrag frag-
ments randomly chopped from a piece of genome.
For simplicity, we shall assume that each fragment
contains the same number of letters frag. Our
problem is reconstructing the original genome from
these reads.
Under favourable circumstances on the ensemble of reads,
the solution of the SCS problem is also the solution of the
assembly problem. Our main emphasis will be in the
combinatorial optimisation problem, namely the SCS.
Reading errors are a real complication in assembly, but
effective methods are known to handle them. Since errors
do not add to the exponential (in genome size) hardness
of the problem, we shall ignore them.
We study the SCS in its formulation as an asymmetric
travelling-salesman problem, where one tries to find the
permutation of fragments that has the maximum overlap
between consecutive segments and, therefore, the mini-
mal total length of the resulting superstring once over-
lapping segments are collapsed. For instance, the SCS of
the strings TTGAA, AGTTG is AGTTGAA. Our reads
are taken from a circular genome of length Lbase pairs
(we use the natural four-letter alphabet: A, C, G, T).
For our main study we choose all bases in the alphabet
randomly (independent picks with uniform probability),
but we have also checked that our main results extend to
a natural genome, namely that of the swinepox virus.
A naive approach to an assembly problem emphasises
the covering fraction:
W=Nfragfrag
L.(1)
W < 1implies that the SCS is shorter than the genome
(obviusly, succesful assembly is impossible under these
circumstances). In typical instances of assembly W
1(W1000 is not uncommon with high-throughput
techniques).
Given a set of reads, obtaining the SCS is NP-hard.
Since we are taking our fragments from a known long
string, however, we always have a candidate solution, as
we now explain. Since the genome is known to us before-
hand, we can exactly locate the position in the genome
of every fragment. If we order these starting points in
increasing order, we obtain by merging overlaps a candi-
date solution for the SCS. We name ordered the length
of the candidate solution, which often turns out to be
the exact solution (ordered =L) when W1. This is
the common situation in applications. Yet, when W < 1,
ordered is guaranteed to be smaller than L. Furthermore,
the ordered sequence may actually be a very bad solu-
tion for SCS problem, when W1. Indeed, in the limit
W0, nothing distinghuises the ordered solution from
a random ordering. In the intermediate regime, W1,
the ordered sequence is a good guess for the SCS (and
for assembly).
For a given algorithm, a run whose resulting super-
string length is ordered will be considered success-
ful. In practice, in the W1region, one never finds
ℓ<ℓordered =Land the original genome coincides with
the SCS.
III. CREATING AND SAMPLING THE DATA
SET
The main classifying feature for our simulations is the
number of fragments, Nfrag. For every Nfrag we create a
set of Nchro synthetic circular chromosomes. As discussed
above, we have considered both random chromosomes —
in which each letter is extracted with uniform probability
randomly from the four-letter alphabet— or extracted
摘要:

PhasetransitioninthecomputationalcomplexityoftheshortestcommonsuperstringandgenomeassemblyL.A.Fernandez,1,2V.Martin-Mayor,1,2andD.Yllanes3,21DepartamentodeFísicaTeórica,UniversidadComplutense,28040Madrid,Spain2InstitutodeBiocomputaciónyFísicadeSistemasComplejos(BIFI),50018Zaragoza,Spain3ChanZuckerbe...

展开>> 收起<<
Phase transition in the computational complexity of the shortest common superstring and genome assembly L. A. Fernandez1 2V. Martin-Mayor1 2and D. Yllanes3 2.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:824.54KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注