Phase transition in the computational complexity of the
shortest common superstring and genome assembly
L. A. Fernandez,1, 2 V. Martin-Mayor,1, 2 and D. Yllanes3, 2
1Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain
2Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain
3Chan Zuckerberg Biohub, 499 Illinois Street, San Francisco, CA 94158, USA
Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging
short fragments, or reads, is known to be NP-hard, either as a version of the shortest common
superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed
to grow exponentially with the the problem size in the worst case. Despite this fact, high-throughput
technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions
of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating
the existence of a phase transition in the computational complexity of the problem and showing
that practical instances always fall in the ‘easy’ phase (solvable by polynomial-time algorithms). In
addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic
algorithms in the hard regime.
I. INTRODUCTION
Sequence assembly is one of the fundamental prob-
lems in bioinformatics. Since an organism’s whole genetic
material cannot be read in one go, current technologies
build on strategies where the genome (or a portion of it,
such as a chromosome) is randomly fragmented in shorter
reads, which then have to be ordered and merged to re-
construct the original sequence. In a naive formulation,
one would look for the shortest sequence that contains all
of the individual reads, or Shortest Common Superstring
(SCS). This is, in principle, a formidable task, since the
SCS belongs to the so-called NP-complete class of prob-
lems [1, 2], for which no efficient algorithms are known
(or believed) to exist. More precisely, NP denotes a large
family of problems that are verifiable in polynomial time,
meaning that potential solutions can be checked in a time
that grows at most as a power of the size of the input. A
problem is then termed NP-complete if it is in NP and
at least as hard as any problem in NP. This is a relevant
notion because it is believed, but not proven, that not
all NP-complete tasks can be solved in polynomial time.
See, e.g., [3–5] for more precise definitions and examples.
As hinted above, the formulation of genome assem-
bly as an SCS problem is not quite correct. This is be-
cause our assumption of parsimony is not true: most
genomes contain repeats, multiple identical stretches of
DNA, which the SCS would collapse. A formulation of
the assembly problem that takes this issue into account
can be made using de Bruijn graphs [6], but the task can
still be proven to be NP-hard by reduction from SCS [7].
Alternative approaches to assembly have been proposed,
for instance the string-graph representation [8], but this
model has also been shown to be NP-hard, by reduction
from the Hamiltonian-cycle problem [7].
In short, no polynomial-time algorithms are known (or
even believed) to exist to solve the sequence-assembly
problem in its general formulation. Despite this fact,
with current high-throughput sequencing techniques and
assembly algorithms, datasets of billions of reads are reg-
ularly assembled (at least at the contig level) [9–14]. This
achievement is in stark contrast with the state of the art
for the travelling salesman, a closely related NP-complete
problem, for which managing as few as 104‘cities’ is ex-
ceedingly difficult and the largest instance solved to date
featured 120 000 locations [15, 16].
The way out of this apparent contradiction is the gen-
eral notion that, while in the worst case an NP-complete
problem takes exponential time to solve, typical instances
might be much easier. This observation could explain
the, at least apparent, success of heuristic methods but
it needs to be formalised: what is a ‘typical’ instance and
how likely is the worst-case scenario? As a path towards
answering those questions, it has been observed that in
several problems in computational biology, small ranges
of parameters cover all the interesting cases. The ques-
tion is, then, whether we can identify the right variables
and whether the problem can be solved in polynomial
time for fixed values of these parameters [17]. This para-
metric complexity paradigm has been applied to genome
assembly, either using statistical analyses or analytical
methods, suggesting that, in some relevant limits, the
problem can indeed be solved correctly with polynomial
algorithms [18–20].
In this work we present an alternative approach to this
question, based on the methods of statistical mechan-
ics. The applicability of the models of statistical physics
to the issue of NP-completeness has been conjectured
since the 1980s [21, 22] and is now well understood [23].
More to the point, phase diagrams for complexity can
be defined for some problems. For instance, in a semi-
nal work [24] all the constraints in the problem can be
satisfied only when a certain parameter is smaller than
its critical value, and the computationally hard problems
arise only in the neighbourhood of the phase boundary.
We show that a similar result can be obtained for the
SCS. Yet, we depart from the paradigm of Ref. [24] in
the sense that computationally hard problems become
arXiv:2210.09986v2 [cond-mat.stat-mech] 11 Mar 2024