Phase transition in the computational complexity of the shortest common superstring and genome assembly L. A. Fernandez1 2V. Martin-Mayor1 2and D. Yllanes3 2

2025-05-02 1 0 824.54KB 9 页 10玖币

侵权投诉

Phase transition in the computational complexity of the

shortest common superstring and genome assembly

L. A. Fernandez,1, 2 V. Martin-Mayor,1, 2 and D. Yllanes3, 2

1Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain

2Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain

3Chan Zuckerberg Biohub, 499 Illinois Street, San Francisco, CA 94158, USA

Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging

short fragments, or reads, is known to be NP-hard, either as a version of the shortest common

superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed

to grow exponentially with the the problem size in the worst case. Despite this fact, high-throughput

technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions

of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating

the existence of a phase transition in the computational complexity of the problem and showing

that practical instances always fall in the ‘easy’ phase (solvable by polynomial-time algorithms). In

addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic

algorithms in the hard regime.

I. INTRODUCTION

Sequence assembly is one of the fundamental prob-

lems in bioinformatics. Since an organism’s whole genetic

material cannot be read in one go, current technologies

build on strategies where the genome (or a portion of it,

such as a chromosome) is randomly fragmented in shorter

reads, which then have to be ordered and merged to re-

construct the original sequence. In a naive formulation,

one would look for the shortest sequence that contains all

of the individual reads, or Shortest Common Superstring

(SCS). This is, in principle, a formidable task, since the

SCS belongs to the so-called NP-complete class of prob-

lems [1, 2], for which no eﬃcient algorithms are known

(or believed) to exist. More precisely, NP denotes a large

family of problems that are veriﬁable in polynomial time,

meaning that potential solutions can be checked in a time

that grows at most as a power of the size of the input. A

problem is then termed NP-complete if it is in NP and

at least as hard as any problem in NP. This is a relevant

notion because it is believed, but not proven, that not

all NP-complete tasks can be solved in polynomial time.

See, e.g., [3–5] for more precise deﬁnitions and examples.

As hinted above, the formulation of genome assem-

bly as an SCS problem is not quite correct. This is be-

cause our assumption of parsimony is not true: most

genomes contain repeats, multiple identical stretches of

DNA, which the SCS would collapse. A formulation of

the assembly problem that takes this issue into account

can be made using de Bruijn graphs [6], but the task can

still be proven to be NP-hard by reduction from SCS [7].

Alternative approaches to assembly have been proposed,

for instance the string-graph representation [8], but this

model has also been shown to be NP-hard, by reduction

from the Hamiltonian-cycle problem [7].

In short, no polynomial-time algorithms are known (or

even believed) to exist to solve the sequence-assembly

problem in its general formulation. Despite this fact,

with current high-throughput sequencing techniques and

assembly algorithms, datasets of billions of reads are reg-

ularly assembled (at least at the contig level) [9–14]. This

achievement is in stark contrast with the state of the art

for the travelling salesman, a closely related NP-complete

problem, for which managing as few as 104‘cities’ is ex-

ceedingly diﬃcult and the largest instance solved to date

featured 120 000 locations [15, 16].

The way out of this apparent contradiction is the gen-

eral notion that, while in the worst case an NP-complete

problem takes exponential time to solve, typical instances

might be much easier. This observation could explain

the, at least apparent, success of heuristic methods but

it needs to be formalised: what is a ‘typical’ instance and

how likely is the worst-case scenario? As a path towards

answering those questions, it has been observed that in

several problems in computational biology, small ranges

of parameters cover all the interesting cases. The ques-

tion is, then, whether we can identify the right variables

and whether the problem can be solved in polynomial

time for ﬁxed values of these parameters [17]. This para-

metric complexity paradigm has been applied to genome

assembly, either using statistical analyses or analytical

methods, suggesting that, in some relevant limits, the

problem can indeed be solved correctly with polynomial

algorithms [18–20].

In this work we present an alternative approach to this

question, based on the methods of statistical mechan-

ics. The applicability of the models of statistical physics

to the issue of NP-completeness has been conjectured

since the 1980s [21, 22] and is now well understood [23].

More to the point, phase diagrams for complexity can

be deﬁned for some problems. For instance, in a semi-

nal work [24] all the constraints in the problem can be

satisﬁed only when a certain parameter is smaller than

its critical value, and the computationally hard problems

arise only in the neighbourhood of the phase boundary.

We show that a similar result can be obtained for the

SCS. Yet, we depart from the paradigm of Ref. [24] in

the sense that computationally hard problems become

arXiv:2210.09986v2 [cond-mat.stat-mech] 11 Mar 2024

the rule, rather than the exception, in one of our two

phases, not just at the critical point.

It is worth emphasizing that, unlike assembly, the SCS

problem always has a well-deﬁned solution, which might

not be unique (unfortunately, in some regions of parame-

ter space this solution is very hard to ﬁnd). Instead, the

assembly problem is well posed only if the whole genome

is covered. It is a classic result [25] that, if Lis the

length of our (portion of) genome and ℓfrag the length

of the reads, in order to ensure that the whole genome

is covered with probability 1−ϵ, the number Nfrag of

reads must satisfy Nfrag = (L/ℓfrag) log(Nfrag/ϵ). There-

fore, in practical applications one is interested in over-

sampling the genome (the oversampling ratio is termed

coverage). Our main result in this respect is that the

regime of full coverage corresponds precisely to the easily

solvable phase of the SCS problem. Therefore, whenever

the assembly problem is well posed, the corresponding

solution can be found in polynomial time.

A ﬁnal note of warning is in order: we shall always

use assembly language (genomes and reads, rather than

superstrings and strings) in order to unify the nomencla-

ture.

The remaining part of this paper is organized as fol-

lows. In Sect. II we deﬁne the two problems considered

here, the SCS and the assembly. In Sect. III we discuss

our data collection and computational approaches. The

analysis of the phase transition is presented in Sect. IV.

An alternative algorithm for the hard phase of the SCS

is presented in Sect. V. Our conclusions are given in

Sect. VI. We provide additional details on the employed

algorithms in the appendices.

II. THE SCS AND SEQUENCE ASSEMBLY

As explained in the introduction, there are two diﬀer-

ent, yet related problems:

•Shortest Common Superstring (SCS). Given Nfrag

sequences of ℓfrag letters taken from a common al-

phabet, ﬁnd the shortest sequence of letters that

contains every one of the Nfrag fragments.

•Ex novo genome reconstruction. Read Nfrag frag-

ments randomly chopped from a piece of genome.

For simplicity, we shall assume that each fragment

contains the same number of letters ℓfrag. Our

problem is reconstructing the original genome from

these reads.

Under favourable circumstances on the ensemble of reads,

the solution of the SCS problem is also the solution of the

assembly problem. Our main emphasis will be in the

combinatorial optimisation problem, namely the SCS.

Reading errors are a real complication in assembly, but

eﬀective methods are known to handle them. Since errors

do not add to the exponential (in genome size) hardness

of the problem, we shall ignore them.

We study the SCS in its formulation as an asymmetric

travelling-salesman problem, where one tries to ﬁnd the

permutation of fragments that has the maximum overlap

between consecutive segments and, therefore, the mini-

mal total length of the resulting superstring once over-

lapping segments are collapsed. For instance, the SCS of

the strings TTGAA, AGTTG is AGTTGAA. Our reads

are taken from a circular genome of length Lbase pairs

(we use the natural four-letter alphabet: A, C, G, T).

For our main study we choose all bases in the alphabet

randomly (independent picks with uniform probability),

but we have also checked that our main results extend to

a natural genome, namely that of the swinepox virus.

A naive approach to an assembly problem emphasises

the covering fraction:

W=Nfragℓfrag

L.(1)

W < 1implies that the SCS is shorter than the genome

(obviusly, succesful assembly is impossible under these

circumstances). In typical instances of assembly W≫

1(W∼1000 is not uncommon with high-throughput

techniques).

Given a set of reads, obtaining the SCS is NP-hard.

Since we are taking our fragments from a known long

string, however, we always have a candidate solution, as

we now explain. Since the genome is known to us before-

hand, we can exactly locate the position in the genome

of every fragment. If we order these starting points in

increasing order, we obtain by merging overlaps a candi-

date solution for the SCS. We name ℓordered the length

of the candidate solution, which often turns out to be

the exact solution (ℓordered =L) when W≫1. This is

the common situation in applications. Yet, when W < 1,

ℓordered is guaranteed to be smaller than L. Furthermore,

the ordered sequence may actually be a very bad solu-

tion for SCS problem, when W≪1. Indeed, in the limit

W→0, nothing distinghuises the ordered solution from

a random ordering. In the intermediate regime, W∼1,

the ordered sequence is a good guess for the SCS (and

for assembly).

For a given algorithm, a run whose resulting super-

string length ℓis ℓ≤ℓordered will be considered success-

ful. In practice, in the W≫1region, one never ﬁnds

ℓ<ℓordered =Land the original genome coincides with

the SCS.

III. CREATING AND SAMPLING THE DATA

SET

The main classifying feature for our simulations is the

number of fragments, Nfrag. For every Nfrag we create a

set of Nchro synthetic circular chromosomes. As discussed

above, we have considered both random chromosomes —

in which each letter is extracted with uniform probability

randomly from the four-letter alphabet— or extracted

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PhasetransitioninthecomputationalcomplexityoftheshortestcommonsuperstringandgenomeassemblyL.A.Fernandez,1,2V.Martin-Mayor,1,2andD.Yllanes3,21DepartamentodeFísicaTeórica,UniversidadComplutense,28040Madrid,Spain2InstitutodeBiocomputaciónyFísicadeSistemasComplejos(BIFI),50018Zaragoza,Spain3ChanZuckerbe...

展开>> 收起<<

Phase transition in the computational complexity of the shortest common superstring and genome assembly L. A. Fernandez1 2V. Martin-Mayor1 2and D. Yllanes3 2.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Phase transition in the computational complexity of the shortest common superstring and genome assembly L. A. Fernandez1 2V. Martin-Mayor1 2and D. Yllanes3 2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: