on base pairing distance.
We compare the efficiency and scalability of our algorithms
to Vienna RNAcofold. and confirm that the runtime and mem-
ory usage of LinearCoFold and LinearCoPartition scale linearly
against combined sequence, while RNAcofold scales cubically
in runtime and quadratically in memory usage. LinearCo-
Fold and LinearCoPartition are orders of magnitude faster
than RNAcofold. On the longest data point in the benchmark
dataset that RNAcofold can run (26,190
nt
), LinearCoFold
is
86.8×
faster than RNAcofold MFE mode, and LinearCo-
Partition is
642.3×
faster than RNAcofold partition function
mode. Notably, RNAcofold cannot finish any sequences longer
than 32,767
nt
, but our LinearCoFold and LinearCoPartition
have no limitation of sequence length internally, and can scale
up to sequences of length 100,000
nt
in 2.2 and 6.9 minutes,
respectively. With respect to accuracy, LinearCoFold and Lin-
earCoPartition’s predictions are more accurate with respect to
Sensitivity (the fraction of known pairs correctly predicted) and
Positive Predictive Value (PPV; the fraction of predicted pairs
that are in the accepted structure). Compared with RNAcofold
MFE, the overall PPV and Sensitivity of LinearCoFold increase
+4.0% and +11.6%, respectively; compared with RNAcofold
MEA, LinearCoPartition MEA gains improvement of +2.9%
on PPV and +5.7% on sensitivity; compared with RNAcofold
TheshKnot, LinearCoPartition TheshKnot increases +2.4%
and +5.5% on PPV and sensitivity, respectively. Furthermore,
we demonstrate that our predicted interaction correlates better
to the wet lab results of the RNA-RNA interaction between
SARS-CoV-2 gRNA and human U4 snRNA, showing that our
algorithms can be used as a fast and reliable computational
tool in the genome studies.
2. Algorithms
A. Extend Single-strand Folding to Double-strand Fold-
ing by concatenation.
Both LinearCoFold and LinearCo-
Partition take two RNA sequences as input, and simplify the
two-strand cofolding to the single-strand folding via concate-
nating two input RNAs. Formally, we denote the two RNA
sequences as
xa=xa
1xa
2...xa
n
and
xb=xb
1xb
2...xb
m
, where
n
and
m
are the lengths of
xa
and
xb
, respectively. Thus, the
new concatenated sequence of length
n+m
can be denoted
as
x=x1x2...xnxn+1xn+2...xn+m
, where the nick point is
between nucleotides xnand xn+1.
After this transformation, the classical dynamic program-
ming algorithm for single-strand folding
33, 34
can be applied
to the concatenated sequence. One thermodynamic change
needs to be considered for this extension is that a structure that
contains intermolecular base pairs incurs a stability penalty for
intermolecular initiation.
35
Formally, in the Nussinov system,
we denote the free energy change of the first intermolecular
base pair
(i, j)
as
ζ(x, i, j)
, which differentiates it from that
of the normal base pair
(p, q)
,
ξ(x, p, q)
. Note that
(i, j)
is the
innermost base pair that contains the nick point, while other
intermolecular base pairs do not incur an addition stability
cost. Besides, the free energy change of the unpaired base
k
is
denoted as
δ(x, k)
. Thus, the free energy change
∆G◦(x,y)
of the concatenated sequence
x
and its structure
y
(
(i, j)∈y
)
can be decomposed as:
∆G◦(x,y) = X
k∈unpaired(y)
δ(x, k) + ζ(x, i, j) + X
(p,q)∈pairs(y)
(p,q)6=(i,j)
ξ(x, p, q)
[1]
Note that if there is no base pair closing the nick point, i.e., the
two strands do not interact with each other, two-strand cofold-
ing is simply single-strand folding of two strands separately.
Next, we consider the Zuker system based on the Turner
energy model.
36, 37, 38
More sophisticated than the Nussinov
model, the Zuker and Turner’s scoring system is based on four
types of loops: exterior loops, hairpin loops, interior loops
(where a bulge loop with unpaired nucleotides only on one side
is considered a type of interior loop) and multiloops. In Fig. 2,
we illustrate the relative positions of the nick point in these
four types of loops. For the external loop, the nick point can
be either covered by a base pair or not (Fig. 2A and B). If an
intermolecular base pair
(i, j)
closing the nick point, the span
[i, j]
can be further decomposed into nicked hairpin, nicked
interior loop and nicked multiloop (Fig. 2C) based on the type
of loops it enclosed. Specifically, the nicked hairpin loop only
requires
i≤n < j
, while the nicked interior loop has an inner
loop from position
p
to
q
, and requires either
i≤n<p
or
q≤n < j
; see the first row of Fig. 2C for an illustration.
The nicked multiloop is more complicated (the second row of
Fig. 2C):
•
the nick point is at the leftmost unpaired region, i.e., it is
between
i
and
p
where
p
is the 5’ end of the first multi-
branch;
•
the nick point is at the rightmost unpaired region, i.e.,
it is between
q
and
j
where
q
is the 3’ end of the last
multi-branch;
•
the nick point is in the middle, i.e., it is between
k
and
l
which are the 3’ end and the 5’ end of two consecutive
multi-branches, respectively.
Such nicked loops are considered to be exterior loops when
calculating their free energy change. Note that the nick point
only affects the innermost loop that directly covers it; the loops
are still normal interior loops and multiloops in the case that
the nick point is covered by another base pair
(p, q)
where
i<p<q<j
, shown in the third row of Fig. 2C. In addition,
we add the intermolecular initiation free energy cost for dimers.
B. LinearCoFold Algorithm.
LinearCoFold aims to predict
the minimum free energy (MFE) structure of double-strand
RNAs in linear runtime without imposing a limit on base pair
length. Formally, LinearCoFold finds the MFE structure
ˆ
y
among all possible structures
Y(x)
under the given energy
model w:
ˆ
y=argmin
y∈Y(x)
∆G◦
w(x,y)
Zhang et al. 3