LinearCoFold and LinearCoPartition Linear-Time Algorithms for Secondary Structure Prediction of Interacting RNA molecules

2025-05-03 0 0 2.57MB 12 页 10玖币
侵权投诉
LinearCoFold and LinearCoPartition: Linear-Time
Algorithms for Secondary Structure Prediction of
Interacting RNA molecules
He Zhangb,a,, Sizhen Lia,, Liang Zhanga, David H. Mathewsc,d,e, and Liang Huanga,
aSchool of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR 97330, USA; bBaidu Research USA, Sunnyvale, CA 94089, USA; cDept. of
Biochemistry & Biophysics; dCenter for RNA Biology; eDept. of Biostatistics & Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA
Many ncRNAs function through RNA-RNA interactions. Fast and re-
liable RNA structure prediction with consideration of RNA-RNA in-
teraction is useful. Some existing tools are less accurate due to
omitting the competing of intermolecular and intramolecular base
pairs, or focus more on predicting the binding region rather than pre-
dicting the complete secondary structure of two interacting strands.
Vienna RNAcofold, which reduces the problem into the classical sin-
gle sequence folding by concatenating two strands, scales in cubic
time against the combined sequence length, and is slow for long se-
quences. To address these issues, we present LinearCoFold, which
predicts the complete minimum free energy structure of two strands
in linear runtime, and LinearCoPartition, which calculates the cofold-
ing partition function and base pairing probabilities in linear run-
time. LinearCoFold and LinearCoPartition follows the concatenation
strategy of RNAcofold, but are orders of magnitude faster than RNA-
cofold. For example, on a sequence pair with combined length of
26,190 nt, LinearCoFold is 86.8×faster than RNAcofold MFE mode
(0.6 minutes vs. 52.1 minutes), and LinearCoPartition is 642.3×faster
than RNAcofold partition function mode (1.8 minutes vs. 1156.2 min-
utes). Different from the local algorithms, LinearCoFold and Lin-
earCoPartition are global cofolding algorithms without restriction on
base pair length. Surprisingly, LinearCoFold and LinearCoPartition’s
predictions have higher PPV and sensitivity of intermolecular base
pairs. Furthermore, we apply LinearCoFold to predict the RNA-RNA
interaction between SARS-CoV-2 gRNA and human U4 snRNA, which
has been experimentally studied, and observe that LinearCoFold’s
prediction correlates better to the wet lab results.
1. Introduction
RNA strands can interact via inter-molecular base pairing and
form RNA-RNA complexes. In nature, many non-coding
RNAs (ncRNAs) function through these RNA-RNA interac-
tions (Fig. 1). For instance, it is well-known that microRNA
(miRNA) binds with messenger RNA (mRNA) to mediate
mRNA destabilization
1
and cleavage.
2
Some longer ncRNAs,
such as small RNA (sRNA), small nuclear RNA (snRNA) and
small nucleolar RNA (snoRNA), involve in RNA-RNA inter-
actions for splicing regulation
3, 4
and chemical modifications.
5
A small clade of tmRNAs have a two-piece form (i.e., split
tmRNA) and form complexes via inter-molecular base pairs
(see Fig. 1A and B). On the other hand, human designed RNAs
that bind specifically to the target RNAs are used for diag-
nostics and treatments. Therapeutic small interfering RNA
(siRNA) triggers RNA interference (RNAi) through siRNA-
mRNA interaction;
6, 7, 8
antisense oligonucleotide (ASO) binds
to target RNA to suppress unwanted gene expression or to
A
U
G
G
U
U
G
U
A
A
U
U
C
C
G
G
C
A
A
U
C
U
G
G
A
G
G
C
G
U
U
C
U
G
G
A
C
A
G
G
G
G
U
U
C
G
A
U
UCCCCUCACCUCCACCA
G
G
G
G
G
U
G
U
A
C
U
G
G
U
C
U
C
G
A
C
A
G
G
G
C
G
G
A
C
A
A
A
G
G
U
G
C
G
C
A
G
G
C
A
A
C
U
1
10
20
30
40
50
60
70
80
90
100
110
A
U
G
G
U
U
G
U
A
AUU
C C
G
G C
A
A U C U
G
GAG
G
C G U U C U G
GACA
G
G
G
G
U
U
CGA
U
U
C
C
C
C
U
CACCUCC
ACC
A
GGGGGUG
U
A
C
U
G
G
U
C
U
C
G
A
CAGGGCG
G
A
C
A
A
AGGU
G
C
GC
A
GG
C
A
A
C
U
1
10 20 30
40
50
60
70
80
90
100
110
5’
3’
3’
5’
intra-molecular
base pairs
inter-molecular
base pairs
inter-molecular base pairs
inter-molecular
base pairs
intra-molecular
base pairs inter-molecular
base pairs
Strand A
Strand B
A B
CRNA-RNA interaction function
siRNA-mRNA mRNA degradation
miRNA-mRNA mRNA cleavage, destabilization
and down-regulation
sRNA-mRNA mRNA silencing
gRNA-mRNA mRNA editing
snRNA-mRNA RNA splicing and regulation
snoRNA-rRNA rRNA modification
split tmRNA rescue of stalled ribosomes;
degradation of defective mRNA
Fig. 1.
Two RNA strands can form RNA-RNA complexes through inter-
molecular base pairs. These interacting RNA molecules are widely dis-
tributed in nature, and are involved in multiple biological processes.
A
: The
secondary structure of the split tmRNA from D. aromatica; two strands are
in green and orange, respectively. The intra-molecular base pairs are in
red, and inter-molecular ones are in blue.
B
: The corresponding circular
plot of structure in
A
.
C
: Some known RNA-RNA interactions and their
functions.
regulate splicing;
9, 10, 11
CRISPR/Cas-13 guide RNA (gRNA)
induces specific RNA editing by initially binding to the target
region.
12, 13, 14
Fast and reliable secondary structure prediction
of interacting RNA molecules is desired to further understand
these biological processes and better design diagnostic and
therapeutic RNA drugs.
Some existing algorithms and systems are used for predict-
ing RNA-RNA interaction (see Tab. 1). The stochastic sam-
pling algorithms
24
and tools, such as Vienna RNAsubopt,
15
can
be used to calculate the accessibilities by counting how many of
the structures have the region of interest completely unpaired,
where accessibility is an indicator represents if the correspond-
ing region is open for binding. The tool OligoWalk calculates
Author contributions: L.H. conceived the idea and directed the project. L.H. and
H.Z. designed algorithms. H.Z. implemented the code. D.H.M. guided the evaluation
that S.L., H.Z. and L.Z. carried out. H.Z. and S.L. wrote the manuscript; L.H. and
D.H.M. revised it.
The authors declare no conflict of interest.
Equal contribution; corresponding author: liang.huang.sh@gmail.com.
Zhang et al. 1
arXiv:2210.14982v1 [q-bio.BM] 26 Oct 2022
system input output MFE or base pair runtime memory
strand(s) partition type usage
RNAsubopt15
one sampled structures partition intramolecular O(n3)O(n2)
RNAplfold16 accessibility
OligoWalk17 two binding affinity & structure both intermolecular O((n+m)2)O((n+m)2)
RNAhybrid18
two binding structure MFE intermolecular O(nm)O(nm)
RNAplex19
RNAup
20 one accessibility partition intramolecular O(n3w)O(n2)
two binding affinity & structure both O(n3w) + O(nw5)O(n2) + O(nw3)
PairFold
21 one
full structure MFE
intramolecular O(n3)O(n2)
two both O((n+m)3)O((n+m)2)
multiple both O((Pini)3)O((Pini)2)
bifold17
two full structure MFE both O((n+m)3)O((n+m)2)
RNAcofold22 both
DuplexFold23 two binding structure MFE intermolecular O(n+m)O((n+m)2)
LinearCoFold two full structure MFE both O(n+m)O(blogb(n+m))
LinearCoPartition partition O(b2(n+m))
Table 1. An overview of existing RNA-RNA interaction prediction tools and our algorithms. In the runtime and memory usage columns,
we denote
n
and
m
as the lengths of two sequences,
w
as the binding window size, and
b
as the beam size in our LinearCoFold and
LinearCoPartition. Note that
w
and
b
are constants; by default,
w
is 25 in RNAup, and
b
is 100 in our algorithms. PairFold is a tool that
can do multiple sequence folding, so we denote
ni
as the length of the
i
th sequence for its multifolding mode. Our LinearCoFold and
LinearCoPartition are the only ones that achieve linear runtime with considering both inter- and intramolecular base pairs.
the accessibility for binding of complementary oligonuleotides
considering either lowest free energy structures or the full
folding ensemble.
17, 25
Instead of obtaining accessibility from
samples, Bernhart et al.
16
introduced a cubic runtime algo-
rithm to precisely compute accessibility. Widely used as they
are, however, these methods are designed for analyzing the
accessibility property of the target sequence, but are not able
to predict the binding structure given a specific oligo.
RNAhybrid
18
and RNAplex
19
are another group of algo-
rithms for predicting the hybridization sites in a target RNA that
interact with small oligos, especially for microRNAs, by scan-
ning along the target RNA and calculating the intermolecular
hybridization. Though being fast, they are less informative and
less accurate due to omitting the competing intermolecular and
intramolecular base pairs.
26, 27
To address this, accessibility-
based method is proposed. As an example, RNAup
20
firstly
calculates the accessibility of windows of interest, then com-
putes the binding energy reward of each window for a given
oligo, and finally combines the target region’s accessibility and
binding reward together to obtain binding affinity. The draw-
back of RNAup (as well as other accessibility-based tools)
is the slowness: its first step, accessibility computation for
multiple windows, employs a
O(n3w)
algorithm, where
n
is
the target sequence length and
w
is the window size, result-
ing in a substantially slow down compared to RNAhybrid and
RNAplex.
Aiming to compute the binding affinity and predict the bind-
ing region, RNAhybrid, RNAplex and RNAup are not able to
predict the complete binding conformation of two sequences.
However, the joint structure consisting of both the intramolec-
ular base pairs and intermolecular base pairs is desired in many
cases. Fig. 1A and B illustrate the secondary structure in the
region of interaction of the split tmRNA from D. aromatica,
28
showing that both intramolecular and intermolecular base pairs
exist in the binding region. To predict the joint structure, sev-
eral tools, such as bifold,
17
PairFold,
21
Vienna RNAcofold
29
and NUPACK,
30
were developed. The basic framework of
these tools are to concatenate two input sequences as a single
sequence, and predict the whole secondary structure of the con-
catenated sequence based on the classical dynamic program-
ming algorithms. With some differences in implementation,
the runtime of these algorithms are all
O((n+m)3)
, where
n
and
m
are the lengths of the two strands, preventing them to
be applied to long sequences, for instance, long mRNAs and
some full-length viral genomes.
To accelerate and scale up the prediction of the joint struc-
ture we propose LinearCoFold and LinearCoPartition, which
follow the “concatenation” strategy to simplify two-strand co-
folding into classical single-strand folding, and predict both
intramolecular and intermolecular interactions. Different from
previous cubic runtime algorithms, LinearCoFold and Lin-
earCoPartition adopt a left-to-right dynamic programming and
further apply beam pruning heuristics to reduce its runtime to
linear-time. Specifically, LinearCoFold predicts the minimum
free energy structure of two strands, while LinearCoPartition
computes partition function and base pairing probabilities, and
can output assembled structures with downstream algorithms
such as MEA
31
and ThreshKnot.
32
Unlike other local cofold-
ing algorithms, LinearCoFold and LinearCoPartition are global
linear-time algorithms, i.e., they do not impose any limitations
2Zhang et al.
on base pairing distance.
We compare the efficiency and scalability of our algorithms
to Vienna RNAcofold. and confirm that the runtime and mem-
ory usage of LinearCoFold and LinearCoPartition scale linearly
against combined sequence, while RNAcofold scales cubically
in runtime and quadratically in memory usage. LinearCo-
Fold and LinearCoPartition are orders of magnitude faster
than RNAcofold. On the longest data point in the benchmark
dataset that RNAcofold can run (26,190
nt
), LinearCoFold
is
86.8×
faster than RNAcofold MFE mode, and LinearCo-
Partition is
642.3×
faster than RNAcofold partition function
mode. Notably, RNAcofold cannot finish any sequences longer
than 32,767
nt
, but our LinearCoFold and LinearCoPartition
have no limitation of sequence length internally, and can scale
up to sequences of length 100,000
nt
in 2.2 and 6.9 minutes,
respectively. With respect to accuracy, LinearCoFold and Lin-
earCoPartition’s predictions are more accurate with respect to
Sensitivity (the fraction of known pairs correctly predicted) and
Positive Predictive Value (PPV; the fraction of predicted pairs
that are in the accepted structure). Compared with RNAcofold
MFE, the overall PPV and Sensitivity of LinearCoFold increase
+4.0% and +11.6%, respectively; compared with RNAcofold
MEA, LinearCoPartition MEA gains improvement of +2.9%
on PPV and +5.7% on sensitivity; compared with RNAcofold
TheshKnot, LinearCoPartition TheshKnot increases +2.4%
and +5.5% on PPV and sensitivity, respectively. Furthermore,
we demonstrate that our predicted interaction correlates better
to the wet lab results of the RNA-RNA interaction between
SARS-CoV-2 gRNA and human U4 snRNA, showing that our
algorithms can be used as a fast and reliable computational
tool in the genome studies.
2. Algorithms
A. Extend Single-strand Folding to Double-strand Fold-
ing by concatenation.
Both LinearCoFold and LinearCo-
Partition take two RNA sequences as input, and simplify the
two-strand cofolding to the single-strand folding via concate-
nating two input RNAs. Formally, we denote the two RNA
sequences as
xa=xa
1xa
2...xa
n
and
xb=xb
1xb
2...xb
m
, where
n
and
m
are the lengths of
xa
and
xb
, respectively. Thus, the
new concatenated sequence of length
n+m
can be denoted
as
x=x1x2...xnxn+1xn+2...xn+m
, where the nick point is
between nucleotides xnand xn+1.
After this transformation, the classical dynamic program-
ming algorithm for single-strand folding
33, 34
can be applied
to the concatenated sequence. One thermodynamic change
needs to be considered for this extension is that a structure that
contains intermolecular base pairs incurs a stability penalty for
intermolecular initiation.
35
Formally, in the Nussinov system,
we denote the free energy change of the first intermolecular
base pair
(i, j)
as
ζ(x, i, j)
, which differentiates it from that
of the normal base pair
(p, q)
,
ξ(x, p, q)
. Note that
(i, j)
is the
innermost base pair that contains the nick point, while other
intermolecular base pairs do not incur an addition stability
cost. Besides, the free energy change of the unpaired base
k
is
denoted as
δ(x, k)
. Thus, the free energy change
G(x,y)
of the concatenated sequence
x
and its structure
y
(
(i, j)y
)
can be decomposed as:
G(x,y) = X
kunpaired(y)
δ(x, k) + ζ(x, i, j) + X
(p,q)pairs(y)
(p,q)6=(i,j)
ξ(x, p, q)
[1]
Note that if there is no base pair closing the nick point, i.e., the
two strands do not interact with each other, two-strand cofold-
ing is simply single-strand folding of two strands separately.
Next, we consider the Zuker system based on the Turner
energy model.
36, 37, 38
More sophisticated than the Nussinov
model, the Zuker and Turner’s scoring system is based on four
types of loops: exterior loops, hairpin loops, interior loops
(where a bulge loop with unpaired nucleotides only on one side
is considered a type of interior loop) and multiloops. In Fig. 2,
we illustrate the relative positions of the nick point in these
four types of loops. For the external loop, the nick point can
be either covered by a base pair or not (Fig. 2A and B). If an
intermolecular base pair
(i, j)
closing the nick point, the span
[i, j]
can be further decomposed into nicked hairpin, nicked
interior loop and nicked multiloop (Fig. 2C) based on the type
of loops it enclosed. Specifically, the nicked hairpin loop only
requires
in < j
, while the nicked interior loop has an inner
loop from position
p
to
q
, and requires either
in<p
or
qn < j
; see the first row of Fig. 2C for an illustration.
The nicked multiloop is more complicated (the second row of
Fig. 2C):
the nick point is at the leftmost unpaired region, i.e., it is
between
i
and
p
where
p
is the 5’ end of the first multi-
branch;
the nick point is at the rightmost unpaired region, i.e.,
it is between
q
and
j
where
q
is the 3’ end of the last
multi-branch;
the nick point is in the middle, i.e., it is between
k
and
l
which are the 3’ end and the 5’ end of two consecutive
multi-branches, respectively.
Such nicked loops are considered to be exterior loops when
calculating their free energy change. Note that the nick point
only affects the innermost loop that directly covers it; the loops
are still normal interior loops and multiloops in the case that
the nick point is covered by another base pair
(p, q)
where
i<p<q<j
, shown in the third row of Fig. 2C. In addition,
we add the intermolecular initiation free energy cost for dimers.
B. LinearCoFold Algorithm.
LinearCoFold aims to predict
the minimum free energy (MFE) structure of double-strand
RNAs in linear runtime without imposing a limit on base pair
length. Formally, LinearCoFold finds the MFE structure
ˆ
y
among all possible structures
Y(x)
under the given energy
model w:
ˆ
y=argmin
y∈Y(x)
G
w(x,y)
Zhang et al. 3
摘要:

LinearCoFoldandLinearCoPartition:Linear-TimeAlgorithmsforSecondaryStructurePredictionofInteractingRNAmoleculesHeZhangb,a,y,SizhenLia,y,LiangZhanga,DavidH.Mathewsc,d,e,andLiangHuanga,|aSchoolofElectricalEngineering&ComputerScience,OregonStateUniversity,Corvallis,OR97330,USA;bBaiduResearchUSA,Sunnyval...

展开>> 收起<<
LinearCoFold and LinearCoPartition Linear-Time Algorithms for Secondary Structure Prediction of Interacting RNA molecules.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:2.57MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注