LinearCoFold and LinearCoPartition Linear-Time Algorithms for Secondary Structure Prediction of Interacting RNA molecules

2025-05-03 0 0 2.57MB 12 页 10玖币

LinearCoFold and LinearCoPartition: Linear-Time

Algorithms for Secondary Structure Prediction of

Interacting RNA molecules

He Zhangb,a,†, Sizhen Lia,†, Liang Zhanga, David H. Mathewsc,d,e, and Liang Huanga,♣

aSchool of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR 97330, USA; bBaidu Research USA, Sunnyvale, CA 94089, USA; cDept. of

Biochemistry & Biophysics; dCenter for RNA Biology; eDept. of Biostatistics & Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA

Many ncRNAs function through RNA-RNA interactions. Fast and re-

liable RNA structure prediction with consideration of RNA-RNA in-

teraction is useful. Some existing tools are less accurate due to

omitting the competing of intermolecular and intramolecular base

pairs, or focus more on predicting the binding region rather than pre-

dicting the complete secondary structure of two interacting strands.

Vienna RNAcofold, which reduces the problem into the classical sin-

gle sequence folding by concatenating two strands, scales in cubic

time against the combined sequence length, and is slow for long se-

quences. To address these issues, we present LinearCoFold, which

predicts the complete minimum free energy structure of two strands

in linear runtime, and LinearCoPartition, which calculates the cofold-

ing partition function and base pairing probabilities in linear run-

time. LinearCoFold and LinearCoPartition follows the concatenation

strategy of RNAcofold, but are orders of magnitude faster than RNA-

cofold. For example, on a sequence pair with combined length of

26,190 nt, LinearCoFold is 86.8×faster than RNAcofold MFE mode

(0.6 minutes vs. 52.1 minutes), and LinearCoPartition is 642.3×faster

than RNAcofold partition function mode (1.8 minutes vs. 1156.2 min-

utes). Different from the local algorithms, LinearCoFold and Lin-

earCoPartition are global cofolding algorithms without restriction on

base pair length. Surprisingly, LinearCoFold and LinearCoPartition’s

predictions have higher PPV and sensitivity of intermolecular base

pairs. Furthermore, we apply LinearCoFold to predict the RNA-RNA

interaction between SARS-CoV-2 gRNA and human U4 snRNA, which

has been experimentally studied, and observe that LinearCoFold’s

prediction correlates better to the wet lab results.

1. Introduction

RNA strands can interact via inter-molecular base pairing and

form RNA-RNA complexes. In nature, many non-coding

RNAs (ncRNAs) function through these RNA-RNA interac-

tions (Fig. 1). For instance, it is well-known that microRNA

(miRNA) binds with messenger RNA (mRNA) to mediate

mRNA destabilization

and cleavage.

Some longer ncRNAs,

such as small RNA (sRNA), small nuclear RNA (snRNA) and

small nucleolar RNA (snoRNA), involve in RNA-RNA inter-

actions for splicing regulation

3, 4

and chemical modiﬁcations.

A small clade of tmRNAs have a two-piece form (i.e., split

tmRNA) and form complexes via inter-molecular base pairs

(see Fig. 1A and B). On the other hand, human designed RNAs

that bind speciﬁcally to the target RNAs are used for diag-

nostics and treatments. Therapeutic small interfering RNA

(siRNA) triggers RNA interference (RNAi) through siRNA-

mRNA interaction;

6, 7, 8

antisense oligonucleotide (ASO) binds

to target RNA to suppress unwanted gene expression or to

UCCCCUCACCUCCACCA

100

110

AUU

C C

G C

A U C U

GAG

C G U U C U G

GACA

CGA

CACCUCC

ACC

GGGGGUG

CAGGGCG

AGGU

10 20 30

100

110

5’

3’

5’

intra-molecular

base pairs

inter-molecular

base pairs

inter-molecular base pairs

inter-molecular

base pairs

intra-molecular

base pairs inter-molecular

base pairs

Strand A

Strand B

A B

CRNA-RNA interaction function

siRNA-mRNA mRNA degradation

miRNA-mRNA mRNA cleavage, destabilization

and down-regulation

sRNA-mRNA mRNA silencing

gRNA-mRNA mRNA editing

snRNA-mRNA RNA splicing and regulation

snoRNA-rRNA rRNA modiﬁcation

split tmRNA rescue of stalled ribosomes;

degradation of defective mRNA

Fig. 1.

Two RNA strands can form RNA-RNA complexes through inter-

molecular base pairs. These interacting RNA molecules are widely dis-

tributed in nature, and are involved in multiple biological processes.

: The

secondary structure of the split tmRNA from D. aromatica; two strands are

in green and orange, respectively. The intra-molecular base pairs are in

red, and inter-molecular ones are in blue.

: The corresponding circular

plot of structure in

: Some known RNA-RNA interactions and their

functions.

regulate splicing;

9, 10, 11

CRISPR/Cas-13 guide RNA (gRNA)

induces speciﬁc RNA editing by initially binding to the target

region.

12, 13, 14

Fast and reliable secondary structure prediction

of interacting RNA molecules is desired to further understand

these biological processes and better design diagnostic and

therapeutic RNA drugs.

Some existing algorithms and systems are used for predict-

ing RNA-RNA interaction (see Tab. 1). The stochastic sam-

pling algorithms

and tools, such as Vienna RNAsubopt,

can

be used to calculate the accessibilities by counting how many of

the structures have the region of interest completely unpaired,

where accessibility is an indicator represents if the correspond-

ing region is open for binding. The tool OligoWalk calculates

Author contributions: L.H. conceived the idea and directed the project. L.H. and

H.Z. designed algorithms. H.Z. implemented the code. D.H.M. guided the evaluation

that S.L., H.Z. and L.Z. carried out. H.Z. and S.L. wrote the manuscript; L.H. and

D.H.M. revised it.

The authors declare no conﬂict of interest.

†Equal contribution; ♣corresponding author: liang.huang.sh@gmail.com.

Zhang et al. 1

arXiv:2210.14982v1 [q-bio.BM] 26 Oct 2022

system input output MFE or base pair runtime memory

strand(s) partition type usage

RNAsubopt15

one sampled structures partition intramolecular O(n3)O(n2)

RNAplfold16 accessibility

OligoWalk17 two binding afﬁnity & structure both intermolecular O((n+m)2)O((n+m)2)

RNAhybrid18

two binding structure MFE intermolecular O(nm)O(nm)

RNAplex19

RNAup

20 one accessibility partition intramolecular O(n3w)O(n2)

two binding afﬁnity & structure both O(n3w) + O(nw5)O(n2) + O(nw3)

PairFold

21 one

full structure MFE

intramolecular O(n3)O(n2)

two both O((n+m)3)O((n+m)2)

multiple both O((Pini)3)O((Pini)2)

bifold17

two full structure MFE both O((n+m)3)O((n+m)2)

RNAcofold22 both

DuplexFold23 two binding structure MFE intermolecular O(n+m)O((n+m)2)

LinearCoFold two full structure MFE both O(n+m)O(blogb(n+m))

LinearCoPartition partition O(b2(n+m))

Table 1. An overview of existing RNA-RNA interaction prediction tools and our algorithms. In the runtime and memory usage columns,

we denote

and

as the lengths of two sequences,

as the binding window size, and

as the beam size in our LinearCoFold and

LinearCoPartition. Note that

and

are constants; by default,

is 25 in RNAup, and

is 100 in our algorithms. PairFold is a tool that

can do multiple sequence folding, so we denote

as the length of the

th sequence for its multifolding mode. Our LinearCoFold and

LinearCoPartition are the only ones that achieve linear runtime with considering both inter- and intramolecular base pairs.

the accessibility for binding of complementary oligonuleotides

considering either lowest free energy structures or the full

folding ensemble.

17, 25

Instead of obtaining accessibility from

samples, Bernhart et al.

introduced a cubic runtime algo-

rithm to precisely compute accessibility. Widely used as they

are, however, these methods are designed for analyzing the

accessibility property of the target sequence, but are not able

to predict the binding structure given a speciﬁc oligo.

RNAhybrid

and RNAplex

are another group of algo-

rithms for predicting the hybridization sites in a target RNA that

interact with small oligos, especially for microRNAs, by scan-

ning along the target RNA and calculating the intermolecular

hybridization. Though being fast, they are less informative and

less accurate due to omitting the competing intermolecular and

intramolecular base pairs.

26, 27

To address this, accessibility-

based method is proposed. As an example, RNAup

ﬁrstly

calculates the accessibility of windows of interest, then com-

putes the binding energy reward of each window for a given

oligo, and ﬁnally combines the target region’s accessibility and

binding reward together to obtain binding afﬁnity. The draw-

back of RNAup (as well as other accessibility-based tools)

is the slowness: its ﬁrst step, accessibility computation for

multiple windows, employs a

O(n3w)

algorithm, where

the target sequence length and

is the window size, result-

ing in a substantially slow down compared to RNAhybrid and

RNAplex.

Aiming to compute the binding afﬁnity and predict the bind-

ing region, RNAhybrid, RNAplex and RNAup are not able to

predict the complete binding conformation of two sequences.

However, the joint structure consisting of both the intramolec-

ular base pairs and intermolecular base pairs is desired in many

cases. Fig. 1A and B illustrate the secondary structure in the

region of interaction of the split tmRNA from D. aromatica,

showing that both intramolecular and intermolecular base pairs

exist in the binding region. To predict the joint structure, sev-

eral tools, such as bifold,

PairFold,

Vienna RNAcofold

and NUPACK,

were developed. The basic framework of

these tools are to concatenate two input sequences as a single

sequence, and predict the whole secondary structure of the con-

catenated sequence based on the classical dynamic program-

ming algorithms. With some differences in implementation,

the runtime of these algorithms are all

O((n+m)3)

, where

and

are the lengths of the two strands, preventing them to

be applied to long sequences, for instance, long mRNAs and

some full-length viral genomes.

To accelerate and scale up the prediction of the joint struc-

ture we propose LinearCoFold and LinearCoPartition, which

follow the “concatenation” strategy to simplify two-strand co-

folding into classical single-strand folding, and predict both

intramolecular and intermolecular interactions. Different from

previous cubic runtime algorithms, LinearCoFold and Lin-

earCoPartition adopt a left-to-right dynamic programming and

further apply beam pruning heuristics to reduce its runtime to

linear-time. Speciﬁcally, LinearCoFold predicts the minimum

free energy structure of two strands, while LinearCoPartition

computes partition function and base pairing probabilities, and

can output assembled structures with downstream algorithms

such as MEA

and ThreshKnot.

Unlike other local cofold-

ing algorithms, LinearCoFold and LinearCoPartition are global

linear-time algorithms, i.e., they do not impose any limitations

2Zhang et al.

on base pairing distance.

We compare the efﬁciency and scalability of our algorithms

to Vienna RNAcofold. and conﬁrm that the runtime and mem-

ory usage of LinearCoFold and LinearCoPartition scale linearly

against combined sequence, while RNAcofold scales cubically

in runtime and quadratically in memory usage. LinearCo-

Fold and LinearCoPartition are orders of magnitude faster

than RNAcofold. On the longest data point in the benchmark

dataset that RNAcofold can run (26,190

), LinearCoFold

86.8×

faster than RNAcofold MFE mode, and LinearCo-

Partition is

642.3×

faster than RNAcofold partition function

mode. Notably, RNAcofold cannot ﬁnish any sequences longer

than 32,767

, but our LinearCoFold and LinearCoPartition

have no limitation of sequence length internally, and can scale

up to sequences of length 100,000

in 2.2 and 6.9 minutes,

respectively. With respect to accuracy, LinearCoFold and Lin-

earCoPartition’s predictions are more accurate with respect to

Sensitivity (the fraction of known pairs correctly predicted) and

Positive Predictive Value (PPV; the fraction of predicted pairs

that are in the accepted structure). Compared with RNAcofold

MFE, the overall PPV and Sensitivity of LinearCoFold increase

+4.0% and +11.6%, respectively; compared with RNAcofold

MEA, LinearCoPartition MEA gains improvement of +2.9%

on PPV and +5.7% on sensitivity; compared with RNAcofold

TheshKnot, LinearCoPartition TheshKnot increases +2.4%

and +5.5% on PPV and sensitivity, respectively. Furthermore,

we demonstrate that our predicted interaction correlates better

to the wet lab results of the RNA-RNA interaction between

SARS-CoV-2 gRNA and human U4 snRNA, showing that our

algorithms can be used as a fast and reliable computational

tool in the genome studies.

2. Algorithms

A. Extend Single-strand Folding to Double-strand Fold-

ing by concatenation.

Both LinearCoFold and LinearCo-

Partition take two RNA sequences as input, and simplify the

two-strand cofolding to the single-strand folding via concate-

nating two input RNAs. Formally, we denote the two RNA

sequences as

xa=xa

1xa

2...xa

and

xb=xb

1xb

2...xb

, where

and

are the lengths of

and

, respectively. Thus, the

new concatenated sequence of length

n+m

can be denoted

x=x1x2...xnxn+1xn+2...xn+m

, where the nick point is

between nucleotides xnand xn+1.

After this transformation, the classical dynamic program-

ming algorithm for single-strand folding

33, 34

can be applied

to the concatenated sequence. One thermodynamic change

needs to be considered for this extension is that a structure that

contains intermolecular base pairs incurs a stability penalty for

intermolecular initiation.

Formally, in the Nussinov system,

we denote the free energy change of the ﬁrst intermolecular

base pair

(i, j)

ζ(x, i, j)

, which differentiates it from that

of the normal base pair

(p, q)

ξ(x, p, q)

. Note that

(i, j)

is the

innermost base pair that contains the nick point, while other

intermolecular base pairs do not incur an addition stability

cost. Besides, the free energy change of the unpaired base

denoted as

δ(x, k)

. Thus, the free energy change

∆G◦(x,y)

of the concatenated sequence

and its structure

(

(i, j)∈y

)

can be decomposed as:

∆G◦(x,y) = X

k∈unpaired(y)

δ(x, k) + ζ(x, i, j) + X

(p,q)∈pairs(y)

(p,q)6=(i,j)

ξ(x, p, q)

[1]

Note that if there is no base pair closing the nick point, i.e., the

two strands do not interact with each other, two-strand cofold-

ing is simply single-strand folding of two strands separately.

Next, we consider the Zuker system based on the Turner

energy model.

36, 37, 38

More sophisticated than the Nussinov

model, the Zuker and Turner’s scoring system is based on four

types of loops: exterior loops, hairpin loops, interior loops

(where a bulge loop with unpaired nucleotides only on one side

is considered a type of interior loop) and multiloops. In Fig. 2,

we illustrate the relative positions of the nick point in these

four types of loops. For the external loop, the nick point can

be either covered by a base pair or not (Fig. 2A and B). If an

intermolecular base pair

(i, j)

closing the nick point, the span

[i, j]

can be further decomposed into nicked hairpin, nicked

interior loop and nicked multiloop (Fig. 2C) based on the type

of loops it enclosed. Speciﬁcally, the nicked hairpin loop only

requires

i≤n < j

, while the nicked interior loop has an inner

loop from position

, and requires either

i≤n<p

q≤n < j

; see the ﬁrst row of Fig. 2C for an illustration.

The nicked multiloop is more complicated (the second row of

Fig. 2C):

•

the nick point is at the leftmost unpaired region, i.e., it is

between

and

where

is the 5’ end of the ﬁrst multi-

branch;

•

the nick point is at the rightmost unpaired region, i.e.,

it is between

and

where

is the 3’ end of the last

multi-branch;

•

the nick point is in the middle, i.e., it is between

and

which are the 3’ end and the 5’ end of two consecutive

multi-branches, respectively.

Such nicked loops are considered to be exterior loops when

calculating their free energy change. Note that the nick point

only affects the innermost loop that directly covers it; the loops

are still normal interior loops and multiloops in the case that

the nick point is covered by another base pair

(p, q)

where

i<p<q<j

, shown in the third row of Fig. 2C. In addition,

we add the intermolecular initiation free energy cost for dimers.

B. LinearCoFold Algorithm.

LinearCoFold aims to predict

the minimum free energy (MFE) structure of double-strand

RNAs in linear runtime without imposing a limit on base pair

length. Formally, LinearCoFold ﬁnds the MFE structure

among all possible structures

Y(x)

under the given energy

model w:

y=argmin

y∈Y(x)

∆G◦

w(x,y)

Zhang et al. 3

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LinearCoFoldandLinearCoPartition:Linear-TimeAlgorithmsforSecondaryStructurePredictionofInteractingRNAmoleculesHeZhangb,a,y,SizhenLia,y,LiangZhanga,DavidH.Mathewsc,d,e,andLiangHuanga,|aSchoolofElectricalEngineering&ComputerScience,OregonStateUniversity,Corvallis,OR97330,USA;bBaiduResearchUSA,Sunnyval...

展开>> 收起<<

LinearCoFold and LinearCoPartition Linear-Time Algorithms for Secondary Structure Prediction of Interacting RNA molecules.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LinearCoFold and LinearCoPartition Linear-Time Algorithms for Secondary Structure Prediction of Interacting RNA molecules

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: