Small Coupling Expansion for Multiple Sequence Alignment Louise Budzynski1 2and Andrea Pagnani1 2 3 1DISAT Politecnico di Torino Corso Duca degli Abruzzi 24 I-10129 Torino Italy

2025-05-03 0 0 1.42MB 22 页 10玖币

侵权投诉

Small Coupling Expansion for Multiple Sequence Alignment

Louise Budzynski1, 2 and Andrea Pagnani1, 2, 3

1DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, 24, I-10129, Torino, Italy

2Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo (TO), Italy

3INFN, Sezione di Torino, Torino, Via Pietro Giuria, 1 10125 Torino Italy

The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic

tools that allow to detect evolutionary patterns, as well as functional/structural characterizations

between homologous sequences in diﬀerent organisms. Typically, state-of-the-art bioinformatics

tools are based on proﬁle models that assume the statistical independence of the diﬀerent sites of

the sequences. Over the last years, it has become increasingly clear that homologous sequences

show complex patterns of long-range correlations over the primary sequence as a consequence of

the natural evolution process that selects genetic variants under the constraint of preserving the

functional/structural determinants of the sequence. Here, we present an alignment algorithm based

on message passing techniques that overcomes the limitations of proﬁle models. Our method is based

on a perturbative small-coupling expansion of the free energy of the model that assumes a linear

chain approximation as the 0th-order of the expansion. We test the potentiality of the algorithm

against standard competing strategies on several biological sequences.

PACS numbers:

I. INTRODUCTION

The evolution of biological molecules such as proteins is an ongoing highly nontrivial dynamical process spanning

over billions of years, constrained by the maintenance of relevant structural, and functional determinants. One of the

most striking features of natural evolution is how diﬀerent evolutionary pathways produced ensemble of molecules

characterized by an extremely heterogeneous amino acid sequence – often with a sequence identity lower than 30% –

but with virtually identical three-dimensional native structures. Thanks to a shrewd use of this structural similarity, it

is nowadays possible to classify the entire set of known protein sequences into disjoint classes of sequences originating

from a common ancestral sequence. Sequences belonging to the same class are called homologous.

Homologous sequences are best compared using sequence alignments [1]. Depending on the number of sequences to

align, there are three possible options: (i) Pairwise Alignments aims at casting two sequences into the same framework.

The available algorithms are typically based on some versions of dynamic programming, and scale linearly with the

length of the sequences [2, 3]. (ii) Multiple Sequence Alignments (MSA) maximize the global similarity of more than

two sequences [4]. Dynamic programming techniques can be generalized to more than two sequences, but with a

computational cost that scales exponentially with the number of sequences to be aligned. Producing MSAs of more

than 103sequences remains an open computational challenge. (iii) To align larger number of homologous sequences,

one ﬁrst selects a representative subset called seed for which the use of MSA is computationally feasible. Every single

homolog eventually is aligned to the seed MSA. In this way one can easily align up to 106sequences [5–7].

Standard alignment methods are based on the independent site evolution assumption [1], i.e. the probability of

observing a sequence is factorized among the diﬀerent sites. From a statistical mechanics perspective, such approx-

imation corresponds to a non-interacting 21 colors (20 amino acids + 1 gap symbol) Potts model. Proﬁle hidden

Markov models [6], for instance, are of that type. The computational complexity of proﬁle models is polynomial.

However, proﬁle models neglect long-range correlations, although they are an important statistical feature of ho-

mologous proteins. This well-known phenomenon is at the basis of what biologists call epistasis (i.e. how genetic

variation depend on the genetic context of the sequence). Recently, epistasis has received renewed attention from

the statistical mechanics’ community [8]. Given an MSA of a speciﬁc protein family, one could ask what is the best

statistical description of such an ensemble of sequences. Summary statistics such as one-site frequency count fi(a)

(i.e. the empirically observed frequency of observing amino acid aat position iin the MSA), two-site frequency

count fij (a, b) (i.e. the frequency of observing the amino acid realization a, b at position iand jrespectively), and in

principle higher-order correlations, could be used to inverse statistical modeling of the whole MSA. One can assume

that each sequence in the MSA is independently drawn from a multivariate distribution P(a1, . . . , aL) constrained

to reproduce the multibody empirical frequency counts of the MSA. The use of the maximum-entropy principle is

equivalent to assume a Boltzmann-Gibbs probability measure for P. The related Hamiltonian is a 21-colors general-

ized Potts model characterized by two sets of parameters: local ﬁelds Hi(a), and epistatic two-site interaction terms

Ji,j (a, b). Such parameters can be learned more or less eﬃciently, using the so-called Direct Coupling Analysis (DCA)

[9]. This method has found many interesting applications ranging from the prediction of protein structures [10, 11],

protein-protein interaction [12–14], prediction of mutational eﬀects [15–18], etc. Inherent to this strategy, there is

arXiv:2210.03463v2 [q-bio.QM] 27 Apr 2023

the counter intuitive step of constructing an MSA based on a statistical independence of sites assumption, which is

used, in turn, to predict long-range correlations. To solve this loophole, we propose a mean ﬁeld message-passing

strategy to align sequences to a reference Potts model. To do so, we considered a ﬁrst-order perturbative expansion

a la Plefka [19], setting as 0th order of the expansion the linear chain approximation. Recently, other strategies have

been proposed which take into account long range correlations: search for remote homology [20], a simpliﬁed version

of the message-passing strategy presented here [21], alignment of two Potts models [22], and a more machine learning

inspired method based on tranformers [23].

II. SET-UP OF THE PROBLEM

Although here we will focus on proteins, the method can be extended to other biological sequences, such as RNA

and DNA. Let A= (A1, . . . , AN) be an unaligned amino acid sequence of length N, containing a protein domain

S= (S1, . . . , SL) of a known protein family. While Acontains only amino acids (represented as upper-case letters

from the amino acid alphabet), Smight also contain gaps that are used to indicate the deletion of an amino acid in

the sequence A. We assume that the protein family is described by a Potts Hamiltonian:

HDCA(S) = −

i=1

Hi(Si)−X

i<j

Jij (Si, Sj).(1)

The couplings Jij and external ﬁelds Hihave been learned from the seed MSA in a preprocessing step, using DCA,

and the sub-sequence Sis assumed to have the same length Las the seed. The energy HDCA is considered as a score

for the sub-sequence Sto belong to the protein family. In this setting, our problem consists in ﬁnding a sub-sequence

Swith the lowest energy (i.e. with the highest score). Contrarily to proﬁle models, the Hamiltonian HDCA also

includes pairwise interactions related to residue co-evolution, hopefully leading to more accurate alignments in cases

where conservation of single residues is not suﬃcient to describe the protein family. The Hamiltonian in Eq. (1)

does not model the insertions statistics, because the parameters Jij and Hiare learned from the seed MSA, in which

all columns containing inserts have been removed. Therefore, as in [21], we added the insertion cost Hins , which

has been learned from the insertion statistics contained in the full seed alignment. Similarly to [21], we also added

an additional gap cost Hgap to correct the gap statistics learned in HDCA (that deeply depends on how the seed is

constructed). In this setting, the alignment problem corresponds to ﬁnding a sub-sequence S= (S1, . . . , SL) of the

original sequence A= (A1, . . . , AN), such that:

1. Sis an ordered list of amino acids in A(called match states), with the possibility of adding gaps states denoted

“-” between two consecutive positions, and of skipping some amino acids of A(i.e. interpreting them as

insertions).

2. the sub-sequence Sminimizes the total energy H=HDCA +Hins +Hgap.

An example of a sequence Aand its alignment Sis illustrated in Fig. (1). In order to formulate this problem as a

MADVGNSSKSVVLSSAKQIY

D-VGNSSKLSSA

S1=A3

S3=A4

S12=A16

S2='-'

FIG. 1: Example of alignment. Top: original sequence Aof length N= 20, bottom: aligned sequence Sof length L= 12.

Match states are enlightened in (dark gray) blue. There is one gap at position 2 in the sub-sequence S. Three amino acids are

skipped in the original sequence (in (light gray) orange): they are interpreted as insertions.

statistical physics model, we introduce for each position i= 1,...L a pair of variables yi= (xi, ni), where xi∈ {0,1}

is a binary variable, and ni∈ {0,1, . . . , N, N + 1}is a pointer. The variable xiindicates whether position iis a gap

“−” (xi= 0) or a match state (xi= 1). When iis a match, the pointer niindicates the position of the match state

in the full-length sequence A. When iis a gap, the pointer keeps track of the last match state before position i. Note

that we added pointer values n= 0 and n=N+ 1. These value are used for gap states at the beginning and at

the end of the aligned sequence: if matched symbols start to appear only from a position i > 1, we ﬁll the previous

positions j < i with gaps having pointer nj= 0. Similarly, if the last matched state appears at position i<L, we

ﬁll the next positions j > i with gaps having pointers nj=N+ 1. The Potts Hamiltonian re-written in terms of the

variables y= (y1, . . . , yL) is:

HDCA(y) = −

i=1

Hi(Axi.ni)−X

i<j

Jij (Axi.ni, Axj.nj),

where A0=−is the gap state. We will use short-hand notations Hi(yi)≡Hi(Axi.ni) and Jij (yi, yj)≡

Jij (Axi.ni, Axj.nj) in the rest of the paper. The insertion cost Hins and the gap cost Hgap take the form introduced

in[21]. In particular for the insertion cost we have:

Hins(y) =

i=2

ϕi(ni−ni−1−1) ,

with ϕi(∆n) = (1 −δ∆n,0)[λi

o+λi

e(∆n−1)], and ∆ni=ni−ni−1−1 the number of skipped amino acids between

position i−1 and i. The parameters {λi

o, λi

e}have been inferred from the insertion statistics (see [21] section IV.B.).

And for the gap cost we have:

Hgap(y) =

i=1

µ(xi, ni),

with µ(1, n) = 0 for match states, µ(0,0) = µ(0, N + 1) = µext for external gaps, and µ(0, n) = µint for internal gaps

(with 0 < n < N + 1). The values of µint, and µext have been chosen according to the procedure described in [21],

section IV.C: one re-align sequences of the seed MSA using several values of µint, µext, and pick the ones minimizing

the Hamming distance between the re-aligned seed and the original seed.

We ﬁnally introduce the Boltzmann probability law over the set of possible alignments:

P(y) = χin(y1)QL

i=2 χsr(yi−1, yi)χend(yL)

Z(β)e−βH(y),(2)

where χin,χsr and χend are Boolean functions ensuring that the ordering constraints are satisﬁed. The constraint for

Sto be an ordered list of amino acids is Acan indeed be encoded with the function χsr(xi−1, ni−1, xi, ni) between

two consecutive positions:

χsr(0, ni−1,0, ni) = I[ni−1=ni]

χsr(1, ni−1,0, ni) = I[ni−1=ni∨ni=N+ 1]

χsr(0, ni−1,1, ni) = I[0 ≤ni−1< ni< N + 1]

χsr(1, ni−1,1, ni) = I[0 < ni−1< ni< N + 1] ,

and with additional constraints imposed in the ﬁrst and last position:

χin(x1, n1) = δx1,0δn1,0+δx1,1I[0 < n1< N + 1]

χend(xL, nL) = δxL,0δnL,N +1 +δxL,1I[0 < nL< N + 1] .

Conﬁgurations yviolating the ordering constraints have zero-probability. The parameter βplays the role of an inverse-

temperature: by increasing β, the distribution concentrates on the allowed conﬁgurations achieving the smallest energy,

i.e. on the best alignments.

III. SMALL COUPLING EXPANSION

An eﬃcient strategy for approaching this constrained optimization problem is to use Belief-Propagation (BP). BP

is a message-passing method to approximate probability distributions of the form of Eq. (2). In particular it allows to

compute marginal probabilities on any small subset of variables, as well as the partition function Z(β). BP is exact

when the factor graph representing interactions between variables is a tree, and is used as an heuristic for sparse

graphs. In our case however, the set of couplings Jij is deﬁned for all pairs (i, j), resulting in a fully-connected factor

FIG. 2: Left panel: Fully-connected factor graph associated to the probability Eq. (2) with L= 5. Variables yiare represented

by white dots, external ﬁelds Hiby white squares, and couplings Jij by black squares. Right panel: Factor graph obtained

after the perturbative expansion. External ﬁelds H2,...,HL−1and short-range couplings Ji,i+1,i∈ {1,...,L−1}are modiﬁed

according to Eq. (3) (illustrated by red (light gray) stars).

graph, as shown in the left panel of Fig. 2. This makes the problem diﬃcult for BP. However, although the interactions

are very dense (all couplings are non-zero), they are typically weak for distant sites. Conversely, interactions between

two neighbor sites are typically stronger as they encode the one-dimensional structure of the amino acid sequence.

Therefore, in this work we develop an approximation method where long-range couplings are treated perturbatively.

More precisely, we perform a small-coupling expansion of the free-energy F=−1

βlog Z(β) associated with the

Boltzmann distribution in Eq. (2), where the zero-th order corresponds to the model deﬁned on the one-dimensional

chain, i.e. with long-range couplings set to zero: Jij = 0 for |i−j|>1. Higher orders take into account the

contribution of long-range couplings in a perturbative way. We performed the expansion up to the ﬁrst-order term,

and let the computation of higher orders for future work. This pertubative expansion is similar to a Plefka expansion

to obtain the TAP equations [19, 24, 25]. The main diﬀerence is that in the Plefka expansion, the 0th order is the

mean ﬁeld model (i.e. including only external ﬁelds Hi) and all couplings Jij are treated perturbatively, while in

our approach the 0th order includes also the short-range couplings Ji,i+1. We then study the stationary points of

the perturbed free-energy with respect to single-sites and nearest-neighbors sites marginal probabilities Pi(yi) and

Pi,i+1(yi, yi+1), to obtain a set of approximate BP equations. The technical details of this small-coupling expansion

are given in appendices B. and C. In the rest of the paper we refer to these approximate BP equations as the Small

Coupling Expansion (SCE) equations.

This set of SCE equations can be seen as BP equations whose associated factor graph is a linear chain, as represented

in the right panel of Fig. 2, or equivalently to the equations obtained with the transfer matrix method (or dynamic

programming/forward-backward algorithm [1]). The contribution of the long-range couplings Jij ,|i−j|>1 results

into a modiﬁcation of the external ﬁelds Hiand short-range couplings Ji,i+1:

Hi=Hi+fifor i∈ {2, . . . , L −1}

Ji,i+1 =Ji,i+1 +gifor i∈ {1, . . . , L −1}.(3)

Single-site ﬁelds fiand nearest-neighbors pairwise ﬁelds giare computed explicitly from the set of conditional prob-

abilities P(yi|yj) for any i, j with |i−j|>1:

fl(yl) = −

l−1

i=1

j=l+1 X

yi,yj

Jij (yi, yj)Pi(yi|yl)Pj(yj|yl) (4)

and:

gl(yl, yl+1) =

i=1

j=ζl

yi,yj

Jij (yi, yj)Pi(yi|yl)Pj(yj|yl+1) (5)

with ζl

i= max(l+1, i+2). The SCE equations are recursive equations for a set of forward messages Fi(yi),b

Fi(yi), and

backward messages Bi(yi),b

Bi(yi), deﬁned on the edges of the one-dimensional chain, as shown in Fig. 3. We give here

the exact form of the approximate BP equations, their derivation is given in appendix C. For the forward messages

F2FL-1

12L-1 L

BL-1

FL-1

FIG. 3: BP messages deﬁned on the one-dimensional chain. In blue (top arrows): the set of forward messages Fi,

Fi,

and in red (bottom arrows) the set of backward messages Bi,

Bi.

we have:

F1(y1) = 1

z1→e1

eβH1(y1)

Fi(yi) = 1

zi→ei

eβe

Hi(yi)b

Fi(yi),for i≥2

Fi+1(yi) = 1

bzei→i+1 X

eβe

Jei(yi,yi+1)Fi(yi),

(6)

where Fiis deﬁned for i∈ {1, . . . , L −1}and b

Fifor i∈ {2, . . . , L −1}, and zi→ei,bzei→i+1 are normalization factors

ensuring that the BP messages are normalized to 1. And for the backward messages we have:

BL(yL) = 1

zL→eL−1

eβHL(yL)

Bi(yi) = 1

zi→ei−1

eβe

Hi(yi)b

Bi(yi),for i≤L

Bi(yi) = 1

bzei→iX

yi+1

eβe

Jei(yi,yi+1)Bi+1(yi+1),

(7)

where Biis deﬁned for i∈ {2, . . . , L}and b

Bifor i∈ {1, . . . , L −2}, and zi→ei−1,bzei→iare normalization constants.

Single-site and nearest-neighbors marginal probabilities Pi(yi) and Pi,i+1(yi, yi+1) can be expressed in terms of the

BP messages:

P1(y1) = 1

eβH1(y1)b

B1(y1)

Pi(yi) = 1

eβe

Hi(yi)b

Fi(yi)b

Bi(yi),2≤i≤L−1

PL(yL) = 1

eβHL(yL)b

FL(yL),

(8)

and for i∈ {1, . . . , L −1}:

Pi,i+1(yi, yi+1) = eβe

Ji,i+1(yi,yi+1 )

zi,i+1

Fi(yi)Bi+1(yi+1).(9)

From the set of marginal probabilities, one ﬁnally computes the conditional probabilities Pi(yi|yj), for all i, j with

|i−j|>1, from the chain rule, which is valid when long-range couplings are neglected:

Pi(yi|yl) = X

yi−1

Pi−1(yi−1|yl)Pi(yi|yi−1) if i>l+ 1 ,(10)

with a similar expression similarly when i<l−1.

A solution the SCE equations can be found iteratively (see appendix C 3. for a complete description of the al-

gorithm). From a random initialization of the BP messages, the algorithm ﬁrst computes the marginals Pi,Pi,i+1

from Eq. (8, 9), then updates the set of conditional probabilities Pi(yi|yj) from Eq. (10), and ﬁnally computes the

long-range ﬁelds fi, giusing Eqs. (4, 5). BP messages are then updated using the new value of fi, gi, and these steps

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SmallCouplingExpansionforMultipleSequenceAlignmentLouiseBudzynski1,2andAndreaPagnani1,2,31DISAT,PolitecnicodiTorino,CorsoDucadegliAbruzzi,24,I-10129,Torino,Italy2ItalianInstituteforGenomicMedicine,IRCCSCandiolo,SP-142,I-10060,Candiolo(TO),Italy3INFN,SezionediTorino,Torino,ViaPietroGiuria,110125Torin...

展开>> 收起<<

Small Coupling Expansion for Multiple Sequence Alignment Louise Budzynski1 2and Andrea Pagnani1 2 3 1DISAT Politecnico di Torino Corso Duca degli Abruzzi 24 I-10129 Torino Italy.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Small Coupling Expansion for Multiple Sequence Alignment Louise Budzynski1 2and Andrea Pagnani1 2 3 1DISAT Politecnico di Torino Corso Duca degli Abruzzi 24 I-10129 Torino Italy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: