Small Coupling Expansion for Multiple Sequence Alignment Louise Budzynski1 2and Andrea Pagnani1 2 3 1DISAT Politecnico di Torino Corso Duca degli Abruzzi 24 I-10129 Torino Italy

2025-05-03 0 0 1.42MB 22 页 10玖币
侵权投诉
Small Coupling Expansion for Multiple Sequence Alignment
Louise Budzynski1, 2 and Andrea Pagnani1, 2, 3
1DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, 24, I-10129, Torino, Italy
2Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo (TO), Italy
3INFN, Sezione di Torino, Torino, Via Pietro Giuria, 1 10125 Torino Italy
The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic
tools that allow to detect evolutionary patterns, as well as functional/structural characterizations
between homologous sequences in different organisms. Typically, state-of-the-art bioinformatics
tools are based on profile models that assume the statistical independence of the different sites of
the sequences. Over the last years, it has become increasingly clear that homologous sequences
show complex patterns of long-range correlations over the primary sequence as a consequence of
the natural evolution process that selects genetic variants under the constraint of preserving the
functional/structural determinants of the sequence. Here, we present an alignment algorithm based
on message passing techniques that overcomes the limitations of profile models. Our method is based
on a perturbative small-coupling expansion of the free energy of the model that assumes a linear
chain approximation as the 0th-order of the expansion. We test the potentiality of the algorithm
against standard competing strategies on several biological sequences.
PACS numbers:
I. INTRODUCTION
The evolution of biological molecules such as proteins is an ongoing highly nontrivial dynamical process spanning
over billions of years, constrained by the maintenance of relevant structural, and functional determinants. One of the
most striking features of natural evolution is how different evolutionary pathways produced ensemble of molecules
characterized by an extremely heterogeneous amino acid sequence – often with a sequence identity lower than 30% –
but with virtually identical three-dimensional native structures. Thanks to a shrewd use of this structural similarity, it
is nowadays possible to classify the entire set of known protein sequences into disjoint classes of sequences originating
from a common ancestral sequence. Sequences belonging to the same class are called homologous.
Homologous sequences are best compared using sequence alignments [1]. Depending on the number of sequences to
align, there are three possible options: (i) Pairwise Alignments aims at casting two sequences into the same framework.
The available algorithms are typically based on some versions of dynamic programming, and scale linearly with the
length of the sequences [2, 3]. (ii) Multiple Sequence Alignments (MSA) maximize the global similarity of more than
two sequences [4]. Dynamic programming techniques can be generalized to more than two sequences, but with a
computational cost that scales exponentially with the number of sequences to be aligned. Producing MSAs of more
than 103sequences remains an open computational challenge. (iii) To align larger number of homologous sequences,
one first selects a representative subset called seed for which the use of MSA is computationally feasible. Every single
homolog eventually is aligned to the seed MSA. In this way one can easily align up to 106sequences [5–7].
Standard alignment methods are based on the independent site evolution assumption [1], i.e. the probability of
observing a sequence is factorized among the different sites. From a statistical mechanics perspective, such approx-
imation corresponds to a non-interacting 21 colors (20 amino acids + 1 gap symbol) Potts model. Profile hidden
Markov models [6], for instance, are of that type. The computational complexity of profile models is polynomial.
However, profile models neglect long-range correlations, although they are an important statistical feature of ho-
mologous proteins. This well-known phenomenon is at the basis of what biologists call epistasis (i.e. how genetic
variation depend on the genetic context of the sequence). Recently, epistasis has received renewed attention from
the statistical mechanics’ community [8]. Given an MSA of a specific protein family, one could ask what is the best
statistical description of such an ensemble of sequences. Summary statistics such as one-site frequency count fi(a)
(i.e. the empirically observed frequency of observing amino acid aat position iin the MSA), two-site frequency
count fij (a, b) (i.e. the frequency of observing the amino acid realization a, b at position iand jrespectively), and in
principle higher-order correlations, could be used to inverse statistical modeling of the whole MSA. One can assume
that each sequence in the MSA is independently drawn from a multivariate distribution P(a1, . . . , aL) constrained
to reproduce the multibody empirical frequency counts of the MSA. The use of the maximum-entropy principle is
equivalent to assume a Boltzmann-Gibbs probability measure for P. The related Hamiltonian is a 21-colors general-
ized Potts model characterized by two sets of parameters: local fields Hi(a), and epistatic two-site interaction terms
Ji,j (a, b). Such parameters can be learned more or less efficiently, using the so-called Direct Coupling Analysis (DCA)
[9]. This method has found many interesting applications ranging from the prediction of protein structures [10, 11],
protein-protein interaction [12–14], prediction of mutational effects [15–18], etc. Inherent to this strategy, there is
arXiv:2210.03463v2 [q-bio.QM] 27 Apr 2023
2
the counter intuitive step of constructing an MSA based on a statistical independence of sites assumption, which is
used, in turn, to predict long-range correlations. To solve this loophole, we propose a mean field message-passing
strategy to align sequences to a reference Potts model. To do so, we considered a first-order perturbative expansion
a la Plefka [19], setting as 0th order of the expansion the linear chain approximation. Recently, other strategies have
been proposed which take into account long range correlations: search for remote homology [20], a simplified version
of the message-passing strategy presented here [21], alignment of two Potts models [22], and a more machine learning
inspired method based on tranformers [23].
II. SET-UP OF THE PROBLEM
Although here we will focus on proteins, the method can be extended to other biological sequences, such as RNA
and DNA. Let A= (A1, . . . , AN) be an unaligned amino acid sequence of length N, containing a protein domain
S= (S1, . . . , SL) of a known protein family. While Acontains only amino acids (represented as upper-case letters
from the amino acid alphabet), Smight also contain gaps that are used to indicate the deletion of an amino acid in
the sequence A. We assume that the protein family is described by a Potts Hamiltonian:
HDCA(S) =
L
X
i=1
Hi(Si)X
i<j
Jij (Si, Sj).(1)
The couplings Jij and external fields Hihave been learned from the seed MSA in a preprocessing step, using DCA,
and the sub-sequence Sis assumed to have the same length Las the seed. The energy HDCA is considered as a score
for the sub-sequence Sto belong to the protein family. In this setting, our problem consists in finding a sub-sequence
Swith the lowest energy (i.e. with the highest score). Contrarily to profile models, the Hamiltonian HDCA also
includes pairwise interactions related to residue co-evolution, hopefully leading to more accurate alignments in cases
where conservation of single residues is not sufficient to describe the protein family. The Hamiltonian in Eq. (1)
does not model the insertions statistics, because the parameters Jij and Hiare learned from the seed MSA, in which
all columns containing inserts have been removed. Therefore, as in [21], we added the insertion cost Hins , which
has been learned from the insertion statistics contained in the full seed alignment. Similarly to [21], we also added
an additional gap cost Hgap to correct the gap statistics learned in HDCA (that deeply depends on how the seed is
constructed). In this setting, the alignment problem corresponds to finding a sub-sequence S= (S1, . . . , SL) of the
original sequence A= (A1, . . . , AN), such that:
1. Sis an ordered list of amino acids in A(called match states), with the possibility of adding gaps states denoted
“-” between two consecutive positions, and of skipping some amino acids of A(i.e. interpreting them as
insertions).
2. the sub-sequence Sminimizes the total energy H=HDCA +Hins +Hgap.
An example of a sequence Aand its alignment Sis illustrated in Fig. (1). In order to formulate this problem as a
MADVGNSSKSVVLSSAKQIY
D-VGNSSKLSSA
S1=A3
S3=A4
S12=A16
S2='-'
FIG. 1: Example of alignment. Top: original sequence Aof length N= 20, bottom: aligned sequence Sof length L= 12.
Match states are enlightened in (dark gray) blue. There is one gap at position 2 in the sub-sequence S. Three amino acids are
skipped in the original sequence (in (light gray) orange): they are interpreted as insertions.
statistical physics model, we introduce for each position i= 1,...L a pair of variables yi= (xi, ni), where xi∈ {0,1}
is a binary variable, and ni∈ {0,1, . . . , N, N + 1}is a pointer. The variable xiindicates whether position iis a gap
” (xi= 0) or a match state (xi= 1). When iis a match, the pointer niindicates the position of the match state
in the full-length sequence A. When iis a gap, the pointer keeps track of the last match state before position i. Note
that we added pointer values n= 0 and n=N+ 1. These value are used for gap states at the beginning and at
3
the end of the aligned sequence: if matched symbols start to appear only from a position i > 1, we fill the previous
positions j < i with gaps having pointer nj= 0. Similarly, if the last matched state appears at position i<L, we
fill the next positions j > i with gaps having pointers nj=N+ 1. The Potts Hamiltonian re-written in terms of the
variables y= (y1, . . . , yL) is:
HDCA(y) =
L
X
i=1
Hi(Axi.ni)X
i<j
Jij (Axi.ni, Axj.nj),
where A0=is the gap state. We will use short-hand notations Hi(yi)Hi(Axi.ni) and Jij (yi, yj)
Jij (Axi.ni, Axj.nj) in the rest of the paper. The insertion cost Hins and the gap cost Hgap take the form introduced
in[21]. In particular for the insertion cost we have:
Hins(y) =
L
X
i=2
ϕi(nini11) ,
with ϕi(∆n) = (1 δn,0)[λi
o+λi
e(∆n1)], and ∆ni=nini11 the number of skipped amino acids between
position i1 and i. The parameters {λi
o, λi
e}have been inferred from the insertion statistics (see [21] section IV.B.).
And for the gap cost we have:
Hgap(y) =
L
X
i=1
µ(xi, ni),
with µ(1, n) = 0 for match states, µ(0,0) = µ(0, N + 1) = µext for external gaps, and µ(0, n) = µint for internal gaps
(with 0 < n < N + 1). The values of µint, and µext have been chosen according to the procedure described in [21],
section IV.C: one re-align sequences of the seed MSA using several values of µint, µext, and pick the ones minimizing
the Hamming distance between the re-aligned seed and the original seed.
We finally introduce the Boltzmann probability law over the set of possible alignments:
P(y) = χin(y1)QL
i=2 χsr(yi1, yi)χend(yL)
Z(β)eβH(y),(2)
where χin,χsr and χend are Boolean functions ensuring that the ordering constraints are satisfied. The constraint for
Sto be an ordered list of amino acids is Acan indeed be encoded with the function χsr(xi1, ni1, xi, ni) between
two consecutive positions:
χsr(0, ni1,0, ni) = I[ni1=ni]
χsr(1, ni1,0, ni) = I[ni1=nini=N+ 1]
χsr(0, ni1,1, ni) = I[0 ni1< ni< N + 1]
χsr(1, ni1,1, ni) = I[0 < ni1< ni< N + 1] ,
and with additional constraints imposed in the first and last position:
χin(x1, n1) = δx1,0δn1,0+δx1,1I[0 < n1< N + 1]
χend(xL, nL) = δxL,0δnL,N +1 +δxL,1I[0 < nL< N + 1] .
Configurations yviolating the ordering constraints have zero-probability. The parameter βplays the role of an inverse-
temperature: by increasing β, the distribution concentrates on the allowed configurations achieving the smallest energy,
i.e. on the best alignments.
III. SMALL COUPLING EXPANSION
An efficient strategy for approaching this constrained optimization problem is to use Belief-Propagation (BP). BP
is a message-passing method to approximate probability distributions of the form of Eq. (2). In particular it allows to
compute marginal probabilities on any small subset of variables, as well as the partition function Z(β). BP is exact
when the factor graph representing interactions between variables is a tree, and is used as an heuristic for sparse
graphs. In our case however, the set of couplings Jij is defined for all pairs (i, j), resulting in a fully-connected factor
4
FIG. 2: Left panel: Fully-connected factor graph associated to the probability Eq. (2) with L= 5. Variables yiare represented
by white dots, external fields Hiby white squares, and couplings Jij by black squares. Right panel: Factor graph obtained
after the perturbative expansion. External fields H2,...,HL1and short-range couplings Ji,i+1,i∈ {1,...,L1}are modified
according to Eq. (3) (illustrated by red (light gray) stars).
graph, as shown in the left panel of Fig. 2. This makes the problem difficult for BP. However, although the interactions
are very dense (all couplings are non-zero), they are typically weak for distant sites. Conversely, interactions between
two neighbor sites are typically stronger as they encode the one-dimensional structure of the amino acid sequence.
Therefore, in this work we develop an approximation method where long-range couplings are treated perturbatively.
More precisely, we perform a small-coupling expansion of the free-energy F=1
βlog Z(β) associated with the
Boltzmann distribution in Eq. (2), where the zero-th order corresponds to the model defined on the one-dimensional
chain, i.e. with long-range couplings set to zero: Jij = 0 for |ij|>1. Higher orders take into account the
contribution of long-range couplings in a perturbative way. We performed the expansion up to the first-order term,
and let the computation of higher orders for future work. This pertubative expansion is similar to a Plefka expansion
to obtain the TAP equations [19, 24, 25]. The main difference is that in the Plefka expansion, the 0th order is the
mean field model (i.e. including only external fields Hi) and all couplings Jij are treated perturbatively, while in
our approach the 0th order includes also the short-range couplings Ji,i+1. We then study the stationary points of
the perturbed free-energy with respect to single-sites and nearest-neighbors sites marginal probabilities Pi(yi) and
Pi,i+1(yi, yi+1), to obtain a set of approximate BP equations. The technical details of this small-coupling expansion
are given in appendices B. and C. In the rest of the paper we refer to these approximate BP equations as the Small
Coupling Expansion (SCE) equations.
This set of SCE equations can be seen as BP equations whose associated factor graph is a linear chain, as represented
in the right panel of Fig. 2, or equivalently to the equations obtained with the transfer matrix method (or dynamic
programming/forward-backward algorithm [1]). The contribution of the long-range couplings Jij ,|ij|>1 results
into a modification of the external fields Hiand short-range couplings Ji,i+1:
e
Hi=Hi+fifor i∈ {2, . . . , L 1}
e
Ji,i+1 =Ji,i+1 +gifor i∈ {1, . . . , L 1}.(3)
Single-site fields fiand nearest-neighbors pairwise fields giare computed explicitly from the set of conditional prob-
abilities P(yi|yj) for any i, j with |ij|>1:
fl(yl) =
l1
X
i=1
L
X
j=l+1 X
yi,yj
Jij (yi, yj)Pi(yi|yl)Pj(yj|yl) (4)
and:
gl(yl, yl+1) =
l
X
i=1
L
X
j=ζl
iX
yi,yj
Jij (yi, yj)Pi(yi|yl)Pj(yj|yl+1) (5)
with ζl
i= max(l+1, i+2). The SCE equations are recursive equations for a set of forward messages Fi(yi),b
Fi(yi), and
backward messages Bi(yi),b
Bi(yi), defined on the edges of the one-dimensional chain, as shown in Fig. 3. We give here
the exact form of the approximate BP equations, their derivation is given in appendix C. For the forward messages
5
F1
B2
F2FL-1
BL
12L-1 L
BL-1
F2
^
FL
^
B1
^
BL-1
^
FL-1
^
B2
^
FIG. 3: BP messages defined on the one-dimensional chain. In blue (top arrows): the set of forward messages Fi,
b
Fi,
and in red (bottom arrows) the set of backward messages Bi,
b
Bi.
we have:
F1(y1) = 1
z1e1
eβH1(y1)
Fi(yi) = 1
ziei
eβe
Hi(yi)b
Fi(yi),for i2
b
Fi+1(yi) = 1
bzeii+1 X
yi
eβe
Jei(yi,yi+1)Fi(yi),
(6)
where Fiis defined for i∈ {1, . . . , L 1}and b
Fifor i∈ {2, . . . , L 1}, and ziei,bzeii+1 are normalization factors
ensuring that the BP messages are normalized to 1. And for the backward messages we have:
BL(yL) = 1
zLeL1
eβHL(yL)
Bi(yi) = 1
ziei1
eβe
Hi(yi)b
Bi(yi),for iL
b
Bi(yi) = 1
bzeiiX
yi+1
eβe
Jei(yi,yi+1)Bi+1(yi+1),
(7)
where Biis defined for i∈ {2, . . . , L}and b
Bifor i∈ {1, . . . , L 2}, and ziei1,bzeiiare normalization constants.
Single-site and nearest-neighbors marginal probabilities Pi(yi) and Pi,i+1(yi, yi+1) can be expressed in terms of the
BP messages:
P1(y1) = 1
z1
eβH1(y1)b
B1(y1)
Pi(yi) = 1
zi
eβe
Hi(yi)b
Fi(yi)b
Bi(yi),2iL1
PL(yL) = 1
zL
eβHL(yL)b
FL(yL),
(8)
and for i∈ {1, . . . , L 1}:
Pi,i+1(yi, yi+1) = eβe
Ji,i+1(yi,yi+1 )
zi,i+1
Fi(yi)Bi+1(yi+1).(9)
From the set of marginal probabilities, one finally computes the conditional probabilities Pi(yi|yj), for all i, j with
|ij|>1, from the chain rule, which is valid when long-range couplings are neglected:
Pi(yi|yl) = X
yi1
Pi1(yi1|yl)Pi(yi|yi1) if i>l+ 1 ,(10)
with a similar expression similarly when i<l1.
A solution the SCE equations can be found iteratively (see appendix C 3. for a complete description of the al-
gorithm). From a random initialization of the BP messages, the algorithm first computes the marginals Pi,Pi,i+1
from Eq. (8, 9), then updates the set of conditional probabilities Pi(yi|yj) from Eq. (10), and finally computes the
long-range fields fi, giusing Eqs. (4, 5). BP messages are then updated using the new value of fi, gi, and these steps
摘要:

SmallCouplingExpansionforMultipleSequenceAlignmentLouiseBudzynski1,2andAndreaPagnani1,2,31DISAT,PolitecnicodiTorino,CorsoDucadegliAbruzzi,24,I-10129,Torino,Italy2ItalianInstituteforGenomicMedicine,IRCCSCandiolo,SP-142,I-10060,Candiolo(TO),Italy3INFN,SezionediTorino,Torino,ViaPietroGiuria,110125Torin...

展开>> 收起<<
Small Coupling Expansion for Multiple Sequence Alignment Louise Budzynski1 2and Andrea Pagnani1 2 3 1DISAT Politecnico di Torino Corso Duca degli Abruzzi 24 I-10129 Torino Italy.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.42MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注