Small Coupling Expansion for Multiple Sequence Alignment
Louise Budzynski1, 2 and Andrea Pagnani1, 2, 3
1DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, 24, I-10129, Torino, Italy
2Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo (TO), Italy
3INFN, Sezione di Torino, Torino, Via Pietro Giuria, 1 10125 Torino Italy
The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic
tools that allow to detect evolutionary patterns, as well as functional/structural characterizations
between homologous sequences in different organisms. Typically, state-of-the-art bioinformatics
tools are based on profile models that assume the statistical independence of the different sites of
the sequences. Over the last years, it has become increasingly clear that homologous sequences
show complex patterns of long-range correlations over the primary sequence as a consequence of
the natural evolution process that selects genetic variants under the constraint of preserving the
functional/structural determinants of the sequence. Here, we present an alignment algorithm based
on message passing techniques that overcomes the limitations of profile models. Our method is based
on a perturbative small-coupling expansion of the free energy of the model that assumes a linear
chain approximation as the 0th-order of the expansion. We test the potentiality of the algorithm
against standard competing strategies on several biological sequences.
PACS numbers:
I. INTRODUCTION
The evolution of biological molecules such as proteins is an ongoing highly nontrivial dynamical process spanning
over billions of years, constrained by the maintenance of relevant structural, and functional determinants. One of the
most striking features of natural evolution is how different evolutionary pathways produced ensemble of molecules
characterized by an extremely heterogeneous amino acid sequence – often with a sequence identity lower than 30% –
but with virtually identical three-dimensional native structures. Thanks to a shrewd use of this structural similarity, it
is nowadays possible to classify the entire set of known protein sequences into disjoint classes of sequences originating
from a common ancestral sequence. Sequences belonging to the same class are called homologous.
Homologous sequences are best compared using sequence alignments [1]. Depending on the number of sequences to
align, there are three possible options: (i) Pairwise Alignments aims at casting two sequences into the same framework.
The available algorithms are typically based on some versions of dynamic programming, and scale linearly with the
length of the sequences [2, 3]. (ii) Multiple Sequence Alignments (MSA) maximize the global similarity of more than
two sequences [4]. Dynamic programming techniques can be generalized to more than two sequences, but with a
computational cost that scales exponentially with the number of sequences to be aligned. Producing MSAs of more
than 103sequences remains an open computational challenge. (iii) To align larger number of homologous sequences,
one first selects a representative subset called seed for which the use of MSA is computationally feasible. Every single
homolog eventually is aligned to the seed MSA. In this way one can easily align up to 106sequences [5–7].
Standard alignment methods are based on the independent site evolution assumption [1], i.e. the probability of
observing a sequence is factorized among the different sites. From a statistical mechanics perspective, such approx-
imation corresponds to a non-interacting 21 colors (20 amino acids + 1 gap symbol) Potts model. Profile hidden
Markov models [6], for instance, are of that type. The computational complexity of profile models is polynomial.
However, profile models neglect long-range correlations, although they are an important statistical feature of ho-
mologous proteins. This well-known phenomenon is at the basis of what biologists call epistasis (i.e. how genetic
variation depend on the genetic context of the sequence). Recently, epistasis has received renewed attention from
the statistical mechanics’ community [8]. Given an MSA of a specific protein family, one could ask what is the best
statistical description of such an ensemble of sequences. Summary statistics such as one-site frequency count fi(a)
(i.e. the empirically observed frequency of observing amino acid aat position iin the MSA), two-site frequency
count fij (a, b) (i.e. the frequency of observing the amino acid realization a, b at position iand jrespectively), and in
principle higher-order correlations, could be used to inverse statistical modeling of the whole MSA. One can assume
that each sequence in the MSA is independently drawn from a multivariate distribution P(a1, . . . , aL) constrained
to reproduce the multibody empirical frequency counts of the MSA. The use of the maximum-entropy principle is
equivalent to assume a Boltzmann-Gibbs probability measure for P. The related Hamiltonian is a 21-colors general-
ized Potts model characterized by two sets of parameters: local fields Hi(a), and epistatic two-site interaction terms
Ji,j (a, b). Such parameters can be learned more or less efficiently, using the so-called Direct Coupling Analysis (DCA)
[9]. This method has found many interesting applications ranging from the prediction of protein structures [10, 11],
protein-protein interaction [12–14], prediction of mutational effects [15–18], etc. Inherent to this strategy, there is
arXiv:2210.03463v2 [q-bio.QM] 27 Apr 2023