
Input
Sequence
Genetic
database
search
Structure
database
search
template_pair_feat
template_angle_feat
msa_feat
target_feat
residue_index
extra_msa_feat embed
extra MSA
representation
(b, se, r, ce)
embed
pair
representation
(b, r, r, cz)
MSA
representation
(b, s, r, cm)
embed
embed
TemplatePairStack
(2 blocks)
concat
Extra
MSA
Stack
(4 blocks)
pair
representation
(b, r, r, cz)
MSA
representation
(b, s, r, cm)
Evoformer
Stack
(48 blocks)
pair
representation
(b, r, r, cz)
MSA
representation
(b, s, r, cm)
Structure
Module
(8 blocks)
Recycling (1 ~ 3 times)
preprocess embedding
encoder decoder
recycle
Template Embedding
(4 templates)
Figure 2: Overall framework of AlphaFold2. Dimension names: b: mini-batchs, s: clustered MSA sequences, se: extra MSA
sequences, r: residues, c: channels. The Extra MSA stack is composed of Evoformer, so AlphaFold2 has 52 Evoformer blocks.
end-to-end on 128 TPUv3 cores from scratch, limiting its
wide usage. The structure of the AlphaFold2 is complex,
as shown in Figure 2, which leads to high training over-
head. Specifically, there are three main reasons: First, the
AlphaFold2 is relatively deep, and the Evoformer block has
two computing branches and cannot be calculated in paral-
lel. Second, the official open-source implemented total batch
size is limited to 128, and each device has only 1 batch size,
which cannot be extended to more devices in parallel to ac-
celerate training through data parallelism. Third, although
the parameters of AlphaFold2 are only 93M, the number of
parameter tensors reaches 4630. The time overhead of ac-
cessing these small tensors in different training stages of
each iteration is not negligible.
To this end, this paper proposes two optimization tech-
niques for two of the above three problems to achieve effi-
cient AlphaFold2 training under the premise of fully align-
ing hyperparameters (network model configuration and to-
tal batchsize of 128 with 1 protein sample per device).
First, inspired by AlphaFold-mutimer (Evans et al. 2021),
we modify the two serial computing branches in the Evo-
former block into a parallel computing structure, named Par-
allel Evoformer, as shown in Figure 1. Second, we propose
a novel Branch Parallelism (BP) for Parallel Evoformer,
which can break the barrier of parallel acceleration that can-
not be scaled to more devices through data parallelism due
to a batch size of 1 on each device.
The method proposed in this paper to efficiently train Al-
phaFold2 models is general and not limited to deep learning
frameworks and the version of re-implemented AlphaFold2.
We perform extensive experimental verification on Uni-
Fold implemented in PyTorch and HelixFold implemented
in PaddlePaddle. Extensive experimental results show that
Branch Parallelism can achieve similar training performance
improvements on both UniFold and HelixFold, which are
38.67% and 36.93% higher, respectively. We also demon-
strate that the accuracy of Parallel Evoformer could be on
par with AlphaFold2 on the CASP14 and CAMEO datasets.
The main contributions of this paper can be summarized
as follows:
• We improve the Evoformer in AlphaFold2 to Parallel
Evoformer, which breaks the computational dependency
of MSA and pair representation, and experiments show
that this does not affect the accuracy.
• We propose Branch Parallelism for Parallel Evoformer,
which splits different computing branches across more
devices in parallel to speed up training efficiency. This
breaks the limitation of data parallelism in the official
implementation of AlphaFold2.
• We reduce the end-to-end training time of AlphaFold2
to 4.18 days on UniFold and 4.88 days on HelixFold,
improving the training performance by 38.67% and
36.93%, respectively. It achieves efficient AlphaFold2
training, saving R&D economic costs for biocomputing
research.
2 Background
2.1 Overview of AlphaFold2
Comparing to the traditional protein structure prediction
model which usually consists of multiple steps, AlphaFold2
processes the input protein sequence and predicts the 3D
protein structure through an end-to-end procedure. In gen-
eral, AlphaFold2 takes the amino acid sequence as input and
then search against protein databases to obtain MSAs and
similar templates. By using MSA information, we can de-
tect correlations between the parts of similar sequences that
are more likely to mutate. The templates with regards to the
input sequence, on the other hand, provide structural infor-
mation for the model to predict the final structure.
The overall framework of AlphaFold2 can be divided into
five parts: Preprocess, Embedding, Encoder, Decoder, and
Recycle, which is shown in Figure 2. The Preprocess part
mainly parses the input raw sequence and generates MSA-
related and template-related features via genetic database
search and structure database search. The features are then
embedded into MSA representation, pair representation and
extra MSA representation during Embedding part. These
representations contain sufficient co-evolutionary informa-
tion among similar sequences and geometric information of