
Preprint
ALPHAFOLD DISTILLATION FOR PROTEIN DESIGN
Igor Melnyk ∗
, Aurelie Lozano, Payel Das, Vijil Chenthamarakshan
IBM Research,
Yorktown Heights, NY 10598
ABSTRACT
Inverse protein folding, the process of designing sequences that fold into a specific
3D structure, is crucial in bio-engineering and drug discovery. Traditional methods
rely on experimentally resolved structures, but these cover only a small fraction
of protein sequences. Forward folding models like AlphaFold offer a potential
solution by accurately predicting structures from sequences. However, these models
are too slow for integration into the optimization loop of inverse folding models
during training. To address this, we propose using knowledge distillation on
folding model confidence metrics, such as pTM or pLDDT scores, to create a
faster and end-to-end differentiable distilled model. This model can then be used
as a structure consistency regularizer in training the inverse folding model. Our
technique is versatile and can be applied to other design tasks, such as sequence-
based protein infilling. Experimental results show that our method outperforms
non-regularized baselines, yielding up to 3% improvement in sequence recovery
and up to 45% improvement in protein diversity while maintaining structural
consistency in generated sequences. Code is available at
https://github.
com/IBM/AFDistill.
1 INTRODUCTION
Eight of the top ten best-selling drugs are engineered proteins, making inverse protein folding a
crucial challenge in bio-engineering and drug discovery (Arnum, 2022). Inverse protein folding
involves designing amino acid sequences that fold into a specific 3D structure. Computationally, this
task is known as computational protein design and has been traditionally addressed by optimizing
amino acid sequences against a physics-based scoring function (Kuhlman et al., 2003). Recently, deep
generative models have been introduced to learn the mapping from protein structure to sequences
(Jing et al., 2020; Cao et al., 2021; Wu et al., 2021; Karimi et al., 2020; Hsu et al., 2022; Fu & Sun,
2022). While these models often use high amino acid recovery, TM score, and low perplexity as
success criteria, they overlook the primary goal of designing novel and diverse sequences that fold
into the desired structure and exhibit novel functions.
In parallel, recent advancements have also greatly enhanced protein representation learning (Rives
et al., 2021; Zhang et al., 2022), structure prediction from sequences (Jumper et al., 2021; Baek
et al., 2021b), and conditional protein sequence generation (Das et al., 2021; Anishchenko et al.,
2021). While inverse protein folding has traditionally focused on sequences with resolved structures,
which represent less than 0.1% of known protein sequences, a recent study improved performance by
training on millions of AlphaFold-predicted structures (Hsu et al., 2022). Despite this success, large-
scale training is computationally expensive. A more efficient method could be to use a pre-trained
forward folding model to guide the training of the inverse folding model.
In this work we construct a framework where the inverse folding model is trained using a loss
objective that consists of regular sequence reconstruction loss, augmented with an additional structure
consistency loss (SC) (see Fig. 1 for system overview). One way of implementing this would be
to use folding models, e.g., AlphaFold, to estimate structure from generated sequence, compare it
with ground truth and compute TM score to regularize the training. However, a challenge in using
Alphafold (or similar) directly is computational cost of inference (see Fig. 2), and the need of ground
∗Corresponding author: igor.melnyk@ibm.com
1
arXiv:2210.03488v2 [q-bio.BM] 22 Nov 2023