Preprint ALPHA FOLD DISTILLATION FOR PROTEIN DESIGN Igor Melnyk Aurelie Lozano Payel Das Vijil Chenthamarakshan

2025-05-02 0 0 2.7MB 25 页 10玖币
侵权投诉
Preprint
ALPHAFOLD DISTILLATION FOR PROTEIN DESIGN
Igor Melnyk
, Aurelie Lozano, Payel Das, Vijil Chenthamarakshan
IBM Research,
Yorktown Heights, NY 10598
ABSTRACT
Inverse protein folding, the process of designing sequences that fold into a specific
3D structure, is crucial in bio-engineering and drug discovery. Traditional methods
rely on experimentally resolved structures, but these cover only a small fraction
of protein sequences. Forward folding models like AlphaFold offer a potential
solution by accurately predicting structures from sequences. However, these models
are too slow for integration into the optimization loop of inverse folding models
during training. To address this, we propose using knowledge distillation on
folding model confidence metrics, such as pTM or pLDDT scores, to create a
faster and end-to-end differentiable distilled model. This model can then be used
as a structure consistency regularizer in training the inverse folding model. Our
technique is versatile and can be applied to other design tasks, such as sequence-
based protein infilling. Experimental results show that our method outperforms
non-regularized baselines, yielding up to 3% improvement in sequence recovery
and up to 45% improvement in protein diversity while maintaining structural
consistency in generated sequences. Code is available at
https://github.
com/IBM/AFDistill.
1 INTRODUCTION
Eight of the top ten best-selling drugs are engineered proteins, making inverse protein folding a
crucial challenge in bio-engineering and drug discovery (Arnum, 2022). Inverse protein folding
involves designing amino acid sequences that fold into a specific 3D structure. Computationally, this
task is known as computational protein design and has been traditionally addressed by optimizing
amino acid sequences against a physics-based scoring function (Kuhlman et al., 2003). Recently, deep
generative models have been introduced to learn the mapping from protein structure to sequences
(Jing et al., 2020; Cao et al., 2021; Wu et al., 2021; Karimi et al., 2020; Hsu et al., 2022; Fu & Sun,
2022). While these models often use high amino acid recovery, TM score, and low perplexity as
success criteria, they overlook the primary goal of designing novel and diverse sequences that fold
into the desired structure and exhibit novel functions.
In parallel, recent advancements have also greatly enhanced protein representation learning (Rives
et al., 2021; Zhang et al., 2022), structure prediction from sequences (Jumper et al., 2021; Baek
et al., 2021b), and conditional protein sequence generation (Das et al., 2021; Anishchenko et al.,
2021). While inverse protein folding has traditionally focused on sequences with resolved structures,
which represent less than 0.1% of known protein sequences, a recent study improved performance by
training on millions of AlphaFold-predicted structures (Hsu et al., 2022). Despite this success, large-
scale training is computationally expensive. A more efficient method could be to use a pre-trained
forward folding model to guide the training of the inverse folding model.
In this work we construct a framework where the inverse folding model is trained using a loss
objective that consists of regular sequence reconstruction loss, augmented with an additional structure
consistency loss (SC) (see Fig. 1 for system overview). One way of implementing this would be
to use folding models, e.g., AlphaFold, to estimate structure from generated sequence, compare it
with ground truth and compute TM score to regularize the training. However, a challenge in using
Alphafold (or similar) directly is computational cost of inference (see Fig. 2), and the need of ground
Corresponding author: igor.melnyk@ibm.com
1
arXiv:2210.03488v2 [q-bio.BM] 22 Nov 2023
Preprint
Input
3D Structure
Inverse
Folding
Model
Predicted
Protein Sequence
Predicted
3D Structure
TM, LDDT
pTM, pLDDT
Higher Fidelity, Slow
Lower Fidelity, Slow
pTM, pLDDT Lower Fidelity, Fast (Proposed)
AlphaFold
Proposed
AFDistill
Model
CE loss
+Final loss
Input
3D Structure
Inverse
Folding
Model
Predicted
Protein Sequence
Structure Consistency (SC) score
Training Inference
<latexit sha1_base64="JEJTWflGPOPggj+SE81nM7T83dc=">AAACTXicbVDBThsxEPWmFEJoIRRx4mIRIXGKdiugHKP2whEkAkjZKPJ6J8GK7V3ZszSRtR/Dtf2Ynvsh3FBVb7IHAoxk6enNjN+8l+RSWAzDv0Hjw9rH9Y3mZmvr0+ftnfbulxubFYZDn2cyM3cJsyCFhj4KlHCXG2AqkXCbTH9U/dsHMFZk+hrnOQwVm2gxFpyhp0btfRcvPnEG0pLGM8XMtGyN2p2wGy6KvgVRDTqkrsvRbkDjNOOFAo1cMmsHUZjj0DGDgksoW3FhIWd8yiYw8FAzBXboFtIlPfJMSseZ8U8jXbAvNxxT1s5V4icVw3v7uleR7/UGBY7Ph07ovEDQfCk0LiTFjFZh0FQY4CjnHjBuhL+V8ntmGEcf2YpKLqrTvA8NP3mmFNOpq9NyMcIMXZwKPXGnp2W54tbNliarTKPXCb4FN1+70Vn35Oqk0/tep9skB+SQHJOIfCM9ckEuSZ9w4sgj+UV+B3+Cp+A5+LccbQT1zh5ZqcbGf3FptT0=</latexit>
7
<latexit sha1_base64="JEJTWflGPOPggj+SE81nM7T83dc=">AAACTXicbVDBThsxEPWmFEJoIRRx4mIRIXGKdiugHKP2whEkAkjZKPJ6J8GK7V3ZszSRtR/Dtf2Ynvsh3FBVb7IHAoxk6enNjN+8l+RSWAzDv0Hjw9rH9Y3mZmvr0+ftnfbulxubFYZDn2cyM3cJsyCFhj4KlHCXG2AqkXCbTH9U/dsHMFZk+hrnOQwVm2gxFpyhp0btfRcvPnEG0pLGM8XMtGyN2p2wGy6KvgVRDTqkrsvRbkDjNOOFAo1cMmsHUZjj0DGDgksoW3FhIWd8yiYw8FAzBXboFtIlPfJMSseZ8U8jXbAvNxxT1s5V4icVw3v7uleR7/UGBY7Ph07ovEDQfCk0LiTFjFZh0FQY4CjnHjBuhL+V8ntmGEcf2YpKLqrTvA8NP3mmFNOpq9NyMcIMXZwKPXGnp2W54tbNliarTKPXCb4FN1+70Vn35Oqk0/tep9skB+SQHJOIfCM9ckEuSZ9w4sgj+UV+B3+Cp+A5+LccbQT1zh5ZqcbGf3FptT0=</latexit>
7
<latexit sha1_base64="d2eG6fnTFfcEuY5lQo79On2Nlso=">AAACdHicbVHLahsxFJWnj6Tuy2mhm3QhagJdmZmSR5ehhZBlCnES8BhzR3PtCEuaQbqT2qjzM922P5Qf6TqasRd1kguCw7n33MdRVirpKI5vO9GTp8+eb22/6L589frN297OuwtXVFbgUBSqsFcZOFTS4JAkKbwqLYLOFF5m8+9N/vIGrZOFOadliWMNMyOnUgAFatL74NO2iScEVfNUaLDzujvp9eNB3AZ/CJI16LN1nE12OidpXohKoyGhwLlREpc09mBJCoV1N60cliDmMMNRgAY0urFvZ9d8LzA5nxY2PEO8Zf9XeNDOLXUWKjXQtbufa8jHcqOKpl/HXpqyIjRiNWhaKU4Fb9zgubQoSC0DAGFl2JWLa7AgKHi2MaWUzWrhDoM/RaE1mNyni9YtnxIuyKe5NDN/cFDXG9f6xerITaV4RJk0yuB8ct/nh+DiyyA5HOz/2O8ff1v/wTbbZZ/YZ5awI3bMTtkZGzLBfrHf7A/72/kXfYz60d6qNOqsNe/ZRkSDOy5WwuI=</latexit>
3
Figure 1: Overview of the proposed AFDistill system. AFDistill contrasts with traditional methods
(red line) that use models like AlphaFold to predict protein structure, which is then compared to the
actual structure. This method is slow due to model inference times (refer Fig. 2). An alternative (blue
line) uses internal metrics from folding model without structure prediction but remains slow and less
precise. Our solution, distills AlphaFold’s confidence metrics into a faster, differentiable model that
offers accuracy akin to AlphaFold, allowing seamless integration into the training process (green
line). The improved inverse folding model’s inference is shown on the right.
truth reference structure. Internal confidence structure metrics from folding model can be used
instead. However, that approach is still slow for the in-the-loop optimization. To address this, we:
(i) Carry out knowledge distillation on AlphaFold and incorporate the resulting model, AFDistill
(fixed), into the regularized training of the inverse folding model, which is referred to as structure
consistency (SC) loss. The major novelty here is that AFDistill enables direct prediction of TM or
LDDT scores of a given protein sequence bypassing the structure estimation or the access to ground
truth structure. Primary practical benefits of our model include being fast, precise, and end-to-end
differentiable. Employing SC loss during training for downstream tasks can be seen as integrating
AlphaFold’s domain expertise into the model, thereby offering additional boost in its performance.
(ii) Perform extensive evaluations, demonstrating that our proposed system surpasses existing bench-
marks in structure-guided protein sequence design by achieving lower perplexity, higher amino
acid recovery, and maintaining proximity to the original protein structure. Additionally, our system
enhances sequence diversity, a key objective in protein design. Due to a trade-off between sequence
and structure recovery, our regularized model offers better sequence diversity while maintaining struc-
tural integrity. Importantly, our regularization technique is versatile, as evidenced by its successful
application in sequence-based protein infilling, where we also observe performance improvement.
(iii) Lastly, our SC metric can either be used as regularization for inverse folding, infilling and other
protein optimization tasks (e.g., (Moffat et al., 2021)) which would benefit from structural consistency
estimation of the designed protein sequence, or as an affordable AlphaFold alternative that provides
scoring of a given protein, reflecting its structural content.
2 RELATED WORK
Forward Protein Folding. Recent computational methods for predicting protein structure from
its sequence include AlphaFold (Jumper et al., 2021), which uses multiple sequence alignments
(MSAs) and pairwise features. Similarly, RoseTTAFold (Baek et al., 2021a) integrates sequence, 2D
distance map, and 3D coordinate information. OpenFold (Ahdritz et al., 2022) replicates AlphaFold
in PyTorch. However, due to the unavailability of MSAs for certain proteins and their inference
time overhead, MSA-free methods like OmegaFold (Wu et al., 2022), HelixFold (Fang et al., 2022),
ESMFold (Lin et al., 2022), and Roney & Ovchinnikov (2022) emerged. These leverage pretrained
language models, offering accuracy on par with or exceeding AlphaFold and RoseTTAFold based on
the input type.
Inverse Protein Folding. Recent algorithms address the inverse protein folding problem of finding
amino acid sequences for a specified structure. (Norn et al., 2020) used a deep learning method
2
Preprint
Figure 2: Inference times for protein sequences using our AFDistill model compared to alternatives
are displayed on the left. AFDistill maintains fast inference for longer sequences: 0.028s for 1024-
length and 0.035s for 2048-length. Timings for AlphaFold and OpenFold (Ahdritz et al., 2022) do not
include MSA search times, which can range from minutes to hours. Values for HelixFold (Fang et al.,
2022), OmegaFold (Wu et al., 2022), and ESMFold (Lin et al., 2022) are from their publications. The
center plot shows kernel density of true vs. AFDistill-predicted TM scores (Pearson’s correlation:
0.77), while the right displays a similar plot for pLDDT values (Pearson’s correlation: 0.76). Refer to
Section 3 for details.
optimizing via the trRosetta structure prediction network (Yang et al., 2020). (Anand et al., 2022) de-
signed a deep neural network that models side-chain conformers structurally. In contrast, (Jendrusch
et al., 2021) employed AlphaFold (Jumper et al., 2021) in an optimization loop for sequence genera-
tion, though its use is resource-intensive due to MSA search. MSA-free methods like OmegaFold,
HelixFold, and EMSFold are quicker but still slow for optimization loops.
In this work, we propose knowledge distillation from the forward folding algorithm AlphaFold, and
build a student model that is small, practical and accurate enough. We show that the distilled model
can be efficiently used within the inverse folding model optimization loop and improve quality of
designed protein sequences.
3 ALPHAFOLD DISTILL
Knowledge distillation (Hinton et al., 2015) transfers knowledge from a large complex model, in
our case AlphaFold, to a smaller one, here this is the AFDistill model (see Fig. 3). Traditionally,
the distillation would be done using soft labels, which are probabilities from AlphaFold model, and
hard labels, the ground truth classes. However, in our case we do not use the probabilities as they are
harder to collect or unavailable, but rather the model’s predictions (pTM/pLDDT) and the hard labels,
TM/LDDT scores, computed based on AlphaFold’s predicted 3D structures.
Scores to Distill TM-score (Zhang & Skolnick, 2004) measures the mean distance between struc-
turally aligned
Cα
atoms, scaled by a length-dependent distance parameter, while LDDT (Mariani
et al., 2013) calculates the average of four fractions using distances between atom pairs based on
four tolerance thresholds within a 15
˚
A inclusion radius. Both metrics range from 0 to 1, with higher
values indicating more similar structures. pTM and pLDDT are AlphaFold-predicted metrics for a
given protein sequence, representing the model’s confidence in the estimated structure. pLDDT is a
local per-residue score, while pTM is a global confidence metric for overall chain reconstruction. In
this work, we interpret these metrics as indicators of sequence quality or validity for downstream
applications (see Section 4).
3.1 DATA
Using Release 3 (January 2022) of AlphaFold Protein Structure Database (Varadi et al., 2021),
we collected a set of 907,578 predicted structures. Each of these predicted structures contains 3D
coordinates of all the residue atoms as well as the per-resiude pLDDT confidence scores.
3
Preprint
Input
Protein Sequence
Proposed
AFDistill
Model
Logits CE loss
Predicted
3D structure
pTM, pLDDT
Ground Truth
3D structure
TM, LDDT
Discretizer AlphaFold
Classes
Input
Protein Sequence
Proposed
AFDistill
Model
Structure Consistency (SC) score
pTM, pLDDT
Training
Inference
Figure 3: Distillation overview. Top diagram shows the training of AFDistill. The scores from
AlphaFold’s confidence estimation are denoted as pTM and pLDDT, while the scores which are
computed using ground truth and the AlphaFold’s predicted 3D structures are denoted as TM and
LDDT. These values are then discretized and treated as class labels during cross-entropy (CE)
training. Note that the training based on TM/LDTT is limited since the number of known ground
truth structures is small. The bottom diagram shows the inference stage of AFDistill, where for each
protein sequence it estimates pTM and pLDDT scores.
Table 1: Statistics from January 2022 (left side) and July 2022 (right size) releases of the AlphaFold
database. For the earlier release, we created multiple datasets for pTM and pLDDT estimation, while
for the later, larger release we curated datasets only for pLDDT estimation.
Release 3 (January 2022) Release 4 (July 2022)
Name Size Name Size
Original 907,578 Original 214,687,406
TM 42K 42,605 pLDDT balanced 1M 1,000,000
TM augmented 86K 86,811 pLDDT balanced 10M 10,000,000
pTM synthetic 1M 1,244,788 pLDDT balanced 60M 66,736,124
LDDT 42K 42,605
pLDDT 1M 905,850
To avoid data leakage to the downstream applications, we first filtered out the structures that have
40% sequence similarity or more to the validation and test splits of CATH 4.2 dataset (discussed in
Section 4). Then, using the remaining structures, we created our pLDDT 1M dataset (see Table 1),
where each protein sequence is paired with the sequence of per-residue pLDDTs. Additionally, to
reduce the computational complexity of AFDistill training, we limited the maximum protein length
to 500 by randomly cropping a subsequence.
Figure 4: Distribution of the (p)TM/(p)LDDT scores in various datasets used in AFDistill training.
We created datasets based on true TM and LDDT values using predicted AlphaFold structures.
Specifically, using the PDB-to-UniProt mapping, we selected a subset of samples with matching
ground truth PDB sequences and 3D structures, resulting in 42,605 structures. We denote these
datasets as TM 42K and LDDT 42K (see Table 1). Fig. 4 shows their score density distribution, which
4
Preprint
Table 2: Validation loss of AFDistill on datasets from Table 1 (For more details, see Tables 5 and 6.)
Training data Val CE loss Training data Val CE loss
TM 42K 1.10 LDDT 42K 3.39
TM augmented 86K 2.12 pLDDT 1M 3.24
pTM synthetic 1M 2.55 pLDDT balanced 1M 2.63
pLDDT balanced 10M 2.43
pLDDT balanced 60M 2.40
is skewed towards higher values. To address data imbalance, we curated two additional TM-based
datasets. TM augmented 86K was obtained by augmenting TM 42K with a set of perturbed original
protein sequences (permuted/replaced parts of protein sequence), estimating their structures with
AlphaFold, computing corresponding TM-score, and keeping the low and medium range TM values.
pTM synthetic 1M was obtained by generating random synthetic protein sequences and feeding
them to AFDistill (pre-trained on TM 42K data), to generate additional data samples and collect
lower-range pTM values. The distribution of the scores for these additional datasets is also shown in
Fig. 4, where both TM augmented 86K and pTM synthetic 1M datasets are less skewed, covering
lower (p)TM values better.
Using Release 4 (July 2022) with over 214M predicted structures, we observed a similar high
skewness in pLDDT values. To mitigate this, we filtered out samples with upper-range mean-pLDDT
values, resulting in a 60M sequences dataset, with additional 10M and 1M versions created. Their
density is shown in Fig. 4.
In summary, AFDistill is trained to predict both the actual structural measures (TM, LDDT, computed
using true and AlphaFold’s predicted structures) as well as AlphaFold’s estimated scores (pTM and
pLDDT). In either case the estimated structural consistency (SC) score is well correlated with its
target (refer to Fig.2) and can be used as an indicators of protein sequence quality or validity.
3.2 MODEL
AFDistill model is based on ProtBert (Elnaggar et al., 2020), a Transformer BERT model (420M
parameters) pretrained on a large corpus of protein sequences using masked language modeling.
For our task we modify ProtBert head by setting the vocabulary size to 50 (bins), corresponding to
discretizing pTM/pLDDT in range (0,1). For pTM (scalar) the output corresponds to the first
CLS
token of the output sequence, while for pLDDT (sequence) the predictions are made for each residue
position.
Figure 5: Examples of 3D protein structures from the dataset, corresponding to high, medium, and
low actual TM scores (top row in legend), as well as AFDistill predictions, trained on TM 42K
(middle row) and TM augmented 86K (bottom row).
3.3 DISTILLATION EXPERIMENTAL RESULTS
In this section, we discuss the model evaluation results after training on the presented datasets. To
address data imbalance, we used weighted sampling during minibatch generation and Focal loss (Lin
et al., 2017) instead of traditional cross-entropy loss. Table 2 shows results for (p)TM-based datasets.
5
摘要:

PreprintALPHAFOLDDISTILLATIONFORPROTEINDESIGNIgorMelnyk∗,AurelieLozano,PayelDas,VijilChenthamarakshanIBMResearch,YorktownHeights,NY10598ABSTRACTInverseproteinfolding,theprocessofdesigningsequencesthatfoldintoaspecific3Dstructure,iscrucialinbio-engineeringanddrugdiscovery.Traditionalmethodsrelyonexpe...

展开>> 收起<<
Preprint ALPHA FOLD DISTILLATION FOR PROTEIN DESIGN Igor Melnyk Aurelie Lozano Payel Das Vijil Chenthamarakshan.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:2.7MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注