Preprint ALPHA FOLD DISTILLATION FOR PROTEIN DESIGN Igor Melnyk Aurelie Lozano Payel Das Vijil Chenthamarakshan

2025-05-02 0 0 2.7MB 25 页 10玖币

侵权投诉

Preprint

ALPHAFOLD DISTILLATION FOR PROTEIN DESIGN

Igor Melnyk ∗

, Aurelie Lozano, Payel Das, Vijil Chenthamarakshan

IBM Research,

Yorktown Heights, NY 10598

ABSTRACT

Inverse protein folding, the process of designing sequences that fold into a speciﬁc

3D structure, is crucial in bio-engineering and drug discovery. Traditional methods

rely on experimentally resolved structures, but these cover only a small fraction

of protein sequences. Forward folding models like AlphaFold offer a potential

solution by accurately predicting structures from sequences. However, these models

are too slow for integration into the optimization loop of inverse folding models

during training. To address this, we propose using knowledge distillation on

folding model conﬁdence metrics, such as pTM or pLDDT scores, to create a

faster and end-to-end differentiable distilled model. This model can then be used

as a structure consistency regularizer in training the inverse folding model. Our

technique is versatile and can be applied to other design tasks, such as sequence-

based protein inﬁlling. Experimental results show that our method outperforms

non-regularized baselines, yielding up to 3% improvement in sequence recovery

and up to 45% improvement in protein diversity while maintaining structural

consistency in generated sequences. Code is available at

https://github.

com/IBM/AFDistill.

1 INTRODUCTION

Eight of the top ten best-selling drugs are engineered proteins, making inverse protein folding a

crucial challenge in bio-engineering and drug discovery (Arnum, 2022). Inverse protein folding

involves designing amino acid sequences that fold into a speciﬁc 3D structure. Computationally, this

task is known as computational protein design and has been traditionally addressed by optimizing

amino acid sequences against a physics-based scoring function (Kuhlman et al., 2003). Recently, deep

generative models have been introduced to learn the mapping from protein structure to sequences

(Jing et al., 2020; Cao et al., 2021; Wu et al., 2021; Karimi et al., 2020; Hsu et al., 2022; Fu & Sun,

2022). While these models often use high amino acid recovery, TM score, and low perplexity as

success criteria, they overlook the primary goal of designing novel and diverse sequences that fold

into the desired structure and exhibit novel functions.

In parallel, recent advancements have also greatly enhanced protein representation learning (Rives

et al., 2021; Zhang et al., 2022), structure prediction from sequences (Jumper et al., 2021; Baek

et al., 2021b), and conditional protein sequence generation (Das et al., 2021; Anishchenko et al.,

2021). While inverse protein folding has traditionally focused on sequences with resolved structures,

which represent less than 0.1% of known protein sequences, a recent study improved performance by

training on millions of AlphaFold-predicted structures (Hsu et al., 2022). Despite this success, large-

scale training is computationally expensive. A more efﬁcient method could be to use a pre-trained

forward folding model to guide the training of the inverse folding model.

In this work we construct a framework where the inverse folding model is trained using a loss

objective that consists of regular sequence reconstruction loss, augmented with an additional structure

consistency loss (SC) (see Fig. 1 for system overview). One way of implementing this would be

to use folding models, e.g., AlphaFold, to estimate structure from generated sequence, compare it

with ground truth and compute TM score to regularize the training. However, a challenge in using

Alphafold (or similar) directly is computational cost of inference (see Fig. 2), and the need of ground

∗Corresponding author: igor.melnyk@ibm.com

arXiv:2210.03488v2 [q-bio.BM] 22 Nov 2023

Preprint

Input

3D Structure

Inverse

Folding

Model

Predicted

Protein Sequence

Predicted

3D Structure

TM, LDDT

pTM, pLDDT

Higher Fidelity, Slow

Lower Fidelity, Slow

pTM, pLDDT Lower Fidelity, Fast (Proposed)

AlphaFold

Proposed

AFDistill

Model

CE loss

+Final loss

Input

3D Structure

Inverse

Folding

Model

Predicted

Protein Sequence

Structure Consistency (SC) score

Training Inference

<latexit sha1_base64="JEJTWflGPOPggj+SE81nM7T83dc=">AAACTXicbVDBThsxEPWmFEJoIRRx4mIRIXGKdiugHKP2whEkAkjZKPJ6J8GK7V3ZszSRtR/Dtf2Ynvsh3FBVb7IHAoxk6enNjN+8l+RSWAzDv0Hjw9rH9Y3mZmvr0+ftnfbulxubFYZDn2cyM3cJsyCFhj4KlHCXG2AqkXCbTH9U/dsHMFZk+hrnOQwVm2gxFpyhp0btfRcvPnEG0pLGM8XMtGyN2p2wGy6KvgVRDTqkrsvRbkDjNOOFAo1cMmsHUZjj0DGDgksoW3FhIWd8yiYw8FAzBXboFtIlPfJMSseZ8U8jXbAvNxxT1s5V4icVw3v7uleR7/UGBY7Ph07ovEDQfCk0LiTFjFZh0FQY4CjnHjBuhL+V8ntmGEcf2YpKLqrTvA8NP3mmFNOpq9NyMcIMXZwKPXGnp2W54tbNliarTKPXCb4FN1+70Vn35Oqk0/tep9skB+SQHJOIfCM9ckEuSZ9w4sgj+UV+B3+Cp+A5+LccbQT1zh5ZqcbGf3FptT0=</latexit>

<latexit sha1_base64="d2eG6fnTFfcEuY5lQo79On2Nlso=">AAACdHicbVHLahsxFJWnj6Tuy2mhm3QhagJdmZmSR5ehhZBlCnES8BhzR3PtCEuaQbqT2qjzM922P5Qf6TqasRd1kguCw7n33MdRVirpKI5vO9GTp8+eb22/6L589frN297OuwtXVFbgUBSqsFcZOFTS4JAkKbwqLYLOFF5m8+9N/vIGrZOFOadliWMNMyOnUgAFatL74NO2iScEVfNUaLDzujvp9eNB3AZ/CJI16LN1nE12OidpXohKoyGhwLlREpc09mBJCoV1N60cliDmMMNRgAY0urFvZ9d8LzA5nxY2PEO8Zf9XeNDOLXUWKjXQtbufa8jHcqOKpl/HXpqyIjRiNWhaKU4Fb9zgubQoSC0DAGFl2JWLa7AgKHi2MaWUzWrhDoM/RaE1mNyni9YtnxIuyKe5NDN/cFDXG9f6xerITaV4RJk0yuB8ct/nh+DiyyA5HOz/2O8ff1v/wTbbZZ/YZ5awI3bMTtkZGzLBfrHf7A/72/kXfYz60d6qNOqsNe/ZRkSDOy5WwuI=</latexit>

Figure 1: Overview of the proposed AFDistill system. AFDistill contrasts with traditional methods

(red line) that use models like AlphaFold to predict protein structure, which is then compared to the

actual structure. This method is slow due to model inference times (refer Fig. 2). An alternative (blue

line) uses internal metrics from folding model without structure prediction but remains slow and less

precise. Our solution, distills AlphaFold’s conﬁdence metrics into a faster, differentiable model that

offers accuracy akin to AlphaFold, allowing seamless integration into the training process (green

line). The improved inverse folding model’s inference is shown on the right.

truth reference structure. Internal conﬁdence structure metrics from folding model can be used

instead. However, that approach is still slow for the in-the-loop optimization. To address this, we:

(i) Carry out knowledge distillation on AlphaFold and incorporate the resulting model, AFDistill

(ﬁxed), into the regularized training of the inverse folding model, which is referred to as structure

consistency (SC) loss. The major novelty here is that AFDistill enables direct prediction of TM or

LDDT scores of a given protein sequence bypassing the structure estimation or the access to ground

truth structure. Primary practical beneﬁts of our model include being fast, precise, and end-to-end

differentiable. Employing SC loss during training for downstream tasks can be seen as integrating

AlphaFold’s domain expertise into the model, thereby offering additional boost in its performance.

(ii) Perform extensive evaluations, demonstrating that our proposed system surpasses existing bench-

marks in structure-guided protein sequence design by achieving lower perplexity, higher amino

acid recovery, and maintaining proximity to the original protein structure. Additionally, our system

enhances sequence diversity, a key objective in protein design. Due to a trade-off between sequence

and structure recovery, our regularized model offers better sequence diversity while maintaining struc-

tural integrity. Importantly, our regularization technique is versatile, as evidenced by its successful

application in sequence-based protein inﬁlling, where we also observe performance improvement.

(iii) Lastly, our SC metric can either be used as regularization for inverse folding, inﬁlling and other

protein optimization tasks (e.g., (Moffat et al., 2021)) which would beneﬁt from structural consistency

estimation of the designed protein sequence, or as an affordable AlphaFold alternative that provides

scoring of a given protein, reﬂecting its structural content.

2 RELATED WORK

Forward Protein Folding. Recent computational methods for predicting protein structure from

its sequence include AlphaFold (Jumper et al., 2021), which uses multiple sequence alignments

(MSAs) and pairwise features. Similarly, RoseTTAFold (Baek et al., 2021a) integrates sequence, 2D

distance map, and 3D coordinate information. OpenFold (Ahdritz et al., 2022) replicates AlphaFold

in PyTorch. However, due to the unavailability of MSAs for certain proteins and their inference

time overhead, MSA-free methods like OmegaFold (Wu et al., 2022), HelixFold (Fang et al., 2022),

ESMFold (Lin et al., 2022), and Roney & Ovchinnikov (2022) emerged. These leverage pretrained

language models, offering accuracy on par with or exceeding AlphaFold and RoseTTAFold based on

the input type.

Inverse Protein Folding. Recent algorithms address the inverse protein folding problem of ﬁnding

amino acid sequences for a speciﬁed structure. (Norn et al., 2020) used a deep learning method

Preprint

Figure 2: Inference times for protein sequences using our AFDistill model compared to alternatives

are displayed on the left. AFDistill maintains fast inference for longer sequences: 0.028s for 1024-

length and 0.035s for 2048-length. Timings for AlphaFold and OpenFold (Ahdritz et al., 2022) do not

include MSA search times, which can range from minutes to hours. Values for HelixFold (Fang et al.,

2022), OmegaFold (Wu et al., 2022), and ESMFold (Lin et al., 2022) are from their publications. The

center plot shows kernel density of true vs. AFDistill-predicted TM scores (Pearson’s correlation:

0.77), while the right displays a similar plot for pLDDT values (Pearson’s correlation: 0.76). Refer to

Section 3 for details.

optimizing via the trRosetta structure prediction network (Yang et al., 2020). (Anand et al., 2022) de-

signed a deep neural network that models side-chain conformers structurally. In contrast, (Jendrusch

et al., 2021) employed AlphaFold (Jumper et al., 2021) in an optimization loop for sequence genera-

tion, though its use is resource-intensive due to MSA search. MSA-free methods like OmegaFold,

HelixFold, and EMSFold are quicker but still slow for optimization loops.

In this work, we propose knowledge distillation from the forward folding algorithm AlphaFold, and

build a student model that is small, practical and accurate enough. We show that the distilled model

can be efﬁciently used within the inverse folding model optimization loop and improve quality of

designed protein sequences.

3 ALPHAFOLD DISTILL

Knowledge distillation (Hinton et al., 2015) transfers knowledge from a large complex model, in

our case AlphaFold, to a smaller one, here this is the AFDistill model (see Fig. 3). Traditionally,

the distillation would be done using soft labels, which are probabilities from AlphaFold model, and

hard labels, the ground truth classes. However, in our case we do not use the probabilities as they are

harder to collect or unavailable, but rather the model’s predictions (pTM/pLDDT) and the hard labels,

TM/LDDT scores, computed based on AlphaFold’s predicted 3D structures.

Scores to Distill TM-score (Zhang & Skolnick, 2004) measures the mean distance between struc-

turally aligned

Cα

atoms, scaled by a length-dependent distance parameter, while LDDT (Mariani

et al., 2013) calculates the average of four fractions using distances between atom pairs based on

four tolerance thresholds within a 15

A inclusion radius. Both metrics range from 0 to 1, with higher

values indicating more similar structures. pTM and pLDDT are AlphaFold-predicted metrics for a

given protein sequence, representing the model’s conﬁdence in the estimated structure. pLDDT is a

local per-residue score, while pTM is a global conﬁdence metric for overall chain reconstruction. In

this work, we interpret these metrics as indicators of sequence quality or validity for downstream

applications (see Section 4).

3.1 DATA

Using Release 3 (January 2022) of AlphaFold Protein Structure Database (Varadi et al., 2021),

we collected a set of 907,578 predicted structures. Each of these predicted structures contains 3D

coordinates of all the residue atoms as well as the per-resiude pLDDT conﬁdence scores.

Preprint

Input

Protein Sequence

Proposed

AFDistill

Model

Logits CE loss

Predicted

3D structure

pTM, pLDDT

Ground Truth

3D structure

TM, LDDT

Discretizer AlphaFold

Classes

Input

Protein Sequence

Proposed

AFDistill

Model

Structure Consistency (SC) score

pTM, pLDDT

Training

Inference

Figure 3: Distillation overview. Top diagram shows the training of AFDistill. The scores from

AlphaFold’s conﬁdence estimation are denoted as pTM and pLDDT, while the scores which are

computed using ground truth and the AlphaFold’s predicted 3D structures are denoted as TM and

LDDT. These values are then discretized and treated as class labels during cross-entropy (CE)

training. Note that the training based on TM/LDTT is limited since the number of known ground

truth structures is small. The bottom diagram shows the inference stage of AFDistill, where for each

protein sequence it estimates pTM and pLDDT scores.

Table 1: Statistics from January 2022 (left side) and July 2022 (right size) releases of the AlphaFold

database. For the earlier release, we created multiple datasets for pTM and pLDDT estimation, while

for the later, larger release we curated datasets only for pLDDT estimation.

Release 3 (January 2022) Release 4 (July 2022)

Name Size Name Size

Original 907,578 Original 214,687,406

TM 42K 42,605 pLDDT balanced 1M 1,000,000

TM augmented 86K 86,811 pLDDT balanced 10M 10,000,000

pTM synthetic 1M 1,244,788 pLDDT balanced 60M 66,736,124

LDDT 42K 42,605

pLDDT 1M 905,850

To avoid data leakage to the downstream applications, we ﬁrst ﬁltered out the structures that have

40% sequence similarity or more to the validation and test splits of CATH 4.2 dataset (discussed in

Section 4). Then, using the remaining structures, we created our pLDDT 1M dataset (see Table 1),

where each protein sequence is paired with the sequence of per-residue pLDDTs. Additionally, to

reduce the computational complexity of AFDistill training, we limited the maximum protein length

to 500 by randomly cropping a subsequence.

Figure 4: Distribution of the (p)TM/(p)LDDT scores in various datasets used in AFDistill training.

We created datasets based on true TM and LDDT values using predicted AlphaFold structures.

Speciﬁcally, using the PDB-to-UniProt mapping, we selected a subset of samples with matching

ground truth PDB sequences and 3D structures, resulting in 42,605 structures. We denote these

datasets as TM 42K and LDDT 42K (see Table 1). Fig. 4 shows their score density distribution, which

Preprint

Table 2: Validation loss of AFDistill on datasets from Table 1 (For more details, see Tables 5 and 6.)

Training data Val CE loss Training data Val CE loss

TM 42K 1.10 LDDT 42K 3.39

TM augmented 86K 2.12 pLDDT 1M 3.24

pTM synthetic 1M 2.55 pLDDT balanced 1M 2.63

pLDDT balanced 10M 2.43

pLDDT balanced 60M 2.40

is skewed towards higher values. To address data imbalance, we curated two additional TM-based

datasets. TM augmented 86K was obtained by augmenting TM 42K with a set of perturbed original

protein sequences (permuted/replaced parts of protein sequence), estimating their structures with

AlphaFold, computing corresponding TM-score, and keeping the low and medium range TM values.

pTM synthetic 1M was obtained by generating random synthetic protein sequences and feeding

them to AFDistill (pre-trained on TM 42K data), to generate additional data samples and collect

lower-range pTM values. The distribution of the scores for these additional datasets is also shown in

Fig. 4, where both TM augmented 86K and pTM synthetic 1M datasets are less skewed, covering

lower (p)TM values better.

Using Release 4 (July 2022) with over 214M predicted structures, we observed a similar high

skewness in pLDDT values. To mitigate this, we ﬁltered out samples with upper-range mean-pLDDT

values, resulting in a 60M sequences dataset, with additional 10M and 1M versions created. Their

density is shown in Fig. 4.

In summary, AFDistill is trained to predict both the actual structural measures (TM, LDDT, computed

using true and AlphaFold’s predicted structures) as well as AlphaFold’s estimated scores (pTM and

pLDDT). In either case the estimated structural consistency (SC) score is well correlated with its

target (refer to Fig.2) and can be used as an indicators of protein sequence quality or validity.

3.2 MODEL

AFDistill model is based on ProtBert (Elnaggar et al., 2020), a Transformer BERT model (420M

parameters) pretrained on a large corpus of protein sequences using masked language modeling.

For our task we modify ProtBert head by setting the vocabulary size to 50 (bins), corresponding to

discretizing pTM/pLDDT in range (0,1). For pTM (scalar) the output corresponds to the ﬁrst

⟨CLS⟩

token of the output sequence, while for pLDDT (sequence) the predictions are made for each residue

position.

Figure 5: Examples of 3D protein structures from the dataset, corresponding to high, medium, and

low actual TM scores (top row in legend), as well as AFDistill predictions, trained on TM 42K

(middle row) and TM augmented 86K (bottom row).

3.3 DISTILLATION EXPERIMENTAL RESULTS

In this section, we discuss the model evaluation results after training on the presented datasets. To

address data imbalance, we used weighted sampling during minibatch generation and Focal loss (Lin

et al., 2017) instead of traditional cross-entropy loss. Table 2 shows results for (p)TM-based datasets.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PreprintALPHAFOLDDISTILLATIONFORPROTEINDESIGNIgorMelnyk∗,AurelieLozano,PayelDas,VijilChenthamarakshanIBMResearch,YorktownHeights,NY10598ABSTRACTInverseproteinfolding,theprocessofdesigningsequencesthatfoldintoaspecific3Dstructure,iscrucialinbio-engineeringanddrugdiscovery.Traditionalmethodsrelyonexpe...

展开>> 收起<<

Preprint ALPHA FOLD DISTILLATION FOR PROTEIN DESIGN Igor Melnyk Aurelie Lozano Payel Das Vijil Chenthamarakshan.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint ALPHA FOLD DISTILLATION FOR PROTEIN DESIGN Igor Melnyk Aurelie Lozano Payel Das Vijil Chenthamarakshan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: