Antibody Representation Learning for Drug Discovery

2025-04-22 0 0 750.86KB 9 页 10玖币

侵权投诉

Lin Li 1, Esther Gupta 1, John Spaeth 1, Leslie Shing 1, Tristan Bepler 2, 3, Rajmonda Sulo Caceres 1

1MIT Lincoln Laboratory, Lexington, MA 02421

2Research Laboratory of Electronics, MIT, Cambridge, 02139

3Simons Machine Learning Center, NYSBC, New York, 10027

{lin.li, esther.wolf, john.spaeth, leslie.shing}@ll.mit.edu, tbepler@mit.edu, rajmonda.caceres@ll.mit.edu

Abstract

Therapeutic antibody development has become an increas-

ingly popular approach for drug development. To date, anti-

body therapeutics are largely developed using large scale ex-

perimental screens of antibody libraries containing hundreds

of millions of antibody sequences. The high cost and difﬁ-

culty of developing therapeutic antibodies create a pressing

need for computational methods to predict antibody proper-

ties and create bespoke designs. However, the relationship

between antibody sequence and activity is a complex phys-

ical process and traditional iterative design approaches rely

on large scale assays and random mutagenesis. Deep learning

methods have emerged as a promising way to learn antibody

property predictors, but predicting antibody properties and

target-speciﬁc activities depends critically on the choice of

antibody representations and data linking sequences to prop-

erties is often limited. Recently, methods for learning bio-

logical sequence representations from large, general purpose

protein datasets have demonstrated impressive ability to cap-

ture structural and functional properties of proteins. How-

ever, existing works have not yet investigated the value, lim-

itations and opportunities of these methods in application to

antibody-based drug discovery. In this paper, we present re-

sults on a novel SARS-CoV-2 antibody binding dataset and

an additional benchmark dataset. We compare three classes of

models: conventional statistical sequence models, supervised

learning on each dataset independently, and ﬁne-tuning an an-

tibody speciﬁc pre-trained embedding model. The pre-trained

models were trained on tens of millions of natural healthy hu-

man antibody sequences and protein sequences, respectively.

Experimental results suggest that self-supervised pretraining

of feature representation consistently offers signiﬁcant im-

DISTRIBUTION STATEMENT A. Approved for public re-

lease. Distribution is unlimited. This material is based upon work

supported by the Under Secretary of Defense for Research and En-

gineering under Air Force Contract No. FA8702-15-D-0001. Any

opinions, ﬁndings, conclusions or recommendations expressed in

this material are those of the author(s) and do not necessarily re-

ﬂect the views of the Under Secretary of Defense for Research and

livered to the U.S. Government with Unlimited Rights, as deﬁned

in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstand-

ing any copyright notice, U.S. Government rights in this work are

deﬁned by DFARS 252.227-7013 or DFARS 252.227-7014 as de-

tailed above. Use of this work other than as speciﬁcally authorized

by the U.S. Government may violate any copyrights that exist in

this work.

provement in over previous approaches. We also investigate

the impact of data size on the model performance, and dis-

cuss challenges and opportunities that the machine learning

community can address to advance in silico engineering and

design of therapeutic antibodies.

Drug development is a complex, time-consuming and costly

process. The current drug design paradigm is primarily

an assay driven process: millions or billions of candidate

molecules are screened for activity to identify a small num-

ber of candidates for further development and optimiza-

tion. Often, this involves intensive laboratory experiments

and expert-guided analysis to design and test molecule vari-

ants. This approach poses high costs in terms of the equip-

ment, experimental environment, and the expert knowledge

required. For antibody drug development, candidate anti-

bodies are typically identiﬁed by afﬁnity maturation of enor-

mous naive antibody libraries in vitro with phage or yeast

display or in vivo by animal immunization (Lu et al. 2020).

These methods require target molecules (epitopes) to be sta-

bly producible and rely on sheer scale to identify antibody

sequences that interact with the target. This makes these

processes expensive and unpredictable, because there is no

guarantee that a suitable antibody exists within the starting

library and candidates can fail later in the drug development

process due to safety or manufacturability concerns. As a

result, the ability to identify many candidate molecules and

optimize these quickly is critical for predictable, fast, and

inexpensive drug development.

With several technological advances in experimental an-

tibody development, including the ability to obtain pure

antibodies in large numbers for research and clinical use,

as well as, the successful translation of antibodies to the

clinic (Lu et al. 2020), antibody design–a predominant ther-

apeutic modality for various diseases (Lu et al. 2020)–has

become an increasingly popular approach for drug develop-

ment. While these advancements have shortened timelines

throughout the development pipeline, in silico engineering

of antibody candidates that enables cheaper and faster drug

development against rapidly-evolving antigens still remains

a challenge. Computational methods for antibody develop-

ment, to date, have largely focused on physical simulation-

based or simple statistical model-based approaches to op-

timize the complementary determining regions (CDRs) of

antibodies (Tiller and Tessier 2015). Physics-based meth-

arXiv:2210.02881v1 [q-bio.QM] 5 Oct 2022

ods approach antibody optimization from ﬁrst principles by

simulating interactions between antibodies and targets at the

atomic level. Although these methods are conceptually ap-

pealing and offer the promise of generalization to any an-

tibody or target sequence, in practice, they require solved

structures for antibody-target complexes and are impracti-

cally slow to compute (Yamashita 2018), limiting the space

of antibody variants that can be computationally explored.

Machine learning (ML) has demonstrated potential in ac-

celerating drug discovery with good performance and lower

computational cost (Yang et al. 2019; Bepler and Berger

2019; Liu et al. 2020; Stokes et al. 2020). A key challenge

in ML for drug development is learning effective feature

representations that capture structural and functional prop-

erties of candidate drug designs. Traditionally, ML mod-

els used in bioinformatics research rely on expert-designed

descriptors to predict drug-like properties. These descrip-

tors are often limited to known chemical or biochemical

classes and are inherently challenging for unexplored sys-

tems. Recent advances of neural networks provide an alter-

native way to learn representations automatically from data.

The work in (Yang et al. 2019) has leveraged graph neu-

ral networks (GNNs) to learn representations of molecular

data, which are found to offer signiﬁcant improvements over

models currently used in industrial workﬂows. Using a sim-

ilar approach, authors in (Stokes et al. 2020) have identiﬁed

Halicin, a new antibiotic molecule that is effective against a

broad class of disease-causing bacteria. Authors in (Bepler

and Berger 2019; Rao et al. 2019; Alley et al. 2019; Rives

et al. 2021; Bepler and Berger 2021) have explored learn-

ing protein sequence representations in a self-supervised and

semi-supervised fashion to predict structural properties and

other down-stream tasks. They have highlighted the value

of various neural network architectures and pre-training in

substantially improving our ability to predict structural and

functional properties from sequences alone. Authors in (Liu

et al. 2020) have shown promising results in applying CNN

architectures for both the task of antibody enrichment pre-

diction and antibody design. While above works suggest

promise of deep representation learning for biological sys-

tems, many questions still remain open for antibody-speciﬁc

feature representation for drug discovery.

In this work, we develop the ﬁrst antibody-speciﬁc pre-

trained protein embedding models and incorporate the

learned features into a broader ML system for antibody

binding afﬁnity prediction. Using this system, we compare

the descriptive and predictive power of conventional anti-

body features and features learned by deep language mod-

els. Next, we explore the role of pre-training in supporting

knowledge transfer and improving antibody binding predic-

tion performance, and we evaluate the importance of pre-

training datasets by comparing models trained only on an-

tibody sequences to models trained on general protein se-

quences. Finally, we investigate training data sufﬁciency re-

quirements that support performance robustness and knowl-

edge transfer. We analyze the sensitivity of feature learning

as a function of training data size and characterize data size

ranges that are most suitable for deep learning of antibody

features. We make the following contributions:

• We present two large antibody sequence datasets with

binding afﬁnity measurements (Walsh et al. 2021; Liu

et al. 2020) for analyzing antibody-speciﬁc representa-

tion learning. The LL-SARS-CoV-2 dataset, in particular,

is the ﬁrst of its kind in terms of sequence diversity, con-

sisting of SARS-Cov2 antibody variants generated from

mutations across the full antibody sequence.

• We develop the ﬁrst antibody sequence-speciﬁc pre-

trained language model. We ﬁnd that pre-training on gen-

eral protein sequence datasets supports better feature re-

ﬁnement and learning for antibody binding prediction.

This result highlights the important role of training over

a diverse set of protein sequences.

• We demonstrate that language models consistently learn

much more effective features for antibody binding pre-

diction than conventional antibody sequence features or

features learned by a CNN model. This result is consis-

tent with recent results that have investigated the power

of language models in supporting downstream tasks in

the natural language domain. (Saunshi, Malladi, and

Arora 2021).

• Our training data size sensitivity analysis reveals a range

of data sizes (about 3200-6400 samples), where we ob-

serve the most steep improvement of binding prediction

performance. Fewer data samples are insufﬁcient to cap-

ture the antibody sequence to binding mapping, while

more data samples offer diminishing returns.

Background

Antibodies are Y-shaped proteins produced by the immune

system to tag or neutralize foreign substances, called anti-

gens. As shown in Figure 1, antibodies consist of two iden-

tical light chains and two identical heavy chains. Each chain

has a variable region and a constant region. The tip of the

variable region forms the antibody’s binding surface, also

known as a paratope. This region recognizes and binds to

a speciﬁc antigen’s binding surface called an epitope. The

variable regions of each chain contain three hypervariable

regions known as the complementarity-determining regions

(CDRs), denoted as CDR-L1, CDR-L2, CDR-L3 and CDR-

H1, CDR-H2, CDR-H3, for the light and heavy chains re-

spectively (highlighted in yellow in Figure 1).

The amino acid sequence of the CDRs determines the

antigens to which an antibody will bind (O’Connor, Adams,

and Fairman 2010). The variable domain of the light chain

can consist of 94-139 amino acids, while that of the

heavy chain is slightly longer, consisting of 92-160 amino

acids (Abhinandan and Martin 2008). Each amino acid is

encoded into a 25-character alphabet with 20 characters for

the standard amino acids and 5characters for non-standard,

ambiguous or unknown amino acids.

Many natural properties of antibodies are considered

when designing optimized antibody candidates against a

speciﬁc target including binding afﬁnity, binding speciﬁcity,

stability, solubility, and effector functions. Antibody binding

afﬁnity, the property we focus on in this paper, is the binding

strength between the antibody and antigen, and is inversely

related to the equilibrium dissociation constant KD, a ratio

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AntibodyRepresentationLearningforDrugDiscoveryLinLi1,EstherGupta1,JohnSpaeth1,LeslieShing1,TristanBepler2,3,RajmondaSuloCaceres11MITLincolnLaboratory,Lexington,MA024212ResearchLaboratoryofElectronics,MIT,Cambridge,021393SimonsMachineLearningCenter,NYSBC,NewYork,10027flin.li,esther.wolf,john.spaeth,l...

展开>> 收起<<

Antibody Representation Learning for Drug Discovery.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Antibody Representation Learning for Drug Discovery

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: