Antibody Representation Learning for Drug Discovery

2025-04-22 0 0 750.86KB 9 页 10玖币
侵权投诉
Antibody Representation Learning for Drug Discovery
Lin Li 1, Esther Gupta 1, John Spaeth 1, Leslie Shing 1, Tristan Bepler 2, 3, Rajmonda Sulo Caceres 1
1MIT Lincoln Laboratory, Lexington, MA 02421
2Research Laboratory of Electronics, MIT, Cambridge, 02139
3Simons Machine Learning Center, NYSBC, New York, 10027
{lin.li, esther.wolf, john.spaeth, leslie.shing}@ll.mit.edu, tbepler@mit.edu, rajmonda.caceres@ll.mit.edu
Abstract
Therapeutic antibody development has become an increas-
ingly popular approach for drug development. To date, anti-
body therapeutics are largely developed using large scale ex-
perimental screens of antibody libraries containing hundreds
of millions of antibody sequences. The high cost and diffi-
culty of developing therapeutic antibodies create a pressing
need for computational methods to predict antibody proper-
ties and create bespoke designs. However, the relationship
between antibody sequence and activity is a complex phys-
ical process and traditional iterative design approaches rely
on large scale assays and random mutagenesis. Deep learning
methods have emerged as a promising way to learn antibody
property predictors, but predicting antibody properties and
target-specific activities depends critically on the choice of
antibody representations and data linking sequences to prop-
erties is often limited. Recently, methods for learning bio-
logical sequence representations from large, general purpose
protein datasets have demonstrated impressive ability to cap-
ture structural and functional properties of proteins. How-
ever, existing works have not yet investigated the value, lim-
itations and opportunities of these methods in application to
antibody-based drug discovery. In this paper, we present re-
sults on a novel SARS-CoV-2 antibody binding dataset and
an additional benchmark dataset. We compare three classes of
models: conventional statistical sequence models, supervised
learning on each dataset independently, and fine-tuning an an-
tibody specific pre-trained embedding model. The pre-trained
models were trained on tens of millions of natural healthy hu-
man antibody sequences and protein sequences, respectively.
Experimental results suggest that self-supervised pretraining
of feature representation consistently offers significant im-
DISTRIBUTION STATEMENT A. Approved for public re-
lease. Distribution is unlimited. This material is based upon work
supported by the Under Secretary of Defense for Research and En-
gineering under Air Force Contract No. FA8702-15-D-0001. Any
opinions, findings, conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily re-
flect the views of the Under Secretary of Defense for Research and
Engineering. © 2021 Massachusetts Institute of Technology. De-
livered to the U.S. Government with Unlimited Rights, as defined
in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstand-
ing any copyright notice, U.S. Government rights in this work are
defined by DFARS 252.227-7013 or DFARS 252.227-7014 as de-
tailed above. Use of this work other than as specifically authorized
by the U.S. Government may violate any copyrights that exist in
this work.
provement in over previous approaches. We also investigate
the impact of data size on the model performance, and dis-
cuss challenges and opportunities that the machine learning
community can address to advance in silico engineering and
design of therapeutic antibodies.
Drug development is a complex, time-consuming and costly
process. The current drug design paradigm is primarily
an assay driven process: millions or billions of candidate
molecules are screened for activity to identify a small num-
ber of candidates for further development and optimiza-
tion. Often, this involves intensive laboratory experiments
and expert-guided analysis to design and test molecule vari-
ants. This approach poses high costs in terms of the equip-
ment, experimental environment, and the expert knowledge
required. For antibody drug development, candidate anti-
bodies are typically identified by affinity maturation of enor-
mous naive antibody libraries in vitro with phage or yeast
display or in vivo by animal immunization (Lu et al. 2020).
These methods require target molecules (epitopes) to be sta-
bly producible and rely on sheer scale to identify antibody
sequences that interact with the target. This makes these
processes expensive and unpredictable, because there is no
guarantee that a suitable antibody exists within the starting
library and candidates can fail later in the drug development
process due to safety or manufacturability concerns. As a
result, the ability to identify many candidate molecules and
optimize these quickly is critical for predictable, fast, and
inexpensive drug development.
With several technological advances in experimental an-
tibody development, including the ability to obtain pure
antibodies in large numbers for research and clinical use,
as well as, the successful translation of antibodies to the
clinic (Lu et al. 2020), antibody design–a predominant ther-
apeutic modality for various diseases (Lu et al. 2020)–has
become an increasingly popular approach for drug develop-
ment. While these advancements have shortened timelines
throughout the development pipeline, in silico engineering
of antibody candidates that enables cheaper and faster drug
development against rapidly-evolving antigens still remains
a challenge. Computational methods for antibody develop-
ment, to date, have largely focused on physical simulation-
based or simple statistical model-based approaches to op-
timize the complementary determining regions (CDRs) of
antibodies (Tiller and Tessier 2015). Physics-based meth-
arXiv:2210.02881v1 [q-bio.QM] 5 Oct 2022
ods approach antibody optimization from first principles by
simulating interactions between antibodies and targets at the
atomic level. Although these methods are conceptually ap-
pealing and offer the promise of generalization to any an-
tibody or target sequence, in practice, they require solved
structures for antibody-target complexes and are impracti-
cally slow to compute (Yamashita 2018), limiting the space
of antibody variants that can be computationally explored.
Machine learning (ML) has demonstrated potential in ac-
celerating drug discovery with good performance and lower
computational cost (Yang et al. 2019; Bepler and Berger
2019; Liu et al. 2020; Stokes et al. 2020). A key challenge
in ML for drug development is learning effective feature
representations that capture structural and functional prop-
erties of candidate drug designs. Traditionally, ML mod-
els used in bioinformatics research rely on expert-designed
descriptors to predict drug-like properties. These descrip-
tors are often limited to known chemical or biochemical
classes and are inherently challenging for unexplored sys-
tems. Recent advances of neural networks provide an alter-
native way to learn representations automatically from data.
The work in (Yang et al. 2019) has leveraged graph neu-
ral networks (GNNs) to learn representations of molecular
data, which are found to offer significant improvements over
models currently used in industrial workflows. Using a sim-
ilar approach, authors in (Stokes et al. 2020) have identified
Halicin, a new antibiotic molecule that is effective against a
broad class of disease-causing bacteria. Authors in (Bepler
and Berger 2019; Rao et al. 2019; Alley et al. 2019; Rives
et al. 2021; Bepler and Berger 2021) have explored learn-
ing protein sequence representations in a self-supervised and
semi-supervised fashion to predict structural properties and
other down-stream tasks. They have highlighted the value
of various neural network architectures and pre-training in
substantially improving our ability to predict structural and
functional properties from sequences alone. Authors in (Liu
et al. 2020) have shown promising results in applying CNN
architectures for both the task of antibody enrichment pre-
diction and antibody design. While above works suggest
promise of deep representation learning for biological sys-
tems, many questions still remain open for antibody-specific
feature representation for drug discovery.
In this work, we develop the first antibody-specific pre-
trained protein embedding models and incorporate the
learned features into a broader ML system for antibody
binding affinity prediction. Using this system, we compare
the descriptive and predictive power of conventional anti-
body features and features learned by deep language mod-
els. Next, we explore the role of pre-training in supporting
knowledge transfer and improving antibody binding predic-
tion performance, and we evaluate the importance of pre-
training datasets by comparing models trained only on an-
tibody sequences to models trained on general protein se-
quences. Finally, we investigate training data sufficiency re-
quirements that support performance robustness and knowl-
edge transfer. We analyze the sensitivity of feature learning
as a function of training data size and characterize data size
ranges that are most suitable for deep learning of antibody
features. We make the following contributions:
We present two large antibody sequence datasets with
binding affinity measurements (Walsh et al. 2021; Liu
et al. 2020) for analyzing antibody-specific representa-
tion learning. The LL-SARS-CoV-2 dataset, in particular,
is the first of its kind in terms of sequence diversity, con-
sisting of SARS-Cov2 antibody variants generated from
mutations across the full antibody sequence.
We develop the first antibody sequence-specific pre-
trained language model. We find that pre-training on gen-
eral protein sequence datasets supports better feature re-
finement and learning for antibody binding prediction.
This result highlights the important role of training over
a diverse set of protein sequences.
We demonstrate that language models consistently learn
much more effective features for antibody binding pre-
diction than conventional antibody sequence features or
features learned by a CNN model. This result is consis-
tent with recent results that have investigated the power
of language models in supporting downstream tasks in
the natural language domain. (Saunshi, Malladi, and
Arora 2021).
Our training data size sensitivity analysis reveals a range
of data sizes (about 3200-6400 samples), where we ob-
serve the most steep improvement of binding prediction
performance. Fewer data samples are insufficient to cap-
ture the antibody sequence to binding mapping, while
more data samples offer diminishing returns.
Background
Antibodies are Y-shaped proteins produced by the immune
system to tag or neutralize foreign substances, called anti-
gens. As shown in Figure 1, antibodies consist of two iden-
tical light chains and two identical heavy chains. Each chain
has a variable region and a constant region. The tip of the
variable region forms the antibody’s binding surface, also
known as a paratope. This region recognizes and binds to
a specific antigen’s binding surface called an epitope. The
variable regions of each chain contain three hypervariable
regions known as the complementarity-determining regions
(CDRs), denoted as CDR-L1, CDR-L2, CDR-L3 and CDR-
H1, CDR-H2, CDR-H3, for the light and heavy chains re-
spectively (highlighted in yellow in Figure 1).
The amino acid sequence of the CDRs determines the
antigens to which an antibody will bind (O’Connor, Adams,
and Fairman 2010). The variable domain of the light chain
can consist of 94-139 amino acids, while that of the
heavy chain is slightly longer, consisting of 92-160 amino
acids (Abhinandan and Martin 2008). Each amino acid is
encoded into a 25-character alphabet with 20 characters for
the standard amino acids and 5characters for non-standard,
ambiguous or unknown amino acids.
Many natural properties of antibodies are considered
when designing optimized antibody candidates against a
specific target including binding affinity, binding specificity,
stability, solubility, and effector functions. Antibody binding
affinity, the property we focus on in this paper, is the binding
strength between the antibody and antigen, and is inversely
related to the equilibrium dissociation constant KD, a ratio
摘要:

AntibodyRepresentationLearningforDrugDiscoveryLinLi1,EstherGupta1,JohnSpaeth1,LeslieShing1,TristanBepler2,3,RajmondaSuloCaceres11MITLincolnLaboratory,Lexington,MA024212ResearchLaboratoryofElectronics,MIT,Cambridge,021393SimonsMachineLearningCenter,NYSBC,NewYork,10027flin.li,esther.wolf,john.spaeth,l...

展开>> 收起<<
Antibody Representation Learning for Drug Discovery.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:750.86KB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注