ods approach antibody optimization from first principles by
simulating interactions between antibodies and targets at the
atomic level. Although these methods are conceptually ap-
pealing and offer the promise of generalization to any an-
tibody or target sequence, in practice, they require solved
structures for antibody-target complexes and are impracti-
cally slow to compute (Yamashita 2018), limiting the space
of antibody variants that can be computationally explored.
Machine learning (ML) has demonstrated potential in ac-
celerating drug discovery with good performance and lower
computational cost (Yang et al. 2019; Bepler and Berger
2019; Liu et al. 2020; Stokes et al. 2020). A key challenge
in ML for drug development is learning effective feature
representations that capture structural and functional prop-
erties of candidate drug designs. Traditionally, ML mod-
els used in bioinformatics research rely on expert-designed
descriptors to predict drug-like properties. These descrip-
tors are often limited to known chemical or biochemical
classes and are inherently challenging for unexplored sys-
tems. Recent advances of neural networks provide an alter-
native way to learn representations automatically from data.
The work in (Yang et al. 2019) has leveraged graph neu-
ral networks (GNNs) to learn representations of molecular
data, which are found to offer significant improvements over
models currently used in industrial workflows. Using a sim-
ilar approach, authors in (Stokes et al. 2020) have identified
Halicin, a new antibiotic molecule that is effective against a
broad class of disease-causing bacteria. Authors in (Bepler
and Berger 2019; Rao et al. 2019; Alley et al. 2019; Rives
et al. 2021; Bepler and Berger 2021) have explored learn-
ing protein sequence representations in a self-supervised and
semi-supervised fashion to predict structural properties and
other down-stream tasks. They have highlighted the value
of various neural network architectures and pre-training in
substantially improving our ability to predict structural and
functional properties from sequences alone. Authors in (Liu
et al. 2020) have shown promising results in applying CNN
architectures for both the task of antibody enrichment pre-
diction and antibody design. While above works suggest
promise of deep representation learning for biological sys-
tems, many questions still remain open for antibody-specific
feature representation for drug discovery.
In this work, we develop the first antibody-specific pre-
trained protein embedding models and incorporate the
learned features into a broader ML system for antibody
binding affinity prediction. Using this system, we compare
the descriptive and predictive power of conventional anti-
body features and features learned by deep language mod-
els. Next, we explore the role of pre-training in supporting
knowledge transfer and improving antibody binding predic-
tion performance, and we evaluate the importance of pre-
training datasets by comparing models trained only on an-
tibody sequences to models trained on general protein se-
quences. Finally, we investigate training data sufficiency re-
quirements that support performance robustness and knowl-
edge transfer. We analyze the sensitivity of feature learning
as a function of training data size and characterize data size
ranges that are most suitable for deep learning of antibody
features. We make the following contributions:
• We present two large antibody sequence datasets with
binding affinity measurements (Walsh et al. 2021; Liu
et al. 2020) for analyzing antibody-specific representa-
tion learning. The LL-SARS-CoV-2 dataset, in particular,
is the first of its kind in terms of sequence diversity, con-
sisting of SARS-Cov2 antibody variants generated from
mutations across the full antibody sequence.
• We develop the first antibody sequence-specific pre-
trained language model. We find that pre-training on gen-
eral protein sequence datasets supports better feature re-
finement and learning for antibody binding prediction.
This result highlights the important role of training over
a diverse set of protein sequences.
• We demonstrate that language models consistently learn
much more effective features for antibody binding pre-
diction than conventional antibody sequence features or
features learned by a CNN model. This result is consis-
tent with recent results that have investigated the power
of language models in supporting downstream tasks in
the natural language domain. (Saunshi, Malladi, and
Arora 2021).
• Our training data size sensitivity analysis reveals a range
of data sizes (about 3200-6400 samples), where we ob-
serve the most steep improvement of binding prediction
performance. Fewer data samples are insufficient to cap-
ture the antibody sequence to binding mapping, while
more data samples offer diminishing returns.
Background
Antibodies are Y-shaped proteins produced by the immune
system to tag or neutralize foreign substances, called anti-
gens. As shown in Figure 1, antibodies consist of two iden-
tical light chains and two identical heavy chains. Each chain
has a variable region and a constant region. The tip of the
variable region forms the antibody’s binding surface, also
known as a paratope. This region recognizes and binds to
a specific antigen’s binding surface called an epitope. The
variable regions of each chain contain three hypervariable
regions known as the complementarity-determining regions
(CDRs), denoted as CDR-L1, CDR-L2, CDR-L3 and CDR-
H1, CDR-H2, CDR-H3, for the light and heavy chains re-
spectively (highlighted in yellow in Figure 1).
The amino acid sequence of the CDRs determines the
antigens to which an antibody will bind (O’Connor, Adams,
and Fairman 2010). The variable domain of the light chain
can consist of 94-139 amino acids, while that of the
heavy chain is slightly longer, consisting of 92-160 amino
acids (Abhinandan and Martin 2008). Each amino acid is
encoded into a 25-character alphabet with 20 characters for
the standard amino acids and 5characters for non-standard,
ambiguous or unknown amino acids.
Many natural properties of antibodies are considered
when designing optimized antibody candidates against a
specific target including binding affinity, binding specificity,
stability, solubility, and effector functions. Antibody binding
affinity, the property we focus on in this paper, is the binding
strength between the antibody and antigen, and is inversely
related to the equilibrium dissociation constant KD, a ratio