
measurements returned by the lab in each iteration is appended to our training set and we update our
models using the augmented dataset for the next iteration.
The wet lab’s measurement process can be viewed as a black-box function that is expensive to
evaluate. In the context of identifying designs maximizing this function, Bayesian optimization (BO)
emerges as a promising, sample-efficient framework that trades off exploration (evaluating highly
uncertain designs) and exploitation (evaluating designs believed to carry the best properties) in a
principled manner [
4
]. It relies on a probabilistic surrogate model that infers the posterior distribution
over the objectives and an acquisition function that assigns an expected utility value to each candidate.
BO has been successfully applied to a variety of protein engineering applications [5–7].
In particular, we cast our problem as multi-objective BO, where multiple objectives are jointly
optimized. Our objectives originate from the molecular properties evaluated during in vitro validation.
This validation process involves producing the design, confirming its pharmacology, and evaluating
whether it is active against a given drug target of interest. If found to be potent, the design is
then assayed for developability attributes—physicochemical properties that characterize the safety,
delivery, and manufacturability [8].
The experimental process of in vitro validation signifies a hierarchy among the objectives. Consider
the property “expression” in the context of antibody design, for instance. A designed antibody
candidate must first be expressed in live cell culture. If the level of expression does not meet a
fixed threshold, the lab cannot produce it and it cannot be assayed for potency and developability
downstream. Supposing now that a design did express in viable amounts, if it does not bind to a target
antigen with sufficient “affinity” (and is thus not potent), then the design fails as an antibody and there
is little practical incentive in assaying it for developability (such as specificity and thermostability)—
even if it is possible to do so. The dependency between properties, whether experimental or biological
in origin, motivates us prioritize some objectives before others when selecting the subset of designs
to submit to the wet lab. Our primary goal is to identify “joint positive” designs, designs that meet the
chosen thresholds in all the parent objectives (expressing binders) according to the specified partial
ordering and also perform well in the leaf-level objectives (high specificity, thermostability).
To this end, we propose PropertyDAG, a simple framework that operates on top of the traditional
multi-objective BO to impose a desired partial ordering on the objectives, e.g. expression
→
affinity
→ {
specificity, thermostability
}
. Our framework modifies the posterior inference procedure within
standard BO in two ways. First, we treat the objectives as mixed-variable—in particular, each
objective is modeled as a mixture of zeros and a wide dispersion of real-valued, non-zero values.
The surrogate model consists of a binary classifier and a regressor, which infer the zero mode and
the non-zero mode, respectively. We show that this modeling choice is well-suited for biological
properties, which tend to carry excess zero, or null, values and fat tails [
9
]. Second, before samples
from the posterior distribution inferred by the surrogate model enter the acquisition function, we
transform the samples such that they conform to the specified partial ordering of properties. We
run multi-objective BO with PropertyDAG over multiple simulated active learning iterations to a
penicillin production task, a toy numerical problem, and a real-world antibody design task. In all
three tasks, PropertyDAG-BO identifies significantly more joint positives compared to standard BO.
After the final iteration, the surrogate models trained under Property-BO also output more accurate
predictions on the joint positives in a held-out test set than do the standard BO equivalents.
2 Background
Bayesian optimization
(BO) is a popular technique for sample-efficient black-box optimization [see
10
,
11
, for a review]. Suppose our objective
f:X → R
is a black-box function of the design space
X
that is expensive to evaluate. Our goal is to efficiently identify a design
x?∈ X
maximizing
1f
.
BO leverages two tools, a probabilistic surrogate model and a utility function, to trade off exploration
(evaluating highly uncertain designs) and exploitation (evaluating designs believed to maximize
f
) in
a principled manner.
For each iteration
t∈N
, we have a dataset
Dt={(x(1), y(1)),· · · ,(x(Nt), y(Nt))} ∈ Dt
, where
each
y(n)
is a noisy observation of
f
. First, the probabilistic model
ˆ
f:X → R
infers the posterior
1
For simplicity, we define the task as maximization in this paper without loss of generality. For minimizing
f, we can negate f, for instance.
2