PropertyDAG Multi-objective Bayesian optimization of partially ordered mixed-variable properties for biological sequence design

2025-05-02 0 0 681.57KB 11 页 10玖币
侵权投诉
PropertyDAG: Multi-objective Bayesian optimization
of partially ordered, mixed-variable properties for
biological sequence design
Ji Won Park1,Samuel Stanton1,Saeed Saremi1,Andrew Watkins1,Henri Dwyer1,Vladimir
Gligorijevi´
c1,Richard Bonneau1,Stephen Ra1, and Kyunghyun Cho1,2,3,4
1Prescient Design, Genentech
2Department of Computer Science, Courant Institute of Mathematical Sciences, New York University
3Center for Data Science, New York University
4CIFAR Fellow
park.ji_won@gene.com
Abstract
Bayesian optimization offers a sample-efficient framework for navigating the
exploration-exploitation trade-off in the vast design space of biological sequences.
Whereas it is possible to optimize the various properties of interest jointly using a
multi-objective acquisition function, such as the expected hypervolume improve-
ment (EHVI), this approach does not account for objectives with a hierarchical
dependency structure. We consider a common use case where some regions of the
Pareto frontier are prioritized over others according to a specified
partial ordering
in the objectives. For instance, when designing antibodies, we maximize the
binding affinity to a target antigen only if it can be expressed in live cell culture—
modeling the experimental dependency in which affinity can only be measured
for antibodies that can be expressed and thus produced in viable quantities. In
general, we may want to confer a partial ordering to the properties such that each is
optimized conditioned on its parent properties satisfying some feasibility condition.
To this end, we present PropertyDAG, a framework that operates on top of the
traditional multi-objective BO to impose this desired ordering on the objectives, e.g.
expression
affinity. We demonstrate its performance over multiple simulated
active learning iterations on a penicillin production task, toy numerical problem,
and a real-world antibody design task.
1 Introduction
Designing biological sequences entails searching over vast combinatorial design spaces. Recently,
deep sequence generation models trained on large datasets of known, functional sequences have
shown promise in generating physically and chemically plausible designs [e.g.
1
3
]. Whereas these
models accelerate the design process, limited resources place a cap on how many designs we can
characterize in vitro for assessing their suitability. Only once a design is validated in vitro and
undergoes multiple rounds of optimization can it proceed down the drug development pipeline to
preclinical development and clinical trials, where its performance is tested in vivo.
Because the wet lab cannot provide feedback on all of the candidate designs, we take an iterative,
data-driven approach to select the the most informative subset to submit to the wet lab. Many drug
design applications call for such an active learning approach, as the initial datasets available to
train predictive models on our desired properties of interest tends to be small or nonexistent. The
Preprint. Under review.
arXiv:2210.04096v1 [cs.LG] 8 Oct 2022
measurements returned by the lab in each iteration is appended to our training set and we update our
models using the augmented dataset for the next iteration.
The wet lab’s measurement process can be viewed as a black-box function that is expensive to
evaluate. In the context of identifying designs maximizing this function, Bayesian optimization (BO)
emerges as a promising, sample-efficient framework that trades off exploration (evaluating highly
uncertain designs) and exploitation (evaluating designs believed to carry the best properties) in a
principled manner [
4
]. It relies on a probabilistic surrogate model that infers the posterior distribution
over the objectives and an acquisition function that assigns an expected utility value to each candidate.
BO has been successfully applied to a variety of protein engineering applications [5–7].
In particular, we cast our problem as multi-objective BO, where multiple objectives are jointly
optimized. Our objectives originate from the molecular properties evaluated during in vitro validation.
This validation process involves producing the design, confirming its pharmacology, and evaluating
whether it is active against a given drug target of interest. If found to be potent, the design is
then assayed for developability attributes—physicochemical properties that characterize the safety,
delivery, and manufacturability [8].
The experimental process of in vitro validation signifies a hierarchy among the objectives. Consider
the property “expression” in the context of antibody design, for instance. A designed antibody
candidate must first be expressed in live cell culture. If the level of expression does not meet a
fixed threshold, the lab cannot produce it and it cannot be assayed for potency and developability
downstream. Supposing now that a design did express in viable amounts, if it does not bind to a target
antigen with sufficient “affinity” (and is thus not potent), then the design fails as an antibody and there
is little practical incentive in assaying it for developability (such as specificity and thermostability)—
even if it is possible to do so. The dependency between properties, whether experimental or biological
in origin, motivates us prioritize some objectives before others when selecting the subset of designs
to submit to the wet lab. Our primary goal is to identify “joint positive” designs, designs that meet the
chosen thresholds in all the parent objectives (expressing binders) according to the specified partial
ordering and also perform well in the leaf-level objectives (high specificity, thermostability).
To this end, we propose PropertyDAG, a simple framework that operates on top of the traditional
multi-objective BO to impose a desired partial ordering on the objectives, e.g. expression
affinity
→ {
specificity, thermostability
}
. Our framework modifies the posterior inference procedure within
standard BO in two ways. First, we treat the objectives as mixed-variable—in particular, each
objective is modeled as a mixture of zeros and a wide dispersion of real-valued, non-zero values.
The surrogate model consists of a binary classifier and a regressor, which infer the zero mode and
the non-zero mode, respectively. We show that this modeling choice is well-suited for biological
properties, which tend to carry excess zero, or null, values and fat tails [
9
]. Second, before samples
from the posterior distribution inferred by the surrogate model enter the acquisition function, we
transform the samples such that they conform to the specified partial ordering of properties. We
run multi-objective BO with PropertyDAG over multiple simulated active learning iterations to a
penicillin production task, a toy numerical problem, and a real-world antibody design task. In all
three tasks, PropertyDAG-BO identifies significantly more joint positives compared to standard BO.
After the final iteration, the surrogate models trained under Property-BO also output more accurate
predictions on the joint positives in a held-out test set than do the standard BO equivalents.
2 Background
Bayesian optimization
(BO) is a popular technique for sample-efficient black-box optimization [see
10
,
11
, for a review]. Suppose our objective
f:X R
is a black-box function of the design space
X
that is expensive to evaluate. Our goal is to efficiently identify a design
x?∈ X
maximizing
1f
.
BO leverages two tools, a probabilistic surrogate model and a utility function, to trade off exploration
(evaluating highly uncertain designs) and exploitation (evaluating designs believed to maximize
f
) in
a principled manner.
For each iteration
tN
, we have a dataset
Dt={(x(1), y(1)),· · · ,(x(Nt), y(Nt))} ∈ Dt
, where
each
y(n)
is a noisy observation of
f
. First, the probabilistic model
ˆ
f:X R
infers the posterior
1
For simplicity, we define the task as maximization in this paper without loss of generality. For minimizing
f, we can negate f, for instance.
2
distribution
p(ˆ
f|Dt)
, quantifying the plausibility of surrogate objectives
ˆ
f∈ F
. Next, we introduce
a utility function
u:X × F × Dt:R
. The acquisition function
a(x)
is simply the expected utility
of xw.r.t. our beliefs about f,
a(x) = Zu(x,ˆ
f, Dt)p(ˆ
f|Dt)dˆ
f. (1)
For example, we obtain the expected improvement (EI) acquisition function if we take
uEI(x,ˆ
f, D) =
[ˆ
f(x)max(x0,y0)∈D y0]+,
where
[·]+= max(·,0)
[
12
,
4
]. Generally the integral is approximated
by Monte Carlo (MC) with posterior samples
ˆ
f(j)p(ˆ
f|Dt)
. We select a maximizer of
a
as the
new design, measure its properties, and append the observation to the dataset. The surrogate is then
retrained on the augmented dataset and the procedure repeats.
Multi-objective optimization
When there are multiple objectives of interest, a single best design
may not exist. Suppose there are
K
objectives,
f:X RK
. The goal of multi-objective
optimization (MOO) is to identify the set of Pareto-optimal solutions such that improving one
objective within the set leads to worsening another. We say that
x
dominates
x0
, or
f(x)f(x0)
, if
fk(x)fk(x0)
for all
k∈ {1, . . . , K}
and
fk(x)> fk(x0)
for some
k
. The set of non-dominated
solutions Xis defined in terms of the Pareto frontier (PF) P,
X?={x:f(x)∈ P?},where P?={f(x) : x X ,@x0∈ X s.t. f(x0)f(x)}.(2)
MOO algorithms typically aim to identify a finite approximation to
X?
, which may be infinite, within
a reasonable number of iterations. One way to measure the quality of an approximate PF
P
is to
compute the hypervolume
HV(P|rref )
of the polytope bounded by
P ∪ {rref }
, where
rref RK
is a
user-specified reference point. We obtain the expected hypervolume improvement (EHVI) acquisition
function if we take
uEHVI(x,ˆ
f, D) = HVI(P0,P|rref ) = [HV(P0|rref )HV(P|rref )]+,(3)
where P0=P ∪ { ˆ
f(x)}[13, 14].
Noisy observations
In the noiseless setting, the observed baseline PF is the true baseline PF, i.e.
Pt={y:y∈ Yt,@y0∈ Yts.t. y0y}
where
Yt:={y(n)}Nt
n=1
. This does not, however,
hold in many practical applications, where measurements carry noise. For instance, given a zero-
mean Gaussian measurement process with noise covariance
Σ
, the feedback for a design
x
is
y
N(f(x),Σ)
, not
f(x)
itself. The noisy expected hypervolume improvement (NEHVI) acquisition
function marginalizes over the surrogate posterior at the previously observed points
Xt={x(n)}Nt
n=1
,
uNEHVI(x,ˆ
f, D) = HVI( ˆ
P0
t,ˆ
Pt|rref ),(4)
where ˆ
Pt={ˆ
f(x) : x∈ Xt,@x0∈ Xts.t. ˆ
f(x0)ˆ
f(x)}and ˆ
P0=ˆ
P ∪ { ˆ
f(x)}[15].
Batched (parallel) optimization
Sequential optimization, or querying
f
for one design per itera-
tion, is impractical for many applications due to the latency in feedback. In protein engineering, for
example, it may be necessary to select a batch of designs in a given iteration and wait several months
to receive measurements [
16
,
17
]. Jointly selecting a batch of
q
designs from a large pool of
q0q
candidates requires combinatorial evaluations of the acquisition function.
3 Related Work
Existing work on multi-objective BO does not account for objectives with a hierarchical dependency
structure [
18
21
,
15
]. We refer to [
22
] for a formulation of single-objective BO with a hierarchy in
how the objective is computed. A body of work focuses on constrained optimization, which optimizes
a black-box function subject to a set of black-box constraints being satisfied [
23
28
]. For dealing
with mixed-variable objectives, [
29
] propose reparameterizing the discrete random variables in terms
of continuous parameters. Our approach here is to model them explicitly using the zero-inflated
formalism.
3
摘要:

PropertyDAG:Multi-objectiveBayesianoptimizationofpartiallyordered,mixed-variablepropertiesforbiologicalsequencedesignJiWonPark1,SamuelStanton1,SaeedSaremi1,AndrewWatkins1,HenriDwyer1,VladimirGligorijevi´c1,RichardBonneau1,StephenRa1,andKyunghyunCho1,2,3,41PrescientDesign,Genentech2DepartmentofComput...

展开>> 收起<<
PropertyDAG Multi-objective Bayesian optimization of partially ordered mixed-variable properties for biological sequence design.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:681.57KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注