PropertyDAG Multi-objective Bayesian optimization of partially ordered mixed-variable properties for biological sequence design

2025-05-02 0 0 681.57KB 11 页 10玖币

侵权投诉

PropertyDAG: Multi-objective Bayesian optimization

of partially ordered, mixed-variable properties for

biological sequence design

Ji Won Park1,Samuel Stanton1,Saeed Saremi1,Andrew Watkins1,Henri Dwyer1,Vladimir

Gligorijevi´

c1,Richard Bonneau1,Stephen Ra1, and Kyunghyun Cho1,2,3,4

1Prescient Design, Genentech

2Department of Computer Science, Courant Institute of Mathematical Sciences, New York University

3Center for Data Science, New York University

4CIFAR Fellow

park.ji_won@gene.com

Abstract

Bayesian optimization offers a sample-efﬁcient framework for navigating the

exploration-exploitation trade-off in the vast design space of biological sequences.

Whereas it is possible to optimize the various properties of interest jointly using a

multi-objective acquisition function, such as the expected hypervolume improve-

ment (EHVI), this approach does not account for objectives with a hierarchical

dependency structure. We consider a common use case where some regions of the

Pareto frontier are prioritized over others according to a speciﬁed

partial ordering

in the objectives. For instance, when designing antibodies, we maximize the

binding afﬁnity to a target antigen only if it can be expressed in live cell culture—

modeling the experimental dependency in which afﬁnity can only be measured

for antibodies that can be expressed and thus produced in viable quantities. In

general, we may want to confer a partial ordering to the properties such that each is

optimized conditioned on its parent properties satisfying some feasibility condition.

To this end, we present PropertyDAG, a framework that operates on top of the

traditional multi-objective BO to impose this desired ordering on the objectives, e.g.

expression

→

afﬁnity. We demonstrate its performance over multiple simulated

active learning iterations on a penicillin production task, toy numerical problem,

and a real-world antibody design task.

1 Introduction

Designing biological sequences entails searching over vast combinatorial design spaces. Recently,

deep sequence generation models trained on large datasets of known, functional sequences have

shown promise in generating physically and chemically plausible designs [e.g.

–

]. Whereas these

models accelerate the design process, limited resources place a cap on how many designs we can

characterize in vitro for assessing their suitability. Only once a design is validated in vitro and

undergoes multiple rounds of optimization can it proceed down the drug development pipeline to

preclinical development and clinical trials, where its performance is tested in vivo.

Because the wet lab cannot provide feedback on all of the candidate designs, we take an iterative,

data-driven approach to select the the most informative subset to submit to the wet lab. Many drug

design applications call for such an active learning approach, as the initial datasets available to

train predictive models on our desired properties of interest tends to be small or nonexistent. The

Preprint. Under review.

arXiv:2210.04096v1 [cs.LG] 8 Oct 2022

measurements returned by the lab in each iteration is appended to our training set and we update our

models using the augmented dataset for the next iteration.

The wet lab’s measurement process can be viewed as a black-box function that is expensive to

evaluate. In the context of identifying designs maximizing this function, Bayesian optimization (BO)

emerges as a promising, sample-efﬁcient framework that trades off exploration (evaluating highly

uncertain designs) and exploitation (evaluating designs believed to carry the best properties) in a

principled manner [

]. It relies on a probabilistic surrogate model that infers the posterior distribution

over the objectives and an acquisition function that assigns an expected utility value to each candidate.

BO has been successfully applied to a variety of protein engineering applications [5–7].

In particular, we cast our problem as multi-objective BO, where multiple objectives are jointly

optimized. Our objectives originate from the molecular properties evaluated during in vitro validation.

This validation process involves producing the design, conﬁrming its pharmacology, and evaluating

whether it is active against a given drug target of interest. If found to be potent, the design is

then assayed for developability attributes—physicochemical properties that characterize the safety,

delivery, and manufacturability [8].

The experimental process of in vitro validation signiﬁes a hierarchy among the objectives. Consider

the property “expression” in the context of antibody design, for instance. A designed antibody

candidate must ﬁrst be expressed in live cell culture. If the level of expression does not meet a

ﬁxed threshold, the lab cannot produce it and it cannot be assayed for potency and developability

downstream. Supposing now that a design did express in viable amounts, if it does not bind to a target

antigen with sufﬁcient “afﬁnity” (and is thus not potent), then the design fails as an antibody and there

is little practical incentive in assaying it for developability (such as speciﬁcity and thermostability)—

even if it is possible to do so. The dependency between properties, whether experimental or biological

in origin, motivates us prioritize some objectives before others when selecting the subset of designs

to submit to the wet lab. Our primary goal is to identify “joint positive” designs, designs that meet the

chosen thresholds in all the parent objectives (expressing binders) according to the speciﬁed partial

ordering and also perform well in the leaf-level objectives (high speciﬁcity, thermostability).

To this end, we propose PropertyDAG, a simple framework that operates on top of the traditional

multi-objective BO to impose a desired partial ordering on the objectives, e.g. expression

→

afﬁnity

→ {

speciﬁcity, thermostability

}

. Our framework modiﬁes the posterior inference procedure within

standard BO in two ways. First, we treat the objectives as mixed-variable—in particular, each

objective is modeled as a mixture of zeros and a wide dispersion of real-valued, non-zero values.

The surrogate model consists of a binary classiﬁer and a regressor, which infer the zero mode and

the non-zero mode, respectively. We show that this modeling choice is well-suited for biological

properties, which tend to carry excess zero, or null, values and fat tails [

]. Second, before samples

from the posterior distribution inferred by the surrogate model enter the acquisition function, we

transform the samples such that they conform to the speciﬁed partial ordering of properties. We

run multi-objective BO with PropertyDAG over multiple simulated active learning iterations to a

penicillin production task, a toy numerical problem, and a real-world antibody design task. In all

three tasks, PropertyDAG-BO identiﬁes signiﬁcantly more joint positives compared to standard BO.

After the ﬁnal iteration, the surrogate models trained under Property-BO also output more accurate

predictions on the joint positives in a held-out test set than do the standard BO equivalents.

2 Background

Bayesian optimization

(BO) is a popular technique for sample-efﬁcient black-box optimization [see

, for a review]. Suppose our objective

f:X → R

is a black-box function of the design space

that is expensive to evaluate. Our goal is to efﬁciently identify a design

x?∈ X

maximizing

BO leverages two tools, a probabilistic surrogate model and a utility function, to trade off exploration

(evaluating highly uncertain designs) and exploitation (evaluating designs believed to maximize

) in

a principled manner.

For each iteration

t∈N

, we have a dataset

Dt={(x(1), y(1)),· · · ,(x(Nt), y(Nt))} ∈ Dt

, where

each

y(n)

is a noisy observation of

. First, the probabilistic model

f:X → R

infers the posterior

For simplicity, we deﬁne the task as maximization in this paper without loss of generality. For minimizing

f, we can negate f, for instance.

distribution

p(ˆ

f|Dt)

, quantifying the plausibility of surrogate objectives

f∈ F

. Next, we introduce

a utility function

u:X × F × Dt:→R

. The acquisition function

a(x)

is simply the expected utility

of xw.r.t. our beliefs about f,

a(x) = Zu(x,ˆ

f, Dt)p(ˆ

f|Dt)dˆ

f. (1)

For example, we obtain the expected improvement (EI) acquisition function if we take

uEI(x,ˆ

f, D) =

[ˆ

f(x)−max(x0,y0)∈D y0]+,

where

[·]+= max(·,0)

[

]. Generally the integral is approximated

by Monte Carlo (MC) with posterior samples

f(j)∼p(ˆ

f|Dt)

. We select a maximizer of

as the

new design, measure its properties, and append the observation to the dataset. The surrogate is then

retrained on the augmented dataset and the procedure repeats.

Multi-objective optimization

When there are multiple objectives of interest, a single best design

may not exist. Suppose there are

objectives,

f:X → RK

. The goal of multi-objective

optimization (MOO) is to identify the set of Pareto-optimal solutions such that improving one

objective within the set leads to worsening another. We say that

dominates

, or

f(x)f(x0)

, if

fk(x)≥fk(x0)

for all

k∈ {1, . . . , K}

and

fk(x)> fk(x0)

for some

. The set of non-dominated

solutions X∗is deﬁned in terms of the Pareto frontier (PF) P∗,

X?={x:f(x)∈ P?},where P?={f(x) : x∈ X ,@x0∈ X s.t. f(x0)f(x)}.(2)

MOO algorithms typically aim to identify a ﬁnite approximation to

, which may be inﬁnite, within

a reasonable number of iterations. One way to measure the quality of an approximate PF

is to

compute the hypervolume

HV(P|rref )

of the polytope bounded by

P ∪ {rref }

, where

rref ∈RK

is a

user-speciﬁed reference point. We obtain the expected hypervolume improvement (EHVI) acquisition

function if we take

uEHVI(x,ˆ

f, D) = HVI(P0,P|rref ) = [HV(P0|rref )−HV(P|rref )]+,(3)

where P0=P ∪ { ˆ

f(x)}[13, 14].

Noisy observations

In the noiseless setting, the observed baseline PF is the true baseline PF, i.e.

Pt={y:y∈ Yt,@y0∈ Yts.t. y0y}

where

Yt:={y(n)}Nt

n=1

. This does not, however,

hold in many practical applications, where measurements carry noise. For instance, given a zero-

mean Gaussian measurement process with noise covariance

, the feedback for a design

y∼

N(f(x),Σ)

, not

f(x)

itself. The noisy expected hypervolume improvement (NEHVI) acquisition

function marginalizes over the surrogate posterior at the previously observed points

Xt={x(n)}Nt

n=1

uNEHVI(x,ˆ

f, D) = HVI( ˆ

t,ˆ

Pt|rref ),(4)

where ˆ

Pt={ˆ

f(x) : x∈ Xt,@x0∈ Xts.t. ˆ

f(x0)ˆ

f(x)}and ˆ

P0=ˆ

P ∪ { ˆ

f(x)}[15].

Batched (parallel) optimization

Sequential optimization, or querying

for one design per itera-

tion, is impractical for many applications due to the latency in feedback. In protein engineering, for

example, it may be necessary to select a batch of designs in a given iteration and wait several months

to receive measurements [

]. Jointly selecting a batch of

designs from a large pool of

q0q

candidates requires combinatorial evaluations of the acquisition function.

3 Related Work

Existing work on multi-objective BO does not account for objectives with a hierarchical dependency

structure [

–

]. We refer to [

] for a formulation of single-objective BO with a hierarchy in

how the objective is computed. A body of work focuses on constrained optimization, which optimizes

a black-box function subject to a set of black-box constraints being satisﬁed [

–

]. For dealing

with mixed-variable objectives, [

] propose reparameterizing the discrete random variables in terms

of continuous parameters. Our approach here is to model them explicitly using the zero-inﬂated

formalism.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PropertyDAG:Multi-objectiveBayesianoptimizationofpartiallyordered,mixed-variablepropertiesforbiologicalsequencedesignJiWonPark1,SamuelStanton1,SaeedSaremi1,AndrewWatkins1,HenriDwyer1,VladimirGligorijevi´c1,RichardBonneau1,StephenRa1,andKyunghyunCho1,2,3,41PrescientDesign,Genentech2DepartmentofComput...

展开>> 收起<<

PropertyDAG Multi-objective Bayesian optimization of partially ordered mixed-variable properties for biological sequence design.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PropertyDAG Multi-objective Bayesian optimization of partially ordered mixed-variable properties for biological sequence design

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: