A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences Nataša Tagasovska1Nathan C. Frey1Andreas Loukas1Isidro Hötzel2Julien

2025-04-30 9 0 1.39MB 10 页 10玖币

侵权投诉

A Pareto-optimal compositional energy-based model

for sampling and optimization of protein sequences

Nataša Tagasovska1,Nathan C. Frey1,Andreas Loukas1,Isidro Hötzel2,Julien

Lafrance-Vanasse2,Ryan Lewis Kelly2,Yan Wu2,Arvind Rajpal2,Richard Bonneau1,

Kyunghyun Cho1,3,4,5,Stephen Ra1, and Vladimir Gligorijevi´

1Prescient Design, Genentech

2Antibody Engineering, Genentech

3Department of Computer Science, Courant Institute of Mathematical Sciences, New York University

4Center for Data Science, New York University

5CIFAR Fellow

natasa.tagasovska@roche.com

Abstract

Deep generative models have emerged as a popular machine learning-based ap-

proach for inverse design problems in the life sciences. However, these problems

often require sampling new designs that satisfy multiple properties of interest in ad-

dition to learning the data distribution. This multi-objective optimization becomes

more challenging when properties are independent or orthogonal to each other.

In this work, we propose a Pareto-compositional energy-based model (pcEBM),

a framework that uses multiple gradient descent for sampling new designs that

adhere to various constraints in optimizing distinct properties. We demonstrate

its ability to learn non-convex Pareto fronts and generate sequences that simulta-

neously satisfy multiple desired properties across a series of real-world antibody

design tasks.

1 Introduction

Generative models have shown promise across various applications in the life sciences for generating

chemically- and physically-plausible designs and in accelerating the process of scientiﬁc discovery.

Part of this trend of adoption can be owed to the convincing examples created by generative adversarial

networks (GANs) [

], variational autoencoders (VAEs) [

], energy-based models (EBMs) [

] and,

more recently, diffusion models [

;

]. However, there are far fewer success stories in real-world

industry applications [

;

]. Some reasons include an overrepresenation of image datasets; a lack of

evaluation protocols and metrics for synthetic data [

;

]; challenges around controllable generation

and training — for exmaple GANs; and challenges in generating samples that are different from

have been seen during training [

;

]. Taken together, these serve to limit the applicability of

generative modeling for real-world use cases.

Our work is motivated by this and, in particular, the need to accelerate the development and discovery

of new molecules, namely therapeutic antibodies. Though several generative models have already

been proposed for these purposes [

;

], generating new samples without guidance or control

does not guarantee downstream success of the proposed molecules. In practice, each molecule has to

satisfy multiple properties. For therapeutic antibodies, this could include properties such as binding

afﬁnity, polyreactivity, and viscosity [

;

]. Failure to account for these and other properties

can lead to serious complications later on during scale-up, manufacturing, and clinical trials and

the optimization of straightforward objectives does not necessarily translate into progress in the

laboratory [28].

Preprint. Under review.

arXiv:2210.10838v1 [cs.LG] 19 Oct 2022

Figure 1: Example output of the proposed Pareto-optimal energy-based sampler (pcEBM) compared

to a naive multi-objective sampling (cEBM). The green marker denotes the starting sequence; each

point in the plot corresponds to a modiﬁed design of that starting point, aiming at improving afﬁnity

and nonspeciﬁcity (BV-ELISA or BV score). pcEBM introduces samples along a Pareto front, with

candidates minimizing both objectives simultaneously, while candidates generated by the cEBM

minimize only one objective.

Motivated by this challenge, we propose a new EBM for antibody design that simultaneously takes

into account multiple properties that an antibody has to satisfy. The adherence to multiple properties is

a challenge in itself, as it often involves optimizing multiple conﬂicting objectives. In antibodies, for

instance, optimization of binding afﬁnity alone may come at the expense of developability properties,

parameters that govern the likelihood of success of a molecule during manufacturing and quality

control .

In general, in optimizing multiple conﬂicting properties, it is often impossible to ﬁnd a single sample

that satisﬁes all of the objectives simultaneously [

]. We argue that, instead, one should aim to

propose a set of diverse data points from the Pareto front [

] that correspond to different choices

for various objective functions. Doing so enables a global perspective the optimal trade-off between

objectives and can select a molecule according to their preference. To achieve Pareto optimality, we

rely on recent advances in multi-objective optimization (MOO)[

;

] as well as on compositional

sampling with EBMs [

] to build a Pareto-optimal compositional EBM (pcEBM). Figure 1 exempliﬁes

both why we need pcEBM and what we achieve with it when optimizing a design of an initial antibody

sequence.

We ﬁrst present the the necessary background on multi-objective optimization in subsection 2.1 and

related work on compositional sampling with EBM in subsection 2.2 leading to pcEBM described in

subsection 2.3. In section 3 we include empirical evaluation and discussion, before concluding in

section 4.

2 Background and method

Problem setup.

Antibodies are composed of two chains of amino acids (AA), which can be repre-

sented as sequences of characters. Each AA comes from an alphabet of 20 characters corresponding ,

typically of combined length

L∼250

. We will denote the sequences as

x= (x1, . . . , xL)

, where

xl∈ {1,...,20}

corresponds to the AA type at position

. For each of those sequences, we measure

mproperties, such that fi:RL→Rfor all i∈1, . . . , m.

Our goal is to generate new sequences

x∗

with preferred values for each of the

properties. Since

we cannot fully satisfy all of the properties/objectives simultaneously (they may be conﬂicting with

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

APareto-optimalcompositionalenergy-basedmodelforsamplingandoptimizationofproteinsequencesNataaTagasovska1,NathanC.Frey1,AndreasLoukas1,IsidroHötzel2,JulienLafrance-Vanasse2,RyanLewisKelly2,YanWu2,ArvindRajpal2,RichardBonneau1,KyunghyunCho1,3,4,5,StephenRa1,andVladimirGligorijevi´c11PrescientDesign,...

展开>> 收起<<

A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences Nataša Tagasovska1Nathan C. Frey1Andreas Loukas1Isidro Hötzel2Julien.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences Nataša Tagasovska1Nathan C. Frey1Andreas Loukas1Isidro Hötzel2Julien

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: