A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences Nataša Tagasovska1Nathan C. Frey1Andreas Loukas1Isidro Hötzel2Julien

2025-04-30 0 0 1.39MB 10 页 10玖币
侵权投诉
A Pareto-optimal compositional energy-based model
for sampling and optimization of protein sequences
Nataša Tagasovska1,Nathan C. Frey1,Andreas Loukas1,Isidro Hötzel2,Julien
Lafrance-Vanasse2,Ryan Lewis Kelly2,Yan Wu2,Arvind Rajpal2,Richard Bonneau1,
Kyunghyun Cho1,3,4,5,Stephen Ra1, and Vladimir Gligorijevi´
c1
1Prescient Design, Genentech
2Antibody Engineering, Genentech
3Department of Computer Science, Courant Institute of Mathematical Sciences, New York University
4Center for Data Science, New York University
5CIFAR Fellow
natasa.tagasovska@roche.com
Abstract
Deep generative models have emerged as a popular machine learning-based ap-
proach for inverse design problems in the life sciences. However, these problems
often require sampling new designs that satisfy multiple properties of interest in ad-
dition to learning the data distribution. This multi-objective optimization becomes
more challenging when properties are independent or orthogonal to each other.
In this work, we propose a Pareto-compositional energy-based model (pcEBM),
a framework that uses multiple gradient descent for sampling new designs that
adhere to various constraints in optimizing distinct properties. We demonstrate
its ability to learn non-convex Pareto fronts and generate sequences that simulta-
neously satisfy multiple desired properties across a series of real-world antibody
design tasks.
1 Introduction
Generative models have shown promise across various applications in the life sciences for generating
chemically- and physically-plausible designs and in accelerating the process of scientific discovery.
Part of this trend of adoption can be owed to the convincing examples created by generative adversarial
networks (GANs) [
13
], variational autoencoders (VAEs) [
23
], energy-based models (EBMs) [
8
] and,
more recently, diffusion models [
35
;
32
;
5
]. However, there are far fewer success stories in real-world
industry applications [
6
;
15
]. Some reasons include an overrepresenation of image datasets; a lack of
evaluation protocols and metrics for synthetic data [
37
;
33
]; challenges around controllable generation
and training — for exmaple GANs; and challenges in generating samples that are different from
have been seen during training [
22
;
2
;
38
]. Taken together, these serve to limit the applicability of
generative modeling for real-world use cases.
Our work is motivated by this and, in particular, the need to accelerate the development and discovery
of new molecules, namely therapeutic antibodies. Though several generative models have already
been proposed for these purposes [
10
;
12
;
31
], generating new samples without guidance or control
does not guarantee downstream success of the proposed molecules. In practice, each molecule has to
satisfy multiple properties. For therapeutic antibodies, this could include properties such as binding
affinity, polyreactivity, and viscosity [
1
;
20
;
39
]. Failure to account for these and other properties
can lead to serious complications later on during scale-up, manufacturing, and clinical trials and
the optimization of straightforward objectives does not necessarily translate into progress in the
laboratory [28].
Preprint. Under review.
arXiv:2210.10838v1 [cs.LG] 19 Oct 2022
Figure 1: Example output of the proposed Pareto-optimal energy-based sampler (pcEBM) compared
to a naive multi-objective sampling (cEBM). The green marker denotes the starting sequence; each
point in the plot corresponds to a modified design of that starting point, aiming at improving affinity
and nonspecificity (BV-ELISA or BV score). pcEBM introduces samples along a Pareto front, with
candidates minimizing both objectives simultaneously, while candidates generated by the cEBM
minimize only one objective.
Motivated by this challenge, we propose a new EBM for antibody design that simultaneously takes
into account multiple properties that an antibody has to satisfy. The adherence to multiple properties is
a challenge in itself, as it often involves optimizing multiple conflicting objectives. In antibodies, for
instance, optimization of binding affinity alone may come at the expense of developability properties,
parameters that govern the likelihood of success of a molecule during manufacturing and quality
control .
In general, in optimizing multiple conflicting properties, it is often impossible to find a single sample
that satisfies all of the objectives simultaneously [
21
]. We argue that, instead, one should aim to
propose a set of diverse data points from the Pareto front [
3
] that correspond to different choices
for various objective functions. Doing so enables a global perspective the optimal trade-off between
objectives and can select a molecule according to their preference. To achieve Pareto optimality, we
rely on recent advances in multi-objective optimization (MOO)[
34
;
27
] as well as on compositional
sampling with EBMs [
8
] to build a Pareto-optimal compositional EBM (pcEBM). Figure 1 exemplifies
both why we need pcEBM and what we achieve with it when optimizing a design of an initial antibody
sequence.
We first present the the necessary background on multi-objective optimization in subsection 2.1 and
related work on compositional sampling with EBM in subsection 2.2 leading to pcEBM described in
subsection 2.3. In section 3 we include empirical evaluation and discussion, before concluding in
section 4.
2 Background and method
Problem setup.
Antibodies are composed of two chains of amino acids (AA), which can be repre-
sented as sequences of characters. Each AA comes from an alphabet of 20 characters corresponding ,
typically of combined length
L250
. We will denote the sequences as
x= (x1, . . . , xL)
, where
xl∈ {1,...,20}
corresponds to the AA type at position
l
. For each of those sequences, we measure
mproperties, such that fi:RLRfor all i1, . . . , m.
Our goal is to generate new sequences
x
with preferred values for each of the
m
properties. Since
we cannot fully satisfy all of the properties/objectives simultaneously (they may be conflicting with
2
摘要:

APareto-optimalcompositionalenergy-basedmodelforsamplingandoptimizationofproteinsequencesNatašaTagasovska1,NathanC.Frey1,AndreasLoukas1,IsidroHötzel2,JulienLafrance-Vanasse2,RyanLewisKelly2,YanWu2,ArvindRajpal2,RichardBonneau1,KyunghyunCho1,3,4,5,StephenRa1,andVladimirGligorijevi´c11PrescientDesign,...

展开>> 收起<<
A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences Nataša Tagasovska1Nathan C. Frey1Andreas Loukas1Isidro Hötzel2Julien.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.39MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注