Conditional Neural Processes for Molecules Miguel Garcia-Ortegon DPMMS

2025-05-06 0 0 1.34MB 9 页 10玖币

侵权投诉

Conditional Neural Processes for Molecules

Miguel Garcia-Ortegon

DPMMS

University of Cambridge

Wilberforce Rd, Cambridge, UK

mg770@cam.ac.uk

Andreas Bender

Dept. of Chemistry

University of Cambridge

Lensﬁeld Rd, Cambridge, UK

ab454@cam.ac.uk

Sergio Bacallado

DPMMS

University of Cambridge

Wilberforce Rd, Cambridge, UK

sb2116@cam.ac.uk

Abstract

Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian

Processes (GPs). They are adept at modelling data consisting of few observations of many related

functions on the same input space and are trained by minimizing a variational objective, which is

computationally much less expensive than the Bayesian updating required by GPs. So far, most

studies of NPs have focused on low-dimensional datasets which are not representative of realistic

transfer learning tasks. Drug discovery is one application area that is characterized by datasets

consisting of many chemical properties or functions which are sparsely observed, yet depend on

shared features or representations of the molecular inputs. This paper applies the conditional neural

process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs

show competitive performance in few-shot learning tasks relative to supervised learning baselines

common in chemoinformatics, as well as an alternative model for transfer learning based on pre-

training and reﬁning neural network regressors. We present a Bayesian optimization experiment which

showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty

quantiﬁcation.

1 Introduction

1.1 Learning from sparse chemical datasets

Recent years have seen an explosion of novel machine learning (ML) methods for molecular tasks, often relying on

large neural networks that require vast amounts of labelled data. The development of these models has been fueled

by an expectation that ML will greatly accelerate drug discovery [

]. Unfortunately, their real-world applicability

is hindered by the sparsity of chemical datasets, which comprise many molecular functions with a few observations

each. It is estimated that in-house datasets in the pharmaceutical industry are less than 1% complete, whereas ChEMBL

is only 0.05% complete [

]. In order to take advantage real-world chemical datasets, we require models that are able

to transfer information across separate functions, even if annotated on non-overlapping molecules, and can make

predictions on new functions with very few observed labels. This setting, known as meta-learning, could be used to

frame and make an impact on problems in many areas of computer-aided drug design, including virtual screening, data

imputation, quantitative structure-activity relationships (QSAR), Bayesian optimization or bioactivity ﬁngerprinting,

among others.

Neural processes are a novel family of models that show promise in meta-learning but have so far only been tested on

toy low-dimensional datasets. In this paper, we evaluate the performance of the CNP in several molecular tasks using

high-dimensional molecular representations.

1.2 Conditional Neural Processes (CNPs)

Consider a dataset consisting of observations of real-valued functions

f1, . . . , fn

on the same input space

. Each

function

is observed at a set of

input points

O∈ X oi

; we deﬁne

O= (f(xi

O,1), . . . , f (xi

O,oi))

. Let

be a

test function,

(xC, yC)

be a vector of

context points and the values of

on these inputs, and

(xT, yT)

be a vector

target points and the values of

on these inputs. A neural process (NP) [

] aims to describe the predictive

6th Workshop on Meta-Learning at NeurIPS 2022, New Orleans.

arXiv:2210.09211v3 [stat.ML] 23 Feb 2023

distribution

p(yT|xT, xC, yC)

. This is done by mapping

(xC, yC, xT)

through a parametric function, which is trained

on the data (xi

O, yi

O)i=1,...,n. In particular, we model the predictive distribution with a product measure:

qθ(yT|xT, xC, yC) =

j=1

N(yT,j ;µθ(xT,j ), σ2

θ(xT,j )).

The mean and variance,

µθ(x)

and

σ2

θ(x)

, of the predictive distribution at target input

are obtained through the

following mapping:

rj=hθ(xC,j , yC,j )for j= 1, . . . , c (encoding)

r=r1⊕ · · · ⊕ rc(aggregation)

(µθ(x), σ2

θ(x)) = gθ(x, r)for all x∈ X (decoding)

where

hθ

and

gθ

are neural networks,

⊕

is a commutative operation and

is a global representation for the entire

context data.This architecture ensures that the predictive distribution is invariant to permutations of the context and

target points, respectively. The parameters

of the encoder and decoder are trained by backpropagation using the data

(xi

O, yi

O)i=1,...,n. Conditional NPs (CNPs) [4] minimise a particularly simple objective:

−E"1

i=1

log qθ(yi

T|xi

T, xi

C, yi

C)#,

where the expectation is taken with respect to a random partition of the observations

(xi

O, yi

for function

into a

set of context points

(xi

C, yi

and target points

(xi

T, yi

. This objective function does not explicitly regularize the

predictive distribution

qθ

, so when the model is overparametrized, the objective can diverge and the variance parameters

σ2(·, r)

can underestimate uncertainty. Latent NPs (LNPs) [

] avoid this problem by maximizing an approximate

Evidence Lower Bound, derived through a more conventional variational inference approach.

So far NP models have been evaluated on low-dimensional settings, where they excel at few-shot learning. Here, we

will analyze their performance on molecules represented by high-dimensional chemical ﬁngerprints.

1.3 The DOCKSTRING dataset

The DOCKSTRING dataset [

] is a molecular dataset for benchmarking of ML models. It comprises more than 15

million docking scores for 58 protein targets and 260k molecules. Targets were chosen to be medically relevant and

represent a variety of protein families, and molecules were curated from PubChem and ChEMBL to be representative

of chemical series in drug discovery projects.

Each molecule in the DOCKSTRING dataset is annotated with all 58 protein target scores, which makes it especially

suitable to design benchmark tasks in transfer learning. In this paper, we sample a small subset of it to evaluate

regression and transfer learning by CNPs in the low-data regime.

2 Methods

2.1 Dataset and split

NPs are able to learn across different datapoints within the same function, and across different functions within the

same input space. Therefore, the dataset was split across both the datapoint dimension and the function dimension. We

refer to these splits as dtrain,dtest and ftrain,ftest respectively (Figure 1).

To emulate learning in a low-data regime, we took a small sample of the train and test sets deﬁned in the DOCKSTRING

package. The dtrain set consisted of 2500 molecules from the original train set, and the dtest set consisted of 2500

molecules from the original test set. DOCKSTRING original sets were split by clusters, which prevented data leakage

from chemical analogues in dtrain and dtest. Our function split was derived from the DOCKSTRING regression task,

using the 5 task targets (ESR2, KIT, PARP1, PGR, F2) as ftest and the other 53 targets as ftrain.

2.2 CNP and benchmark models

A simple CNP was implemented with 3 linear layers in the encoder network, a mean aggregator function and 3

linear layers in the decoder network. We include four benchmarks commonly used in ML for chemoinformatics: a

feed-forward neural network with the same number of layers as the CNP (NN), k-nearest neighbours with

k= 5

(KNN)

and

k= 1

(FSS, known as ﬁngerprint similarity search in chemoinformatics [

]), and a random forest regressor with

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConditionalNeuralProcessesforMoleculesMiguelGarcia-OrtegonDPMMSUniversityofCambridgeWilberforceRd,Cambridge,UKmg770@cam.ac.ukAndreasBenderDept.ofChemistryUniversityofCambridgeLenseldRd,Cambridge,UKab454@cam.ac.ukSergioBacalladoDPMMSUniversityofCambridgeWilberforceRd,Cambridge,UKsb2116@cam.ac.ukAbst...

展开>> 收起<<

Conditional Neural Processes for Molecules Miguel Garcia-Ortegon DPMMS.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Conditional Neural Processes for Molecules Miguel Garcia-Ortegon DPMMS

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: