Conditional Neural Processes for Molecules Miguel Garcia-Ortegon DPMMS

2025-05-06 0 0 1.34MB 9 页 10玖币
侵权投诉
Conditional Neural Processes for Molecules
Miguel Garcia-Ortegon
DPMMS
University of Cambridge
Wilberforce Rd, Cambridge, UK
mg770@cam.ac.uk
Andreas Bender
Dept. of Chemistry
University of Cambridge
Lensfield Rd, Cambridge, UK
ab454@cam.ac.uk
Sergio Bacallado
DPMMS
University of Cambridge
Wilberforce Rd, Cambridge, UK
sb2116@cam.ac.uk
Abstract
Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian
Processes (GPs). They are adept at modelling data consisting of few observations of many related
functions on the same input space and are trained by minimizing a variational objective, which is
computationally much less expensive than the Bayesian updating required by GPs. So far, most
studies of NPs have focused on low-dimensional datasets which are not representative of realistic
transfer learning tasks. Drug discovery is one application area that is characterized by datasets
consisting of many chemical properties or functions which are sparsely observed, yet depend on
shared features or representations of the molecular inputs. This paper applies the conditional neural
process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs
show competitive performance in few-shot learning tasks relative to supervised learning baselines
common in chemoinformatics, as well as an alternative model for transfer learning based on pre-
training and refining neural network regressors. We present a Bayesian optimization experiment which
showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty
quantification.
1 Introduction
1.1 Learning from sparse chemical datasets
Recent years have seen an explosion of novel machine learning (ML) methods for molecular tasks, often relying on
large neural networks that require vast amounts of labelled data. The development of these models has been fueled
by an expectation that ML will greatly accelerate drug discovery [
1
,
2
]. Unfortunately, their real-world applicability
is hindered by the sparsity of chemical datasets, which comprise many molecular functions with a few observations
each. It is estimated that in-house datasets in the pharmaceutical industry are less than 1% complete, whereas ChEMBL
is only 0.05% complete [
3
]. In order to take advantage real-world chemical datasets, we require models that are able
to transfer information across separate functions, even if annotated on non-overlapping molecules, and can make
predictions on new functions with very few observed labels. This setting, known as meta-learning, could be used to
frame and make an impact on problems in many areas of computer-aided drug design, including virtual screening, data
imputation, quantitative structure-activity relationships (QSAR), Bayesian optimization or bioactivity fingerprinting,
among others.
Neural processes are a novel family of models that show promise in meta-learning but have so far only been tested on
toy low-dimensional datasets. In this paper, we evaluate the performance of the CNP in several molecular tasks using
high-dimensional molecular representations.
1.2 Conditional Neural Processes (CNPs)
Consider a dataset consisting of observations of real-valued functions
f1, . . . , fn
on the same input space
X
. Each
function
fi
is observed at a set of
oi
input points
xi
O X oi
; we define
yi
O= (f(xi
O,1), . . . , f (xi
O,oi))
. Let
f
be a
test function,
(xC, yC)
be a vector of
c
context points and the values of
f
on these inputs, and
(xT, yT)
be a vector
of
t
target points and the values of
f
on these inputs. A neural process (NP) [
4
,
5
] aims to describe the predictive
6th Workshop on Meta-Learning at NeurIPS 2022, New Orleans.
arXiv:2210.09211v3 [stat.ML] 23 Feb 2023
distribution
p(yT|xT, xC, yC)
. This is done by mapping
(xC, yC, xT)
through a parametric function, which is trained
on the data (xi
O, yi
O)i=1,...,n. In particular, we model the predictive distribution with a product measure:
qθ(yT|xT, xC, yC) =
t
Y
j=1
N(yT,j ;µθ(xT,j ), σ2
θ(xT,j )).
The mean and variance,
µθ(x)
and
σ2
θ(x)
, of the predictive distribution at target input
x
are obtained through the
following mapping:
rj=hθ(xC,j , yC,j )for j= 1, . . . , c (encoding)
r=r1 · · · rc(aggregation)
(µθ(x), σ2
θ(x)) = gθ(x, r)for all x∈ X (decoding)
where
hθ
and
gθ
are neural networks,
is a commutative operation and
r
is a global representation for the entire
context data.This architecture ensures that the predictive distribution is invariant to permutations of the context and
target points, respectively. The parameters
θ
of the encoder and decoder are trained by backpropagation using the data
(xi
O, yi
O)i=1,...,n. Conditional NPs (CNPs) [4] minimise a particularly simple objective:
E"1
n
n
X
i=1
log qθ(yi
T|xi
T, xi
C, yi
C)#,
where the expectation is taken with respect to a random partition of the observations
(xi
O, yi
O)
for function
fi
into a
set of context points
(xi
C, yi
C)
and target points
(xi
T, yi
T)
. This objective function does not explicitly regularize the
predictive distribution
qθ
, so when the model is overparametrized, the objective can diverge and the variance parameters
σ2(·, r)
can underestimate uncertainty. Latent NPs (LNPs) [
5
] avoid this problem by maximizing an approximate
Evidence Lower Bound, derived through a more conventional variational inference approach.
So far NP models have been evaluated on low-dimensional settings, where they excel at few-shot learning. Here, we
will analyze their performance on molecules represented by high-dimensional chemical fingerprints.
1.3 The DOCKSTRING dataset
The DOCKSTRING dataset [
6
] is a molecular dataset for benchmarking of ML models. It comprises more than 15
million docking scores for 58 protein targets and 260k molecules. Targets were chosen to be medically relevant and
represent a variety of protein families, and molecules were curated from PubChem and ChEMBL to be representative
of chemical series in drug discovery projects.
Each molecule in the DOCKSTRING dataset is annotated with all 58 protein target scores, which makes it especially
suitable to design benchmark tasks in transfer learning. In this paper, we sample a small subset of it to evaluate
regression and transfer learning by CNPs in the low-data regime.
2 Methods
2.1 Dataset and split
NPs are able to learn across different datapoints within the same function, and across different functions within the
same input space. Therefore, the dataset was split across both the datapoint dimension and the function dimension. We
refer to these splits as dtrain,dtest and ftrain,ftest respectively (Figure 1).
To emulate learning in a low-data regime, we took a small sample of the train and test sets defined in the DOCKSTRING
package. The dtrain set consisted of 2500 molecules from the original train set, and the dtest set consisted of 2500
molecules from the original test set. DOCKSTRING original sets were split by clusters, which prevented data leakage
from chemical analogues in dtrain and dtest. Our function split was derived from the DOCKSTRING regression task,
using the 5 task targets (ESR2, KIT, PARP1, PGR, F2) as ftest and the other 53 targets as ftrain.
2.2 CNP and benchmark models
A simple CNP was implemented with 3 linear layers in the encoder network, a mean aggregator function and 3
linear layers in the decoder network. We include four benchmarks commonly used in ML for chemoinformatics: a
feed-forward neural network with the same number of layers as the CNP (NN), k-nearest neighbours with
k= 5
(KNN)
and
k= 1
(FSS, known as fingerprint similarity search in chemoinformatics [
7
]), and a random forest regressor with
2
摘要:

ConditionalNeuralProcessesforMoleculesMiguelGarcia-OrtegonDPMMSUniversityofCambridgeWilberforceRd,Cambridge,UKmg770@cam.ac.ukAndreasBenderDept.ofChemistryUniversityofCambridgeLenseldRd,Cambridge,UKab454@cam.ac.ukSergioBacalladoDPMMSUniversityofCambridgeWilberforceRd,Cambridge,UKsb2116@cam.ac.ukAbst...

展开>> 收起<<
Conditional Neural Processes for Molecules Miguel Garcia-Ortegon DPMMS.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:9 页 大小:1.34MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注