Estimating Boltzmann Averages for Protein Structural Quantities Using Sequential Monte Carlo

2025-05-06 0 0 460.51KB 20 页 10玖币
侵权投诉
Estimating Boltzmann Averages for Protein
Structural Quantities Using Sequential Monte
Carlo
Zhaoran Hou and Samuel W.K. Wong
University of Waterloo
October 18, 2022
Abstract
Sequential Monte Carlo (SMC) methods are widely used to draw samples from in-
tractable target distributions. Particle degeneracy can hinder the use of SMC when
the target distribution is highly constrained or multimodal. As a motivating appli-
cation, we consider the problem of sampling protein structures from the Boltzmann
distribution. This paper proposes a general SMC method that propagates multiple
descendants for each particle, followed by resampling to maintain the desired number
of particles. Simulation studies demonstrate the efficacy of the method for tackling the
protein sampling problem. As a real data example, we use our method to estimate the
number of atomic contacts for a key segment of the SARS-CoV-2 viral spike protein.
Key words and phrases: Monte Carlo methods, particle filter, protein structure analy-
sis, SARS-CoV-2.
1 Introduction
Sequential Monte Carlo (SMC) methods, also known as particle filters, are simulation-based
Monte Carlo algorithms for sampling from a target distribution. SMC originated from on-
line inference problems in dynamical systems, where observations arrive sequentially and
interest lies in the posterior distribution of hidden state variables (Liu and Chen,1998);
since then, SMC methods have been used to solve a wide range of practical problems.
This paper proposes a SMC method that generalizes the propagation and resampling strategy
of Fearnhead and Clifford (2003), by constructing a sequence of upsampling and downsam-
pling steps as we shall subsequently define. Our method is motivated by the problem of
sampling 3-D structures of proteins from the Boltzmann distribution, which is a challenging
task due to atomic interactions and constraints (see Section 2for scientific background).
Author for correspondence: samuel.wong@uwaterloo.ca
1
arXiv:2210.14216v1 [stat.ME] 25 Oct 2022
These constraints can exacerbate particle degeneracy when using SMC to obtain samples,
for which our method provides a solution.
We begin with a review of the relevant SMC concepts following Doucet et al. (2001). Assume
we have random variables (x0,...,xT) (denoted by x0:T) with continuous support X(T+1),
and we wish to draw samples from the target distribution p(x0:T). Let fT:X(T+1) RnfT
denote a square integrable function of interest, then its expectation with respect to p(x0:T)
is given by
Ep[fT(x0:T)] = ZfT(x0:T)p(x0:T)dx0:T.(1.1)
Since this integration is usually analytically intractable, the goal is to produce a set of
particles {x(n)
0:T}N
n=1 with weights {w(x(n)
0:T)}N
n=1 such that an estimate to the integral is given
by
Ep[fT(x0:T)] PN
n=1 fT(x(n)
0:T)w(x(n)
0:T)
PN
n=1 w(x(n)
0:T),(1.2)
which requires the weights to be proper with respect to p(x0:T) (Liu and Chen,1998;Liu,
2001), i.e.,
EhfT(x(n)
0:T)w(x(n)
0:T)iEp[fT(x0:T)] .
Often, p(x0:T) does not adopt a form from which we can directly sample (or use importance
sampling efficiently) in practice (e.g., Jacquier et al. (2002); Carvalho et al. (2010)). In
this case, a set of auxiliary distributions {pt(x0:t)}T
t=0 can be introduced, with pT(x0:T) =
p(x0:T), to facilitate sequential sampling (Liu,2001); note that pt(x0:t) does not need to
equal Rp(x0:T)dxt+1:Twhen t<T. To then construct {x(n)
0:T}N
n=1, SMC generates particles
according to the auxiliary distributions via a sequence of propagation and resampling steps.
Assume a set of weighted particles {x(n)
0:t1}N
n=1 have been sampled from pt1(x0:t1) for 0 <
tT; for each existing x(n)
0:t1, the propagation step samples xtand appends it to the existing
particle to form (x(n)
0:t1,xt) as a weighted sample from pt(x0:t). Afterwards, a resampling step
may be done to preserve a set of more evenly-weighted particles. Distinct particles with more
evenly-distributed weights are desired to better represent the target distribution and thus
reduce the Monte Carlo variance of the estimates in Equation (1.2).
Sequential importance sampling with resampling (SISR) is a common framework to imple-
ment propagation with the help of importance distributions η(x0), η(x1|x0), . . . , η(xT|
x0:T1) and resampling (Liu and Chen,1995,9), as summarized in Algorithm 1. To briefly
note some key features of SISR, Step 1 samples one descendant for each particle and Step 3
resamples from the propagated particles. If Step 3 is omitted, the SISR framework reduces
to sequential importance sampling (SIS). The necessity of Step 3 depends on the importance
weights: if the importance weights are all constant, resampling only reduces the distinction
of the particles and thus increases Monte Carlo variance. However, the importance weights
are usually uneven in practice; then without Step 3, some of the importance weights evolv-
ing in Step 2 may decay to zero along with the propagation, which is known as particle
degeneracy.
Many resampling schemes for Step 3 have been proposed to tackle particle degeneracy and
maintain more even weights. Gordon et al. (1993) adopted multinomial sampling with
2
Algorithm 1: Sequential importance sampling with resampling
Require: particle size N;
Initialization: Sample {x(n)
0}N
n=1 from η(x0), and the weight
w(x(n)
0)p0(x(n)
0)(x(n)
0);
for t= 1, . . . , T do
Step 1: Sample e
xtfrom η(xt|x(n)
0:t1) and set e
x(n)
0:t= (x(n)
0:t1,e
xt) for each n;
Step 2: Evaluate the weight
w(e
x(n)
0:t)w(x(n)
0:t1)pt(e
x(n)
0:t)/(pt1(x(n)
0:t1)η(e
xt|x(n)
0:t1)) for each n;
Step 3: Resample Nparticles {x(n)
0:t}N
n=1 from {e
x(n)
0:t}N
n=1 based on {w(e
x(n)
0:t)}N
n=1
and update the weights {w(x(n)
0:t)}N
n=1;
end
i.i.d. draws; Kitagawa (1996) proposed stratified resampling which lines up the importance
weights, divides the interval into equal parts and uniformly samples from each subinterval;
Liu and Chen (1998) proposed residual resampling, which implements random sampling af-
ter retaining copies of current particles based on the weights. Li et al. (2022) summarized
some previous resampling schemes, including the three aforementioned ones, and showed the
equivalency of optimal transport resampling and stratified resampling along with their opti-
mality in one dimensional cases. The resampling schemes update the sample weights while
preserving the proper weighting condition, but differ in the resulting Monte Carlo variance.
Thus, schemes that minimize the resampling variance are preferred. Further, we note that
these resampling schemes do not thoroughly solve the particle degeneracy of SISR for all
situations. For example as seen later in the main application of this paper, the particle de-
generacy encountered when sampling protein backbone segments makes SISR inapplicable:
the particle weights can decay to zero so rapidly that there could be no particles with positive
weights after just a few propagation and resampling steps. To deal with this issue naively,
we would need to exponentially increase Nwith the length of the protein segment to guar-
antee that SISR can successfully complete, which comes with an enormous computational
burden.
Fearnhead and Clifford (2003) proposed an SMC algorithm for hidden Markov models re-
stricted to a finite space X(i.e., |X | =M < ), that circumvents some of the limitations of
SISR. Their method explores every value of Xand produces Mdescendants during propaga-
tion for each of the Nparticles. It then resamples Ndistinct particles from the MN particles
such that the particle weights minimize the expected squared error loss function, and thus
is an optimal resampling scheme. This SMC algorithm has also been adopted in subsequent
research due to its useful features. For example, Zhang et al. (2007) investigated its appli-
cation in sampling protein structures with a simplified discrete-state representation for the
positions of the amino acids in a protein; in this context, the weight indicates the plausibil-
ity of a position under the given energy function. Lin et al. (2008) examined the effects of
retaining the unfilled spaces of different shapes and sizes enclosed in the interior of proteins
by adopting optimal resampling when considering two-dimensional lattice models of pro-
tein structures. Fearnhead and Liu (2007) presented a variation on the optimal resampling
by ordering the particles before resampling and illustrated the optimality of the extended
3
resampling method in terms of minimizing the mean-square error for changepoint models.
Wong et al. (2018) adopted the propagation idea with fixed M= 100 (rather than exploring
all possible values) to identify protein structures with low potential energy; their method
omitted resampling and was primarily designed for structure prediction, so the produced
samples are not properly weighted and cannot be used for Monte Carlo integration.
We shall define upsampling to be a propagation step that samples MN descendants from
Nparticles (M1), and downsampling to be a resampling step which resamples Npar-
ticles from the MN descendants. An upsampling-downsampling framework combines these
upsampling and downsampling features. The SMC method proposed in this paper, while
motivated by the sampling problem in protein structures, is a generally applicable strategy
for sampling from multivariate continuous distributions to compute Monte Carlo integrals.
It may be especially effective when particle degeneracy cannot be solved by existing resam-
pling schemes, e.g., for target distributions that are highly constrained or have many sharp
local modes.
The remainder of the paper is laid out as follows. In Section 2, we introduce the scientific
background of proteins and our goal of estimating Boltzman averages. In Section 3, we
describe the construction of our SMC method and how to use it for sampling protein back-
bone segments. Section 4presents two simulation studies on the performance of the SMC
method: Section 4.1 investigates the role of the upsample size M; Section 4.2 illustrates
the numerical convergence of our SMC estimates. These simulations also demonstrate the
inefficacy of naive importance sampling and SISR for the protein sampling problem. In Sec-
tion 5, we apply the proposed SMC method to estimate the number of atomic contacts for
a key segment of the SARS-CoV-2 viral spike protein. In Section 6, we briefly summarize
the paper and its contributions, and discuss some potential future directions. Proofs of the
theorems are provided in the Supplementary Materials.
2 Motivating Application: Estimating Protein Struc-
tural Quantities
2.1 Overview of protein structure
Proteins have a crucial role in carrying out biological processes and their functions are
dependent on their 3-D structures. A protein consists of a sequence of amino acids, where
successive amino acids are connected by peptide bonds. Four atoms including N, Cα, C and
O are common to each of the 20 different amino acid types and compose the backbone of
the protein that can be visualized as
· · · Cαt
Rt
Ct
Ot
Nt+1 Cαt+1
Rt+1
· · ·
for the amino acids with indices tand t+ 1 in the amino acid sequence. Side chain groups
extend from the Cαatoms (e.g., Rtand Rt+1 in the visualized backbone structure) and
distinguish 20 different types of amino acids. Proteins generally adopt stable 3-D structures
that are essentially determined by the sequence of the amino acids (Anfinsen,1973). At the
4
摘要:

EstimatingBoltzmannAveragesforProteinStructuralQuantitiesUsingSequentialMonteCarloZhaoranHouandSamuelW.K.Wong*UniversityofWaterlooOctober18,2022AbstractSequentialMonteCarlo(SMC)methodsarewidelyusedtodrawsamplesfromin-tractabletargetdistributions.ParticledegeneracycanhindertheuseofSMCwhenthetargetdis...

展开>> 收起<<
Estimating Boltzmann Averages for Protein Structural Quantities Using Sequential Monte Carlo.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:460.51KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注