Estimating Boltzmann Averages for Protein Structural Quantities Using Sequential Monte Carlo

2025-05-06 0 0 460.51KB 20 页 10玖币

侵权投诉

Estimating Boltzmann Averages for Protein

Structural Quantities Using Sequential Monte

Carlo

Zhaoran Hou and Samuel W.K. Wong∗

University of Waterloo

October 18, 2022

Abstract

Sequential Monte Carlo (SMC) methods are widely used to draw samples from in-

tractable target distributions. Particle degeneracy can hinder the use of SMC when

the target distribution is highly constrained or multimodal. As a motivating appli-

cation, we consider the problem of sampling protein structures from the Boltzmann

distribution. This paper proposes a general SMC method that propagates multiple

descendants for each particle, followed by resampling to maintain the desired number

of particles. Simulation studies demonstrate the eﬃcacy of the method for tackling the

protein sampling problem. As a real data example, we use our method to estimate the

number of atomic contacts for a key segment of the SARS-CoV-2 viral spike protein.

Key words and phrases: Monte Carlo methods, particle ﬁlter, protein structure analy-

sis, SARS-CoV-2.

1 Introduction

Sequential Monte Carlo (SMC) methods, also known as particle ﬁlters, are simulation-based

Monte Carlo algorithms for sampling from a target distribution. SMC originated from on-

line inference problems in dynamical systems, where observations arrive sequentially and

interest lies in the posterior distribution of hidden state variables (Liu and Chen,1998);

since then, SMC methods have been used to solve a wide range of practical problems.

This paper proposes a SMC method that generalizes the propagation and resampling strategy

of Fearnhead and Cliﬀord (2003), by constructing a sequence of upsampling and downsam-

pling steps as we shall subsequently deﬁne. Our method is motivated by the problem of

sampling 3-D structures of proteins from the Boltzmann distribution, which is a challenging

task due to atomic interactions and constraints (see Section 2for scientiﬁc background).

∗Author for correspondence: samuel.wong@uwaterloo.ca

arXiv:2210.14216v1 [stat.ME] 25 Oct 2022

These constraints can exacerbate particle degeneracy when using SMC to obtain samples,

for which our method provides a solution.

We begin with a review of the relevant SMC concepts following Doucet et al. (2001). Assume

we have random variables (x0,...,xT) (denoted by x0:T) with continuous support X(T+1),

and we wish to draw samples from the target distribution p(x0:T). Let fT:X(T+1) →RnfT

denote a square integrable function of interest, then its expectation with respect to p(x0:T)

is given by

Ep[fT(x0:T)] = ZfT(x0:T)p(x0:T)dx0:T.(1.1)

Since this integration is usually analytically intractable, the goal is to produce a set of

particles {x(n)

0:T}N

n=1 with weights {w(x(n)

0:T)}N

n=1 such that an estimate to the integral is given

Ep[fT(x0:T)] ≈PN

n=1 fT(x(n)

0:T)w(x(n)

0:T)

n=1 w(x(n)

0:T),(1.2)

which requires the weights to be proper with respect to p(x0:T) (Liu and Chen,1998;Liu,

2001), i.e.,

EhfT(x(n)

0:T)w(x(n)

0:T)i∝Ep[fT(x0:T)] .

Often, p(x0:T) does not adopt a form from which we can directly sample (or use importance

sampling eﬃciently) in practice (e.g., Jacquier et al. (2002); Carvalho et al. (2010)). In

this case, a set of auxiliary distributions {pt(x0:t)}T

t=0 can be introduced, with pT(x0:T) =

p(x0:T), to facilitate sequential sampling (Liu,2001); note that pt(x0:t) does not need to

equal Rp(x0:T)dxt+1:Twhen t<T. To then construct {x(n)

0:T}N

n=1, SMC generates particles

according to the auxiliary distributions via a sequence of propagation and resampling steps.

Assume a set of weighted particles {x(n)

0:t−1}N

n=1 have been sampled from pt−1(x0:t−1) for 0 <

t≤T; for each existing x(n)

0:t−1, the propagation step samples xtand appends it to the existing

particle to form (x(n)

0:t−1,xt) as a weighted sample from pt(x0:t). Afterwards, a resampling step

may be done to preserve a set of more evenly-weighted particles. Distinct particles with more

evenly-distributed weights are desired to better represent the target distribution and thus

reduce the Monte Carlo variance of the estimates in Equation (1.2).

Sequential importance sampling with resampling (SISR) is a common framework to imple-

ment propagation with the help of importance distributions η(x0), η(x1|x0), . . . , η(xT|

x0:T−1) and resampling (Liu and Chen,1995,9), as summarized in Algorithm 1. To brieﬂy

note some key features of SISR, Step 1 samples one descendant for each particle and Step 3

resamples from the propagated particles. If Step 3 is omitted, the SISR framework reduces

to sequential importance sampling (SIS). The necessity of Step 3 depends on the importance

weights: if the importance weights are all constant, resampling only reduces the distinction

of the particles and thus increases Monte Carlo variance. However, the importance weights

are usually uneven in practice; then without Step 3, some of the importance weights evolv-

ing in Step 2 may decay to zero along with the propagation, which is known as particle

degeneracy.

Many resampling schemes for Step 3 have been proposed to tackle particle degeneracy and

maintain more even weights. Gordon et al. (1993) adopted multinomial sampling with

Algorithm 1: Sequential importance sampling with resampling

Require: particle size N;

Initialization: Sample {x(n)

0}N

n=1 from η(x0), and the weight

w(x(n)

0)∝p0(x(n)

0)/η(x(n)

0);

for t= 1, . . . , T do

Step 1: Sample e

xtfrom η(xt|x(n)

0:t−1) and set e

x(n)

0:t= (x(n)

0:t−1,e

xt) for each n;

Step 2: Evaluate the weight

w(e

x(n)

0:t)∝w(x(n)

0:t−1)pt(e

x(n)

0:t)/(pt−1(x(n)

0:t−1)η(e

xt|x(n)

0:t−1)) for each n;

Step 3: Resample Nparticles {x(n)

0:t}N

n=1 from {e

x(n)

0:t}N

n=1 based on {w(e

x(n)

0:t)}N

n=1

and update the weights {w(x(n)

0:t)}N

n=1;

end

i.i.d. draws; Kitagawa (1996) proposed stratiﬁed resampling which lines up the importance

weights, divides the interval into equal parts and uniformly samples from each subinterval;

Liu and Chen (1998) proposed residual resampling, which implements random sampling af-

ter retaining copies of current particles based on the weights. Li et al. (2022) summarized

some previous resampling schemes, including the three aforementioned ones, and showed the

equivalency of optimal transport resampling and stratiﬁed resampling along with their opti-

mality in one dimensional cases. The resampling schemes update the sample weights while

preserving the proper weighting condition, but diﬀer in the resulting Monte Carlo variance.

Thus, schemes that minimize the resampling variance are preferred. Further, we note that

these resampling schemes do not thoroughly solve the particle degeneracy of SISR for all

situations. For example as seen later in the main application of this paper, the particle de-

generacy encountered when sampling protein backbone segments makes SISR inapplicable:

the particle weights can decay to zero so rapidly that there could be no particles with positive

weights after just a few propagation and resampling steps. To deal with this issue naively,

we would need to exponentially increase Nwith the length of the protein segment to guar-

antee that SISR can successfully complete, which comes with an enormous computational

burden.

Fearnhead and Cliﬀord (2003) proposed an SMC algorithm for hidden Markov models re-

stricted to a ﬁnite space X(i.e., |X | =M < ∞), that circumvents some of the limitations of

SISR. Their method explores every value of Xand produces Mdescendants during propaga-

tion for each of the Nparticles. It then resamples Ndistinct particles from the MN particles

such that the particle weights minimize the expected squared error loss function, and thus

is an optimal resampling scheme. This SMC algorithm has also been adopted in subsequent

research due to its useful features. For example, Zhang et al. (2007) investigated its appli-

cation in sampling protein structures with a simpliﬁed discrete-state representation for the

positions of the amino acids in a protein; in this context, the weight indicates the plausibil-

ity of a position under the given energy function. Lin et al. (2008) examined the eﬀects of

retaining the unﬁlled spaces of diﬀerent shapes and sizes enclosed in the interior of proteins

by adopting optimal resampling when considering two-dimensional lattice models of pro-

tein structures. Fearnhead and Liu (2007) presented a variation on the optimal resampling

by ordering the particles before resampling and illustrated the optimality of the extended

resampling method in terms of minimizing the mean-square error for changepoint models.

Wong et al. (2018) adopted the propagation idea with ﬁxed M= 100 (rather than exploring

all possible values) to identify protein structures with low potential energy; their method

omitted resampling and was primarily designed for structure prediction, so the produced

samples are not properly weighted and cannot be used for Monte Carlo integration.

We shall deﬁne upsampling to be a propagation step that samples MN descendants from

Nparticles (M≥1), and downsampling to be a resampling step which resamples Npar-

ticles from the MN descendants. An upsampling-downsampling framework combines these

upsampling and downsampling features. The SMC method proposed in this paper, while

motivated by the sampling problem in protein structures, is a generally applicable strategy

for sampling from multivariate continuous distributions to compute Monte Carlo integrals.

It may be especially eﬀective when particle degeneracy cannot be solved by existing resam-

pling schemes, e.g., for target distributions that are highly constrained or have many sharp

local modes.

The remainder of the paper is laid out as follows. In Section 2, we introduce the scientiﬁc

background of proteins and our goal of estimating Boltzman averages. In Section 3, we

describe the construction of our SMC method and how to use it for sampling protein back-

bone segments. Section 4presents two simulation studies on the performance of the SMC

method: Section 4.1 investigates the role of the upsample size M; Section 4.2 illustrates

the numerical convergence of our SMC estimates. These simulations also demonstrate the

ineﬃcacy of naive importance sampling and SISR for the protein sampling problem. In Sec-

tion 5, we apply the proposed SMC method to estimate the number of atomic contacts for

a key segment of the SARS-CoV-2 viral spike protein. In Section 6, we brieﬂy summarize

the paper and its contributions, and discuss some potential future directions. Proofs of the

theorems are provided in the Supplementary Materials.

2 Motivating Application: Estimating Protein Struc-

tural Quantities

2.1 Overview of protein structure

Proteins have a crucial role in carrying out biological processes and their functions are

dependent on their 3-D structures. A protein consists of a sequence of amino acids, where

successive amino acids are connected by peptide bonds. Four atoms including N, Cα, C and

O are common to each of the 20 diﬀerent amino acid types and compose the backbone of

the protein that can be visualized as

· · · Cαt

Nt+1 Cαt+1

Rt+1

· · ·

for the amino acids with indices tand t+ 1 in the amino acid sequence. Side chain groups

extend from the Cαatoms (e.g., Rtand Rt+1 in the visualized backbone structure) and

distinguish 20 diﬀerent types of amino acids. Proteins generally adopt stable 3-D structures

that are essentially determined by the sequence of the amino acids (Anﬁnsen,1973). At the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EstimatingBoltzmannAveragesforProteinStructuralQuantitiesUsingSequentialMonteCarloZhaoranHouandSamuelW.K.Wong*UniversityofWaterlooOctober18,2022AbstractSequentialMonteCarlo(SMC)methodsarewidelyusedtodrawsamplesfromin-tractabletargetdistributions.ParticledegeneracycanhindertheuseofSMCwhenthetargetdis...

展开>> 收起<<

Estimating Boltzmann Averages for Protein Structural Quantities Using Sequential Monte Carlo.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Estimating Boltzmann Averages for Protein Structural Quantities Using Sequential Monte Carlo

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: