Approximate Methods for Bayesian Computation Radu V . Craiu1and Evgeny Levi2

2025-04-27 0 0 444.27KB 20 页 10玖币

侵权投诉

Approximate Methods for

Bayesian Computation

Radu V. Craiu,1and Evgeny Levi,2

1Department of Statistical Sciences, University of Toronto, Toronto, Canada, M5G 1Z5; email:

radu.craiu@utoronto.ca

2Department of Statistical Sciences, University of Toronto, Toronto, Canada, M5G 1Z5

xxxxxx 0000. 00:1–20

Keywords

ABC, Bayesian synthetic likelihood, coresets, divide and conquer, Markov

chain Monte Carlo, subsampling

Abstract

Rich data generating mechanisms are ubiquitous in this age of information

and require complex statistical models to draw meaningful inference. While

Bayesian analysis has seen enormous development in the last 30 years, ben-

eﬁtting from the impetus given by the successful application of Markov chain

Monte Carlo (MCMC) sampling, the combination of big data and complex mod-

els conspire to produce signiﬁcant challenges for the traditional MCMC algo-

rithms. We review modern algorithmic developments addressing the latter and

compare their performance using numerical experiments.

arXiv:2210.03243v2 [stat.CO] 19 Oct 2022

Contents

1. Introduction .................................................................................................... 2

2. Modern challenges for Bayesian computation: Massive Data.. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . . . . 4

2.1. Divide and Conquer...................................................................................... 4

2.2. Subsampling.............................................................................................. 5

3. Modern Challenges for Bayesian Computation: Intractable Likelihoods.... . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . 7

3.1. Approximate Bayesian Computation (ABC) ... . .. . ... ... . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . 8

3.2. Bayesian Synthetic likelihood ........................................................................... 8

4. Double Jeopardy .............................................................................................. 9

5. Numerical experiments ....................................................................................... 10

5.1. Description of the simulation settings .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . . . . .. . .. . .. . .. . .. . .. . .. . 10

5.2. Logistic Regression ...................................................................................... 12

5.3. Stochastic Volatility....................................................................................... 15

6. Conclusion and future directions ............................................................................. 16

1. Introduction

The data science revolution has led to multiple pressure points in statistics. A statistical sample from

a large population exhibits, in the 21st century, very different characteristics than what one would

have seen merely a few years ago. The ubiquitous and almost continuous recording of many of our

activities has made it relatively easy to collect enormous amounts of information that require analysis

and interpretation. This drastic increase in data volume imposes sober re-evaluations of most classical

approaches to statistical inference.

The computational side of a Bayesian statistician’s toolbox is perhaps most challenged by these

developments. The impetus of Bayesian statistics that has been felt since the early 1990s has been

given by the spectacular advances in computation, especially those around Markov chain Monte Carlo

(MCMC) sampling. Thanks to methods in this class of algorithms, the statisticians have been liberated

to think freely about the Bayesian model components used for a given problem, without worrying

about the mathematical intractability of the analysis.

Indeed, given a data set y, most of the pairings of a sampling density, f(y|θ), and a prior, p(θ),

result in a posterior distribution

π(θ|y) = p(θ)f(y|θ)

Rp(θ)f(y|θ)dθ (1)

that cannot be analyzed directly, usually because the denominator in (1) cannot be computed analyti-

cally. The latter fact impedes the calculation of quantities of interest related to π, most of which can

be expressed as

I=Zh(θ)π(θ|y)dθ, (2)

for some function hthat is determined by the question of interest. For instance, if θis univariate and

we let h(θ) = θrin (2), then Iis equal the r-th moment of π, or h(θ) = 1(−∞,t](θ)leads to the

cumulative distribution function (cdf) of πat a point t.

The classical Monte Carlo method, devised by Von Neumann & Ulam (1951) at the middle of the

twentieth century, relies on sampling independently {θ1,...,θm}from distribution πand approxi-

mating Iwith

I=1

k=1

h(θk).(3)

2 Craiu and Levi

However, the unknown constant in (1) creates a knowledge gap that an MCMC algorithm closes by

constructing and running a Harris-recurrent, π-irreducible, aperiodic Markov chain whose stationary

distribution is exactly the posterior distribution π(θ|y). The values taken by the chain make up the

samples θ1,...,θm. A couple of issues emerge immediately. First, because πis the chain’s stationary

distribution, the samples will be approximatively distributed with πonly after the chain has entered its

stationary regime. Second, due to the Markov property, the samples are typically positively correlated

(although see Frigessi et al. 2000, Craiu & Meng 2005, for exceptions) which reduces the amount of

information they contain about π. To see that, let us imagine the extreme case in which the msamples

are perfectly correlated, in which case they would provide very little information about π.

The success MCMC sampling had in boosting the use of Bayesian models is largely due to the

ease of implementation of some of its most popular algorithms. For instance, the Metropolis-Hastings

algorithm (henceforth, MH) (Metropolis et al. 1953, Hastings 1970) can be implemented using the

following recursive procedure.

Step 0 Initialize the Markov chain at θ0and choose a proposal density q(·|ζ)which may or may not

depend on ζ.

Step t At the t-th step (1≤t≤m−1) do:

PR Draw proposal ωtfrom the density q(·|θt);

AR Set

θt+1 =(ωtwith probability αt

θtwith probability 1−αt

where

αt= min 1,π(ωt|y)q(θt|ωt)

π(θt|y)q(ωt|θt).(4)

Because of the form of the acceptance probability (4), its calculation is not prevented by the

unknown denominator in (1). Nevertheless, computation of (4) hinges on the ability to calculate the

sampling density f(y|θ)for any parameter value θand to be able to do it mtimes. The challenges

posed to Bayesian computation by the modern data and modelling environment have their roots in this

implicit assumption.

In very broad strokes, one can speak of two main challenges in modern Bayesian computation.

The ﬁrst one concerns the computational price of calculating a likelihood when the data is massive,

say of order N(think of Nas being of the order of hundred of millions or even billions). Even

in the tame case of iid observations, in order to know (4) we will have to compute a likelihood (or

sampling density) that involves Nterms and this will have to be repeated each time a new MCMC

sample is produced. The cumulative cost is unsustainable as a single MCMC iteration can take days.

A second challenge emerges when the model’s complexity keeps up with the data volume and yields

an intractable likelihood so that (4) simply cannot be computed analytically. Finally, a meta-challenge

appears in Bayesian analyses that merge massive data with intractable models.

This paper discusses MCMC-adjacent methodology that is used to alleviate the pressure posed on

Bayesian computation by the above challenges. Space constraints impedes the presentation of details

and variants, but in all cases we present the main ideas and refer the interested reader to the relevant

literature. In order to gauge their computational efﬁciency, we run numerical experiments where, using

publicly available software packages, the algorithms are implemented on two statistical models.

In the next section we describe in more detail the challenges we just described and Section 3

summarizes some of the solutions proposed to address them. Numerical experiments meant to illus-

trate and compare different algorithms are reported in Section 4. The paper ends with comments and

discussion of future directions for research.

www.annualreviews.org •Approximate Bayesian Computation 3

2. Modern challenges for Bayesian computation: Massive Data

Consider data ycollected on Nindependent items so that y={y1,...,yN}∈XNand denote

by f(y|θ)the sampling distribution which depends on parameter θ∈Θ⊂Rd. At each iteration

of the MH sampler, one needs to compute f(y|ωt) = QN

k=1 f(yk|ωt)where ωtis the proposal in

(4). Modern applications often rely on data that are large enough so that the repeated calculation

of f(y|ωt)is impractical, even impossible. It is also not unusual for data size to be so large as to

prohibit storage on a single machine, so that computation of the likelihood involves also repeated

communication between multiple machines, thus adding signiﬁcantly to the computational burden.

Prompted by the obstacle of large data, computational Bayesians have designed a number of ap-

proaches to alleviate the problem. Two general ideas are currently standing out in terms of popularity

and usage: divide-and-conquer (DAC) strategies and subsampling with minimum loss of information.

2.1. Divide and Conquer

The DAC approach is based on partitioning the sample into a number of sub-samples, called batches

that are analyzed separately on a number of workers (CPUs, GPUs, servers, etc). After the batch-

speciﬁc estimates about the parameter of interest are obtained, the results are combined so that the

analyst recovers a large part of, ideally all, the information that would have been available if the whole

sample were analyzed in the usual way, on a single machine. While this idea seems applicable in a wide

range of scenarios, there are a couple of constraints that restrict its generality. First, the procedure is

computationally effective if it is designed to minimize, preferably eliminate, communication between

the workers before combining the batch-speciﬁc results. Second, it is often difﬁcult to produce an

accurate assessment of the resulting loss of information at the combining stage. Some of the ﬁrst

proponents of DAC for MCMC sampling are Neiswanger et al. (2013), Scott et al. (2016), and Wang

& Dunson (2013). In their approach, the subposterior distribution corresponding to the jth batch, is

deﬁned as

π(j)(θ|y(j))∝f(y(j)|θ)[p(θ)]1/J (5)

where f, p are as in (1), y(j)is the data that was assigned to batch j,1≤j≤J, and Jis the

total number of batches. With this choice, one immediately gets that QJ

j=1 π(j)∝π(θ|y). Both

Neiswanger et al. (2013) and Scott et al. (2016) consider ways to combine samples from the subpos-

teriors π(j)(θ),1≤j≤J, in situations in which all posteriors, batch-speciﬁc and full data ones, are

Gaussian or can be approximated by mixtures of Gaussians. in this case, one can demonstrate that a

weighted average of samples from all the π(j)’s have density π. The use of the Weierstrass transform

for each posterior density, proposed in Wang & Dunson (2013), extends the range of theoretical valid-

ity beyond Gaussian distributions. The authors also establish error bounds between the approximation

and the true posterior. Nemeth & Sherlock (2018) use a Gaussian process (GP) approximation of each

subposterior. Once again, the Gaussian nature of the approximation makes recombination possible

and relatively straightforward. Limitations of the method are strongly linked with those of GP-based

estimation. For instance, when the sub-posterior samplers are sluggish, large MCMC samples might

be needed which, in turn, make the calculation of the GP-based approximation very expensive. The

idea of using the values of the sub-posterior at each MCMC sample is adopted also by Changye &

Robert (2019) who propose to deﬁne the subposteriors using π(j)∝ {[p(θ)]1/J f(y(j)|θ)}λj. The

scale factor λjis used to control the uncertainty in the subposterior. Alternative ways to deﬁne the

sub-posteriors are produced by Entezari et al. (2018) who use π(j)∝p(θ)[f(y(j)|θ)]J. The intuitive

idea is to ”match” the size of the original sample and the batch-speciﬁc one. Their approach has been

applied successfully to BART (Chipman et al. 2010, Pratola 2016) models.

4 Craiu and Levi

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ApproximateMethodsforBayesianComputationRaduV.Craiu,1andEvgenyLevi,21DepartmentofStatisticalSciences,UniversityofToronto,Toronto,Canada,M5G1Z5;email:radu.craiu@utoronto.ca2DepartmentofStatisticalSciences,UniversityofToronto,Toronto,Canada,M5G1Z5xxxxxx0000.00:120Copyright©0000byAnnualReviews.Allrigh...

展开>> 收起<<

Approximate Methods for Bayesian Computation Radu V . Craiu1and Evgeny Levi2.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Approximate Methods for Bayesian Computation Radu V . Craiu1and Evgeny Levi2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: