Importance Sampling Methods for Bayesian Inference with Partitioned Data Marc Box

2025-04-24 0 0 6.71MB 61 页 10玖币

侵权投诉

Importance Sampling Methods for Bayesian

Inference with Partitioned Data

Marc Box∗

November 1, 2022

Abstract

This article presents new methodology for sample-based Bayesian in-

ference when data are partitioned and communication between the parts is

expensive, as arises by necessity in the context of “big data” or by choice

in order to take advantage of computational parallelism. The method,

which we call the Laplace enriched multiple importance estimator, uses

new multiple importance sampling techniques to approximate posterior

expectations using samples drawn independently from the local poste-

rior distributions (those conditioned on isolated parts of the data). We

construct Laplace approximations from which additional samples can be

drawn relatively quickly and improve the methods in high-dimensional

estimation. The methods are “embarrassingly parallel”, make no restric-

tion on the sampling algorithm (including MCMC) to use or choice of

prior distribution, and do not rely on any assumptions about the poste-

rior such as normality. The performance of the methods is demonstrated

and compared against some alternatives in experiments with simulated

data.

Keywords: big data; parallel computing; Bayesian inference; Markov chain

Monte Carlo; embarrassingly parallel; federated inference; multiple importance

sampling

1 Introduction

Bayesian sample-based computation is a common approach to Bayesian infer-

ence in non-trivial models where it is infeasible to compute the normalising

constant of the posterior distribution. The focus of this article is on performing

Bayesian computation when data are partitioned and communication between

the parts is expensive or impossible. We present new methodology for Bayesian

inference in this context without having to combine the parts and communicat-

ing between the parts only at the end of sampling. Our methods achieve this

∗Email: marc.box@protonmail.com

arXiv:2210.06620v2 [stat.ME] 31 Oct 2022

with competitive performance and with fewer constraints or assumptions than

some other methods.

Bayesian computation with partitioned data is challenging because the algo-

rithms for generating samples from the posterior typically require many calcula-

tions involving the whole data set; speciﬁcally, likelihood evaluations. For exam-

ple, in Markov chain Monte Carlo (MCMC) algorithms such as the Metropolis-

Hastings algorithm, a (typically large) number of dependent samples are gen-

erated from a Markov chain, each constrained by the previous sample and the

data (see e.g. Robert and Casella (2004)). When used to target a posterior dis-

tribution π(θ|x)given data x, the probability of accepting proposal θ0given x

and previous sample θis

min π(θ0|x)

π(θ|x)

q(θ|θ0)

q(θ0|θ),1,(1)

where qis the density of the proposal distribution. The ratio π(θ0|x)

π(θ|x)(which

may be of unnormalised p.d.f.s because the normalising constant cancels) must

be evaluated for every proposal θ0, i.e. in every iteration. Even if observations

are conditionally independent given θ, so that likelihoods evaluated using parts

of the data can be multiplied to give the full data likelihood, this can still be a

problem if the necessary data transfer is expensive or impossible.

There are three important data analysis situations where data are parti-

tioned:

1. When a data set is too large to work with in the memory of one computer.

2. When there are several sources or owners of data which are unable or

unwilling to share their data.

3. When there is the possibility of speeding up sampling by running multiple

instances of the sampling algorithm in parallel on separate parts of the

data.

The ﬁrst situation arises in the context of “big data”: data that must be stored

in a distributed manner with no shared memory for computing. Big data has

become very important in modern science and business because of the possi-

bility of ﬁnding patterns not observable on a smaller scale and which may lead

to deeper understanding or provide a competitive edge (Bryant et al. (2008);

Sagiroglu and Sinanc (2013)). The open source Apache Hadoop framework

is widely used for storing and computing with big data on a cluster (Apache

Hadoop (2018); Borthakur (2007)). In the Hadoop ﬁle system (HDFS), data are

partitioned into “blocks” and stored across the cluster, in duplicate for resilience

to errors, then processed using parallel computation models such as MapReduce

(Dean and Ghemawat (2008)) and Spark (Zaharia et al. (2010)) which operate

on data using memory local to each block. A major source of ineﬃciency in

these computations arises when data must be communicated between cluster

nodes (Kalavri and Vlassov (2013); Sarkar et al. (2015)).

One possible approach to Bayesian computation in this context is to down-

sample the data to a size that will ﬁt in the local memory of one computer,

but down-sampling a large data set seems to defeat the purpose of collecting it

in the ﬁrst place. As pointed out in Scott et al. (2016), some large, complex

models genuinely require a large amount of data for robust estimation.

The second situation arises due to data privacy concerns or in meta-analyses.

Inference in this situation is sometimes known as “federated inference” (and the

distributed data, “federated data”) (Xiong et al. (2021); Ma et al. (2021)). Cur-

rent approaches to preserving privacy often rely on the masking of data or the

addition of noise (Dwork (2008); Torra and Navarro-Arribas (2016)), both which

imply the loss of information. Meta-analyses use statistical procedures such as

mixed eﬀects models to pool the results of primary studies using aggregated

data when there is no access to raw observational data (DerSimonian and Laird

(1986)). Performing inference in global models for pooled data without any

participants having to share their data may open up new possibilities in these

situations.

In the third situation, we may assume there is ample memory for the entire

data set, but computational parallelism is available such as through multiple

CPU cores, with a GPU or array of GPUs (Lee et al. (2010)), or on a cluster,

and the time complexity of the sampling algorithm depends on the number of

data points. In this situation there is an opportunity to generate more samples

in a given time by running the sampler in parallel on subsets of the data. This

may result in estimators with lower bias and variance than would otherwise be

possible.

If the data can be contained in the memory of a single node, another way

of taking advantage of computational parallelism is to run multiple MCMC

chains in parallel. Besides the large number of samples that can be generated

(e.g. Lao et al. (2020)), there is potential for improved convergence and new

adaptive algorithms (Green et al. (2015)). This is a diﬀerent mode of parallelism

and not the concern of this paper.

Our approach is for each worker node (the cluster node or agent managing

each data part) to run the same sampling algorithm independently on their lo-

cal data, resulting in sets of samples from posterior distributions diﬀerent from

the full data posterior distribution. We regard these local posteriors as impor-

tance proposal distributions or components of a mixture proposal distribution

targeting the posterior (Robert and Casella (2004); Owen (2013)). By correctly

weighting the samples we can construct Monte Carlo estimators of a posterior

expectation that are asymptotically unbiased in the sense of approaching zero

bias in the limit of inﬁnite samples. There are two observations that suggest

importance sampling-based estimation in this context may be fruitful. Firstly,

the local posteriors should be similar to the posterior (so long as the parts of

data are similar in distribution); secondly, the tails of the local posteriors should

be fatter than those of the posterior because they are conditioned on less data

(see e.g. MacKay et al. (2003)). We make use of three importance weighting

strategies which ﬁt into the class of multiple importance sampling (Veach and

Guibas (1995); Hesterberg (1995); Owen (2013); Elvira and Martino (2021)).

Two of these strategies we devised by extending the methods of Veach and

Guibas (1995) to the case where both the target density and the proposal den-

sities are only known up to a constant of proportionality. The third strategy we

believe is novel.

Importance sampling is known to suﬀer the curse of dimensionality, leading

to poor performance in high-dimensional models (MacKay et al. (2003)). To

address this we include samples from Laplace approximations to the posterior

to complement the samples received from the workers. These additional samples

can be generated easily without additional iterations of the posterior sampling

algorithm, which are often relatively costly. The approximations are simple

constructions from the pooled samples that provide additional importance pro-

posals which, it is hoped, cover regions of parameter space not covered by the

local posteriors. We consider three ways of doing this, but ﬁnd that only one of

them is particularly useful in the majority of examples.

The advantages of our methods can be summarised as follows. They ap-

pear to perform relatively well (comparing with some alternative approaches)

in terms of approximating posterior expectations across a range of models, and

in particular for non-normal posteriors. We will provide evidence of this from

experiments with synthetic data. Our methods have no preference of algorithm

used for sampling by the workers using local data, so long as it is approximately

unbiased, and no communication between workers is required until the very end

of the sampling. This means we can perform sampling in an “embarrassingly

parallel” fashion (Herlihy et al. (2020)). In fact, no data (i.e. observations) need

be transmitted between nodes at all (after any initial partitioning of data). This

is an essential requirement in the use case of collaboration between parties who

are unable to share data, and is an advantage to parallel computation on a

cluster where data transfer between nodes is a performance bottleneck. Our

methods can be used without any constraint on the choice of prior distribution.

This is in contrast to some other methods, in which it is necessary that the prior

be amenable to a certain transformation for the methods to be unbiased. This

constrains the choices available for the prior in those methods, which can have

unwanted implications for the analysis, particularly in analyses of small data

sets. There are also no hyperparameters that need to be set or tuned in our

methods.

We make no distributional assumptions about the posterior. The only as-

sumptions we need beyond those implied by the model or the sampling algorithm

are that observations which are held in separate parts are conditionally inde-

pendent of each other given model parameters, that the likelihood function is

computable and the mild assumptions required by importance sampling. Condi-

tional independence of all observations is suﬃcient but stronger than necessary.

However, whilst we make no explicit assumptions against non-random parti-

tioning of the data, random partitioning would likely be beneﬁcial for methods

based on importance sampling because it makes the local posteriors more likely

to resemble the posterior. We do not require the size of the data parts to be

equal or the number of samples drawn by each worker to be equal.

There are two useful performance diagnostics we propose to use with our

methods. First, we derive an eﬀective sample size in the manner of Kong (1992).

This is a measure of the sampling eﬃciency lost due to the use of an approx-

imation. Second, we look at the ˆ

kdiagnostic arising in the Pareto smoothed

importance estimator of Vehtari et al. (2015). This is an estimate of the shape

parameter in a generalised Pareto model for the tail of the importance weight

distribution, for which Vehtari et al. (2015) identiﬁed a threshold which seems

to be a valuable indicator of poor performance. We ﬁnd that, together, these

indicate situations where our estimators perform poorly.

As of writing, the problem of Bayesian inference with partitioned data re-

mains an open challenge (Green et al. (2015); Bardenet et al. (2017)), although

there are some notable contributions. Some hierarchical models have a structure

that is particularly amenable to distributed processing, such as the hierarchical

Dirichlet process topic model, for which Newman et al. (2009) devise a dis-

tributed Gibbs sampling algorithm. This approach does not generalise to other

models, however.

A number of methods start from the observation that the posterior density is

proportional to the product of local posterior densities, the product distribution,

under a conditional independence assumption for the data. Scott et al. (2016)

make normal distribution assumptions for the local posteriors, justiﬁed by the

Bernstein-von Mises theorem (Van der Vaart (1998)), and pool samples using a

weighted linear combination. Neiswanger et al. (2013) propose three methods,

one which is similar to Scott et al. (2016) and two using kernel density estimators

constructed from the product distribution and sampled from using an indepen-

dent Metropolis-within-Gibbs algorithm. We will describe these methods in

more detail in Section 2.5 because we use them for performance comparisons

with our methods. Huang and Gelman (2005) use a similar approach to Scott

et al. (2016) but provide speciﬁc approaches to normal models, linear models

and hierarchical models. Luengo et al. (2015) consider a similar estimator to

Scott et al. (2016) but pool the local posterior estimators rather than individual

samples. This also relies on the Bernstein-von Mises theorem, but they propose

a bias correction which helps when the posterior does not follow a normal dis-

tribution or with small data sets. Luengo et al. (2018) propose more reﬁned

estimators along these lines. An approach related to Neiswanger et al. (2013)

is Wang and Dunson (2013), who use the Weierstrass transform of the local

posterior densities and sample from the product of these using a Gibbs sam-

pling algorithm. This method may perform better when the posterior deviates

from normality, but requires some communication between workers during sam-

pling and involves some hyperparameters. Nemeth and Sherlock (2018) use a

Gaussian process prior on the log of each of the local posterior densities, the

sum of which is a Gaussian process approximation to the log posterior density,

from which they sample using Hamiltonian Monte Carlo. Importance weight-

ing is then used to improve the approximation represented by these samples

(this is a diﬀerent use of importance sampling from our methods, although the

computation of the unnormalised posterior density is the same).

There are also approaches that do not start from the product distribution.

Xu et al. (2014) use expectation propagation message passing to enforce agree-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImportanceSamplingMethodsforBayesianInferencewithPartitionedDataMarcBox*November1,2022AbstractThisarticlepresentsnewmethodologyforsample-basedBayesianin-ferencewhendataarepartitionedandcommunicationbetweenthepartsisexpensive,asarisesbynecessityinthecontextofbigdataorbychoiceinordertotakeadvantageo...

展开>> 收起<<

Importance Sampling Methods for Bayesian Inference with Partitioned Data Marc Box.pdf

共61页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Importance Sampling Methods for Bayesian Inference with Partitioned Data Marc Box

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: