Importance Sampling Methods for Bayesian Inference with Partitioned Data Marc Box

2025-04-24 0 0 6.71MB 61 页 10玖币
侵权投诉
Importance Sampling Methods for Bayesian
Inference with Partitioned Data
Marc Box
November 1, 2022
Abstract
This article presents new methodology for sample-based Bayesian in-
ference when data are partitioned and communication between the parts is
expensive, as arises by necessity in the context of “big data” or by choice
in order to take advantage of computational parallelism. The method,
which we call the Laplace enriched multiple importance estimator, uses
new multiple importance sampling techniques to approximate posterior
expectations using samples drawn independently from the local poste-
rior distributions (those conditioned on isolated parts of the data). We
construct Laplace approximations from which additional samples can be
drawn relatively quickly and improve the methods in high-dimensional
estimation. The methods are “embarrassingly parallel”, make no restric-
tion on the sampling algorithm (including MCMC) to use or choice of
prior distribution, and do not rely on any assumptions about the poste-
rior such as normality. The performance of the methods is demonstrated
and compared against some alternatives in experiments with simulated
data.
Keywords: big data; parallel computing; Bayesian inference; Markov chain
Monte Carlo; embarrassingly parallel; federated inference; multiple importance
sampling
1 Introduction
Bayesian sample-based computation is a common approach to Bayesian infer-
ence in non-trivial models where it is infeasible to compute the normalising
constant of the posterior distribution. The focus of this article is on performing
Bayesian computation when data are partitioned and communication between
the parts is expensive or impossible. We present new methodology for Bayesian
inference in this context without having to combine the parts and communicat-
ing between the parts only at the end of sampling. Our methods achieve this
Email: marc.box@protonmail.com
1
arXiv:2210.06620v2 [stat.ME] 31 Oct 2022
with competitive performance and with fewer constraints or assumptions than
some other methods.
Bayesian computation with partitioned data is challenging because the algo-
rithms for generating samples from the posterior typically require many calcula-
tions involving the whole data set; specifically, likelihood evaluations. For exam-
ple, in Markov chain Monte Carlo (MCMC) algorithms such as the Metropolis-
Hastings algorithm, a (typically large) number of dependent samples are gen-
erated from a Markov chain, each constrained by the previous sample and the
data (see e.g. Robert and Casella (2004)). When used to target a posterior dis-
tribution π(θ|x)given data x, the probability of accepting proposal θ0given x
and previous sample θis
min π(θ0|x)
π(θ|x)
q(θ|θ0)
q(θ0|θ),1,(1)
where qis the density of the proposal distribution. The ratio π(θ0|x)
π(θ|x)(which
may be of unnormalised p.d.f.s because the normalising constant cancels) must
be evaluated for every proposal θ0, i.e. in every iteration. Even if observations
are conditionally independent given θ, so that likelihoods evaluated using parts
of the data can be multiplied to give the full data likelihood, this can still be a
problem if the necessary data transfer is expensive or impossible.
There are three important data analysis situations where data are parti-
tioned:
1. When a data set is too large to work with in the memory of one computer.
2. When there are several sources or owners of data which are unable or
unwilling to share their data.
3. When there is the possibility of speeding up sampling by running multiple
instances of the sampling algorithm in parallel on separate parts of the
data.
The first situation arises in the context of “big data”: data that must be stored
in a distributed manner with no shared memory for computing. Big data has
become very important in modern science and business because of the possi-
bility of finding patterns not observable on a smaller scale and which may lead
to deeper understanding or provide a competitive edge (Bryant et al. (2008);
Sagiroglu and Sinanc (2013)). The open source Apache Hadoop framework
is widely used for storing and computing with big data on a cluster (Apache
Hadoop (2018); Borthakur (2007)). In the Hadoop file system (HDFS), data are
partitioned into “blocks” and stored across the cluster, in duplicate for resilience
to errors, then processed using parallel computation models such as MapReduce
(Dean and Ghemawat (2008)) and Spark (Zaharia et al. (2010)) which operate
on data using memory local to each block. A major source of inefficiency in
these computations arises when data must be communicated between cluster
nodes (Kalavri and Vlassov (2013); Sarkar et al. (2015)).
2
One possible approach to Bayesian computation in this context is to down-
sample the data to a size that will fit in the local memory of one computer,
but down-sampling a large data set seems to defeat the purpose of collecting it
in the first place. As pointed out in Scott et al. (2016), some large, complex
models genuinely require a large amount of data for robust estimation.
The second situation arises due to data privacy concerns or in meta-analyses.
Inference in this situation is sometimes known as “federated inference” (and the
distributed data, “federated data”) (Xiong et al. (2021); Ma et al. (2021)). Cur-
rent approaches to preserving privacy often rely on the masking of data or the
addition of noise (Dwork (2008); Torra and Navarro-Arribas (2016)), both which
imply the loss of information. Meta-analyses use statistical procedures such as
mixed effects models to pool the results of primary studies using aggregated
data when there is no access to raw observational data (DerSimonian and Laird
(1986)). Performing inference in global models for pooled data without any
participants having to share their data may open up new possibilities in these
situations.
In the third situation, we may assume there is ample memory for the entire
data set, but computational parallelism is available such as through multiple
CPU cores, with a GPU or array of GPUs (Lee et al. (2010)), or on a cluster,
and the time complexity of the sampling algorithm depends on the number of
data points. In this situation there is an opportunity to generate more samples
in a given time by running the sampler in parallel on subsets of the data. This
may result in estimators with lower bias and variance than would otherwise be
possible.
If the data can be contained in the memory of a single node, another way
of taking advantage of computational parallelism is to run multiple MCMC
chains in parallel. Besides the large number of samples that can be generated
(e.g. Lao et al. (2020)), there is potential for improved convergence and new
adaptive algorithms (Green et al. (2015)). This is a different mode of parallelism
and not the concern of this paper.
Our approach is for each worker node (the cluster node or agent managing
each data part) to run the same sampling algorithm independently on their lo-
cal data, resulting in sets of samples from posterior distributions different from
the full data posterior distribution. We regard these local posteriors as impor-
tance proposal distributions or components of a mixture proposal distribution
targeting the posterior (Robert and Casella (2004); Owen (2013)). By correctly
weighting the samples we can construct Monte Carlo estimators of a posterior
expectation that are asymptotically unbiased in the sense of approaching zero
bias in the limit of infinite samples. There are two observations that suggest
importance sampling-based estimation in this context may be fruitful. Firstly,
the local posteriors should be similar to the posterior (so long as the parts of
data are similar in distribution); secondly, the tails of the local posteriors should
be fatter than those of the posterior because they are conditioned on less data
(see e.g. MacKay et al. (2003)). We make use of three importance weighting
strategies which fit into the class of multiple importance sampling (Veach and
Guibas (1995); Hesterberg (1995); Owen (2013); Elvira and Martino (2021)).
3
Two of these strategies we devised by extending the methods of Veach and
Guibas (1995) to the case where both the target density and the proposal den-
sities are only known up to a constant of proportionality. The third strategy we
believe is novel.
Importance sampling is known to suffer the curse of dimensionality, leading
to poor performance in high-dimensional models (MacKay et al. (2003)). To
address this we include samples from Laplace approximations to the posterior
to complement the samples received from the workers. These additional samples
can be generated easily without additional iterations of the posterior sampling
algorithm, which are often relatively costly. The approximations are simple
constructions from the pooled samples that provide additional importance pro-
posals which, it is hoped, cover regions of parameter space not covered by the
local posteriors. We consider three ways of doing this, but find that only one of
them is particularly useful in the majority of examples.
The advantages of our methods can be summarised as follows. They ap-
pear to perform relatively well (comparing with some alternative approaches)
in terms of approximating posterior expectations across a range of models, and
in particular for non-normal posteriors. We will provide evidence of this from
experiments with synthetic data. Our methods have no preference of algorithm
used for sampling by the workers using local data, so long as it is approximately
unbiased, and no communication between workers is required until the very end
of the sampling. This means we can perform sampling in an “embarrassingly
parallel” fashion (Herlihy et al. (2020)). In fact, no data (i.e. observations) need
be transmitted between nodes at all (after any initial partitioning of data). This
is an essential requirement in the use case of collaboration between parties who
are unable to share data, and is an advantage to parallel computation on a
cluster where data transfer between nodes is a performance bottleneck. Our
methods can be used without any constraint on the choice of prior distribution.
This is in contrast to some other methods, in which it is necessary that the prior
be amenable to a certain transformation for the methods to be unbiased. This
constrains the choices available for the prior in those methods, which can have
unwanted implications for the analysis, particularly in analyses of small data
sets. There are also no hyperparameters that need to be set or tuned in our
methods.
We make no distributional assumptions about the posterior. The only as-
sumptions we need beyond those implied by the model or the sampling algorithm
are that observations which are held in separate parts are conditionally inde-
pendent of each other given model parameters, that the likelihood function is
computable and the mild assumptions required by importance sampling. Condi-
tional independence of all observations is sufficient but stronger than necessary.
However, whilst we make no explicit assumptions against non-random parti-
tioning of the data, random partitioning would likely be beneficial for methods
based on importance sampling because it makes the local posteriors more likely
to resemble the posterior. We do not require the size of the data parts to be
equal or the number of samples drawn by each worker to be equal.
There are two useful performance diagnostics we propose to use with our
4
methods. First, we derive an effective sample size in the manner of Kong (1992).
This is a measure of the sampling efficiency lost due to the use of an approx-
imation. Second, we look at the ˆ
kdiagnostic arising in the Pareto smoothed
importance estimator of Vehtari et al. (2015). This is an estimate of the shape
parameter in a generalised Pareto model for the tail of the importance weight
distribution, for which Vehtari et al. (2015) identified a threshold which seems
to be a valuable indicator of poor performance. We find that, together, these
indicate situations where our estimators perform poorly.
As of writing, the problem of Bayesian inference with partitioned data re-
mains an open challenge (Green et al. (2015); Bardenet et al. (2017)), although
there are some notable contributions. Some hierarchical models have a structure
that is particularly amenable to distributed processing, such as the hierarchical
Dirichlet process topic model, for which Newman et al. (2009) devise a dis-
tributed Gibbs sampling algorithm. This approach does not generalise to other
models, however.
A number of methods start from the observation that the posterior density is
proportional to the product of local posterior densities, the product distribution,
under a conditional independence assumption for the data. Scott et al. (2016)
make normal distribution assumptions for the local posteriors, justified by the
Bernstein-von Mises theorem (Van der Vaart (1998)), and pool samples using a
weighted linear combination. Neiswanger et al. (2013) propose three methods,
one which is similar to Scott et al. (2016) and two using kernel density estimators
constructed from the product distribution and sampled from using an indepen-
dent Metropolis-within-Gibbs algorithm. We will describe these methods in
more detail in Section 2.5 because we use them for performance comparisons
with our methods. Huang and Gelman (2005) use a similar approach to Scott
et al. (2016) but provide specific approaches to normal models, linear models
and hierarchical models. Luengo et al. (2015) consider a similar estimator to
Scott et al. (2016) but pool the local posterior estimators rather than individual
samples. This also relies on the Bernstein-von Mises theorem, but they propose
a bias correction which helps when the posterior does not follow a normal dis-
tribution or with small data sets. Luengo et al. (2018) propose more refined
estimators along these lines. An approach related to Neiswanger et al. (2013)
is Wang and Dunson (2013), who use the Weierstrass transform of the local
posterior densities and sample from the product of these using a Gibbs sam-
pling algorithm. This method may perform better when the posterior deviates
from normality, but requires some communication between workers during sam-
pling and involves some hyperparameters. Nemeth and Sherlock (2018) use a
Gaussian process prior on the log of each of the local posterior densities, the
sum of which is a Gaussian process approximation to the log posterior density,
from which they sample using Hamiltonian Monte Carlo. Importance weight-
ing is then used to improve the approximation represented by these samples
(this is a different use of importance sampling from our methods, although the
computation of the unnormalised posterior density is the same).
There are also approaches that do not start from the product distribution.
Xu et al. (2014) use expectation propagation message passing to enforce agree-
5
摘要:

ImportanceSamplingMethodsforBayesianInferencewithPartitionedDataMarcBox*November1,2022AbstractThisarticlepresentsnewmethodologyforsample-basedBayesianin-ferencewhendataarepartitionedandcommunicationbetweenthepartsisexpensive,asarisesbynecessityinthecontextofbigdataorbychoiceinordertotakeadvantageo...

展开>> 收起<<
Importance Sampling Methods for Bayesian Inference with Partitioned Data Marc Box.pdf

共61页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:61 页 大小:6.71MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 61
客服
关注