2. Modern challenges for Bayesian computation: Massive Data
Consider data ycollected on Nindependent items so that y={y1,...,yN}∈XNand denote
by f(y|θ)the sampling distribution which depends on parameter θ∈Θ⊂Rd. At each iteration
of the MH sampler, one needs to compute f(y|ωt) = QN
k=1 f(yk|ωt)where ωtis the proposal in
(4). Modern applications often rely on data that are large enough so that the repeated calculation
of f(y|ωt)is impractical, even impossible. It is also not unusual for data size to be so large as to
prohibit storage on a single machine, so that computation of the likelihood involves also repeated
communication between multiple machines, thus adding significantly to the computational burden.
Prompted by the obstacle of large data, computational Bayesians have designed a number of ap-
proaches to alleviate the problem. Two general ideas are currently standing out in terms of popularity
and usage: divide-and-conquer (DAC) strategies and subsampling with minimum loss of information.
2.1. Divide and Conquer
The DAC approach is based on partitioning the sample into a number of sub-samples, called batches
that are analyzed separately on a number of workers (CPUs, GPUs, servers, etc). After the batch-
specific estimates about the parameter of interest are obtained, the results are combined so that the
analyst recovers a large part of, ideally all, the information that would have been available if the whole
sample were analyzed in the usual way, on a single machine. While this idea seems applicable in a wide
range of scenarios, there are a couple of constraints that restrict its generality. First, the procedure is
computationally effective if it is designed to minimize, preferably eliminate, communication between
the workers before combining the batch-specific results. Second, it is often difficult to produce an
accurate assessment of the resulting loss of information at the combining stage. Some of the first
proponents of DAC for MCMC sampling are Neiswanger et al. (2013), Scott et al. (2016), and Wang
& Dunson (2013). In their approach, the subposterior distribution corresponding to the jth batch, is
defined as
π(j)(θ|y(j))∝f(y(j)|θ)[p(θ)]1/J (5)
where f, p are as in (1), y(j)is the data that was assigned to batch j,1≤j≤J, and Jis the
total number of batches. With this choice, one immediately gets that QJ
j=1 π(j)∝π(θ|y). Both
Neiswanger et al. (2013) and Scott et al. (2016) consider ways to combine samples from the subpos-
teriors π(j)(θ),1≤j≤J, in situations in which all posteriors, batch-specific and full data ones, are
Gaussian or can be approximated by mixtures of Gaussians. in this case, one can demonstrate that a
weighted average of samples from all the π(j)’s have density π. The use of the Weierstrass transform
for each posterior density, proposed in Wang & Dunson (2013), extends the range of theoretical valid-
ity beyond Gaussian distributions. The authors also establish error bounds between the approximation
and the true posterior. Nemeth & Sherlock (2018) use a Gaussian process (GP) approximation of each
subposterior. Once again, the Gaussian nature of the approximation makes recombination possible
and relatively straightforward. Limitations of the method are strongly linked with those of GP-based
estimation. For instance, when the sub-posterior samplers are sluggish, large MCMC samples might
be needed which, in turn, make the calculation of the GP-based approximation very expensive. The
idea of using the values of the sub-posterior at each MCMC sample is adopted also by Changye &
Robert (2019) who propose to define the subposteriors using π(j)∝ {[p(θ)]1/J f(y(j)|θ)}λj. The
scale factor λjis used to control the uncertainty in the subposterior. Alternative ways to define the
sub-posteriors are produced by Entezari et al. (2018) who use π(j)∝p(θ)[f(y(j)|θ)]J. The intuitive
idea is to ”match” the size of the original sample and the batch-specific one. Their approach has been
applied successfully to BART (Chipman et al. 2010, Pratola 2016) models.
4 Craiu and Levi