Federated Bayesian Computation via Piecewise Deterministic Markov Processes Joris Bierkens Andrew B. Duncan

2025-05-06 0 0 641.48KB 27 页 10玖币
侵权投诉
Federated Bayesian Computation via Piecewise
Deterministic Markov Processes
Joris Bierkens
, Andrew B. Duncan
October 26, 2022
Abstract
When performing Bayesian computations in practice, one is often faced with the challenge
that the constituent model components and/or the data are only available in a distributed
fashion, e.g. due to privacy concerns or sheer volume. While various methods have been
proposed for performing posterior inference in such federated settings, these either make
very strong assumptions on the data and/or model or otherwise introduce significant bias
when the local posteriors are combined to form an approximation of the target posterior.
By leveraging recently developed methods for Markov Chain Monte Carlo (MCMC) based
on Piecewise Deterministic Markov Processes (PDMPs), we develop a computation- and
communication- efficient family of posterior inference algorithms (Fed-PDMC) which pro-
vides asymptotically exact approximations of the full posterior over a large class of Bayesian
models, allowing heterogenous model and data contributions from each client. We show that
communication between clients and the server preserves the privacy of the individual data
sources by establishing differential privacy guarantees. We quantify the performance of Fed-
PDMC over a class of illustrative analytical case-studies and demonstrate its efficacy on a
number of synthetic examples along with realistic Bayesian computation benchmarks.
1 Introduction
In the problem of Federated Bayesian learning we are faced with the unique challenge that,
either due to privacy or scalability, the model and its data are distributed across a federation
of workers. In this setting, the model and/or data owned by the individual worker must not
be disclosed to the other workers, and neither to any coordinator. While this problem has
been studied in previous works, much of the proposed methodology has involved sacrificing the
asymptotic exactness, which is characteristic of MCMC-based sampling algorithms, to facilitate
the distribution and federation of the data, or alternatively being focused on a very narrow class
of models.
The problem of federated learning has largely been studied in the optimisation setting, where
a global loss function which can be decomposed into local worker contributions must be optimised.
Classical strategies including Fed-SGD and Fed-Avg [MMR+17], FedAc [YM20], and subsequent
extensions, build on the general idea that each worker seeks to locally move towards the optimiser
of the model based on their own local data contribution, computing a local gradient, and a central
server aggregates these local gradient candidates in an appropriate fashion to obtain an estimate
of the global minimiser.
Delft University of Technology
Imperial College London, Alan Turing Institute
1
arXiv:2210.13816v1 [stat.CO] 25 Oct 2022
The validity of these approaches typically hinge on strong assumptions. Firstly, it is assumed
that the data across workers is independent and identically distributed, though a number of more
recent federated learning methods seek to weaken these requirements, e.g. through knowledge
distillation [ZHZ21] or Bayesian non-parametric modelling, e.g. [YAG+19]. Secondly, it is as-
sumed that workers use the same local model, though recent work on model personalization has
suggested some strategies to address this [MMRS20].
In this work, we address the problem of federated learning in a Bayesian setting, i.e. we seek
to generate samples from a global posterior probability distribution obtained as a multiplicative
composition of local posteriors distributed across the workers, and without sharing of model
and/or data. This is an inherently more challenging problem due to the fact that far more
information pertaining to the local posterior distribution must be somehow communicated with
other workers while ensuring privacy of data/model.
Previous works have sought to lift methodology from the federated learning to the Bayesian
setting, employing Stochastic Gradient Langevin Dynamics (SGLD)-based generalisations of Fed-
erated Learning counterparts, e.g. [ZLZ+19, EMMBK21, VPD+22]. Similarly, the Langevin-type
algorithm proposed in [SSR22] combines distributed MCMC with compression techniques to re-
duce the burden of communicating large gradients.
Related approaches seek to employ approximations of the local posterior contributions which
are used to communicate information to the central server. Such approaches include distributed
variational inference [ZBKM18, HWL+17], using Gaussian approximations [ASGXR20] and en-
semble approaches [LAD+21]. In [BGR22], local predictive posterior contributions are distilled
and stored into a neural network which is communicated with the central server.
Some works seek to reformulate the federated learning problem through the lens of Bayesian
model averaging, where the local model contributions are combined into an accurate global ap-
proximation as a model ensemble [CC20, TG20], building on other Bayesian uncertainty quan-
tification methods used in deep learning such as [MIG+19]. Related to this are approaches which
adopt a Bayesian hierarchical modelling view of Federated Learning, introducing hierarchical
priors and fixed and random effects to share global information across the different federated
workers, [KVMD22].
All of the approaches discussed above either employ local posterior approximations to enable
effective communication, and/or are contingent on very strict approximations on the structure
of the model. To our knowledge, there is no approach which can perform a full, (asymptotically)
exact Bayesian inference in this context for a general Bayesian model. In this paper we provide a
federated (or distributed) approach to Markov Chain Monte Carlo with the following properties:
(i) the correct posterior distribution is retained; (ii) the observational data may be distributed
among workers with no requirement to exchange information other than the algorithmic output;
(iii) the observational data amongst different workers does not have to be identically distributed,
nor do the local prior distributions have to be the same; (iv) the efficiency of the federated
approach compares favourably to the non-federated approach in the sense that the algorithmic
slowing down is compensated by the fact that computation is distributed among workers; (v) the
amount of information that is communicated between the workers and the server respects the
privacy requirements of the worker, which can be quantified from a differential privacy viewpoint.
We will base our approach on the framework of Piecewise Deterministic Monte Carlo [BVD17,
BFR19], which we will introduce in Section 2. As discussed in Section 3 this framework can be
easily extended to allow for a federated (or distributed) approach while retaining the correct
stationary distribution. The computational efficiency of our method is discussed in Section 4.
We will also consider our approach from the viewpoint of differential privacy in Section 5. We
provide numerical experiments for several examples to establish proof of concept and investigate
efficiency properties in Section 6.
2
2 Piecewise Deterministic Monte Carlo
Here we concisely describe the framework of Piecewise Deterministic Monte Carlo (PDMC) in
some generality. Essentially, a PDMC sampler is based on a piecewise deterministic Markov
process. This is a continuous time Markov process which moves along continuous deterministic
trajectories, until at random times, a jump within the state space is made. In PDMC the state
space consists of a position process X(t)Rdand a velocity process V(t) taking values in a set
V Rd. The jumps (or events) will only affect the velocity. The process will be designed to
have a particular stationary probability distribution µ(dx, dv) with marginal position distribution
π(dx) = Rv∈V µ(dx, dv). Here πmay be considered to be a Bayesian posterior distribution of
interest.
2.1 Deterministic dynamics
In the general setting, the deterministic dynamics are described as the solution of an ordinary
differential equation
dx(t)
dt =v(t),dv(t)
dt =ψ(x(t)),(1)
where ψ:RdRis a sufficiently regular function so that solutions to (1) are defined uniquely,
e.g., ψmay be assumed to be globally Lipschitz. The deterministic dynamics are assumed
to preserve a ‘reference’ stationary measure µ0(dx, dv) = π0(dx)ν(dv), where π0(dx) =
exp(U0(x)) dx for a suitable function U0:RdRand νis a probability measure on V. This
means that for a solution φ(t;x0, v0) := (x(t;x0, v0), v(t;x0, v0)) to (1) with initial condition
(x0, v0), we have for all integrable f:Rd× V Rthat
Zx0Rd,v0∈V
f(φ(t;x0, v0))µ0(dx0, dv0)
=ZxRd,v∈V
f(x, v)µ0(dx, dv).
An interesting special case is when π0(dx) is chosen to be the prior distribution in a Bayesian
inference problem, but this is not necessary.
Example 2.1 (Zig-Zag Sampler).For the Zig-Zag Sampler (ZZS, [BFR19]), we take ψ(x) = 0,
U0(x) = 0, V={−1,+1}dand the stationary velocity distribution is taken to be ν= Uniform(V).
We see that the velocities assume only discrete values which do not change under the deterministic
dynamics.
Example 2.2 (Bouncy Particle Sampler and Boomerang Sampler).Let V=Rdequipped with a
Gaussian measure ν=N(0,Σ). where Σ is a positive definite matrix. The dynamics (1) preserve
µ0by taking ψ(x) = ΣU0(x). In particular, for the Bouncy Particle Sampler (BPS, [BVD17]
we take U0(x) = 0 and thus ψ(x) = 0, and usually Σ = Idso that µ0(dx, dv) = Leb(Rd)(dx)1
N(0, Id)(dv). For the Boomerang Sampler [BGKR20], we take U0(x) = 1
2xTΣ1x, so ψ(x) = x
and have µ0(dx, dv) = N(0,Σ)(dx)⊗ N(0,Σ)(dv). In contrast to the ZZS, the BPS has a
continuous space of possible velocities, but as for ZZS, the velocities do not change under the
deterministic dynamics. For the Boomerang Sampler the deterministic dynamics correspond to
a (skewed) harmonic oscillator.
1Leb(Rd) denotes Lebesgue measure on Rd.
3
2.2 Jumps
Next we specify the jumping mechanism which changes the velocity at random times. This is
governed by a jump intensity λ(x, v) and a Markov jump kernel Q(x, v, dv0) : Rd×V ×B(V)2
[0,1]. More generally we may have multiple types of jumps with multiple types of rates (λi)k
i=1
and jump distributions (Qi)k
i=1, competing for which event occurs first. Suppose we start from
time 0 at position (x0, v0) and recall that we have deterministic dynamics t7→ φ(t;x0, v0). The
distribution of the inter-event times τiare given by
P(τit) = exp Zt
0
λi(φ(s;x0, v0)) ds.
The event that actually takes place is specified by setting i0= arg miniτi. At time τi0we make
a transition according to the selected jump kernel Qi0, so that the distribution of the velocity
after the jump is given by Qi0(φ(τi0;x0, v0),·).
Remark 2.3. It is always possible to write a combination of jump mechanisms (λi, Qi)k
i=1 as a
single jump mechanism (λ, Q) by defining
λ(x, v) =
k
X
i=1
λi(x, v),
Q(x, v, dv0) =
k
X
i=1
λi(x, v)
λ(x, v)λ(x,v)>0Qi(x, v, dv0).
This provides a convenient notational simplification which we will use whenever this does not
cause confusion.
Using the notation of the previous remark, Algorithm 1 describes a general PDMC sampler.
In practice it may be challenging to simulate τsatisfying (2). We discuss the usual approach
of Poisson thinning in the Appendix.
2.3 Stationary distribution
It is possible to formulate conditions on λiand Qiin order to have a prespecified stationary
distribution. For a function f:Rd× V Rwe write
Qf(x, v) = Zv0
Q(x, v, dv0)f(x, v0)
Suppose we wish to have the distribution exp(U(x))π0(dx) as (marginal) stationary distribu-
tion. In order to achieve this we impose the following conditions (understood to hold for all
bounded measurable f:Rd× V R):
(i) Invariance of νunder the jump kernels: for all xRd,
Zv∈V
Qif(x, v)ν(dv) = Zv∈V
f(x, v)ν(dv) (3)
2B(V) denotes the σ-field of Borel subsets of V.
4
Algorithm 1 Piecewise Deterministic Monte Carlo
Input: Initial condition (x, v)Rd× V.
Output: The sequence of skeleton points (Tk, Xk, Vk)
k=0.
1: Set (T0, X0, V0) = (0, x, v).
2: for k= 0,1,2, . . . (until a stopping criterion is met) do
3: Simulate τsuch that
P(τt) = exp Zt
0
λ(φ(s;Xk, Vk)) ds(2)
4: Set
Tk+1 =Tk+τ,
(Xk+1,˜
Vk+1) = φ(τ;Xk, Vk)
5: Simulate V(Tk+1)Q(Xk+1,˜
Vk+1,·)
6: end for
(ii) Effective sign reversal under jumps: for all i= 1, . . . , k and xRd,
Zv∈V
λi(x, v)Qif(x, v)ν(dv)
=Zv∈V
λi(x, v)f(x, v)ν(dv),(4)
and
(iii) Event intensity condition: for all (x, v)Rd× V,
k
X
i=1
[λi(x, v)λi(x, v)] = hv, U(x)i.(5)
Under the stated conditions it follows that the process with deterministic dynamics (1), and
jumps according to (λi, Qi)m
i=1, has stationary distribution µ(dx, dv)exp(U(x)) π0(dx)
ν(dv). The proof of this result depends on the notion of the Markov process generator and is
beyond the scope of this work; see e.g. [BFR19, BVD17, BGKR20]
Example 2.4 (Zig-Zag Sampler).For Zig-Zag, we consider for i= 1, . . . , d,λi(x, v) = max(viiU(x),0)+
γi(x) and Qif(x, v) = f(x, Fiv), where γiis a non-negative function (called the excess switching
rate or refreshment rate
(Fiv)j=(vjj6=i,
vjj=i.
This corresponds to flipping the ith direction of the velocity at a rate which depends on the ith
partial derivative of Uas indicated.
Example 2.5 (Bouncy Particle Sampler and Boomerang Sampler).For BPS and Boomerang,
recall that ν(dv)∝ N(0,Σ) for a positive definite matrix Σ. We take
λ(x, v) = max(hv, U(x)i,0)
5
摘要:

FederatedBayesianComputationviaPiecewiseDeterministicMarkovProcessesJorisBierkens*,AndrewB.Duncan„October26,2022AbstractWhenperformingBayesiancomputationsinpractice,oneisoftenfacedwiththechallengethattheconstituentmodelcomponentsand/orthedataareonlyavailableinadistributedfashion,e.g.duetoprivacyconc...

展开>> 收起<<
Federated Bayesian Computation via Piecewise Deterministic Markov Processes Joris Bierkens Andrew B. Duncan.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:27 页 大小:641.48KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注