Adaptive Tuning for Metropolis Adjusted Langevin Trajectories Lionel Riou-Durand Pavel Sountsov Jure Vogrinc University of Warwick Google Research University of Warwick

2025-04-30 0 0 559.51KB 15 页 10玖币
侵权投诉
Adaptive Tuning for Metropolis Adjusted Langevin Trajectories
Lionel Riou-Durand Pavel Sountsov Jure Vogrinc
University of Warwick Google Research University of Warwick
Charles C. Margossian Sam Power
Flatiron Institute University of Bristol
Abstract
Hamiltonian Monte Carlo (HMC) is a widely
used sampler for continuous probability distri-
butions. In many cases, the underlying Hamil-
tonian dynamics exhibit a phenomenon of reso-
nance which decreases the efficiency of the al-
gorithm and makes it very sensitive to hyperpa-
rameter values. This issue can be tackled effi-
ciently, either via the use of trajectory length ran-
domization (RHMC) or via partial momentum
refreshment. The second approach is connected
to the kinetic Langevin diffusion, and has been
mostly investigated through the use of General-
ized HMC (GHMC). However, GHMC induces
momentum flips upon rejections causing the
sampler to backtrack and waste computational
resources. In this work we focus on a recent
algorithm bypassing this issue, named Metropo-
lis Adjusted Langevin Trajectories (MALT). We
build upon recent strategies for tuning the hyper-
parameters of RHMC which target a bound on
the Effective Sample Size (ESS) and adapt it to
MALT, thereby enabling the first user-friendly
deployment of this algorithm. We construct a
method to optimize a sharper bound on the ESS
and reduce the estimator variance. Easily com-
patible with parallel implementation, the resul-
tant Adaptive MALT algorithm is competitive
in terms of ESS rate and hits useful tradeoffs
in memory usage when compared to GHMC,
RHMC and NUTS.
Proceedings of the 26th International Conference on Artificial In-
telligence and Statistics (AISTATS) 2023, TBD, USA. PMLR:
Volume XXX. Copyright 2023 by the author(s).
1 INTRODUCTION
Hamiltonian Monte Carlo (HMC; Duane et al. (1987); Neal
(2011)) is a widely-known MCMC sampler for distribu-
tions with differentiable densities. The use of multiple
gradient-informed steps per proposal suppresses random
walk behavior and makes HMC particularly powerful for
high-dimensional numerical integration. Its implementa-
tion in probabilistic-programming languages (e.g. Carpen-
ter et al.,2017;Bingham et al.,2018;Salvatier et al.,2016;
Lao et al.,2020) leverages automatic differentiation and
adaptive tuning with the no-U-turn sampler (NUTS; Hoff-
man and Gelman (2014)) which require minimal input from
the practitioner.
Robust implementations of HMC rely on randomizing the
length of Hamiltonian trajectories (Bou-Rabee and Sanz-
Serna,2017, RHMC) to avoid resonances. NUTS is well
established and remains wildly popular but recent studies
suggest optimally tuned RHMC samplers can perform bet-
ter. Furthermore, NUTS’ recursive construction (control-
flow-heavy) complicates its implementation when running
many parallel chains (Radul et al.,2020;Lao and Dillon,
2019;Phan and Pradhan,2019). To take better advan-
tage of modern computing hardware such as GPUs, several
schemes for tuning RHMC have been proposed (Hoffman
et al.,2021;Sountsov and Hoffman,2021), showing signif-
icant gains of efficiency compared to NUTS.
A second approach to mitigate resonances is based on
the kinetic Langevin diffusion. Built upon its discretiza-
tion, Generalized HMC (Horowitz,1991, GHMC) alter-
nates single adjusted steps with partial momentum refresh-
ments. However, the original GHMC algorithm had not
been proven competitive, its efficiency being restrained
by momentum flips induced upon rejections. To reduce
the backtracking caused by these flips, Neal (2020) pro-
posed a slice-sampling approach to cluster the event times
of the rejections of a proposal. The efficiency of GHMC
was significantly improved by this method, which was
later turned into a tuning-free algorithm (Hoffman and
Sountsov,2022).
arXiv:2210.12200v2 [stat.CO] 22 Feb 2023
Adaptive Tuning for Metropolis Adjusted Langevin Trajectories
In this paper, we focus on a new numerical scheme for
approximating the kinetic Langevin diffusion, recently in-
troduced as Metropolis Adjusted Langevin Trajectories
(Riou-Durand and Vogrinc,2022, MALT). By construc-
tion, MALT enables full erasing of the momentum flips and
shares several desirable properties with RHMC. Compared
to GHMC, some qualitative arguments in favour of MALT
were made but no quantitative comparison had been con-
ducted. One objective is to fill this gap.
Our general goal is to develop a tuning-free version of the
MALT algorithm, compatible with running many parallel
chains. In Section 3 we develop tuning heuristics and al-
gorithmic solutions for each parameter. Built upon these
principles, we present in Section 4 an adaptive sampler, ap-
plicable for multiple chains as well as for single chain set-
tings. In Section 5 we compare numerically its efficiency
to several adaptive samplers (GHMC, RHMC, NUTS) on a
set of benchmark models.
To summarize our contributions:
We develop tuning heuristics for each of MALT’s hy-
perparameters, building upon recent tuning strategies
for RHMC (Hoffman et al.,2021;Sountsov and Hoff-
man,2021) which target a bound on the Effective
Sample Size (ESS).
We propose a method to optimize a sharper bound on
the ESS, for which we derive a gradient estimate with
reduced variance. This variance reduction method is
applicable to both MALT and RHMC and allows for
speeding up the adaptation of the trajectory length.
We construct an Adaptive MALT algorithm, easily
compatible with, but not reliant on, running multiple
chains in parallel.
We show that Adaptive MALT is competitive in terms
of ESS rate and hits useful tradeoffs in memory us-
age when compared to previous adaptive samplers
(GHMC, RHMC, NUTS).
2 BACKGROUND
This section reviews the use of Hamiltonian dynamics to
construct MCMC samplers targeting a differentiable den-
sity Π(x)exp(Φ(x)), for xRd. We introduce
the notion of partial momentum refreshment and present its
connection to the kinetic Langevin diffusion, thereby mo-
tivating GHMC. Unfortunately the utility of GHMC is hin-
dered by momentum flips, required to preserve detailed bal-
ance but statistically inefficient; we explain how MALT by-
passes this issue. Finally we review existing tuning strate-
gies for HMC and GHMC samplers.
2.1 Hamiltonian Dynamics
To describe the motion of a particle xRdand its velocity
vRd(a.k.a. momentum), Hamiltonian dynamics are
defined for a positive definite mass matrix MRd×das
dxt=M1vtdt, dvt=−∇Φ(xt) dt. (1)
By construction, Hamiltonian dynamics preserve the den-
sity Π(x, v)exp(Φ(x)v>M1v/2). In general, the
ODE (1) does not have a closed form solution. Instead, the
leapfrog method is commonly used to integrate this ODE
numerically. It is defined for a time-step h > 0as the fol-
lowing update (x1, v1) = LEAPFROG(x0, v0;M, h)such
that
v1/2=v0(h/2)Φ(x0)
x1=x0+hM1v1/2
v1=v1/2(h/2)Φ(x1).
(2)
The leapfrog integrator is time-reversible (Neal,2011), i.e.
(x0,v0) = LEAPFROG(x1,v1;M, h). This means that
going backwards in time is equivalent to flipping the mo-
menta. HMC is then built upon the following recursion.
From a position x0Rdand a velocity v0∼ Nd(0d, M), a
Hamiltonian trajectory of length τ > 0is proposed by com-
posing the leapfrog update for L=dτ/hesteps. The re-
sultant proposal (xL, vL)is faced with an accept-reject test
(a.k.a. Metropolis correction) to compensate for the numer-
ical error when solving (1). We note |v|2
M1=v>M1v.
With probability 1exp(+), where += max(0,∆)
and
∆ = Φ(xL)Φ(x0) + |vL|2
M1− |v0|2
M1/2,
the proposal is rejected and its momentum is flipped i.e. the
output is (x0,v0). Flipping the momentum upon rejec-
tion is a technical requirement to ensure that the Metropolis
correction preserves Π. Overall, the momentum flips are
erased since v0is fully refreshed at each iteration.
HMC prevents random walk behaviour by proposing tra-
jectories with persistent momentum, thereby exploring the
sampling space more efficiently. However, HMC exhibits
a phenomenon of resonance when the target Πhas hetero-
geneous scales. Typically, fast mixing for one component
can result in arbitrarily slow mixing for other components
(Riou-Durand and Vogrinc,2022). This issue makes the ef-
ficiency of HMC highly sensitive to the choice of its tuning
parameters.
One solution consists on drawing τat random at each itera-
tion (RHMC), thereby smoothing the auto-correlations be-
tween successive Hamiltonian trajectories (Bou-Rabee and
Sanz-Serna,2017). The choice of the sampling distribu-
tion for τis further discussed in Section 5. In this work,
we consider another approach based on partial momentum
refreshment.
L. Riou-Durand, P. Sountsov, J. Vogrinc, C.C. Margossian, S. Power
2.2 Partial Momentum Refreshment
The resonances of Hamiltonian trajectories can be reduced
by refreshing partially the velocity along their path. Kinetic
Langevin dynamics leverage this property by augmenting
(1) with a continuous momentum refreshment induced by
a Brownian motion (Bt)t0on Rd. These are defined for a
damping parameter γ0(a.k.a. friction), as
dxt=M1vtdt,
dvt=(Φ(xt) + γvt) dt+p2γM1/2dBt.(3)
The damping controls the refreshment speed. As γ0,
the SDE (3) reduces to the ODE (1) and exhibits reso-
nances. These resonances vanish as γgets larger. Increas-
ing γtoo much however decreases the persistence of the
momentum and encourages random walk behavior. This
tradeoff can be solved for a critical choice of damping. For
log-concave distributions, quantitative mixing rates have
been established by Eberle et al. (2019); Dalalyan and
Riou-Durand (2020); Cao et al. (2019). Choosing a mass
Min (3) is equivalent to preconditioning the augmented
space with the linear map (x, v)7→ (M1/2x, M1/2v).
Langevin trajectories can be approximated numerically by
alternating between the leapfrog update (2) and a partial
momentum refreshment, defined for η=eγh as
v0=η v +p1η2ξ, ξ ∼ Nd(0d, M).
This partial momentum refreshment leaves Πinvariant.
To compensate for the numerical error, one strategy con-
sists in applying the Metropolis correction to each leapfrog
step. GHMC is built upon this principle (Horowitz,
1991), thereby inducing a momentum flip to every rejected
leapfrog step; see Section 2.1. Unlike HMC, momentum
flips are not erased with GHMC, which causes the sam-
pler to backtrack and limits its efficiency (Kennedy and
Pendleton,2001). This also exacerbates the problem of
tuning h, which must be small enough to prevent momen-
tum flips from becoming overwhelming (Bou-Rabee and
Vanden-Eijnden,2010).
Several solutions have been proposed to improve GHMC’s
efficiency. Delayed rejection methods have been used to
reduce the number of momentum flips at a higher compu-
tational cost (Mira,2001;Campos and Sanz-Serna,2015;
Park and Atchad´
e,2020). We highlight a recent approach
based on slice sampling (Neal,2020), which correlates the
accept-reject coin tosses over time and encourages larger
periods without momentum flips.
2.3 Metropolis Adjusted Langevin Trajectories
In this work, we focus on a sampler recently proposed
by Riou-Durand and Vogrinc (2022) to bypass the prob-
lem caused by momentum flips. This method consists in
applying the Metropolis correction to an entire numerical
Langevin trajectory, rather than applying it to each of its
L=dτ/heleapfrog steps. The resultant sampler, named
Metropolis Adjusted Langevin Trajectories (MALT), is de-
fined by iterating Algorithm 1.
Algorithm 1 Metropolis Adjusted Langevin Trajectory
1: function MALT(X;M, γ, h, τ)
2: Set L← dτ/heand ηeγh
3: Draw independent (ξi)i=0,...,L ∼ Nd(0d, M)
4: Set (x0, v0)(X, ξ0)and 0
5: for i= 1 to Ldo
6: Refresh v0
i1η vi1+p1η2ξi
7: Set (xi, vi)LEAPFROG(xi1, v0
i1;M, h)
8: Update ∆ + |vi|2
M1− |v0
i1|2
M1/2
9: end for
10: Set X0xLand ∆ + Φ(xL)Φ(x0)
11: Draw Z∼ Exp(1)
12: if Z < then set X0X
13: end if
14: return X0, vL, x0, v0
0,
15: end function
In this algorithm, the overall numerical error unfolds as
∆ = PL
i=1 δi, where
δi= Φ(xi)Φ(xi1) + |vi|2
M1− |v0
i1|2
M1/2
is the energy difference incurred by the ith leapfrog step.
As a consequence, the probability of accepting an entire
trajectory for MALT is higher than the probability of ac-
cepting Lconsecutive leapfrog steps for GHMC. Indeed:
Paccept
GHMC =ePL
i=1 δ+
ie(PL
i=1 δi)+
=Paccept
MALT.
Detailed balance and ergodicity of Algorithm 1are proven
in Riou-Durand and Vogrinc (2022). Notably, the momen-
tum flips get erased by fully refreshing v0∼ N(0d, M)
at the start of each trajectory. This ensures reversibility
of the sampler with respect to Π, and makes HMC a spe-
cial case of MALT obtained for γ= 0. As γgets larger,
Langevin trajectories become ergodic, thereby preventing
U-turns and resonances. Compared to GHMC, Algorithm 1
also exhibits an advantage for tuning the step-size, which
is supported by optimal scaling results derived in Riou-
Durand and Vogrinc (2022). The tuning of MALT’s pa-
rameters is further discussed in Section 3.
2.4 Adaptive Tuning Strategies
To measure the efficiency of a MCMC sampler, we con-
sider the problem of estimating E[ϕ(x)] as xΠfor a
square integrable function ϕ:Rd7→ R. The Monte Carlo
estimator, bϕN=N1PN
n=1 ϕ(Xn), estimates the desired
expectation value, with Nthe number of approximate sam-
ples. A scale-free measurement of bϕNs precision is given
摘要:

AdaptiveTuningforMetropolisAdjustedLangevinTrajectoriesLionelRiou-DurandPavelSountsovJureVogrincUniversityofWarwickGoogleResearchUniversityofWarwickCharlesC.MargossianSamPowerFlatironInstituteUniversityofBristolAbstractHamiltonianMonteCarlo(HMC)isawidelyusedsamplerforcontinuousprobabilitydistri-buti...

展开>> 收起<<
Adaptive Tuning for Metropolis Adjusted Langevin Trajectories Lionel Riou-Durand Pavel Sountsov Jure Vogrinc University of Warwick Google Research University of Warwick.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:559.51KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注