Inferring Smooth Control Monte Carlo Posterior Policy Iteration with Gaussian Processes Joe Watson1Jan Peters1234

2025-05-05 0 0 2.87MB 36 页 10玖币
侵权投诉
Inferring Smooth Control: Monte Carlo
Posterior Policy Iteration with Gaussian Processes
Joe Watson1Jan Peters1234
1Department of Computer Science, Technical University of Darmstadt
2Centre for Cognitive Science, Technical University of Darmstadt
3German Research Center for AI 4Hessian.AI
{joe,jan}@robot-learning.de
Abstract: Monte Carlo methods have become increasingly relevant for control
of non-differentiable systems, approximate dynamics models and learning from
data. These methods scale to high-dimensional spaces and are effective at the
non-convex optimizations often seen in robot learning. We look at sample-based
methods from the perspective of inference-based control, specifically posterior
policy iteration. From this perspective, we highlight how Gaussian noise priors
produce rough control actions that are unsuitable for physical robot deployment.
Considering smoother Gaussian process priors, as used in episodic reinforcement
learning and motion planning, we demonstrate how smoother model predictive
control can be achieved using online sequential inference. This inference is real-
ized through an efficient factorization of the action distribution and a novel means
of optimizing the likelihood temperature to improve importance sampling accu-
racy. We evaluate this approach on several high-dimensional robot control tasks,
matching the sample efficiency of prior heuristic methods while also ensuring
smoothness. Simulation results can be seen at monte-carlo-ppi.github.io.
Keywords: approximate inference, policy search, model predictive control
0
20
r
Baseline (icem, coloured noise)
0T
-1
0
1
a
0H
-1
0
1
Action
sample
0
20
r
Ours (lbps,se kernel)
0T
-1
0
1
a
0H
-1
0
1
Action prior
sample
Figure 1: High-dimensional, contact-rich tasks such as manipulation (left) can be performed effec-
tively using sample-based model predictive control. While prior work uses correlated actuator noise
to improve sample-efficiency and exploration, these methods do not preserve the smoothness in the
downstream actuation a, resulting in aggressive control (center). We use smooth Gaussian process
priors to infer posterior actions (right), which preserves smoothness while maintaining performance
and sample efficiency, as both are using only 32 samples. Rewards rshow quartiles over 25 seeds.
1 Introduction
Learning robot control requires optimization to be performed on sampled transitions of the envi-
ronment [1]. Monte Carlo methods [2] provide a principled means to approach such algorithms,
bridging black-box optimization and approximate inference techniques. These methods have been
adopted extensively by the community for their impressive simulated [3,4,5,6] and real-world
[7,8,9,10,11,12,13,14,15] robot learning results. Their appeal includes requiring only function
evaluations of the dynamics and objective, so can be applied to complex environments with min-
imal overhead (Figure 1). Moreover, their stochastic nature also avoids issues with local minima
6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.
arXiv:2210.03512v1 [cs.LG] 7 Oct 2022
that occur with gradient-based solvers [16,17]. Finally, while Monte Carlo sampling is expensive,
shooting methods can be effectively parallelized across processes and the advent of simulations on
GPUs also provides a means of acceleration [18,13]. However, some aspects of black-box opti-
mization are open to criticism. Sample-based solvers such as the cross-entropy method (CEM) [19]
appear ‘wasteful’, ignoring computation by throwing away the majority of samples, while others
enforce high-entropy search distributions to avoid premature convergence [18]. Moreover, many
design decisions and hyperparameters are heuristic in nature, which is undesirable from both the
user- and research perspective when interpreting, tuning or advancing these methods.
In this work, we consider Monte Carlo optimal control through the broader perspective of inference-
based control [20,21,22,23,24,25,26,27], where optimization is achieved through importance
sampling [28]. This approach covers settings such as policy search [29], motion planning [8,30] and
model predictive control (MPC) [18]. From this view point, we highlight two key design decisions:
the likelihood temperature and the distribution over action sequences. An adaptive temperature
scheme is crucial for controlling the optimization behavior across objectives and distributions, but in
many methods this aspect is ignored or opaque. Moreover, correlated action sequences are equally
crucial for performing effective exploration and control in practical settings. Smoothness, arising
from such correlations, is an aspect of human motion [31]. Smooth priors have taken many forms
across domains, such as movement primitives [32], smoothed- [11,12] or coloured noise [4]. We
use Gaussian processes [33] as action priors and show how they can be scaled to high-dimensional
action spaces through factorization of the covariance. Evaluating on simulated robotic systems,
we reproduce prior results on policy search while transferring these ideas to MPC, matching prior
performance with respect to sample efficiency while ensuring smooth actuation.
Contribution. First, we present a perspective of episodic inference-based control based on Gibbs
posteriors. Using this view, we then present novel Monte Carlo variants that incorporate the approx-
imate inference error due to importance sampling, simplifying the hyperparameter while providing
regularization. Thirdly, we demonstrate how richer Gaussian process priors can be combined with
these regularized Gibbs posteriors for Monte Carlo MPC using online sequential inference, which
achieves greater smoothness and sample efficiency than standard white noise priors. We highlight
connections between this approach to MPC and effective prior approaches to episodic policy search.
2 Monte Carlo Methods for Optimal Control
This section outlines the problem setting and introduces variational optimization and posterior pol-
icy iteration methods. We consider the standard (stochastic) optimal control setting in discrete-
time, with states sRds, actions aRda. Optimization is framed as maximizing a reward
r:Rds×RdaRunder dynamics p(st+1 |st,at)and initial state distribution p(s1),
max
a1,...,aT
EhPT
t=1 r(st,at)is.t. st+1 p(· | st,at),s1p(s1).(1)
This work focuses on the episodic setting, where optimization is performed after evaluating the
current solution over a finite-time horizon T. We frequently use the episodic return R, where
R(S,A) = PT
t=1 r(st,at), using upper-case to denote sequences, e.g. A:= {a1,...,aT}.
2.1 Variational Optimization with Gibbs Posteriors
The optimization outlined above is amenable to gradient-based solvers such as stochastic differential
dynamic programming [34]. However, to aid optimization through exploration and regularization,
we can consider optimizing a parametric belief over action sequences qQ. The variational for-
mulation (Equation 3) generalizes Bayes’ rule beyond optimizing likelihoods and resembles many
learning algorithms [35,36]. This work concerns optimizing an open-loop action sequence to max-
imize an episodic return. Bayesian inference of an action sequence from data, known as input esti-
mation, can be performed using message passing of the appropriate probabilistic graphical model,
capturing the sequential structure of the problem and necessary priors [27]. If the measurement
log-likelihood is replaced with the control objective, this inference computation can be shown to
have precise dualities with dynamic programming-based optimal control [37]. While this switch in
objective provides a powerful suite of inference tools for efficient computation, it requires treating
the control objective as a Markovian log-likelihood, which is not the case for episodic objectives.
The Gibbs likelihood is a general treatment of the objective-as-likelihood (Definition 1) [38,39].
2
Definition 1. (Gibbs likelihoods and posteriors) For a loss fand prior p(x), the Gibbs posterior
qαfor parameter xis derived by constructing the Gibbs likelihood exp(α f(x)) from the loss,
qα(x) = 1
Zα
exp(α f(x)) p(x), Zα=Zexp(α f(x)) p(x)dx, α 0.(2)
This posterior minimizes the following objective
qα= arg minq∈Q Exq(·)[f(x)] + 1
αDKL[q(x)|| p(x)].(3)
This objective appears in PAC-Bayes methods [38], mirror descent methods [40] and Bayesian
inference as the evidence lower bound objective when f(x)is a negative log-likelihood [39].
Augmenting the variational optimization objective with prior regularization (Equation 3), we obtain
an expression of the optimal belief in the action sequence (Equation 2). The parameter αhas a range
of meanings, depending on context. In PAC-Bayes it is the dataset size, in mirror descent it is an
update step size and in risk-sensitive control it is the sensitivity [41,42]. Example 1in Appendix A
examines a tractable linear-quadratic-Gaussian example of this update, demonstrating its relation to
Newton-like optimization and highlighting the effect αhas on the regularized update.
2.2 Posterior Policy Iteration
The optimal control problem (Equation 1) is ambiguous regarding whether the action sequence or
state-action trajectory is the optimization variable. Applying the Gibbs posterior to the optimal con-
trol setting recovers Rawlik et al.s posterior policy iteration [41], which can be implemented using
the joint distribution or policy. We consider the following joint state-action distribution, that fac-
torizes in the following Markovian fashion p(S,A) = p(s1)QT
t=1 p(st+1 |st,at)p(at|st). Pos-
terior policy iteration updates the state-action distribution through the policy, constructing a Gibbs
likelihood from the reward, as the dynamics and initial state distribution are constant.
Definition 2. (Posterior policy iterations (PPI) [43]) As the initial distribution and dynamics
are shared by the prior and posterior joint state-action distribution, the joint Gibbs posterior
qα(S,A)exp(αR(S,A)) p(S,A)can be alternatively expressed using the policy posterior
update qα(A|S)exp(αR(S,A)) p(A|S).
Using this update, the key decisions are choosing p(A|S),αand the inference approximation. If
pand qαare Gaussian, then PPI involves iterative refinement of the distribution. In the Monte Carlo
setting, qαtakes the form of an importance-weighted empirical distribution. To apply iteratively, p
is updated using the M-projection, following the objective (Equation 3), i.e. a weighted maximum
likelihood fit of the policy parameters [29]. This approach is a stochastic approximate expectation
maximization (SAEM) method [44] and described fully in Algorithm 1in the Appendix. We argue a
key aspect of PPI methods is how to specify the inverse temperature αduring optimization (Section
3), as it has a strong influence on the posterior, which is important when fitting rich distributions
such as Gaussian processes (Section 4) from samples. Gaussian process action priors can be applied
to several control settings, such as policy search and model predictive control (Section 6).
3 Posterior Policy Constraints for Monte Carlo Optimization
The Gibbs posterior in Definition 2has been adopted widely in control, albeit from a range of dif-
ferent perspectives, such as Bayesian smoothing [23], solutions to the Feynman-Kac equation [45],
maximum entropy [26], mirror descent [46] and entropy-regularized reinforcement learning [47].
An open question is how best to set αfor Monte Carlo optimization? Relative entropy policy search
(Definition 3), provides a principled and effective means of deriving αfor stochastic optimization,
using the constrained optimization view of entropy-regularized optimal control.
Definition 3. (Episodic relative entropy policy search (eREPS) [29]) Maximize the expected return,
subject to a hard KL bound on the policy update,
maxθEst+1p(·|st,at),atqθ(·|st),s1p(·)[R(st,at)] s.t. DKL[qθ(A|S)|| p(A|S)] .
The posterior policy takes the form qθ(A|S)exp(αR)p(A|S), where αis derived from Lagrange
multiplier calculated by minimizing the empirical dual G(·)using Nsamples,
minαG(α) =
α+1
αlog Rp(S,A) exp(αR(S,A)) dSdA
α+1
αlog 1
NPN
n=1 exp(αRn).
3
While REPS is a principled approach to stochastic optimization, we posit two weaknesses: The hard
KL constraint is difficult to specify, as it depends on the optimization problem, distribution family
and dimensionality. Secondly, the Monte Carlo approximation of the dual has no regularization and
may poorly adhere to the KL constraint without sufficient samples. Therefore, we desire an alterna-
tive approach that resolves these two issues, capturing the Monte Carlo approximation error with a
simpler hyperparameter. To tackle this problem, we interpret the REPS update as a pseudo-posterior,
where the temperature is calculated using the KL constraint. We make this interpretation concrete
by reversing the objective and constraint, switching to an equality constraint for the expectation,
minθDKL[qθ(A|S)|| p(A|S)] s.t. Est+1p(·|st,at),atqθ(·|st),s1p(·)[Ptr(st,at)] = R.
This objective is a minimum relative entropy problem [48], which yields the same Gibbs posterior
as eREPS (Lemma 1, Appendix A). With exact inference, a suitable prior and oracle knowledge
of the maximum return, this program computes the optimal policy in a single step by setting R
to the optimal value. However, in this work, the expectation constraint requires self-normalized
importance sampling (SNIS) on sampled returns R(n)using samples from the current policy prior,
Est+1p(·|st,at),atqθ(·|st),s1p(·)[Ptr(st,at)] Pnw(n)
q/pR(n)=PnR(n)exp(α R(n))
Pnexp(α R(n))=R.
Rather than specifying Rhere, we identify that this estimator is fundamentally limited by inference
accuracy. We capture this error by applying an IS-derived concentration inequality to this estimate
(Theorem 1) [49]. This lower bound can be used as an objective for optimizing α, balancing policy
improvement with approximate inference accuracy.
Theorem 1. (Importance sampling estimator concentration inequality (Theorem 2, [49])) Let qand
pbe two probability densities such that qpand d2[q|| p]<+. Let x1,x2,...,xNi.i.d.
random variables sampled from pand f:X →Rbe a bounded function (||f||<+). Then, for
any 0< δ 1and N > 0with probability at least 1δ:
Exq(·)[f(x)] 1
NPN
i=1wq/p(xi)f(xi)− ||f||r(1 δ)d2[q(x)|| p(x)]
δ N .(4)
The divergence term d2[q||p]is the exponentiated R´
enyi-2 divergence, exp D2[q||p]. While this is
tractable for the multivariate Gaussian, it is otherwise not available in closed form. Fortunately, we
can use the effective sample size (ESS) [50] as an approximation, as ˆ
NαN / d2[qα||p][49,51]
(Lemma 2, see Section Aof the Appendix). Combining Equation 4with our constraint, instead of
setting R, we maximize the IS lower bound R
LB to form an objective for the inverse temperature α
which incorporates the inference accuracy due to the sampling given inequality probability 1δ,
max
αR
LB (α, δ) = Eqα/p[R]− ER(δ, ˆ
Nα),ER(δ, ˆ
Nα) = ||R||r(1 δ)
δ
1
pˆ
Nα
.(5)
We refer to this approach as lower-bound policy search (LBPS). This objective combines the ex-
pected performance of qα, based on the IS estimate Eqα/p[·], with regularization ERbased on the
return and inference accuracy. Treating p,N,||R||as task-specific hyperparameters, the only al-
gorithm hyperparameter δ[0,1) defines the probability of the bound. In practice, self-normalized
importance sampling is used for PPI, as the normalizing constants of the Gibbs likelihoods are not
available. While Metelli et al. also derive an SNIS lower bound [49], we found, as they did, that
the IS lower bound with SNIS estimates work better in practice due to the conservatism of the SNIS
bound. An interpretation of this approach is that the R´
enyi-2 regularization constrains the Gibbs
posterior to be one that can be estimated from the finite samples, as the divergence is used in eval-
uating IS sample complexity [52,53]. Moreover, the role of the ESS for regularization is similar to
the ‘elite’ samples in CEM. Connecting these two mechanisms as robust maximum estimators (Sec-
tion A), we also propose effective sample size policy search (ESSPS), which optimizes αto achieve a
desired ESS N, i.e. a R´
enyi-2 divergence bound, using the objective minα|ˆ
NαN|. More details
regarding PPI (Section A) and temperature selection methods (Table 1) are in the Appendix.
This section introduces two methods, LBPS and ESSPS, for constraining the Gibbs posteriors for
Monte Carlo optimization. These methods provide statistical regularization through soft and hard
constraints involving the effective sample size, which avoids the pitfall of fitting high-dimensional
distributions to a few effective samples. A popular setting for these methods is MPC, which performs
episodic optimization over short planning horizons while adapting each time step to the current state.
Moreover, for optimal control, we also need to specify a suitable prior over action sequences. To
apply PPI to the MPC setting, we must implement online optimization given this prior over actions.
4
a
Gaussian Noise Smooth Noise (β=0.25) Smooth Actions (β=0.8)
t
a
Coloured Noise (β=2.5)
t
SE Kernel (l=0.2)
t
Periodic Kernel (l=0.2, T=0.5)
Figure 2: A practical aspect of Monte Carlo control methods for robotics is optimizing smooth ac-
tion sequence. This example shows a non-smooth optimal sequence , which may be undesirable,
though optimal, to fit exactly. Prior methods struggle at providing both effective smooth solutions
in the mean and action samples , as they ultimately fit the action distribution in an inde-
pendent fashion. Using kernel-derived covariance function provides both. The line denotes the
optimization horizon, beyond which is exploratory actions derived from both the posterior and prior.
4 Online Posterior Policy Iteration & Prior Design
In this section, we derive the online realization of posterior policy iteration that uses and maintains
correlated action priors, computing the finite-horizon Hfuture actions given a likelihood on a subset
of actions from the past. Rrepresents the return-based Gibbs likelihood term (Definition 1),
qα(at:t+H|R1:τ) = Rqα(a1:t+H|R1:τ)da1:t1Rp(R1:τ|a1:τ)p(a1:t+H)da1:t1,(6)
where τt+H. As an analogy, this is equivalent to combining forecasting with state estimation,
i.e. p(xt:t+H|y1:t)for states xand measurements y. For correlated priors on the action space,
this computation is tractable if working with Gaussian processes. In fact, a recurring aspect across
several many posterior policy iteration-like approaches is the use of Gaussian process policies,
p(A|S) =
QtN(µt,Σt),(Independent Gaussian noise, e.g. [18]),
N(µ>
wφ(t), φ(t)>Σwφ(t)),(Bayesian linear regression, e.g. ProMP [32]),
QtN(kt+Kts,Σt),(time-varying linear Gaussian e.g. [41,54,42]),
GP(µ(s),Σ(s)),(non-parametric Gaussian process [55]).
Despite the simplicity of Gaussian action noise, for robotics, more sophisticated noise is often de-
sired for safety and effective exploration [56,29]. Prior work has proposed first-order smoothing
[11,12]. Using v(n)
tN(0,I),β[0,1] and Σt=LtL>
t, actions are sampled using
a(n)
t=µt+Ltn(n)
t,n(n)
t=βv(n)
t+ (1 β)n(n)
t1,or n(n)
t=βv(n)
t+p(1 β2)n(n)
t1.
However, in practice it is also implemented as a(n)
t=β(µt+Ltv(n)
t) + (1 β)a(n)
t11. While this
approach directly smooths the actuation, it also introduces a lag, which may deteriorate performance.
Other approaches have used colored noise for sampling the noise n[4]. Contrast these approaches
to Gibbs sampling a multivariate Gaussian joint distribution with 1-step cross-correlations [58],
which is a(n)
t|t1=µ(n)
t|t1+Lt|t1v(n)
t, where µ(n)
t|t1=µt+Σt,t1Σ1
t1(a(n)
t1µt1),and
Σt|t1=ΣtΣt,t1Σ1
t1Σ>
t,t1.The differences are subtle, but important. The initial proposed
sampling scheme essentially adds correlated noise to the mean for exploration, but does not con-
sider the smoothness of the mean itself. The practical implementation incorporates the previous
action, but through exponential smoothing, which introduces a fixed lag that potentially degrades
the quality of the mean action sequence. Correct sampling of the joint distribution has neither of
these issues and naturally extends to correlations over several time steps. We do this in a gen-
eral fashion by considering the (continuous time) Gaussian process (see Section G, Appendix), so
p(at) = N(µti:tj,Σti:tj) = GP(µ(t),Σ(t)) for a discrete-time sequence t= [ti, . . . , tj]. Propo-
sition 1shows how the time shift for MPC can be implemented in a general fashion when using GPs.
Proposition 1. Given a Gaussian process prior GP(µ(t),Σ(t)) and multivariate normal posterior
qα(at1:t2) = N(µt1:t2|R,Σt1:t2|R)for t1to t2, the posterior for t3to t4is expressed as
µt3:t4|R=µt3:t4+Σt3:t4,t1:t2νt1:t2,Σt3:t4|R=Σt3:t4Σt3:t4,t1:t2Λt1:t2Σ>
t3:t4,t1:t2,(7)
where νt1:t2=Σ1
t1:t2(µt1:t2|Rµt1:t2)and Λt1:t2=Σ1
t1:t2(Σt1:t2Σt1:t2|R)Σ1
t1:t2.
This update combines the new sequence prior from t3to t4and the previous likelihood used in the
update for t1to t2, obtained from the posterior and prior. Note, the cross-covariance Σt3:t4,t1:t2is
computed using the covariance function of the prior GP. The proof is in Appendix A.
1See the source code for Nagabandi et al. [11] and MBRL-lib [57].
5
摘要:

InferringSmoothControl:MonteCarloPosteriorPolicyIterationwithGaussianProcessesJoeWatson1JanPeters12341DepartmentofComputerScience,TechnicalUniversityofDarmstadt2CentreforCognitiveScience,TechnicalUniversityofDarmstadt3GermanResearchCenterforAI4Hessian.AIfjoe,jang@robot-learning.deAbstract:MonteCarlo...

展开>> 收起<<
Inferring Smooth Control Monte Carlo Posterior Policy Iteration with Gaussian Processes Joe Watson1Jan Peters1234.pdf

共36页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:36 页 大小:2.87MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 36
客服
关注