Inferring Smooth Control Monte Carlo Posterior Policy Iteration with Gaussian Processes Joe Watson1Jan Peters1234

2025-05-05 0 0 2.87MB 36 页 10玖币

侵权投诉

Inferring Smooth Control: Monte Carlo

Posterior Policy Iteration with Gaussian Processes

Joe Watson1Jan Peters1234

1Department of Computer Science, Technical University of Darmstadt

2Centre for Cognitive Science, Technical University of Darmstadt

3German Research Center for AI 4Hessian.AI

{joe,jan}@robot-learning.de

Abstract: Monte Carlo methods have become increasingly relevant for control

of non-differentiable systems, approximate dynamics models and learning from

data. These methods scale to high-dimensional spaces and are effective at the

non-convex optimizations often seen in robot learning. We look at sample-based

methods from the perspective of inference-based control, speciﬁcally posterior

policy iteration. From this perspective, we highlight how Gaussian noise priors

produce rough control actions that are unsuitable for physical robot deployment.

Considering smoother Gaussian process priors, as used in episodic reinforcement

learning and motion planning, we demonstrate how smoother model predictive

control can be achieved using online sequential inference. This inference is real-

ized through an efﬁcient factorization of the action distribution and a novel means

of optimizing the likelihood temperature to improve importance sampling accu-

racy. We evaluate this approach on several high-dimensional robot control tasks,

matching the sample efﬁciency of prior heuristic methods while also ensuring

smoothness. Simulation results can be seen at monte-carlo-ppi.github.io.

Keywords: approximate inference, policy search, model predictive control

Baseline (icem, coloured noise)

-1

Action

sample

Ours (lbps,se kernel)

-1

Action prior

sample

Figure 1: High-dimensional, contact-rich tasks such as manipulation (left) can be performed effec-

tively using sample-based model predictive control. While prior work uses correlated actuator noise

to improve sample-efﬁciency and exploration, these methods do not preserve the smoothness in the

downstream actuation a, resulting in aggressive control (center). We use smooth Gaussian process

priors to infer posterior actions (right), which preserves smoothness while maintaining performance

and sample efﬁciency, as both are using only 32 samples. Rewards rshow quartiles over 25 seeds.

1 Introduction

Learning robot control requires optimization to be performed on sampled transitions of the envi-

ronment [1]. Monte Carlo methods [2] provide a principled means to approach such algorithms,

bridging black-box optimization and approximate inference techniques. These methods have been

adopted extensively by the community for their impressive simulated [3,4,5,6] and real-world

[7,8,9,10,11,12,13,14,15] robot learning results. Their appeal includes requiring only function

evaluations of the dynamics and objective, so can be applied to complex environments with min-

imal overhead (Figure 1). Moreover, their stochastic nature also avoids issues with local minima

6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.

arXiv:2210.03512v1 [cs.LG] 7 Oct 2022

that occur with gradient-based solvers [16,17]. Finally, while Monte Carlo sampling is expensive,

shooting methods can be effectively parallelized across processes and the advent of simulations on

GPUs also provides a means of acceleration [18,13]. However, some aspects of black-box opti-

mization are open to criticism. Sample-based solvers such as the cross-entropy method (CEM) [19]

appear ‘wasteful’, ignoring computation by throwing away the majority of samples, while others

enforce high-entropy search distributions to avoid premature convergence [18]. Moreover, many

design decisions and hyperparameters are heuristic in nature, which is undesirable from both the

user- and research perspective when interpreting, tuning or advancing these methods.

In this work, we consider Monte Carlo optimal control through the broader perspective of inference-

based control [20,21,22,23,24,25,26,27], where optimization is achieved through importance

sampling [28]. This approach covers settings such as policy search [29], motion planning [8,30] and

model predictive control (MPC) [18]. From this view point, we highlight two key design decisions:

the likelihood temperature and the distribution over action sequences. An adaptive temperature

scheme is crucial for controlling the optimization behavior across objectives and distributions, but in

many methods this aspect is ignored or opaque. Moreover, correlated action sequences are equally

crucial for performing effective exploration and control in practical settings. Smoothness, arising

from such correlations, is an aspect of human motion [31]. Smooth priors have taken many forms

across domains, such as movement primitives [32], smoothed- [11,12] or coloured noise [4]. We

use Gaussian processes [33] as action priors and show how they can be scaled to high-dimensional

action spaces through factorization of the covariance. Evaluating on simulated robotic systems,

we reproduce prior results on policy search while transferring these ideas to MPC, matching prior

performance with respect to sample efﬁciency while ensuring smooth actuation.

Contribution. First, we present a perspective of episodic inference-based control based on Gibbs

posteriors. Using this view, we then present novel Monte Carlo variants that incorporate the approx-

imate inference error due to importance sampling, simplifying the hyperparameter while providing

regularization. Thirdly, we demonstrate how richer Gaussian process priors can be combined with

these regularized Gibbs posteriors for Monte Carlo MPC using online sequential inference, which

achieves greater smoothness and sample efﬁciency than standard white noise priors. We highlight

connections between this approach to MPC and effective prior approaches to episodic policy search.

2 Monte Carlo Methods for Optimal Control

This section outlines the problem setting and introduces variational optimization and posterior pol-

icy iteration methods. We consider the standard (stochastic) optimal control setting in discrete-

time, with states s∈Rds, actions a∈Rda. Optimization is framed as maximizing a reward

r:Rds×Rda→Runder dynamics p(st+1 |st,at)and initial state distribution p(s1),

max

a1,...,aT

EhPT

t=1 r(st,at)is.t. st+1 ∼p(· | st,at),s1∼p(s1).(1)

This work focuses on the episodic setting, where optimization is performed after evaluating the

current solution over a ﬁnite-time horizon T. We frequently use the episodic return R, where

R(S,A) = PT

t=1 r(st,at), using upper-case to denote sequences, e.g. A:= {a1,...,aT}.

2.1 Variational Optimization with Gibbs Posteriors

The optimization outlined above is amenable to gradient-based solvers such as stochastic differential

dynamic programming [34]. However, to aid optimization through exploration and regularization,

we can consider optimizing a parametric belief over action sequences q∈Q. The variational for-

mulation (Equation 3) generalizes Bayes’ rule beyond optimizing likelihoods and resembles many

learning algorithms [35,36]. This work concerns optimizing an open-loop action sequence to max-

imize an episodic return. Bayesian inference of an action sequence from data, known as input esti-

mation, can be performed using message passing of the appropriate probabilistic graphical model,

capturing the sequential structure of the problem and necessary priors [27]. If the measurement

log-likelihood is replaced with the control objective, this inference computation can be shown to

have precise dualities with dynamic programming-based optimal control [37]. While this switch in

objective provides a powerful suite of inference tools for efﬁcient computation, it requires treating

the control objective as a Markovian log-likelihood, which is not the case for episodic objectives.

The Gibbs likelihood is a general treatment of the objective-as-likelihood (Deﬁnition 1) [38,39].

Deﬁnition 1. (Gibbs likelihoods and posteriors) For a loss fand prior p(x), the Gibbs posterior

qαfor parameter xis derived by constructing the Gibbs likelihood exp(−α f(x)) from the loss,

qα(x) = 1

Zα

exp(−α f(x)) p(x), Zα=Zexp(−α f(x)) p(x)dx, α ≥0.(2)

This posterior minimizes the following objective

qα= arg minq∈Q Ex∼q(·)[f(x)] + 1

αDKL[q(x)|| p(x)].(3)

This objective appears in PAC-Bayes methods [38], mirror descent methods [40] and Bayesian

inference as the evidence lower bound objective when f(x)is a negative log-likelihood [39].

Augmenting the variational optimization objective with prior regularization (Equation 3), we obtain

an expression of the optimal belief in the action sequence (Equation 2). The parameter αhas a range

of meanings, depending on context. In PAC-Bayes it is the dataset size, in mirror descent it is an

update step size and in risk-sensitive control it is the sensitivity [41,42]. Example 1in Appendix A

examines a tractable linear-quadratic-Gaussian example of this update, demonstrating its relation to

Newton-like optimization and highlighting the effect αhas on the regularized update.

2.2 Posterior Policy Iteration

The optimal control problem (Equation 1) is ambiguous regarding whether the action sequence or

state-action trajectory is the optimization variable. Applying the Gibbs posterior to the optimal con-

trol setting recovers Rawlik et al.’s posterior policy iteration [41], which can be implemented using

the joint distribution or policy. We consider the following joint state-action distribution, that fac-

torizes in the following Markovian fashion p(S,A) = p(s1)QT

t=1 p(st+1 |st,at)p(at|st). Pos-

terior policy iteration updates the state-action distribution through the policy, constructing a Gibbs

likelihood from the reward, as the dynamics and initial state distribution are constant.

Deﬁnition 2. (Posterior policy iterations (PPI) [43]) As the initial distribution and dynamics

are shared by the prior and posterior joint state-action distribution, the joint Gibbs posterior

qα(S,A)∝exp(αR(S,A)) p(S,A)can be alternatively expressed using the policy posterior

update qα(A|S)∝exp(αR(S,A)) p(A|S).

Using this update, the key decisions are choosing p(A|S),αand the inference approximation. If

pand qαare Gaussian, then PPI involves iterative reﬁnement of the distribution. In the Monte Carlo

setting, qαtakes the form of an importance-weighted empirical distribution. To apply iteratively, p

is updated using the M-projection, following the objective (Equation 3), i.e. a weighted maximum

likelihood ﬁt of the policy parameters [29]. This approach is a stochastic approximate expectation

maximization (SAEM) method [44] and described fully in Algorithm 1in the Appendix. We argue a

key aspect of PPI methods is how to specify the inverse temperature αduring optimization (Section

3), as it has a strong inﬂuence on the posterior, which is important when ﬁtting rich distributions

such as Gaussian processes (Section 4) from samples. Gaussian process action priors can be applied

to several control settings, such as policy search and model predictive control (Section 6).

3 Posterior Policy Constraints for Monte Carlo Optimization

The Gibbs posterior in Deﬁnition 2has been adopted widely in control, albeit from a range of dif-

ferent perspectives, such as Bayesian smoothing [23], solutions to the Feynman-Kac equation [45],

maximum entropy [26], mirror descent [46] and entropy-regularized reinforcement learning [47].

An open question is how best to set αfor Monte Carlo optimization? Relative entropy policy search

(Deﬁnition 3), provides a principled and effective means of deriving αfor stochastic optimization,

using the constrained optimization view of entropy-regularized optimal control.

Deﬁnition 3. (Episodic relative entropy policy search (eREPS) [29]) Maximize the expected return,

subject to a hard KL bound on the policy update,

maxθEst+1∼p(·|st,at),at∼qθ(·|st),s1∼p(·)[R(st,at)] s.t. DKL[qθ(A|S)|| p(A|S)] ≤.

The posterior policy takes the form qθ(A|S)∝exp(αR)p(A|S), where αis derived from Lagrange

multiplier calculated by minimizing the empirical dual G(·)using Nsamples,

minαG(α) = 

α+1

αlog Rp(S,A) exp(αR(S,A)) dSdA≈

α+1

αlog 1

NPN

n=1 exp(αRn).

While REPS is a principled approach to stochastic optimization, we posit two weaknesses: The hard

KL constraint is difﬁcult to specify, as it depends on the optimization problem, distribution family

and dimensionality. Secondly, the Monte Carlo approximation of the dual has no regularization and

may poorly adhere to the KL constraint without sufﬁcient samples. Therefore, we desire an alterna-

tive approach that resolves these two issues, capturing the Monte Carlo approximation error with a

simpler hyperparameter. To tackle this problem, we interpret the REPS update as a pseudo-posterior,

where the temperature is calculated using the KL constraint. We make this interpretation concrete

by reversing the objective and constraint, switching to an equality constraint for the expectation,

minθDKL[qθ(A|S)|| p(A|S)] s.t. Est+1∼p(·|st,at),at∼qθ(·|st),s1∼p(·)[Ptr(st,at)] = R∗.

This objective is a minimum relative entropy problem [48], which yields the same Gibbs posterior

as eREPS (Lemma 1, Appendix A). With exact inference, a suitable prior and oracle knowledge

of the maximum return, this program computes the optimal policy in a single step by setting R∗

to the optimal value. However, in this work, the expectation constraint requires self-normalized

importance sampling (SNIS) on sampled returns R(n)using samples from the current policy prior,

Est+1∼p(·|st,at),at∼qθ(·|st),s1∼p(·)[Ptr(st,at)] ≈Pnw(n)

q/pR(n)=PnR(n)exp(α R(n))

Pnexp(α R(n))=R∗.

Rather than specifying R∗here, we identify that this estimator is fundamentally limited by inference

accuracy. We capture this error by applying an IS-derived concentration inequality to this estimate

(Theorem 1) [49]. This lower bound can be used as an objective for optimizing α, balancing policy

improvement with approximate inference accuracy.

Theorem 1. (Importance sampling estimator concentration inequality (Theorem 2, [49])) Let qand

pbe two probability densities such that qpand d2[q|| p]<+∞. Let x1,x2,...,xNi.i.d.

random variables sampled from pand f:X →Rbe a bounded function (||f||∞<+∞). Then, for

any 0< δ ≤1and N > 0with probability at least 1−δ:

Ex∼q(·)[f(x)] ≥1

NPN

i=1wq/p(xi)f(xi)− ||f||∞r(1 −δ)d2[q(x)|| p(x)]

δ N .(4)

The divergence term d2[q||p]is the exponentiated R´

enyi-2 divergence, exp D2[q||p]. While this is

tractable for the multivariate Gaussian, it is otherwise not available in closed form. Fortunately, we

can use the effective sample size (ESS) [50] as an approximation, as ˆ

Nα≈N / d2[qα||p][49,51]

(Lemma 2, see Section Aof the Appendix). Combining Equation 4with our constraint, instead of

setting R∗, we maximize the IS lower bound R∗

LB to form an objective for the inverse temperature α

which incorporates the inference accuracy due to the sampling given inequality probability 1−δ,

max

αR∗

LB (α, δ) = Eqα/p[R]− ER(δ, ˆ

Nα),ER(δ, ˆ

Nα) = ||R||∞r(1 −δ)

pˆ

Nα

.(5)

We refer to this approach as lower-bound policy search (LBPS). This objective combines the ex-

pected performance of qα, based on the IS estimate Eqα/p[·], with regularization ERbased on the

return and inference accuracy. Treating p,N,||R||∞as task-speciﬁc hyperparameters, the only al-

gorithm hyperparameter δ∈[0,1) deﬁnes the probability of the bound. In practice, self-normalized

importance sampling is used for PPI, as the normalizing constants of the Gibbs likelihoods are not

available. While Metelli et al. also derive an SNIS lower bound [49], we found, as they did, that

the IS lower bound with SNIS estimates work better in practice due to the conservatism of the SNIS

bound. An interpretation of this approach is that the R´

enyi-2 regularization constrains the Gibbs

posterior to be one that can be estimated from the ﬁnite samples, as the divergence is used in eval-

uating IS sample complexity [52,53]. Moreover, the role of the ESS for regularization is similar to

the ‘elite’ samples in CEM. Connecting these two mechanisms as robust maximum estimators (Sec-

tion A), we also propose effective sample size policy search (ESSPS), which optimizes αto achieve a

desired ESS N∗, i.e. a R´

enyi-2 divergence bound, using the objective minα|ˆ

Nα−N∗|. More details

regarding PPI (Section A) and temperature selection methods (Table 1) are in the Appendix.

This section introduces two methods, LBPS and ESSPS, for constraining the Gibbs posteriors for

Monte Carlo optimization. These methods provide statistical regularization through soft and hard

constraints involving the effective sample size, which avoids the pitfall of ﬁtting high-dimensional

distributions to a few effective samples. A popular setting for these methods is MPC, which performs

episodic optimization over short planning horizons while adapting each time step to the current state.

Moreover, for optimal control, we also need to specify a suitable prior over action sequences. To

apply PPI to the MPC setting, we must implement online optimization given this prior over actions.

Gaussian Noise Smooth Noise (β=0.25) Smooth Actions (β=0.8)

Coloured Noise (β=2.5)

SE Kernel (l=0.2)

Periodic Kernel (l=0.2, T=0.5)

Figure 2: A practical aspect of Monte Carlo control methods for robotics is optimizing smooth ac-

tion sequence. This example shows a non-smooth optimal sequence , which may be undesirable,

though optimal, to ﬁt exactly. Prior methods struggle at providing both effective smooth solutions

in the mean and action samples , as they ultimately ﬁt the action distribution in an inde-

pendent fashion. Using kernel-derived covariance function provides both. The line denotes the

optimization horizon, beyond which is exploratory actions derived from both the posterior and prior.

4 Online Posterior Policy Iteration & Prior Design

In this section, we derive the online realization of posterior policy iteration that uses and maintains

correlated action priors, computing the ﬁnite-horizon Hfuture actions given a likelihood on a subset

of actions from the past. Rrepresents the return-based Gibbs likelihood term (Deﬁnition 1),

qα(at:t+H|R1:τ) = Rqα(a1:t+H|R1:τ)da1:t−1∝Rp(R1:τ|a1:τ)p(a1:t+H)da1:t−1,(6)

where τ≤t+H. As an analogy, this is equivalent to combining forecasting with state estimation,

i.e. p(xt:t+H|y1:t)for states xand measurements y. For correlated priors on the action space,

this computation is tractable if working with Gaussian processes. In fact, a recurring aspect across

several many posterior policy iteration-like approaches is the use of Gaussian process policies,

p(A|S) = 









QtN(µt,Σt),(Independent Gaussian noise, e.g. [18]),

N(µ>

wφ(t), φ(t)>Σwφ(t)),(Bayesian linear regression, e.g. ProMP [32]),

QtN(kt+Kts,Σt),(time-varying linear Gaussian e.g. [41,54,42]),

GP(µ(s),Σ(s)),(non-parametric Gaussian process [55]).

Despite the simplicity of Gaussian action noise, for robotics, more sophisticated noise is often de-

sired for safety and effective exploration [56,29]. Prior work has proposed ﬁrst-order smoothing

[11,12]. Using v(n)

t∼N(0,I),β∈[0,1] and Σt=LtL>

t, actions are sampled using

a(n)

t=µt+Ltn(n)

t,n(n)

t=βv(n)

t+ (1 −β)n(n)

t−1,or n(n)

t=βv(n)

t+p(1 −β2)n(n)

t−1.

However, in practice it is also implemented as a(n)

t=β(µt+Ltv(n)

t) + (1 −β)a(n)

t−11. While this

approach directly smooths the actuation, it also introduces a lag, which may deteriorate performance.

Other approaches have used colored noise for sampling the noise n[4]. Contrast these approaches

to Gibbs sampling a multivariate Gaussian joint distribution with 1-step cross-correlations [58],

which is a(n)

t|t−1=µ(n)

t|t−1+Lt|t−1v(n)

t, where µ(n)

t|t−1=µt+Σt,t−1Σ−1

t−1(a(n)

t−1−µt−1),and

Σt|t−1=Σt−Σt,t−1Σ−1

t−1Σ>

t,t−1.The differences are subtle, but important. The initial proposed

sampling scheme essentially adds correlated noise to the mean for exploration, but does not con-

sider the smoothness of the mean itself. The practical implementation incorporates the previous

action, but through exponential smoothing, which introduces a ﬁxed lag that potentially degrades

the quality of the mean action sequence. Correct sampling of the joint distribution has neither of

these issues and naturally extends to correlations over several time steps. We do this in a gen-

eral fashion by considering the (continuous time) Gaussian process (see Section G, Appendix), so

p(at) = N(µti:tj,Σti:tj) = GP(µ(t),Σ(t)) for a discrete-time sequence t= [ti, . . . , tj]. Propo-

sition 1shows how the time shift for MPC can be implemented in a general fashion when using GPs.

Proposition 1. Given a Gaussian process prior GP(µ(t),Σ(t)) and multivariate normal posterior

qα(at1:t2) = N(µt1:t2|R,Σt1:t2|R)for t1to t2, the posterior for t3to t4is expressed as

µt3:t4|R=µt3:t4+Σt3:t4,t1:t2νt1:t2,Σt3:t4|R=Σt3:t4−Σt3:t4,t1:t2Λt1:t2Σ>

t3:t4,t1:t2,(7)

where νt1:t2=Σ−1

t1:t2(µt1:t2|R−µt1:t2)and Λt1:t2=Σ−1

t1:t2(Σt1:t2−Σt1:t2|R)Σ−1

t1:t2.

This update combines the new sequence prior from t3to t4and the previous likelihood used in the

update for t1to t2, obtained from the posterior and prior. Note, the cross-covariance Σt3:t4,t1:t2is

computed using the covariance function of the prior GP. The proof is in Appendix A.

1See the source code for Nagabandi et al. [11] and MBRL-lib [57].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InferringSmoothControl:MonteCarloPosteriorPolicyIterationwithGaussianProcessesJoeWatson1JanPeters12341DepartmentofComputerScience,TechnicalUniversityofDarmstadt2CentreforCognitiveScience,TechnicalUniversityofDarmstadt3GermanResearchCenterforAI4Hessian.AIfjoe,jang@robot-learning.deAbstract:MonteCarlo...

展开>> 收起<<

Inferring Smooth Control Monte Carlo Posterior Policy Iteration with Gaussian Processes Joe Watson1Jan Peters1234.pdf

共36页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Inferring Smooth Control Monte Carlo Posterior Policy Iteration with Gaussian Processes Joe Watson1Jan Peters1234

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: