
While REPS is a principled approach to stochastic optimization, we posit two weaknesses: The hard
KL constraint is difficult to specify, as it depends on the optimization problem, distribution family
and dimensionality. Secondly, the Monte Carlo approximation of the dual has no regularization and
may poorly adhere to the KL constraint without sufficient samples. Therefore, we desire an alterna-
tive approach that resolves these two issues, capturing the Monte Carlo approximation error with a
simpler hyperparameter. To tackle this problem, we interpret the REPS update as a pseudo-posterior,
where the temperature is calculated using the KL constraint. We make this interpretation concrete
by reversing the objective and constraint, switching to an equality constraint for the expectation,
minθDKL[qθ(A|S)|| p(A|S)] s.t. Est+1∼p(·|st,at),at∼qθ(·|st),s1∼p(·)[Ptr(st,at)] = R∗.
This objective is a minimum relative entropy problem [48], which yields the same Gibbs posterior
as eREPS (Lemma 1, Appendix A). With exact inference, a suitable prior and oracle knowledge
of the maximum return, this program computes the optimal policy in a single step by setting R∗
to the optimal value. However, in this work, the expectation constraint requires self-normalized
importance sampling (SNIS) on sampled returns R(n)using samples from the current policy prior,
Est+1∼p(·|st,at),at∼qθ(·|st),s1∼p(·)[Ptr(st,at)] ≈Pnw(n)
q/pR(n)=PnR(n)exp(α R(n))
Pnexp(α R(n))=R∗.
Rather than specifying R∗here, we identify that this estimator is fundamentally limited by inference
accuracy. We capture this error by applying an IS-derived concentration inequality to this estimate
(Theorem 1) [49]. This lower bound can be used as an objective for optimizing α, balancing policy
improvement with approximate inference accuracy.
Theorem 1. (Importance sampling estimator concentration inequality (Theorem 2, [49])) Let qand
pbe two probability densities such that qpand d2[q|| p]<+∞. Let x1,x2,...,xNi.i.d.
random variables sampled from pand f:X →Rbe a bounded function (||f||∞<+∞). Then, for
any 0< δ ≤1and N > 0with probability at least 1−δ:
Ex∼q(·)[f(x)] ≥1
NPN
i=1wq/p(xi)f(xi)− ||f||∞r(1 −δ)d2[q(x)|| p(x)]
δ N .(4)
The divergence term d2[q||p]is the exponentiated R´
enyi-2 divergence, exp D2[q||p]. While this is
tractable for the multivariate Gaussian, it is otherwise not available in closed form. Fortunately, we
can use the effective sample size (ESS) [50] as an approximation, as ˆ
Nα≈N / d2[qα||p][49,51]
(Lemma 2, see Section Aof the Appendix). Combining Equation 4with our constraint, instead of
setting R∗, we maximize the IS lower bound R∗
LB to form an objective for the inverse temperature α
which incorporates the inference accuracy due to the sampling given inequality probability 1−δ,
max
αR∗
LB (α, δ) = Eqα/p[R]− ER(δ, ˆ
Nα),ER(δ, ˆ
Nα) = ||R||∞r(1 −δ)
δ
1
pˆ
Nα
.(5)
We refer to this approach as lower-bound policy search (LBPS). This objective combines the ex-
pected performance of qα, based on the IS estimate Eqα/p[·], with regularization ERbased on the
return and inference accuracy. Treating p,N,||R||∞as task-specific hyperparameters, the only al-
gorithm hyperparameter δ∈[0,1) defines the probability of the bound. In practice, self-normalized
importance sampling is used for PPI, as the normalizing constants of the Gibbs likelihoods are not
available. While Metelli et al. also derive an SNIS lower bound [49], we found, as they did, that
the IS lower bound with SNIS estimates work better in practice due to the conservatism of the SNIS
bound. An interpretation of this approach is that the R´
enyi-2 regularization constrains the Gibbs
posterior to be one that can be estimated from the finite samples, as the divergence is used in eval-
uating IS sample complexity [52,53]. Moreover, the role of the ESS for regularization is similar to
the ‘elite’ samples in CEM. Connecting these two mechanisms as robust maximum estimators (Sec-
tion A), we also propose effective sample size policy search (ESSPS), which optimizes αto achieve a
desired ESS N∗, i.e. a R´
enyi-2 divergence bound, using the objective minα|ˆ
Nα−N∗|. More details
regarding PPI (Section A) and temperature selection methods (Table 1) are in the Appendix.
This section introduces two methods, LBPS and ESSPS, for constraining the Gibbs posteriors for
Monte Carlo optimization. These methods provide statistical regularization through soft and hard
constraints involving the effective sample size, which avoids the pitfall of fitting high-dimensional
distributions to a few effective samples. A popular setting for these methods is MPC, which performs
episodic optimization over short planning horizons while adapting each time step to the current state.
Moreover, for optimal control, we also need to specify a suitable prior over action sequences. To
apply PPI to the MPC setting, we must implement online optimization given this prior over actions.
4