Tabular RL is often solved by choosing actions based on upper confidence bounds on the value
function [14,36], but explicitly computing and optimizing these bounds in the continuous setting is
substantially more challenging. Recent work [
16
] approximates this method by computing one-step
confidence bounds on the dynamics and training a ‘hallucinated’ policy which chooses perturbations
within these bound to maximize expected policy performance. Another recent work [
5
] uses anti-
concentration inequalities to approximate upper confidence bounds in MDPs with discrete actions.
Thompson sampling (TS) [
55
], which samples a realization of the MDP from the posterior and acts
optimally as if the realization was the true model, can be applied for exploration in a model-free
manner as in [
45
] or in a model-based manner as in [
63
]. As the posterior over MDP dynamics or
value functions can be high-dimensional and difficult to represent, the performance of TS can be
hindered by approximation errors using both Gaussian processes and ensembles of neural networks.
Curi et al.
[16]
recently investigated this and found that this was potentially due to an insufficiently
expressive posterior over entire transition functions, implying that it may be quite difficult to solve
tasks using sampled models. Similarly, the posterior over action-value functions in Osband et al.
[45]
is only roughly approximated by training a bootstrapped ensemble of neural networks.
There is also a rich literature of Bayesian methods for exploration, which are typically computationally
expensive and hard to use, though they have attractive theoretical properties. These methods build
upon the fundamental idea of the Bayes-adaptive MDP [
53
], which we detail in Section E.1 alongside
a discussion of this literature.
Additionally, a broad set of methods explore to learn about the environment without addressing a
specified task. This line of work is characterized by Pathak et al.
[47]
, which synthesizes a task-
agnostic reward function from model errors. Other techniques include MAX [
61
], which optimizes
the information gain about the environment dynamics, Random Network Distillation [
11
], which
forces the agent to learn about a random neural network across the state space, and Plan2Explore
[60], which prospectively plans to find areas of novelty where the dynamics are uncertain.
Bayesian Experimental Design: BOED, BO, BAX, and BARL
There is a large literature on
Bayesian optimal experiment design (BOED) [
12
] which focuses on efficiently querying a process or
function to get maximal information about some quantity of interest. When the quantity of interest
is the location of a function optimum, related strategies have been proposed as the entropy search
family of Bayesian optimization (BO) algorithms [
29
,
30
]. Recently, a flexible framework known
as Bayesian algorithm execution (BAX) [43] has been proposed to efficiently estimate properties of
expensive black-box functions; this framework gives a general procedure for sampling points which
are informative about the future execution of a given algorithm that computes the property of interest,
thereby allowing the function property to be estimated with far less data.
A subsequent related work [
40
], known as Bayesian Active Reinforcement Learning (BARL), uses
ideas from BOED and BAX to sample points that are maximally informative about the optimal
trajectory in an MDP. However, BARL relies on a setting the authors call Transition Query Reinforce-
ment Learning (TQRL), which assumes that the environment dynamics can be iteratively queried
at an arbitrary sequence of state-action pairs chosen by the agent. TQRL is thus a highly restrictive
setting which is not suitable when data can only be accessed via a trajectory (rollout) of environment
dynamics; it typically relies on an accurate environment simulator of sufficient expense to warrant its
use. Even then, there will likely be differences between simulators and ground truth dynamics for
complex systems. Thus, one would ideally like to collect data in real environments. However, this
often requires leaving the TQRL setting, and instead collecting data via trajectories only.
In this paper, we aim to apply the information-theoretic ideas from BARL but generalize them to
the general MDP setting as well as learn open loop model-based controllers. The typical method
for learning to solve open-loop control problems was demonstrated successfully in Tesch et al.
[65]
,
where a value function was learned from action sequences to task success. Our method takes a
model-based approach to this problem, using similar exploration strategies as Bayesian optimization
but benefitting from the more substantial supervision that is typical in dynamics model learning.
3 Problem Setting
In this work we deal with finite-horizon discrete-time Markov decision processes (MDPs) which
consist of a sextuple
hS,A, T, r, p0, Hi
where
S
is the state space,
A
is the action space,
T
is the
transition function
T:S × A → P(S)
(using the convention that
P(X)
is the set of probability
measures over
X
),
r:S × A × S → R
is a reward function,
p0(s)
is a distribution over
S
of start
3