Exploration via Planning for Information about the Optimal Trajectory Viraj Mehta1 Ian Char2 Joseph Abbate4 Rory Conlin4 Mark D. Boyer3 Stefano Ermon5

2025-04-27 0 0 1.21MB 21 页 10玖币
侵权投诉
Exploration via Planning for Information about the
Optimal Trajectory
Viraj Mehta1, Ian Char2, Joseph Abbate4, Rory Conlin4, Mark D. Boyer3, Stefano Ermon5,
Jeff Schneider1, Willie Neiswanger5
1Robotics Institute, 2Machine Learning Department, Carnegie Mellon University
3Princeton Plasma Physics Laboratory, 4Princeton University
5Computer Science Department, Stanford University
{virajm,ichar,schneide}@cs.cmu.edu, {jabbate,wconlin}@princeton.edu,
mboyer@pppl.gov, {ermon,neiswanger}@cs.stanford.edu
Abstract
Many potential applications of reinforcement learning (RL) are stymied by the large
numbers of samples required to learn an effective policy. This is especially true
when applying RL to real-world control tasks, e.g. in the sciences or robotics, where
executing a policy in the environment is costly. In popular RL algorithms, agents
typically explore either by adding stochasticity to a reward-maximizing policy or
by attempting to gather maximal information about environment dynamics without
taking the given task into account. In this work, we develop a method that allows us
to plan for exploration while taking both the task and the current knowledge about
the dynamics into account. The key insight to our approach is to plan an action
sequence that maximizes the expected information gain about the optimal trajectory
for the task at hand. We demonstrate that our method learns strong policies with
2x fewer samples than strong exploration baselines and 200x fewer samples than
model free methods on a diverse set of low-to-medium dimensional control tasks
in both the open-loop and closed-loop control settings.1
1 Introduction
The potential of reinforcement learning (RL) as a general-purpose method of learning solutions to
sequential decision making problems is difficult to overstate. Ideally, RL could allow for agents
that learn to accomplish all manner of tasks solely through a given reward function and the agent’s
experience; however, RL has so far broadly fallen short of this. Chief among the difficulties in
realizing this potential is the fact that typical RL methods in continuous problems require very large
numbers of samples to achieve a near-optimal policy. For many interesting applications of RL this
problem is exacerbated by the fact that collecting samples from the true environment can incur huge
costs. For example, expensive scientific experiments are needed for tokamak control [
31
,
13
,
20
] and
design of molecules [
70
], and collecting experience on the road is both costly and runs the risk of car
accidents for autonomous vehicles [66].
Though model-based methods like Deisenroth and Rasmussen
[21]
, Chua et al.
[15]
, and Curi et al.
[16]
are much more efficient than typical model-free methods, they do not explicitly reason about
the prospective task-relevant information content of future observations. Moreover, these methods
typically require thousands of timesteps of environment dynamics data to solve reasonably simple
Markov decision processes (MDPs). Though progress has been made in using Thompson sampling
[
46
], entropy bonuses [
27
], and upper confidence bounds (UCB) [
16
,
5
] to more intelligently explore
1Code is available at: https://github.com/fusion-ml/trajectory-information-rl
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04642v1 [cs.LG] 6 Oct 2022
Trajectory (Joint)
Information Gain
Posterior Samples of
Optimal Trajectory
(b) (c) (d)
(a) (e)
Trajectory Information
Planning
Figure 1: A schematic depiction of Trajectory Information Planning (TIP). Suppose the agent in (a) aims to
determine where to explore next from its current state. To do so, in (b) the agent samples dynamics models
T0P(T|D)
from its current posterior and finds approximately optimal trajectories
τP(τ|T0)
for
each sample. Then in (c) it pools these samples of posterior optimal trajectories
τ
. In (d) it constructs a function
that gives the joint expected information gain about the optimal trajectory
τ
given a planned exploration
trajectory (i.e.
EIGτ
over the set of points visited). Finally, in (e) the agent can plan an action sequence which
maximizes this joint expected information gain.
the state-action space of an MDP, these methods still do not explicitly reason about how information
that the agent gathers will affect the estimated MDP solution. Furthermore, in continuous state-
action settings these methods must make coarse approximations to be computationally tractable (e.g.
bootstrapped Q networks to approxiate a posterior or one-step perturbations for approximate UCB).
Many methods in the vein of Pathak et al.
[47]
and Shyam et al.
[61]
explore directly to gather new
information based on current uncertainty about environment dynamics, but they do not in general
specialize the information that they aim to acquire for a particular task specified by an initial state
distribution and reward function. We believe that an ideal exploration strategy for sample efficiency
should take into account this task, as well as uncertainty about the environment dynamics.
In this work, we start by showing how many methods can be cast as a Bayesian planning problem
over a specific cost function. This framework helps illuminate the importance of the cost function
and what impact it has on exploration. Viewed in this light, it becomes clear that many previous
state-of-the-art methods rely on cost functions that either result in behavior that is too greedy—i.e. the
policy tries to maximize returns during exploration—or too exploratory, i.e. the policy is incentivized
to explore the environment dynamics and does not consider the task at hand. We therefore present a
cost function that balances out these two extremes, by generalizing an information-gain acquisition
function introduced in Mehta et al.
[40]
to apply to a set of future hypothetically acquired data. In
particular, our cost function captures the amount of information that would be gained about the
optimal trajectory, if the agent were to explore by following a particular planned trajectory. As
depicted in Figure 1, this involves the agent sampling what it would hypothetically do given different
realizations of the dynamics, and then planning actions that are informative about those possibilities.
In summary, the contributions of this work are as follows: we develop a novel cost function for
exploration that explicitly accounts for both the specific task and uncertainty about environment
dynamics, a method for planning which applies the cost function to explore in MDPs with continuous
states and actions, and a thorough empirical evaluation of our method across 5 closed-loop and 3
open-loop environments (with a focus on expensive RL tasks in plasma physics) compared against 14
baselines. We find that our proposed method is able to learn policies that perform as well as an agent
with access to the ground truth dynamics using half or fewer samples than comparison methods.
2 Related Work
Exploration in Reinforcement Learning
The most common strategy for exploration in RL is to
execute a greedy policy with some form of added stochasticity. The simplest approach,
-greedy
exploration as used in Mnih et al.
[41]
, takes the current action thought to be best with probability
1
and a random action with probability
. Other methods use added Ornstein-Uhlenbeck action
noise [
37
] to the greedy policy, or entropy bonuses [
27
] to the policy or value function objectives to
add noise to a policy which is otherwise optimizing the RL objective.
2
Tabular RL is often solved by choosing actions based on upper confidence bounds on the value
function [14,36], but explicitly computing and optimizing these bounds in the continuous setting is
substantially more challenging. Recent work [
16
] approximates this method by computing one-step
confidence bounds on the dynamics and training a ‘hallucinated’ policy which chooses perturbations
within these bound to maximize expected policy performance. Another recent work [
5
] uses anti-
concentration inequalities to approximate upper confidence bounds in MDPs with discrete actions.
Thompson sampling (TS) [
55
], which samples a realization of the MDP from the posterior and acts
optimally as if the realization was the true model, can be applied for exploration in a model-free
manner as in [
45
] or in a model-based manner as in [
63
]. As the posterior over MDP dynamics or
value functions can be high-dimensional and difficult to represent, the performance of TS can be
hindered by approximation errors using both Gaussian processes and ensembles of neural networks.
Curi et al.
[16]
recently investigated this and found that this was potentially due to an insufficiently
expressive posterior over entire transition functions, implying that it may be quite difficult to solve
tasks using sampled models. Similarly, the posterior over action-value functions in Osband et al.
[45]
is only roughly approximated by training a bootstrapped ensemble of neural networks.
There is also a rich literature of Bayesian methods for exploration, which are typically computationally
expensive and hard to use, though they have attractive theoretical properties. These methods build
upon the fundamental idea of the Bayes-adaptive MDP [
53
], which we detail in Section E.1 alongside
a discussion of this literature.
Additionally, a broad set of methods explore to learn about the environment without addressing a
specified task. This line of work is characterized by Pathak et al.
[47]
, which synthesizes a task-
agnostic reward function from model errors. Other techniques include MAX [
61
], which optimizes
the information gain about the environment dynamics, Random Network Distillation [
11
], which
forces the agent to learn about a random neural network across the state space, and Plan2Explore
[60], which prospectively plans to find areas of novelty where the dynamics are uncertain.
Bayesian Experimental Design: BOED, BO, BAX, and BARL
There is a large literature on
Bayesian optimal experiment design (BOED) [
12
] which focuses on efficiently querying a process or
function to get maximal information about some quantity of interest. When the quantity of interest
is the location of a function optimum, related strategies have been proposed as the entropy search
family of Bayesian optimization (BO) algorithms [
29
,
30
]. Recently, a flexible framework known
as Bayesian algorithm execution (BAX) [43] has been proposed to efficiently estimate properties of
expensive black-box functions; this framework gives a general procedure for sampling points which
are informative about the future execution of a given algorithm that computes the property of interest,
thereby allowing the function property to be estimated with far less data.
A subsequent related work [
40
], known as Bayesian Active Reinforcement Learning (BARL), uses
ideas from BOED and BAX to sample points that are maximally informative about the optimal
trajectory in an MDP. However, BARL relies on a setting the authors call Transition Query Reinforce-
ment Learning (TQRL), which assumes that the environment dynamics can be iteratively queried
at an arbitrary sequence of state-action pairs chosen by the agent. TQRL is thus a highly restrictive
setting which is not suitable when data can only be accessed via a trajectory (rollout) of environment
dynamics; it typically relies on an accurate environment simulator of sufficient expense to warrant its
use. Even then, there will likely be differences between simulators and ground truth dynamics for
complex systems. Thus, one would ideally like to collect data in real environments. However, this
often requires leaving the TQRL setting, and instead collecting data via trajectories only.
In this paper, we aim to apply the information-theoretic ideas from BARL but generalize them to
the general MDP setting as well as learn open loop model-based controllers. The typical method
for learning to solve open-loop control problems was demonstrated successfully in Tesch et al.
[65]
,
where a value function was learned from action sequences to task success. Our method takes a
model-based approach to this problem, using similar exploration strategies as Bayesian optimization
but benefitting from the more substantial supervision that is typical in dynamics model learning.
3 Problem Setting
In this work we deal with finite-horizon discrete-time Markov decision processes (MDPs) which
consist of a sextuple
hS,A, T, r, p0, Hi
where
S
is the state space,
A
is the action space,
T
is the
transition function
T:S × A P(S)
(using the convention that
P(X)
is the set of probability
measures over
X
),
r:S × A × S R
is a reward function,
p0(s)
is a distribution over
S
of start
3
states, and
HN
is the horizon (i.e. the length of an episode). We always assume
S,A, p0
, and
H
are known. We also assume the reward
r
is known, though our development of the method can
easily be generalized to the case where
r
is unknown. Our primary object of interest is the transition
function
T
, which we learn from data. We address both open and closed loop control settings. In the
more common closed loop setting, our aim is to find a policy
π:S → A
that maximizes Objective
(1)
below. We will denote trajectories as
τp(τ|π, T )
where
τ= [(s0, a0),...,(sH1, aH1), sH]
generated by
s0p0
,
ai=π(si)
, and
si+1 T(si, ai)
. We can write the return of a trajectory as
R(τ) = PH1
i=0 r(si, ai, si+1)
for the states and actions
si, ai
that make up
τ
. The MDP objective
can then be written as
JT(π) = Eτp(τ|π,T )[R(τ)] .(1)
We aim to maximize this objective while minimizing the number of samples from the ground truth
transition function
T
that are required to reach good performance. We denote the optimal policy as
π= arg maxπJT(π)
, which we can assume to be deterministic [
64
] but not necessarily unique.
We use τto denote optimal trajectories, i.e. τp(τ|π, T ).
Similarly, for the open-loop setting, we assume a fixed start state
s0
and aim to find an action sequence
a0, . . . , aH1
that maximizes the sum of rewards in an episode. We will slightly abuse notation and
write
τp(τ|a0:H1)
and
JT(a0:H1)
with these actions fixed in place of a reactive policy, and
again use τto refer to the trajectories generated by an optimal action sequence.
We assume in this work that applying planning algorithms like [
51
] to a dynamics function
T
will
result in a trajectory that approximates
τ
. We will primarily focus on a Gaussian process (GP) model
of the transition dynamics in order to take advantage of its expressive representation of uncertainty
and grounded methods for sampling, conditioning, and joint inference. There is substantial prior work
using GPs in RL—see Section E.2 for a discussion of this literature. Under this modeling choice,
we assume that the dynamics are drawn from a GP prior
P(T)
(see Section A.3 for further details
on our GP model) and use
P(T|D)
for the posterior over transition functions given a dynamics
dataset of triples
D={(si, ai, s0
i)}
. In this work, unions
Dτ
or
DS0
between the dataset
D
and trajectories
τ
or next state predictions
S0
coerce
τ
and
S0
into triples of dynamics data, prior to
the union with the dataset.
4 Trajectory Information Planning
Our method consists of a generic framework for Bayesian (or approximately Bayesian) model-
predictive control and a novel cost function for planning that allows us to explicitly plan to find
the maximal amount of new information relevant to our task. In Section 4.1, we describe the MPC
framework and highlight that many prior methods approximate this framework while using a greedy
cost function that corresponds to the future negative expected rewards or a pure exploration cost
function that corresponds to future information about the dynamics. Afterwards, in Section 4.2, we
derive our new cost function and describe how it is computed. The overall method we introduce
simply applies this planning framework with our new cost function.
4.1 Model-Predictive Control in Bayesian Model-Based RL
In this section, we give a formulation of Bayesian planning for control that generalizes ideas from
methods such as PILCO [
21
] and PETS [
15
]. This formulation highlights these methods’ inherently
greedy nature and hints at a possible solution. The objective of Bayesian planning is to find the
h
-step
action sequence that maximizes the expected future returns under model uncertainty. That is,
argmin
a0,...,ah1
ET0P(T|D)eP(τ|s0=s,a0:h1,T 0)[C(τe)] (2)
for some cost function
C
over trajectories and some start state
s
. If operating in the open-loop control
setting, the agent executes the sequence of actions found without replanning. This procedure can also
be extended to closed-loop control via model-predictive control (MPC), which involves re-planning
(2)
at every state the agent visits and playing the first action from the optimal sequence. Concretely,
the MPC policy for our Bayesian setting is as follows:
πMPC(s) = arg min
a0
min
a1,...,ah1
ET0P(T|D)eP(τ|s0=s,a0:h1,T 0)[C(τe)] (3)
4
Whether we do open-loop control or closed-loop control via MPC, the cost function
C
, is integral to
how the agent will behave. Prior work has predominantly focused on two types of cost function:
Cg(τ) = R(τ)
| {z }
Greedy Exploration
Ce(τ) =
h
X
i=0
H[T(si, ai)|D]
| {z }
Task-Agnostic Exploration
(4)
Previous works such as Kamthe and Deisenroth
[34]
and PETS [
15
] use the greedy exploration cost
function,
Cg
. This cost function incentivizes trajectories that achieve high rewards over the next
h
transitions on average. In works that focus on task-agnostic exploration such as Sekar et al.
[60]
and
Shyam et al.
[61]
, the cost function
Ce
(or similar) is used to encourage the agent to find areas of the
state space in which the model is maximally uncertain. Note that we use
πg
to refer to the greedy
policy given by using (3) with Cg.
The optimization problem in
(3)
is typically approximately solved in one of three ways: Deisenroth
and Rasmussen
[21]
and Curi et al.
[16]
directly backpropagate through the estimated dynamics and
reward functions to find policy parameters that would generate good actions, Janner et al.
[32]
use
an actor-critic method trained via rollouts in the model alongside the data collected to find a policy,
and Chua et al.
[15]
and Mehta et al.
[40]
use the cross-entropy method [
17
] to find action sequences
which directly maximize the reward over the estimated dynamics. In this work, we use a version
of the last method given in Pinneri et al.
[51]
, denoted iCEM, to directly find action sequences that
optimize the cost function being used. We approximate the expectation by playing the actions on
multiple samples from the posterior
P(T|D)
. Algorithm 1gives a formal description of the method
and Section A.5 provides further details.
Algorithm 1 Bayesian Model-Predictive Control with Cost Function C
Inputs:
transition function episode query budget
b
, number of posterior function samples
k
,
planning horizon h.
Initialize D← ∅.
for i[1, . . . , b]do
Sample start state s0p0.
for t[0, . . . , H 1] do
Sample posterior functions {T0
`}k
`=1 P(T0|D).
Approximately find arg mina0,...,ah1Pk
`=1 Eτ`p(τ|T0
`,a0,...,ah1)[C(τ`)] via iCEM.
Execute action a0by sampling st+1 T(st, a0).
Update dataset DD∪ {(st, a0, st+1}.
end for
end for
return πgfor the posterior P(T0|D).
4.2 A Task-Specific Cost Function based on Trajectory Information
In this work, we aim to explore by choosing actions that maximize the conditional expected informa-
tion gain (EIG) about the optimal trajectory
τ
. This is the same overall goal as that of Mehta et al.
[40]
, where the
EIGτ
acquisition function was introduced for this purpose. However, in this paper
we generalize this acquisition function in order to allow for sequential information collection, and
account for the redundant information that could be collected between timesteps. As discussed at
length in Osband et al.
[46]
, it is essential to reason about how an action taken at the current timestep
will affect the possibility of learning something useful in future timesteps. In other words, exploration
must be deep and not greedy. Explicit examples are given in Osband et al.
[46]
where the time to
find an
-optimal policy in a tabular MDP is exponential in the state size unless exploration can
be coordinated over large numbers of timesteps rather than being conducted independently at each
action. As the
EIGτ
acquisition function is only defined over a single state-action pair and mutual
information is submodular, we cannot naively use the acquisition function as is (or sum it over many
datapoints) to choose actions that lead to good long-term exploration. This is clear in e.g. navigation
tasks, where the nearby points visited over trajectories will provide redundant information about the
local environment.
We therefore give a cost function that generalizes
EIGτ
by taking a set of points to query and
computing the joint expected information gain from observing the set. Our cost function is non-
5
摘要:

ExplorationviaPlanningforInformationabouttheOptimalTrajectoryVirajMehta1,IanChar2,JosephAbbate4,RoryConlin4,MarkD.Boyer3,StefanoErmon5,JeffSchneider1,WillieNeiswanger51RoboticsInstitute,2MachineLearningDepartment,CarnegieMellonUniversity3PrincetonPlasmaPhysicsLaboratory,4PrincetonUniversity5Computer...

展开>> 收起<<
Exploration via Planning for Information about the Optimal Trajectory Viraj Mehta1 Ian Char2 Joseph Abbate4 Rory Conlin4 Mark D. Boyer3 Stefano Ermon5.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.21MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注