Exploration via Planning for Information about the Optimal Trajectory Viraj Mehta1 Ian Char2 Joseph Abbate4 Rory Conlin4 Mark D. Boyer3 Stefano Ermon5

2025-04-27 0 0 1.21MB 21 页 10玖币

侵权投诉

Exploration via Planning for Information about the

Optimal Trajectory

Viraj Mehta1, Ian Char2, Joseph Abbate4, Rory Conlin4, Mark D. Boyer3, Stefano Ermon5,

Jeff Schneider1, Willie Neiswanger5

1Robotics Institute, 2Machine Learning Department, Carnegie Mellon University

3Princeton Plasma Physics Laboratory, 4Princeton University

5Computer Science Department, Stanford University

{virajm,ichar,schneide}@cs.cmu.edu, {jabbate,wconlin}@princeton.edu,

mboyer@pppl.gov, {ermon,neiswanger}@cs.stanford.edu

Abstract

Many potential applications of reinforcement learning (RL) are stymied by the large

numbers of samples required to learn an effective policy. This is especially true

when applying RL to real-world control tasks, e.g. in the sciences or robotics, where

executing a policy in the environment is costly. In popular RL algorithms, agents

typically explore either by adding stochasticity to a reward-maximizing policy or

by attempting to gather maximal information about environment dynamics without

taking the given task into account. In this work, we develop a method that allows us

to plan for exploration while taking both the task and the current knowledge about

the dynamics into account. The key insight to our approach is to plan an action

sequence that maximizes the expected information gain about the optimal trajectory

for the task at hand. We demonstrate that our method learns strong policies with

2x fewer samples than strong exploration baselines and 200x fewer samples than

model free methods on a diverse set of low-to-medium dimensional control tasks

in both the open-loop and closed-loop control settings.1

1 Introduction

The potential of reinforcement learning (RL) as a general-purpose method of learning solutions to

sequential decision making problems is difﬁcult to overstate. Ideally, RL could allow for agents

that learn to accomplish all manner of tasks solely through a given reward function and the agent’s

experience; however, RL has so far broadly fallen short of this. Chief among the difﬁculties in

realizing this potential is the fact that typical RL methods in continuous problems require very large

numbers of samples to achieve a near-optimal policy. For many interesting applications of RL this

problem is exacerbated by the fact that collecting samples from the true environment can incur huge

costs. For example, expensive scientiﬁc experiments are needed for tokamak control [

] and

design of molecules [

], and collecting experience on the road is both costly and runs the risk of car

accidents for autonomous vehicles [66].

Though model-based methods like Deisenroth and Rasmussen

[21]

, Chua et al.

[15]

, and Curi et al.

[16]

are much more efﬁcient than typical model-free methods, they do not explicitly reason about

the prospective task-relevant information content of future observations. Moreover, these methods

typically require thousands of timesteps of environment dynamics data to solve reasonably simple

Markov decision processes (MDPs). Though progress has been made in using Thompson sampling

[

], entropy bonuses [

], and upper conﬁdence bounds (UCB) [

] to more intelligently explore

1Code is available at: https://github.com/fusion-ml/trajectory-information-rl

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04642v1 [cs.LG] 6 Oct 2022

Trajectory (Joint)

Information Gain

Posterior Samples of

Optimal Trajectory

(b) (c) (d)

(a) (e)

Trajectory Information

Planning

Figure 1: A schematic depiction of Trajectory Information Planning (TIP). Suppose the agent in (a) aims to

determine where to explore next from its current state. To do so, in (b) the agent samples dynamics models

T0∼P(T|D)

from its current posterior and ﬁnds approximately optimal trajectories

τ∗∼P(τ∗|T0)

for

each sample. Then in (c) it pools these samples of posterior optimal trajectories

τ∗

. In (d) it constructs a function

that gives the joint expected information gain about the optimal trajectory

τ∗

given a planned exploration

trajectory (i.e.

EIGτ∗

over the set of points visited). Finally, in (e) the agent can plan an action sequence which

maximizes this joint expected information gain.

the state-action space of an MDP, these methods still do not explicitly reason about how information

that the agent gathers will affect the estimated MDP solution. Furthermore, in continuous state-

action settings these methods must make coarse approximations to be computationally tractable (e.g.

bootstrapped Q networks to approxiate a posterior or one-step perturbations for approximate UCB).

Many methods in the vein of Pathak et al.

[47]

and Shyam et al.

[61]

explore directly to gather new

information based on current uncertainty about environment dynamics, but they do not in general

specialize the information that they aim to acquire for a particular task speciﬁed by an initial state

distribution and reward function. We believe that an ideal exploration strategy for sample efﬁciency

should take into account this task, as well as uncertainty about the environment dynamics.

In this work, we start by showing how many methods can be cast as a Bayesian planning problem

over a speciﬁc cost function. This framework helps illuminate the importance of the cost function

and what impact it has on exploration. Viewed in this light, it becomes clear that many previous

state-of-the-art methods rely on cost functions that either result in behavior that is too greedy—i.e. the

policy tries to maximize returns during exploration—or too exploratory, i.e. the policy is incentivized

to explore the environment dynamics and does not consider the task at hand. We therefore present a

cost function that balances out these two extremes, by generalizing an information-gain acquisition

function introduced in Mehta et al.

[40]

to apply to a set of future hypothetically acquired data. In

particular, our cost function captures the amount of information that would be gained about the

optimal trajectory, if the agent were to explore by following a particular planned trajectory. As

depicted in Figure 1, this involves the agent sampling what it would hypothetically do given different

realizations of the dynamics, and then planning actions that are informative about those possibilities.

In summary, the contributions of this work are as follows: we develop a novel cost function for

exploration that explicitly accounts for both the speciﬁc task and uncertainty about environment

dynamics, a method for planning which applies the cost function to explore in MDPs with continuous

states and actions, and a thorough empirical evaluation of our method across 5 closed-loop and 3

open-loop environments (with a focus on expensive RL tasks in plasma physics) compared against 14

baselines. We ﬁnd that our proposed method is able to learn policies that perform as well as an agent

with access to the ground truth dynamics using half or fewer samples than comparison methods.

2 Related Work

Exploration in Reinforcement Learning

The most common strategy for exploration in RL is to

execute a greedy policy with some form of added stochasticity. The simplest approach,



-greedy

exploration as used in Mnih et al.

[41]

, takes the current action thought to be best with probability

1−

and a random action with probability



. Other methods use added Ornstein-Uhlenbeck action

noise [

] to the greedy policy, or entropy bonuses [

] to the policy or value function objectives to

add noise to a policy which is otherwise optimizing the RL objective.

Tabular RL is often solved by choosing actions based on upper conﬁdence bounds on the value

function [14,36], but explicitly computing and optimizing these bounds in the continuous setting is

substantially more challenging. Recent work [

] approximates this method by computing one-step

conﬁdence bounds on the dynamics and training a ‘hallucinated’ policy which chooses perturbations

within these bound to maximize expected policy performance. Another recent work [

] uses anti-

concentration inequalities to approximate upper conﬁdence bounds in MDPs with discrete actions.

Thompson sampling (TS) [

], which samples a realization of the MDP from the posterior and acts

optimally as if the realization was the true model, can be applied for exploration in a model-free

manner as in [

] or in a model-based manner as in [

]. As the posterior over MDP dynamics or

value functions can be high-dimensional and difﬁcult to represent, the performance of TS can be

hindered by approximation errors using both Gaussian processes and ensembles of neural networks.

Curi et al.

[16]

recently investigated this and found that this was potentially due to an insufﬁciently

expressive posterior over entire transition functions, implying that it may be quite difﬁcult to solve

tasks using sampled models. Similarly, the posterior over action-value functions in Osband et al.

[45]

is only roughly approximated by training a bootstrapped ensemble of neural networks.

There is also a rich literature of Bayesian methods for exploration, which are typically computationally

expensive and hard to use, though they have attractive theoretical properties. These methods build

upon the fundamental idea of the Bayes-adaptive MDP [

], which we detail in Section E.1 alongside

a discussion of this literature.

Additionally, a broad set of methods explore to learn about the environment without addressing a

speciﬁed task. This line of work is characterized by Pathak et al.

[47]

, which synthesizes a task-

agnostic reward function from model errors. Other techniques include MAX [

], which optimizes

the information gain about the environment dynamics, Random Network Distillation [

], which

forces the agent to learn about a random neural network across the state space, and Plan2Explore

[60], which prospectively plans to ﬁnd areas of novelty where the dynamics are uncertain.

Bayesian Experimental Design: BOED, BO, BAX, and BARL

There is a large literature on

Bayesian optimal experiment design (BOED) [

] which focuses on efﬁciently querying a process or

function to get maximal information about some quantity of interest. When the quantity of interest

is the location of a function optimum, related strategies have been proposed as the entropy search

family of Bayesian optimization (BO) algorithms [

]. Recently, a ﬂexible framework known

as Bayesian algorithm execution (BAX) [43] has been proposed to efﬁciently estimate properties of

expensive black-box functions; this framework gives a general procedure for sampling points which

are informative about the future execution of a given algorithm that computes the property of interest,

thereby allowing the function property to be estimated with far less data.

A subsequent related work [

], known as Bayesian Active Reinforcement Learning (BARL), uses

ideas from BOED and BAX to sample points that are maximally informative about the optimal

trajectory in an MDP. However, BARL relies on a setting the authors call Transition Query Reinforce-

ment Learning (TQRL), which assumes that the environment dynamics can be iteratively queried

at an arbitrary sequence of state-action pairs chosen by the agent. TQRL is thus a highly restrictive

setting which is not suitable when data can only be accessed via a trajectory (rollout) of environment

dynamics; it typically relies on an accurate environment simulator of sufﬁcient expense to warrant its

use. Even then, there will likely be differences between simulators and ground truth dynamics for

complex systems. Thus, one would ideally like to collect data in real environments. However, this

often requires leaving the TQRL setting, and instead collecting data via trajectories only.

In this paper, we aim to apply the information-theoretic ideas from BARL but generalize them to

the general MDP setting as well as learn open loop model-based controllers. The typical method

for learning to solve open-loop control problems was demonstrated successfully in Tesch et al.

[65]

where a value function was learned from action sequences to task success. Our method takes a

model-based approach to this problem, using similar exploration strategies as Bayesian optimization

but beneﬁtting from the more substantial supervision that is typical in dynamics model learning.

3 Problem Setting

In this work we deal with ﬁnite-horizon discrete-time Markov decision processes (MDPs) which

consist of a sextuple

hS,A, T, r, p0, Hi

where

is the state space,

is the action space,

is the

transition function

T:S × A → P(S)

(using the convention that

P(X)

is the set of probability

measures over

r:S × A × S → R

is a reward function,

p0(s)

is a distribution over

of start

states, and

H∈N

is the horizon (i.e. the length of an episode). We always assume

S,A, p0

, and

are known. We also assume the reward

is known, though our development of the method can

easily be generalized to the case where

is unknown. Our primary object of interest is the transition

function

, which we learn from data. We address both open and closed loop control settings. In the

more common closed loop setting, our aim is to ﬁnd a policy

π:S → A

that maximizes Objective

(1)

below. We will denote trajectories as

τ∼p(τ|π, T )

where

τ= [(s0, a0),...,(sH−1, aH−1), sH]

generated by

s0∼p0

ai=π(si)

, and

si+1 ∼T(si, ai)

. We can write the return of a trajectory as

R(τ) = PH−1

i=0 r(si, ai, si+1)

for the states and actions

si, ai

that make up

. The MDP objective

can then be written as

JT(π) = Eτ∼p(τ|π,T )[R(τ)] .(1)

We aim to maximize this objective while minimizing the number of samples from the ground truth

transition function

that are required to reach good performance. We denote the optimal policy as

π∗= arg maxπJT(π)

, which we can assume to be deterministic [

] but not necessarily unique.

We use τ∗to denote optimal trajectories, i.e. τ∗∼p(τ|π∗, T ).

Similarly, for the open-loop setting, we assume a ﬁxed start state

and aim to ﬁnd an action sequence

a0, . . . , aH−1

that maximizes the sum of rewards in an episode. We will slightly abuse notation and

write

τ∼p(τ|a0:H−1)

and

JT(a0:H−1)

with these actions ﬁxed in place of a reactive policy, and

again use τ∗to refer to the trajectories generated by an optimal action sequence.

We assume in this work that applying planning algorithms like [

] to a dynamics function

will

result in a trajectory that approximates

τ∗

. We will primarily focus on a Gaussian process (GP) model

of the transition dynamics in order to take advantage of its expressive representation of uncertainty

and grounded methods for sampling, conditioning, and joint inference. There is substantial prior work

using GPs in RL—see Section E.2 for a discussion of this literature. Under this modeling choice,

we assume that the dynamics are drawn from a GP prior

P(T)

(see Section A.3 for further details

on our GP model) and use

P(T|D)

for the posterior over transition functions given a dynamics

dataset of triples

D={(si, ai, s0

i)}

. In this work, unions

D∪τ

D∪S0

between the dataset

and trajectories

or next state predictions

coerce

and

into triples of dynamics data, prior to

the union with the dataset.

4 Trajectory Information Planning

Our method consists of a generic framework for Bayesian (or approximately Bayesian) model-

predictive control and a novel cost function for planning that allows us to explicitly plan to ﬁnd

the maximal amount of new information relevant to our task. In Section 4.1, we describe the MPC

framework and highlight that many prior methods approximate this framework while using a greedy

cost function that corresponds to the future negative expected rewards or a pure exploration cost

function that corresponds to future information about the dynamics. Afterwards, in Section 4.2, we

derive our new cost function and describe how it is computed. The overall method we introduce

simply applies this planning framework with our new cost function.

4.1 Model-Predictive Control in Bayesian Model-Based RL

In this section, we give a formulation of Bayesian planning for control that generalizes ideas from

methods such as PILCO [

] and PETS [

]. This formulation highlights these methods’ inherently

greedy nature and hints at a possible solution. The objective of Bayesian planning is to ﬁnd the

-step

action sequence that maximizes the expected future returns under model uncertainty. That is,

argmin

a0,...,ah−1

ET0∼P(T|D),τe∼P(τ|s0=s,a0:h−1,T 0)[C(τe)] (2)

for some cost function

over trajectories and some start state

. If operating in the open-loop control

setting, the agent executes the sequence of actions found without replanning. This procedure can also

be extended to closed-loop control via model-predictive control (MPC), which involves re-planning

(2)

at every state the agent visits and playing the ﬁrst action from the optimal sequence. Concretely,

the MPC policy for our Bayesian setting is as follows:

πMPC(s) = arg min

min

a1,...,ah−1

ET0∼P(T|D),τe∼P(τ|s0=s,a0:h−1,T 0)[C(τe)] (3)

Whether we do open-loop control or closed-loop control via MPC, the cost function

, is integral to

how the agent will behave. Prior work has predominantly focused on two types of cost function:

Cg(τ) = −R(τ)

| {z }

Greedy Exploration

Ce(τ) = −

i=0

H[T(si, ai)|D]

| {z }

Task-Agnostic Exploration

(4)

Previous works such as Kamthe and Deisenroth

[34]

and PETS [

] use the greedy exploration cost

function,

. This cost function incentivizes trajectories that achieve high rewards over the next

transitions on average. In works that focus on task-agnostic exploration such as Sekar et al.

[60]

and

Shyam et al.

[61]

, the cost function

(or similar) is used to encourage the agent to ﬁnd areas of the

state space in which the model is maximally uncertain. Note that we use

πg

to refer to the greedy

policy given by using (3) with Cg.

The optimization problem in

(3)

is typically approximately solved in one of three ways: Deisenroth

and Rasmussen

[21]

and Curi et al.

[16]

directly backpropagate through the estimated dynamics and

reward functions to ﬁnd policy parameters that would generate good actions, Janner et al.

[32]

use

an actor-critic method trained via rollouts in the model alongside the data collected to ﬁnd a policy,

and Chua et al.

[15]

and Mehta et al.

[40]

use the cross-entropy method [

] to ﬁnd action sequences

which directly maximize the reward over the estimated dynamics. In this work, we use a version

of the last method given in Pinneri et al.

[51]

, denoted iCEM, to directly ﬁnd action sequences that

optimize the cost function being used. We approximate the expectation by playing the actions on

multiple samples from the posterior

P(T|D)

. Algorithm 1gives a formal description of the method

and Section A.5 provides further details.

Algorithm 1 Bayesian Model-Predictive Control with Cost Function C

Inputs:

transition function episode query budget

, number of posterior function samples

planning horizon h.

Initialize D← ∅.

for i∈[1, . . . , b]do

Sample start state s0∼p0.

for t∈[0, . . . , H −1] do

Sample posterior functions {T0

`}k

`=1 ∼P(T0|D).

Approximately ﬁnd arg mina0,...,ah−1Pk

`=1 Eτ`∼p(τ|T0

`,a0,...,ah−1)[C(τ`)] via iCEM.

Execute action a0by sampling st+1 ∼T(st, a0).

Update dataset D←D∪ {(st, a0, st+1}.

end for

return πgfor the posterior P(T0|D).

4.2 A Task-Speciﬁc Cost Function based on Trajectory Information

In this work, we aim to explore by choosing actions that maximize the conditional expected informa-

tion gain (EIG) about the optimal trajectory

τ∗

. This is the same overall goal as that of Mehta et al.

[40]

, where the

EIGτ∗

acquisition function was introduced for this purpose. However, in this paper

we generalize this acquisition function in order to allow for sequential information collection, and

account for the redundant information that could be collected between timesteps. As discussed at

length in Osband et al.

[46]

, it is essential to reason about how an action taken at the current timestep

will affect the possibility of learning something useful in future timesteps. In other words, exploration

must be deep and not greedy. Explicit examples are given in Osband et al.

[46]

where the time to

ﬁnd an



-optimal policy in a tabular MDP is exponential in the state size unless exploration can

be coordinated over large numbers of timesteps rather than being conducted independently at each

action. As the

EIGτ∗

acquisition function is only deﬁned over a single state-action pair and mutual

information is submodular, we cannot naively use the acquisition function as is (or sum it over many

datapoints) to choose actions that lead to good long-term exploration. This is clear in e.g. navigation

tasks, where the nearby points visited over trajectories will provide redundant information about the

local environment.

We therefore give a cost function that generalizes

EIGτ∗

by taking a set of points to query and

computing the joint expected information gain from observing the set. Our cost function is non-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExplorationviaPlanningforInformationabouttheOptimalTrajectoryVirajMehta1,IanChar2,JosephAbbate4,RoryConlin4,MarkD.Boyer3,StefanoErmon5,JeffSchneider1,WillieNeiswanger51RoboticsInstitute,2MachineLearningDepartment,CarnegieMellonUniversity3PrincetonPlasmaPhysicsLaboratory,4PrincetonUniversity5Computer...

展开>> 收起<<

Exploration via Planning for Information about the Optimal Trajectory Viraj Mehta1 Ian Char2 Joseph Abbate4 Rory Conlin4 Mark D. Boyer3 Stefano Ermon5.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploration via Planning for Information about the Optimal Trajectory Viraj Mehta1 Ian Char2 Joseph Abbate4 Rory Conlin4 Mark D. Boyer3 Stefano Ermon5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: