D-Shape Demonstration-Shaped Reinforcement Learning via Goal Conditioning

2025-05-02 0 0 1.68MB 13 页 10玖币
侵权投诉
D-Shape: Demonstration-Shaped Reinforcement Learning via
Goal Conditioning
Caroline Wang
The University of Texas at Austin
Austin, Texas, United States
caroline.l.wang@utexas.edu
Garrett Warnell
Army Research Laboratory and
The University of Texas at Austin
Austin, Texas, United States
garrett.a.warnell.civ@army.mil
Peter Stone
The University of Texas at Austin and
Sony AI
Austin, Texas, United States
pstone@cs.utexas.edu
ABSTRACT
While combining imitation learning (IL) and reinforcement learn-
ing (RL) is a promising way to address poor sample eciency in
autonomous behavior acquisition, methods that do so typically as-
sume that the requisite behavior demonstrations are provided by
an expert that behaves optimally with respect to a task reward. If,
however, suboptimal demonstrations are provided, a fundamental
challenge appears in that the demonstration-matching objective
of IL conicts with the return-maximization objective of RL. This
paper introduces D-Shape, a new method for combining IL and RL
that uses ideas from reward shaping and goal-conditioned RL to
resolve the above conict. D-Shape allows learning from subopti-
mal demonstrations while retaining the ability to nd the optimal
policy with respect to the task reward. We experimentally vali-
date D-Shape in sparse-reward gridworld domains, showing that it
both improves over RL in terms of sample eciency and converges
consistently to the optimal policy in the presence of suboptimal
demonstrations.
KEYWORDS
reinforcement learning; goal-conditioned reinforcement learning;
imitation from observation; suboptimal demonstrations
ACM Reference Format:
Caroline Wang, Garrett Warnell, and Peter Stone. 2023. D-Shape: Demonstration-
Shaped Reinforcement Learning via Goal Conditioning. In Proc. of the 22nd
International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2023), London, United Kingdom, May 29 – June 2, 2023, IFAAMAS,
13 pages.
1 INTRODUCTION AND BACKGROUND
A longstanding goal of articial intelligence is enabling machines
to learn new behaviors. Towards this goal, the research commu-
nity has proposed both imitation learning (IL) and reinforcement
learning (RL). In IL, the agent is given access to a set of state-action
expert demonstrations and its goal is either to mimic the expert’s
behavior, or infer the expert’s reward function and maximize the
inferred reward. In RL, the agent is provided with a reward sig-
nal, and its goal is to maximize the long-term discounted reward.
While RL algorithms can potentially learn optimal behavior with
respect to the provided reward signal, in practice, they often suer
from high sample complexity in large state-action spaces, or spaces
Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Sys-
tems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023,
London, United Kingdom.
©
2023 International Foundation for Autonomous Agents
and Multiagent Systems (www.ifaamas.org). All rights reserved.
with sparse reward signals. On the other hand, IL methods are typ-
ically more sample-ecient than RL methods, but require expert
demonstration data.
It seems natural to consider using techniques from imitation
learning and demonstration data to speed up reinforcement learn-
ing. However, many IL algorithms implicitly perform divergence
minimization with respect to the provided demonstrations, with
no notion of an extrinsic task reward [
11
]. When we have access
to both demonstration data and an extrinsic task reward, we have
the opportunity to combine IL and RL techniques, but must care-
fully consider whether the demonstrated behavior conicts with
the extrinsic task reward — especially when demonstrated behav-
ior is suboptimal. Moreover, standard IL algorithms are only valid
in situations when demonstrations contain both state and action
information, which is not always the case.
The community has recently made progress in the area of im-
itation from observation (IfO), which extends IL approaches to
situations where demonstrator action information is unavailable,
dicult to induce, or not appropriate for the task at hand. This
last situation may occur if the demonstrator’s transition dynamics
dier from the learner’s—for instance, when the expert is a human
and the agent is a robot [20, 35]. While there has been some work
on performing IL or IfO with demonstrations that are suboptimal
with respect to an implicit task that the demonstrator seeks to ac-
complish [
4
,
5
,
7
,
38
], to date, relatively little work has considered
the problem of combining IfO and RL, where the learner’s true task
is explicitly specied by a reward function [32].
This paper introduces the D-Shape algorithm, which combines
IfO and RL in situations with suboptimal demonstrations. D-Shape
requires only a single, suboptimal, state-only expert demonstration,
and treats demonstration states as goals. To ensure that the optimal
policy with respect to the task reward is not altered, D-Shape uses
potential-based reward shaping to dene a goal-reaching reward.
We show theoretically that D-Shape preserves optimal policies, and
show empirically that D-Shape improves sample eciency over
related approaches with both optimal and suboptimal demonstra-
tions.
2 PRELIMINARIES
This section introduces our notation, and technical concepts that are
key to this work: reinforcement learning with state-only demonstra-
tions, goal-conditioned reinforcement learning, and potential-based
reward shaping.
arXiv:2210.14428v2 [cs.LG] 12 Mar 2023
Figure 1: D-Shape’s interaction with the environment. The
state 𝑠𝑡and the task reward 𝑟𝑡𝑎𝑠𝑘
𝑡come from the environ-
ment. 𝑠𝑡is concatenated with the demonstration state 𝑠𝑒
𝑡+1
as the goal, and 𝑟𝑡𝑎𝑠𝑘 is augmented with the potential-based
goal reaching function, 𝐹𝑔𝑜𝑎𝑙
𝑡.
2.1 Reinforcement Learning with State-Only
Demonstrations
Let
𝑀=(𝑆, 𝐴, 𝑃, 𝑟𝑡𝑎𝑠𝑘,𝛾)
be a nite-horizon Markov Decision Pro-
cess (MDP) with horizon
𝐻
, where
𝑆
and
𝐴
are the state and action
spaces,
𝑃(𝑠|𝑠, 𝑎)
is the transition dynamics of the environment,
𝑟𝑡𝑎𝑠𝑘 (𝑠, 𝑎, 𝑠 )
is a deterministic extrinsic task reward, and
𝛾∈ (
0
,
1
)
is the discount factor. The objective of reinforcement learning is
to discover a policy
𝜋(· | 𝑠)
that maximizes the expected reward
induced by
𝜋
,
E𝜋[Í𝐻1
𝑡=0𝛾𝑡𝑟𝑡𝑎𝑠𝑘 (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1)]
. In this work, we seek
to maximize the same objective, but to do so more eciently by
incorporating additional information in the form of a single, state-
only demonstration
𝐷𝑒={𝑠𝑒
𝑡}𝐻
𝑡=1
, that may be suboptimal with
respect to the task reward. The extent to which the demonstra-
tion can improve learning eciency may depend on its degree
of suboptimality. However, incorporating the demonstration into
the reinforcement learning procedure should not alter the optimal
policy with respect to
𝑟𝑡𝑎𝑠𝑘
, no matter how suboptimal the demon-
stration is. Prior literature has referred to this desideratum as policy
invariance [22].
2.2 Goal-Conditioned Reinforcement Learning
Goal-conditioned reinforcement learning (GCRL) further considers
a set of goals
𝐺
[
18
,
28
]. While standard RL aims to nd policies
that can be used to execute a single task, the objective of GCRL is
to learn a goal-conditioned policy
𝜋(· | [𝑠, 𝑔])
, where the task is to
reach any goal
𝑔𝐺
. Typically,
𝐺
is a predened set of desirable
states, and the reward function depends on the goal. A common
choice of reward function is the sparse indicator function for when
a goal has been reached, 𝑟𝑔
𝑡=1𝑠𝑡=𝑔.
Since it is challenging for RL algorithms to learn under sparse
rewards, Andrychowicz et al. [2] introduced hindsight experience
replay (HER). In the setting considered by Andrychowicz et al
. [2]
,
the goal is set at the beginning of an episode and remains xed
throughout. HER relies on the insight that even if a trajectory fails
to reach the given goal
𝑔
, transitions in the trajectory are successful
examples of reaching future states in the trajectory. More formally,
given a transition with goal
𝑔
,
([𝑠𝑡, 𝑔], 𝑎𝑡, 𝑟𝑔
𝑡,[𝑠𝑡+1, 𝑔])
, HER samples
a set of goals from future states in the episode,
G
. For all goals
𝑔
G
, HER relabels the original transition to
([𝑠𝑡, 𝑔], 𝑎𝑡, 𝑟𝑔
𝑡,[𝑠𝑡+1, 𝑔])
and stores the relabelled transition in a replay buer. An o-policy
RL algorithm is used to learn from the replay buer.
In this work, we use demonstration states as goals, allowing the
goal to change dynamically throughout the episode, and employ
the relabelling technique from HER.
2.3 Potential-Based Reward Shaping
Dene a potential function
𝜙
:
𝑆↦→ R
. Let
𝐹(𝑠, 𝑠)B𝛾𝜙 (𝑠) 𝜙(𝑠)
;
𝐹
is called a potential-based shaping function. Consider the MDP
𝑀=(𝑆, 𝐴, 𝑃, 𝑅B𝑟𝑡𝑎𝑠𝑘 +𝐹,𝛾)
, where
𝑟𝑡𝑎𝑠𝑘
is an extrinsic task
reward. We say
𝑅
is a potential-based reward function. Ng et al
.
[22]
showed that
𝐹
being a potential-based shaping function is
both a necessary and sucient condition to guarantee that (near)
optimal policies learned in
𝑀
are also (near) optimal in
𝑀
— that
is, policy invariance holds. This work leverages potential-based
reward functions as goal-reaching rewards, to bias the learned
policy towards the demonstration trajectory.
3 METHOD
We now introduce D-Shape, our approach to improving sample e-
ciency while leaving the optimal policy according to the task reward
unchanged. The training procedure is summarized in Algorithm 1.
D-Shape requires only a single, possibly suboptimal, state-only
demonstration. We are inspired by model-based IfO methods that
rely on inverse dynamics models (IDMs) — models that, given a
current state and a target state, return the action that induces the
transition. We observe that an IDM can be viewed as a single-step
goal-reaching policy. Although D-Shape does not assume access to
an IDM, we hypothesize that providing expert demonstration states
as goals to the reinforcement learner might be a useful inductive
bias. HER is used to form an implicit curriculum to learn to reach
demonstration states. As such, there are three components of D-
Shape: state augmentation, reward shaping, and a goal-relabelling
procedure. Figure 1 depicts the state augmentation and reward
shaping process of a D-Shape learner as it interacts with the envi-
ronment. Each component is described below.
3.0.1 State augmentation. During training, the policy gathers data
with demonstration states as behavior goals:
𝑎𝑡𝜋𝜃(𝑎𝑡| [𝑠𝑡, 𝑠𝑒
𝑡+1])
.
The agent observes the next state
𝑠𝑡+1
and the task reward
𝑟𝑡𝑎𝑠𝑘
𝑡
.
The next state
𝑠𝑡+1
is augmented with the demonstration state
𝑠𝑒
𝑡+2
as the goal. Note that our method employs dynamic goals, as the
goal changes from time step 𝑡to time step 𝑡+1.
3.0.2 Reward shaping. The task reward
𝑟𝑡𝑎𝑠𝑘
𝑡
is summed with a
potential-based shaping function to form a potential-based goal-
reaching reward. Dene the potential function
𝜙([𝑠𝑡, 𝑔𝑡]) B𝑑(𝑠𝑡, 𝑔𝑡)
,
where
𝑑
is a distance function,
𝑠𝑡
is the state observed by the agent
at time
𝑡
, and
𝑔𝑡
is the provided goal. Because the goal is dened as
part of the state,
𝜙(·)
only depends on the state, as required by the
formulation of potential-based reward shaping considered by Ng
et al. [22]. The potential-based shaping function is then
𝐹𝑔𝑜𝑎𝑙 ([𝑠𝑡, 𝑔𝑡],[𝑠𝑡+1, 𝑔𝑡+1]) B𝛾𝜙 ([𝑠𝑡+1, 𝑔𝑡+1]) 𝜙([𝑠𝑡, 𝑔𝑡]).(1)
Algorithm 1 D-Shape
Require: Single, state-only demonstration 𝐷𝑒B{𝑠𝑒
𝑡}𝐻
𝑡=1
1: Initialize 𝜃at random
2: while 𝜃is not converged do
3: for 𝑡=0 : 𝐻1do
4: Execute 𝑎𝑡𝜋𝜃(· | [𝑠𝑡, 𝑠𝑒
𝑡+1]), observe 𝑟𝑡𝑎𝑠𝑘
𝑡,𝑠𝑡+1
5: Compute 𝑟𝑔𝑜𝑎𝑙
𝑡=𝑟𝑡𝑎𝑠𝑘
𝑡+𝐹𝑔𝑜𝑎𝑙
𝑡using Equation (2)
6: Store transition
7: ([𝑠𝑡, 𝑠𝑒
𝑡+1], 𝑎𝑡, 𝑟𝑔𝑜𝑎𝑙
𝑡,[𝑠𝑡+1, 𝑠𝑒
𝑡+2]) in buer
8: end for
9: for 𝑡=1 : 𝐻1do
10:
Sample set of consecutive goal states
G
uniformly from
episode
11: for (𝑔, 𝑔) G do
12: Recompute 𝐹𝑔𝑜𝑎𝑙
𝑡component of 𝑟𝑔𝑜𝑎𝑙
𝑡using (𝑔, 𝑔)
13: Relabel transition to
14: ([𝑠𝑡, 𝑔], 𝑎𝑡, 𝑟𝑔𝑜𝑎𝑙
𝑡,[𝑠𝑡+1, 𝑔])
15: Store relabelled transition in replay buer
16: end for
17: end for
18: Update 𝜃using o-policy RL algorithm
19: end while
20: return 𝜃
The potential-based reward function is
𝑟𝑔𝑜𝑎𝑙
𝑡B𝑟𝑡𝑎𝑠𝑘
𝑡+𝐹𝑔𝑜𝑎𝑙
𝑡.(2)
Note that
𝑟𝑔𝑜𝑎𝑙
𝑡
is a goal-reaching reward that can be recomputed
with new goals, suitable for goal relabelling. The above procedure re-
sults in “original" transitions of the form
([𝑠𝑡, 𝑠𝑒
𝑡+1], 𝑎𝑡, 𝑟𝑔𝑜𝑎𝑙
𝑡,[𝑠𝑡+1, 𝑠𝑒
𝑡+2])
,
which are stored in a replay buer.
Goal Relabelling.
To encourage the policy to reach provided
goals, we perform the following goal-relabelling procedure on
original transitions using previously achieved states as goals. We
adopt a similar technique to HER, with a slight modication to
the goal sampling strategy. D-Shape’s goal sampling strategy con-
sists of sampling consecutive pairs of achieved states
(𝑔, 𝑔)
from
the current episode. The original transitions are then relabelled
to
([𝑠𝑡, 𝑔], 𝑎𝑡, 𝑟𝑔𝑜𝑎𝑙,[𝑠𝑡+1, 𝑔])
, where the reward is recomputed as
𝑟𝑔𝑜𝑎𝑙=𝑟𝑡𝑎𝑠𝑘
𝑡+𝐹𝑔𝑜𝑎𝑙
𝑡([𝑠𝑡, 𝑔], 𝑎𝑡,[𝑠𝑡+1, 𝑔])
. As the goals are imagi-
nary, even if the goal states change, the task reward
𝑟𝑡𝑎𝑠𝑘
𝑡
remains
unchanged.
The policy
𝜋𝜃
is updated by applying an o-policy RL algo-
rithm to the original data combined with the relabelled data. In
our experiments, we use Q-learning [
36
]. At inference time, the
policy once more acts with demonstration states as goals, i.e.,
𝑎𝑡𝜋𝜃(𝑎𝑡| [𝑠𝑡, 𝑠𝑒
𝑡+1]).
4 CONSISTENCY WITH THE TASK REWARD
In this section, we prove the claim that D-Shape preserves the
optimal policy with respect to the task reward by showing that we
can obtain optimal policies on
𝑀
from an optimal policy learned by
D-Shape. The analysis treats D-Shape as the composition of goal
relabelling and potential-based reward shaping and formalizes this
Figure 2: Our theoretical analysis considers D-Shape as the
composition of goal relabelling (𝑀𝑀) and potential-
based reward shaping (𝑀𝑀). D-Shape learns in 𝑀.
That optimal policies are preserved from 𝑀to 𝑀follows
directly from the policy invariance results of Ng et al. [22].
The theory provided considers a policy 𝜋in 𝑀in 𝑀via the
natural extension, 𝑓(𝜋), and considers a policy 𝜋in 𝑀as
operating in 𝑀via Γ(𝑠).
composition as the MDP transformation,
𝑀𝑀𝑀
(Figure
2).
Let
𝑀=(𝑆, 𝐴, 𝑃, 𝑟𝑡𝑎𝑠𝑘, 𝛾)
be an MDP with horizon
𝐻
, where
𝑆
and
𝐴
are the state and action spaces,
𝑃(𝑠|𝑠, 𝑎)
is the transition
dynamics of the environment,
𝑟𝑡𝑎𝑠𝑘 (𝑠, 𝑎, 𝑠 )
is the deterministic
extrinsic task reward, and
𝛾∈ (
0
,
1
)
is the discount factor. Modify
𝑀
to
𝑀
as follows:
𝑀=(𝑆×𝐺, 𝐴, 𝑃, 𝛾, 𝑟 )
, where
𝐺
is a discrete
set of goals, and
𝑃([𝑠, 𝑔]|[𝑠, 𝑔], 𝑎)=𝑃(𝑠|𝑠, 𝑎)𝑃(𝑔| [𝑠, 𝑔], 𝑎),
𝑟([𝑠, 𝑔], 𝑎, [𝑠, 𝑔]) =𝑟𝑡𝑎𝑠𝑘 (𝑠, 𝑎, 𝑠 ).
We make two independence assumptions in our denition of
𝑃
. First, that the random variable
(𝑠| [𝑠, 𝑔], 𝑎)
is independent
of
(𝑔| [𝑠, 𝑔], 𝑎)
, allowing us to factorize
𝑃([𝑠, 𝑔]|[𝑠, 𝑔], 𝑎)=
𝑃(𝑠| [𝑠, 𝑔], 𝑎)𝑃(𝑔| [𝑠, 𝑔], 𝑎)
. Second, that
(𝑠|𝑠, 𝑔, 𝑎)
is indepen-
dent of
(𝑔|𝑠, 𝑎)
, allowing us to rewrite
𝑃(𝑠| [𝑠, 𝑔], 𝑎)=𝑃(𝑠|𝑠, 𝑎)
.
In the context of D-Shape, the above assumptions simply mean that
a goal in the replay buer must be independent of all states, goals,
and actions other than the previous state, goal, and action. We also
require that
𝑟
is independent of goals. We justify that our imple-
mentation of D-Shape approximately satises these assumptions
in the Supplemental Material.
Now dene
𝑀=(𝑆×𝐺, 𝐴, 𝑃, 𝛾, 𝑟 +𝐹𝑔𝑜𝑎𝑙 )
, where
𝐹𝑔𝑜𝑎𝑙
is
dened as in Equation 1.
𝑀
is identical to
𝑀
, except for the
addition of the potential-based shaping function, 𝐹𝑔𝑜𝑎𝑙 .
D-Shape learns a goal-conditioned policy
𝜋(· | [𝑠, 𝑔])
in
𝑀
. To
perform inference with the goal-conditioned policy in
𝑀
, we must
specify a state-goal mapping
Γ
:
𝑆↦→ 𝐺
. Then
𝜋(· | [𝑠, Γ(𝑠)])
can be executed in
𝑀
. Suppose that
𝜋‡∗ (· | [𝑠, 𝑔])
is an optimal
policy in
𝑀
. Is there a
Γ
such that
𝜋‡∗ (· | [𝑠, Γ(𝑠)])
is optimal in
𝑀
? We show next that the answer is positive, and that an arbitrary
Γsuces.
That
𝜋‡∗
is optimal in
𝑀
follows from the policy invariance
results proven by Ng et al
. [22]
. By their result, as long as the
摘要:

D-Shape:Demonstration-ShapedReinforcementLearningviaGoalConditioningCarolineWangTheUniversityofTexasatAustinAustin,Texas,UnitedStatescaroline.l.wang@utexas.eduGarrettWarnellArmyResearchLaboratoryandTheUniversityofTexasatAustinAustin,Texas,UnitedStatesgarrett.a.warnell.civ@army.milPeterStoneTheUniver...

展开>> 收起<<
D-Shape Demonstration-Shaped Reinforcement Learning via Goal Conditioning.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.68MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注