D-Shape Demonstration-Shaped Reinforcement Learning via Goal Conditioning

2025-05-02 0 0 1.68MB 13 页 10玖币

侵权投诉

D-Shape: Demonstration-Shaped Reinforcement Learning via

Goal Conditioning

Caroline Wang

The University of Texas at Austin

Austin, Texas, United States

caroline.l.wang@utexas.edu

Garrett Warnell

Army Research Laboratory and

The University of Texas at Austin

Austin, Texas, United States

garrett.a.warnell.civ@army.mil

Peter Stone

The University of Texas at Austin and

Sony AI

Austin, Texas, United States

pstone@cs.utexas.edu

ABSTRACT

While combining imitation learning (IL) and reinforcement learn-

ing (RL) is a promising way to address poor sample eciency in

autonomous behavior acquisition, methods that do so typically as-

sume that the requisite behavior demonstrations are provided by

an expert that behaves optimally with respect to a task reward. If,

however, suboptimal demonstrations are provided, a fundamental

challenge appears in that the demonstration-matching objective

of IL conicts with the return-maximization objective of RL. This

paper introduces D-Shape, a new method for combining IL and RL

that uses ideas from reward shaping and goal-conditioned RL to

resolve the above conict. D-Shape allows learning from subopti-

mal demonstrations while retaining the ability to nd the optimal

policy with respect to the task reward. We experimentally vali-

date D-Shape in sparse-reward gridworld domains, showing that it

both improves over RL in terms of sample eciency and converges

consistently to the optimal policy in the presence of suboptimal

demonstrations.

KEYWORDS

reinforcement learning; goal-conditioned reinforcement learning;

imitation from observation; suboptimal demonstrations

ACM Reference Format:

Caroline Wang, Garrett Warnell, and Peter Stone. 2023. D-Shape: Demonstration-

Shaped Reinforcement Learning via Goal Conditioning. In Proc. of the 22nd

International Conference on Autonomous Agents and Multiagent Systems

(AAMAS 2023), London, United Kingdom, May 29 – June 2, 2023, IFAAMAS,

13 pages.

1 INTRODUCTION AND BACKGROUND

A longstanding goal of articial intelligence is enabling machines

to learn new behaviors. Towards this goal, the research commu-

nity has proposed both imitation learning (IL) and reinforcement

learning (RL). In IL, the agent is given access to a set of state-action

expert demonstrations and its goal is either to mimic the expert’s

behavior, or infer the expert’s reward function and maximize the

inferred reward. In RL, the agent is provided with a reward sig-

nal, and its goal is to maximize the long-term discounted reward.

While RL algorithms can potentially learn optimal behavior with

respect to the provided reward signal, in practice, they often suer

from high sample complexity in large state-action spaces, or spaces

Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Sys-

tems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023,

London, United Kingdom.

2023 International Foundation for Autonomous Agents

with sparse reward signals. On the other hand, IL methods are typ-

ically more sample-ecient than RL methods, but require expert

demonstration data.

It seems natural to consider using techniques from imitation

learning and demonstration data to speed up reinforcement learn-

ing. However, many IL algorithms implicitly perform divergence

minimization with respect to the provided demonstrations, with

no notion of an extrinsic task reward [

]. When we have access

to both demonstration data and an extrinsic task reward, we have

the opportunity to combine IL and RL techniques, but must care-

fully consider whether the demonstrated behavior conicts with

the extrinsic task reward — especially when demonstrated behav-

ior is suboptimal. Moreover, standard IL algorithms are only valid

in situations when demonstrations contain both state and action

information, which is not always the case.

The community has recently made progress in the area of im-

itation from observation (IfO), which extends IL approaches to

situations where demonstrator action information is unavailable,

dicult to induce, or not appropriate for the task at hand. This

last situation may occur if the demonstrator’s transition dynamics

dier from the learner’s—for instance, when the expert is a human

and the agent is a robot [20, 35]. While there has been some work

on performing IL or IfO with demonstrations that are suboptimal

with respect to an implicit task that the demonstrator seeks to ac-

complish [

], to date, relatively little work has considered

the problem of combining IfO and RL, where the learner’s true task

is explicitly specied by a reward function [32].

This paper introduces the D-Shape algorithm, which combines

IfO and RL in situations with suboptimal demonstrations. D-Shape

requires only a single, suboptimal, state-only expert demonstration,

and treats demonstration states as goals. To ensure that the optimal

policy with respect to the task reward is not altered, D-Shape uses

potential-based reward shaping to dene a goal-reaching reward.

We show theoretically that D-Shape preserves optimal policies, and

show empirically that D-Shape improves sample eciency over

related approaches with both optimal and suboptimal demonstra-

tions.

2 PRELIMINARIES

This section introduces our notation, and technical concepts that are

key to this work: reinforcement learning with state-only demonstra-

tions, goal-conditioned reinforcement learning, and potential-based

reward shaping.

arXiv:2210.14428v2 [cs.LG] 12 Mar 2023

Figure 1: D-Shape’s interaction with the environment. The

state 𝑠𝑡and the task reward 𝑟𝑡𝑎𝑠𝑘

𝑡come from the environ-

ment. 𝑠𝑡is concatenated with the demonstration state 𝑠𝑒

𝑡+1

as the goal, and 𝑟𝑡𝑎𝑠𝑘 is augmented with the potential-based

goal reaching function, 𝐹𝑔𝑜𝑎𝑙

𝑡.

2.1 Reinforcement Learning with State-Only

Demonstrations

Let

𝑀=(𝑆, 𝐴, 𝑃, 𝑟𝑡𝑎𝑠𝑘,𝛾)

be a nite-horizon Markov Decision Pro-

cess (MDP) with horizon

𝐻

, where

𝑆

and

𝐴

are the state and action

spaces,

𝑃(𝑠′|𝑠, 𝑎)

is the transition dynamics of the environment,

𝑟𝑡𝑎𝑠𝑘 (𝑠, 𝑎, 𝑠 ′)

is a deterministic extrinsic task reward, and

𝛾∈ (

)

is the discount factor. The objective of reinforcement learning is

to discover a policy

𝜋(· | 𝑠)

that maximizes the expected reward

induced by

𝜋

E𝜋[Í𝐻−1

𝑡=0𝛾𝑡𝑟𝑡𝑎𝑠𝑘 (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1)]

. In this work, we seek

to maximize the same objective, but to do so more eciently by

incorporating additional information in the form of a single, state-

only demonstration

𝐷𝑒={𝑠𝑒

𝑡}𝐻

𝑡=1

, that may be suboptimal with

respect to the task reward. The extent to which the demonstra-

tion can improve learning eciency may depend on its degree

of suboptimality. However, incorporating the demonstration into

the reinforcement learning procedure should not alter the optimal

policy with respect to

𝑟𝑡𝑎𝑠𝑘

, no matter how suboptimal the demon-

stration is. Prior literature has referred to this desideratum as policy

invariance [22].

2.2 Goal-Conditioned Reinforcement Learning

Goal-conditioned reinforcement learning (GCRL) further considers

a set of goals

𝐺

[

]. While standard RL aims to nd policies

that can be used to execute a single task, the objective of GCRL is

to learn a goal-conditioned policy

𝜋(· | [𝑠, 𝑔])

, where the task is to

reach any goal

𝑔∈𝐺

. Typically,

𝐺

is a predened set of desirable

states, and the reward function depends on the goal. A common

choice of reward function is the sparse indicator function for when

a goal has been reached, 𝑟𝑔

𝑡=1𝑠𝑡=𝑔.

Since it is challenging for RL algorithms to learn under sparse

rewards, Andrychowicz et al. [2] introduced hindsight experience

replay (HER). In the setting considered by Andrychowicz et al

. [2]

the goal is set at the beginning of an episode and remains xed

throughout. HER relies on the insight that even if a trajectory fails

to reach the given goal

𝑔

, transitions in the trajectory are successful

examples of reaching future states in the trajectory. More formally,

given a transition with goal

𝑔

([𝑠𝑡, 𝑔], 𝑎𝑡, 𝑟𝑔

𝑡,[𝑠𝑡+1, 𝑔])

, HER samples

a set of goals from future states in the episode,

. For all goals

𝑔′∈

, HER relabels the original transition to

([𝑠𝑡, 𝑔′], 𝑎𝑡, 𝑟𝑔′

𝑡,[𝑠𝑡+1, 𝑔′])

and stores the relabelled transition in a replay buer. An o-policy

RL algorithm is used to learn from the replay buer.

In this work, we use demonstration states as goals, allowing the

goal to change dynamically throughout the episode, and employ

the relabelling technique from HER.

2.3 Potential-Based Reward Shaping

Dene a potential function

𝜙

𝑆↦→ R

. Let

𝐹(𝑠, 𝑠′)B𝛾𝜙 (𝑠′) − 𝜙(𝑠)

;

𝐹

is called a potential-based shaping function. Consider the MDP

𝑀′=(𝑆, 𝐴, 𝑃, 𝑅′B𝑟𝑡𝑎𝑠𝑘 +𝐹,𝛾)

, where

𝑟𝑡𝑎𝑠𝑘

is an extrinsic task

reward. We say

𝑅′

is a potential-based reward function. Ng et al

[22]

showed that

𝐹

being a potential-based shaping function is

both a necessary and sucient condition to guarantee that (near)

optimal policies learned in

𝑀′

are also (near) optimal in

𝑀

— that

is, policy invariance holds. This work leverages potential-based

reward functions as goal-reaching rewards, to bias the learned

policy towards the demonstration trajectory.

3 METHOD

We now introduce D-Shape, our approach to improving sample e-

ciency while leaving the optimal policy according to the task reward

unchanged. The training procedure is summarized in Algorithm 1.

D-Shape requires only a single, possibly suboptimal, state-only

demonstration. We are inspired by model-based IfO methods that

rely on inverse dynamics models (IDMs) — models that, given a

current state and a target state, return the action that induces the

transition. We observe that an IDM can be viewed as a single-step

goal-reaching policy. Although D-Shape does not assume access to

an IDM, we hypothesize that providing expert demonstration states

as goals to the reinforcement learner might be a useful inductive

bias. HER is used to form an implicit curriculum to learn to reach

demonstration states. As such, there are three components of D-

Shape: state augmentation, reward shaping, and a goal-relabelling

procedure. Figure 1 depicts the state augmentation and reward

shaping process of a D-Shape learner as it interacts with the envi-

ronment. Each component is described below.

3.0.1 State augmentation. During training, the policy gathers data

with demonstration states as behavior goals:

𝑎𝑡∼𝜋𝜃(𝑎𝑡| [𝑠𝑡, 𝑠𝑒

𝑡+1])

The agent observes the next state

𝑠𝑡+1

and the task reward

𝑟𝑡𝑎𝑠𝑘

𝑡

The next state

𝑠𝑡+1

is augmented with the demonstration state

𝑠𝑒

𝑡+2

as the goal. Note that our method employs dynamic goals, as the

goal changes from time step 𝑡to time step 𝑡+1.

3.0.2 Reward shaping. The task reward

𝑟𝑡𝑎𝑠𝑘

𝑡

is summed with a

potential-based shaping function to form a potential-based goal-

reaching reward. Dene the potential function

𝜙([𝑠𝑡, 𝑔𝑡]) B−𝑑(𝑠𝑡, 𝑔𝑡)

where

𝑑

is a distance function,

𝑠𝑡

is the state observed by the agent

at time

𝑡

, and

𝑔𝑡

is the provided goal. Because the goal is dened as

part of the state,

𝜙(·)

only depends on the state, as required by the

formulation of potential-based reward shaping considered by Ng

et al. [22]. The potential-based shaping function is then

𝐹𝑔𝑜𝑎𝑙 ([𝑠𝑡, 𝑔𝑡],[𝑠𝑡+1, 𝑔𝑡+1]) B𝛾𝜙 ([𝑠𝑡+1, 𝑔𝑡+1]) − 𝜙([𝑠𝑡, 𝑔𝑡]).(1)

Algorithm 1 D-Shape

Require: Single, state-only demonstration 𝐷𝑒B{𝑠𝑒

𝑡}𝐻

𝑡=1

1: Initialize 𝜃at random

2: while 𝜃is not converged do

3: for 𝑡=0 : 𝐻−1do

4: Execute 𝑎𝑡∼𝜋𝜃(· | [𝑠𝑡, 𝑠𝑒

𝑡+1]), observe 𝑟𝑡𝑎𝑠𝑘

𝑡,𝑠𝑡+1

5: Compute 𝑟𝑔𝑜𝑎𝑙

𝑡=𝑟𝑡𝑎𝑠𝑘

𝑡+𝐹𝑔𝑜𝑎𝑙

𝑡using Equation (2)

6: Store transition

7: ([𝑠𝑡, 𝑠𝑒

𝑡+1], 𝑎𝑡, 𝑟𝑔𝑜𝑎𝑙

𝑡,[𝑠𝑡+1, 𝑠𝑒

𝑡+2]) in buer

8: end for

9: for 𝑡=1 : 𝐻−1do

10:

Sample set of consecutive goal states

uniformly from

episode

11: for (𝑔, 𝑔′) ∈ G do

12: Recompute 𝐹𝑔𝑜𝑎𝑙

𝑡component of 𝑟𝑔𝑜𝑎𝑙

𝑡using (𝑔, 𝑔′)

13: Relabel transition to

14: ([𝑠𝑡, 𝑔], 𝑎𝑡, 𝑟𝑔𝑜𝑎𝑙′

𝑡,[𝑠𝑡+1, 𝑔′])

15: Store relabelled transition in replay buer

16: end for

17: end for

18: Update 𝜃using o-policy RL algorithm

19: end while

20: return 𝜃∗

The potential-based reward function is

𝑟𝑔𝑜𝑎𝑙

𝑡B𝑟𝑡𝑎𝑠𝑘

𝑡+𝐹𝑔𝑜𝑎𝑙

𝑡.(2)

Note that

𝑟𝑔𝑜𝑎𝑙

𝑡

is a goal-reaching reward that can be recomputed

with new goals, suitable for goal relabelling. The above procedure re-

sults in “original" transitions of the form

([𝑠𝑡, 𝑠𝑒

𝑡+1], 𝑎𝑡, 𝑟𝑔𝑜𝑎𝑙

𝑡,[𝑠𝑡+1, 𝑠𝑒

𝑡+2])

which are stored in a replay buer.

Goal Relabelling.

To encourage the policy to reach provided

goals, we perform the following goal-relabelling procedure on

original transitions using previously achieved states as goals. We

adopt a similar technique to HER, with a slight modication to

the goal sampling strategy. D-Shape’s goal sampling strategy con-

sists of sampling consecutive pairs of achieved states

(𝑔, 𝑔′)

from

the current episode. The original transitions are then relabelled

([𝑠𝑡, 𝑔], 𝑎𝑡, 𝑟𝑔𝑜𝑎𝑙′,[𝑠𝑡+1, 𝑔′])

, where the reward is recomputed as

𝑟𝑔𝑜𝑎𝑙′=𝑟𝑡𝑎𝑠𝑘

𝑡+𝐹𝑔𝑜𝑎𝑙

𝑡([𝑠𝑡, 𝑔], 𝑎𝑡,[𝑠𝑡+1, 𝑔′])

. As the goals are imagi-

nary, even if the goal states change, the task reward

𝑟𝑡𝑎𝑠𝑘

𝑡

remains

unchanged.

The policy

𝜋𝜃

is updated by applying an o-policy RL algo-

rithm to the original data combined with the relabelled data. In

our experiments, we use Q-learning [

]. At inference time, the

policy once more acts with demonstration states as goals, i.e.,

𝑎𝑡∼𝜋𝜃(𝑎𝑡| [𝑠𝑡, 𝑠𝑒

𝑡+1]).

4 CONSISTENCY WITH THE TASK REWARD

In this section, we prove the claim that D-Shape preserves the

optimal policy with respect to the task reward by showing that we

can obtain optimal policies on

𝑀

from an optimal policy learned by

D-Shape. The analysis treats D-Shape as the composition of goal

relabelling and potential-based reward shaping and formalizes this

Figure 2: Our theoretical analysis considers D-Shape as the

composition of goal relabelling (𝑀→𝑀†) and potential-

based reward shaping (𝑀†→𝑀‡). D-Shape learns in 𝑀†.

That optimal policies are preserved from 𝑀†to 𝑀‡follows

directly from the policy invariance results of Ng et al. [22].

The theory provided considers a policy 𝜋in 𝑀in 𝑀†via the

natural extension, 𝑓(𝜋), and considers a policy 𝜋†in 𝑀†as

operating in 𝑀via Γ(𝑠).

composition as the MDP transformation,

𝑀→𝑀†→𝑀‡

(Figure

2).

Let

𝑀=(𝑆, 𝐴, 𝑃, 𝑟𝑡𝑎𝑠𝑘, 𝛾)

be an MDP with horizon

𝐻

, where

𝑆

and

𝐴

are the state and action spaces,

𝑃(𝑠′|𝑠, 𝑎)

is the transition

dynamics of the environment,

𝑟𝑡𝑎𝑠𝑘 (𝑠, 𝑎, 𝑠 ′)

is the deterministic

extrinsic task reward, and

𝛾∈ (

)

is the discount factor. Modify

𝑀

𝑀†

as follows:

𝑀†=(𝑆×𝐺, 𝐴, 𝑃†, 𝛾, 𝑟 †)

, where

𝐺

is a discrete

set of goals, and

𝑃†([𝑠′, 𝑔′]|[𝑠, 𝑔], 𝑎)=𝑃(𝑠′|𝑠, 𝑎)𝑃(𝑔′| [𝑠, 𝑔], 𝑎),

𝑟†([𝑠, 𝑔], 𝑎, [𝑠′, 𝑔′]) =𝑟𝑡𝑎𝑠𝑘 (𝑠, 𝑎, 𝑠 ′).

We make two independence assumptions in our denition of

𝑃†

. First, that the random variable

(𝑠′| [𝑠, 𝑔], 𝑎)

is independent

(𝑔′| [𝑠, 𝑔], 𝑎)

, allowing us to factorize

𝑃†([𝑠′, 𝑔′]|[𝑠, 𝑔], 𝑎)=

𝑃(𝑠′| [𝑠, 𝑔], 𝑎)𝑃(𝑔′| [𝑠, 𝑔], 𝑎)

. Second, that

(𝑠′|𝑠, 𝑔, 𝑎)

is indepen-

dent of

(𝑔|𝑠, 𝑎)

, allowing us to rewrite

𝑃(𝑠′| [𝑠, 𝑔], 𝑎)=𝑃(𝑠′|𝑠, 𝑎)

In the context of D-Shape, the above assumptions simply mean that

a goal in the replay buer must be independent of all states, goals,

and actions other than the previous state, goal, and action. We also

require that

𝑟†

is independent of goals. We justify that our imple-

mentation of D-Shape approximately satises these assumptions

in the Supplemental Material.

Now dene

𝑀‡=(𝑆×𝐺, 𝐴, 𝑃†, 𝛾, 𝑟 †+𝐹𝑔𝑜𝑎𝑙 )

, where

𝐹𝑔𝑜𝑎𝑙

dened as in Equation 1.

𝑀‡

is identical to

𝑀†

, except for the

addition of the potential-based shaping function, 𝐹𝑔𝑜𝑎𝑙 .

D-Shape learns a goal-conditioned policy

𝜋‡(· | [𝑠, 𝑔])

𝑀‡

. To

perform inference with the goal-conditioned policy in

𝑀

, we must

specify a state-goal mapping

𝑆↦→ 𝐺

. Then

𝜋‡(· | [𝑠, Γ(𝑠)])

can be executed in

𝑀

. Suppose that

𝜋‡∗ (· | [𝑠, 𝑔])

is an optimal

policy in

𝑀‡

. Is there a

such that

𝜋‡∗ (· | [𝑠, Γ(𝑠)])

is optimal in

𝑀

? We show next that the answer is positive, and that an arbitrary

Γsuces.

That

𝜋‡∗

is optimal in

𝑀†

follows from the policy invariance

results proven by Ng et al

. [22]

. By their result, as long as the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

D-Shape:Demonstration-ShapedReinforcementLearningviaGoalConditioningCarolineWangTheUniversityofTexasatAustinAustin,Texas,UnitedStatescaroline.l.wang@utexas.eduGarrettWarnellArmyResearchLaboratoryandTheUniversityofTexasatAustinAustin,Texas,UnitedStatesgarrett.a.warnell.civ@army.milPeterStoneTheUniver...

展开>> 收起<<

D-Shape Demonstration-Shaped Reinforcement Learning via Goal Conditioning.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

D-Shape Demonstration-Shaped Reinforcement Learning via Goal Conditioning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: