
D-Shape: Demonstration-Shaped Reinforcement Learning via
Goal Conditioning
Caroline Wang
The University of Texas at Austin
Austin, Texas, United States
caroline.l.wang@utexas.edu
Garrett Warnell
Army Research Laboratory and
The University of Texas at Austin
Austin, Texas, United States
garrett.a.warnell.civ@army.mil
Peter Stone
The University of Texas at Austin and
Sony AI
Austin, Texas, United States
pstone@cs.utexas.edu
ABSTRACT
While combining imitation learning (IL) and reinforcement learn-
ing (RL) is a promising way to address poor sample eciency in
autonomous behavior acquisition, methods that do so typically as-
sume that the requisite behavior demonstrations are provided by
an expert that behaves optimally with respect to a task reward. If,
however, suboptimal demonstrations are provided, a fundamental
challenge appears in that the demonstration-matching objective
of IL conicts with the return-maximization objective of RL. This
paper introduces D-Shape, a new method for combining IL and RL
that uses ideas from reward shaping and goal-conditioned RL to
resolve the above conict. D-Shape allows learning from subopti-
mal demonstrations while retaining the ability to nd the optimal
policy with respect to the task reward. We experimentally vali-
date D-Shape in sparse-reward gridworld domains, showing that it
both improves over RL in terms of sample eciency and converges
consistently to the optimal policy in the presence of suboptimal
demonstrations.
KEYWORDS
reinforcement learning; goal-conditioned reinforcement learning;
imitation from observation; suboptimal demonstrations
ACM Reference Format:
Caroline Wang, Garrett Warnell, and Peter Stone. 2023. D-Shape: Demonstration-
Shaped Reinforcement Learning via Goal Conditioning. In Proc. of the 22nd
International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2023), London, United Kingdom, May 29 – June 2, 2023, IFAAMAS,
13 pages.
1 INTRODUCTION AND BACKGROUND
A longstanding goal of articial intelligence is enabling machines
to learn new behaviors. Towards this goal, the research commu-
nity has proposed both imitation learning (IL) and reinforcement
learning (RL). In IL, the agent is given access to a set of state-action
expert demonstrations and its goal is either to mimic the expert’s
behavior, or infer the expert’s reward function and maximize the
inferred reward. In RL, the agent is provided with a reward sig-
nal, and its goal is to maximize the long-term discounted reward.
While RL algorithms can potentially learn optimal behavior with
respect to the provided reward signal, in practice, they often suer
from high sample complexity in large state-action spaces, or spaces
Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Sys-
tems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023,
London, United Kingdom.
©
2023 International Foundation for Autonomous Agents
and Multiagent Systems (www.ifaamas.org). All rights reserved.
with sparse reward signals. On the other hand, IL methods are typ-
ically more sample-ecient than RL methods, but require expert
demonstration data.
It seems natural to consider using techniques from imitation
learning and demonstration data to speed up reinforcement learn-
ing. However, many IL algorithms implicitly perform divergence
minimization with respect to the provided demonstrations, with
no notion of an extrinsic task reward [
11
]. When we have access
to both demonstration data and an extrinsic task reward, we have
the opportunity to combine IL and RL techniques, but must care-
fully consider whether the demonstrated behavior conicts with
the extrinsic task reward — especially when demonstrated behav-
ior is suboptimal. Moreover, standard IL algorithms are only valid
in situations when demonstrations contain both state and action
information, which is not always the case.
The community has recently made progress in the area of im-
itation from observation (IfO), which extends IL approaches to
situations where demonstrator action information is unavailable,
dicult to induce, or not appropriate for the task at hand. This
last situation may occur if the demonstrator’s transition dynamics
dier from the learner’s—for instance, when the expert is a human
and the agent is a robot [20, 35]. While there has been some work
on performing IL or IfO with demonstrations that are suboptimal
with respect to an implicit task that the demonstrator seeks to ac-
complish [
4
,
5
,
7
,
38
], to date, relatively little work has considered
the problem of combining IfO and RL, where the learner’s true task
is explicitly specied by a reward function [32].
This paper introduces the D-Shape algorithm, which combines
IfO and RL in situations with suboptimal demonstrations. D-Shape
requires only a single, suboptimal, state-only expert demonstration,
and treats demonstration states as goals. To ensure that the optimal
policy with respect to the task reward is not altered, D-Shape uses
potential-based reward shaping to dene a goal-reaching reward.
We show theoretically that D-Shape preserves optimal policies, and
show empirically that D-Shape improves sample eciency over
related approaches with both optimal and suboptimal demonstra-
tions.
2 PRELIMINARIES
This section introduces our notation, and technical concepts that are
key to this work: reinforcement learning with state-only demonstra-
tions, goal-conditioned reinforcement learning, and potential-based
reward shaping.
arXiv:2210.14428v2 [cs.LG] 12 Mar 2023