Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control Murad Dawood Nils Dengler Jorge de Heuvel Maren Bennewitz

2025-04-29 0 0 4.57MB 7 页 10玖币
侵权投诉
Handling Sparse Rewards in Reinforcement Learning
Using Model Predictive Control
Murad Dawood Nils Dengler Jorge de Heuvel Maren Bennewitz
Abstract Reinforcement learning (RL) has recently proven
great success in various domains. Yet, the design of the reward
function requires detailed domain expertise and tedious fine-
tuning to ensure that agents are able to learn the desired
behaviour. Using a sparse reward conveniently mitigates these
challenges. However, the sparse reward represents a challenge
on its own, often resulting in unsuccessful training of the
agent. In this paper, we therefore address the sparse reward
problem in RL. Our goal is to find an effective alternative to
reward shaping, without using costly human demonstrations,
that would also be applicable to a wide range of domains.
Hence, we propose to use model predictive control (MPC)
as an experience source for training RL agents in sparse
reward environments. Without the need for reward shaping,
we successfully apply our approach in the field of mobile
robot navigation both in simulation and real-world experiments
with a Kuboki Turtlebot 2. We furthermore demonstrate great
improvement over pure RL algorithms in terms of success rate
as well as number of collisions and timeouts. Our experiments
show that MPC as an experience source improves the agent’s
learning process for a given task in the case of sparse rewards.
I. INTRODUCTION
Reinforcement learning (RL) as well as model predictive
control (MPC) have been applied lately to various fields
and shown impressive results. However, there are still great
challenges that need to be dealt with in both approaches.
One major challenge in RL is the design of the reward
function. Shaping the reward function to achieve desired
results requires lots of trials to get the expected behaviour
of the trained policy. This is mainly due to the fact that
during the training, the agents exploit any opportunity given
by the reward function. An obvious solution to this issue
would be to use sparse rewards, i.e., rewarding the agent only
for achieving the goal and giving zero rewards otherwise.
While this approach encourages the agent to complete a
certain task, it is more difficult for the agent to identify
promising behaviour. Since the agent has no idea how well
it is performing during the training before reaching the
goal, it may fail to find the optimal policy. Handling sparse
rewards has been an active topic in the field of reinforcement
learning [1]–[5]. However, it still remains an open question
how an RL agent be successfully trained in a sparse reward
setting using an approach that is applicable to a variety of
domains.
All authors are with the Humanoid Robots Lab, University of Bonn,
Germany. Murad Dawood and Maren Bennewitz are additionally with
the Lamarr Institute for Machine Learning and Artificial Intelligence,
Germany. This work has partially been funded by the Deutsche Forschungs-
gemeinschaft (DFG, German Research Foundation) under the grant number
BE 4420/2-2 (FOR 2535 Anticipating Human Behavior).
MPC
State
RL
training
Demonstrations
Action
Fig. 1: Handling sparse rewards using our approach. The model
predictive controller (MPC) provides demonstrations to the rein-
forcement learning (RL) agent during training, while exploration
still takes place. The MPC demonstrations in combination with RL
guide the agent to find better policies to reach its goal.
One possibility is to use demonstrations to provide the
agent with a course of actions that solve the task at hand.
While demonstrations have been shown to improve the
training process in case of sparse rewards [6]–[8], and
human demonstrations specifically are commonly used in
the literature, providing these demonstrations can be quite
costly. In addition to that, human demonstrations typically
require hardware equipment or virtual reality sets to provide
the demonstrations [6], [9]–[11]. In this work, we therefore
propose to use MPC as an experience source for RL in the
case of sparse rewards. MPC has been very popular lately
in robotics and industry [12]–[17] as it is able to handle
constraints on both states and control signals, can handle
multiple-input multiple-output systems as well as nonlinear
systems, and the cost function can be constructed in a
straightforward way by minimizing the deviation between
the reference states and the current states. The aim of our
work is therefore to show that MPC can be used to provide
demonstrations for an RL agent in sparse reward settings.
The motivation of using RL with MPC demonstrations is
as follows: First, MPC is computationally demanding since it
solves an optimization control problem at each time step. For
highly nonlinear models with numerous states, this may not
be feasible to run in real time on real-world applications [16].
In contrast, inferring a trained policy online for actions is
less demanding even for systems with large state spaces.
Second, while MPC can be tuned to satisfying performance
in a certain scenario, the performance will not be as sat-
isfying when the same controller is deployed in another
scenario. This becomes obvious in trajectory tracking, where
the weight matrices have to be further tuned for different
arXiv:2210.01525v2 [cs.RO] 3 Mar 2023
Policy Selector
MPC
RL
𝜖
Gazebo
Simulation
Environment
Action
State
Replay
Buffer
s, a, r, s’
MPC
Rate
Observation Space
Obstacle
Goal
𝜽goal
dgoal
Action Space
Robot
y
x
Robot
v
𝜽
𝜔
Fig. 2: Illustration of our approach. Left: Both the model predictive control (MPC) and reinforcement learning (RL) interact with the
simulated environment during training. Based on the MPC rate , the policy selector chooses between MPC demonstrations and RL. The
gathered tuples (s, a, r, s0)are stored in the replay buffer. Top middle: The robot’s kinematics are used for nonlinear MPC. Right: The
observation space includes a laser scan (red lines), the closest obstacle (thick red line), and the relative distance and heading to the goal.
trajectories [18]. Third, unlike MPC which demands the full
state of the robot dynamic model, RL agents need only to
attain partial observations from the environment which can
be provided using onboard sensors [19].
We investigate our approach in the field of robot navigation
for the following reasons: 1) For MPC, the kinematic predic-
tion model of mobile robots is straightforward. 2) The state
and action space of mobile robots is small in comparison
to, e.g., humanoids, so that tuning of the MPC is not time-
consuming. 3) The learned policy can easily be tested in
different scenarios and usually successfully be transferred to
a real mobile robot.
To summarize, our main contribution is to demonstrate that
MPC as an experience source improves the training process
of RL agents in sparse rewards settings. We showcase our
approach in a mobile robot navigation scenario with static
and dynamic obstacles, both in simulations and on a real
robot. We also perform an ablation study to analyze the effect
of varying the number of MPC demonstrations during the
training. We make the following key claims: (i) MPC guides
the RL agent to learn tasks in a pure sparse reward setting.
(ii) The learned behavior policy leads to higher success rates
than pure RL. (iii) The balance between MPC demonstrations
and RL exploration influences the convergence rate of the
training. (iv) Our approach can successfully be applied to
the task of mobile robot navigation.
II. RELATED WORK
We will first discuss the use of human demonstrations in
the context of RL, followed by non-human demonstrations,
and finally previous approaches combining MPC with RL.
Human Demonstrations in Reinforcement Learning:
Several approaches have been presented that use human
expert demonstrations to boost the training process of RL
agents by showing examples of how to perform a certain task.
The agents subsequently learn faster in comparison to explor-
ing randomly. For example, [20] used human demonstrations
to boost the training of deep Q-networks and showed great
improvement over different RL approaches in Atari games. A
similar approach [2] used human demonstrations to improve
the training of deep deterministic policy gradient (DDPG) in
robotics tasks. Recently, [6] and [21] proposed combining
supervised learning with RL and providing expert samples
to play a video game and control a self-driving vehicle,
respectively. Additionally, [10] also combined supervised
learning with RL and human demonstrations to perform robot
arm tasks and showed that using demonstrations outperforms
the Hindsight Experience Replay (HER) approach [1].
Non-Human Demonstrations: To overcome the need
for costly human demonstrations, several approaches for
providing non-human demonstrations have been proposed.
[3] applied a hand-crafted policy of low success rate to
improve the training of an unmanned aerial vehicle (UAV) in
a sparse reward setting. As stated in that work, these hand-
crafted policies cannot be applied to diverse scenarios since
they are only able to perform fixed maneuvers. [22] used
a partially trained RL agent with shaped rewards to pro-
vide demonstrations for another RL agent in sparse reward
settings on MuJoCo [23] simulations and to mobile robot
navigation. [24], [25] proposed using proportional controllers
to provide the demonstrations for RL agents for mobile
robot navigation and robotic arm manipulation, respectively.
[5] used demonstrations generated by a global planner to
train a network using imitation learning along with RL for
mobile robot navigation. Unlike the discussed approaches,
we use a model predictive controller as an experience source
since MPC can be applied to a variety of applications and
does not involve reward shaping.
Combining MPC with RL: Several approaches using
MPC and differential dynamic programming (DDP) along
with RL have been presented. In [19] and [26] the authors
implemented the guided policy search (GPS) approach where
they transform the RL problem into a supervised learning
problem using demonstrations from MPC and DDP respec-
tively to train a UAV and MuJoCo environments. [27] used
MPC as an experience source and trained their network using
supervised learning for the navigation of a simulated car
model. However, their approach keeps the MPC running as
a safe fail policy in case the RL agent fails to find a better
action than the MPC. In [28] the authors proposed to apply
meta reinforcement learning along with MPC for demon-
strations to train a mobile robot navigate through randomly
摘要:

HandlingSparseRewardsinReinforcementLearningUsingModelPredictiveControlMuradDawoodNilsDenglerJorgedeHeuvelMarenBennewitzAbstract—Reinforcementlearning(RL)hasrecentlyprovengreatsuccessinvariousdomains.Yet,thedesignoftherewardfunctionrequiresdetaileddomainexpertiseandtediousne-tuningtoensurethatagent...

展开>> 收起<<
Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control Murad Dawood Nils Dengler Jorge de Heuvel Maren Bennewitz.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:4.57MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注