Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control Murad Dawood Nils Dengler Jorge de Heuvel Maren Bennewitz

2025-04-29 0 0 4.57MB 7 页 10玖币

侵权投诉

Handling Sparse Rewards in Reinforcement Learning

Using Model Predictive Control

Murad Dawood Nils Dengler Jorge de Heuvel Maren Bennewitz

Abstract— Reinforcement learning (RL) has recently proven

great success in various domains. Yet, the design of the reward

function requires detailed domain expertise and tedious ﬁne-

tuning to ensure that agents are able to learn the desired

behaviour. Using a sparse reward conveniently mitigates these

challenges. However, the sparse reward represents a challenge

on its own, often resulting in unsuccessful training of the

agent. In this paper, we therefore address the sparse reward

problem in RL. Our goal is to ﬁnd an effective alternative to

reward shaping, without using costly human demonstrations,

that would also be applicable to a wide range of domains.

Hence, we propose to use model predictive control (MPC)

as an experience source for training RL agents in sparse

reward environments. Without the need for reward shaping,

we successfully apply our approach in the ﬁeld of mobile

robot navigation both in simulation and real-world experiments

with a Kuboki Turtlebot 2. We furthermore demonstrate great

improvement over pure RL algorithms in terms of success rate

as well as number of collisions and timeouts. Our experiments

show that MPC as an experience source improves the agent’s

learning process for a given task in the case of sparse rewards.

I. INTRODUCTION

Reinforcement learning (RL) as well as model predictive

control (MPC) have been applied lately to various ﬁelds

and shown impressive results. However, there are still great

challenges that need to be dealt with in both approaches.

One major challenge in RL is the design of the reward

function. Shaping the reward function to achieve desired

results requires lots of trials to get the expected behaviour

of the trained policy. This is mainly due to the fact that

during the training, the agents exploit any opportunity given

by the reward function. An obvious solution to this issue

would be to use sparse rewards, i.e., rewarding the agent only

for achieving the goal and giving zero rewards otherwise.

While this approach encourages the agent to complete a

certain task, it is more difﬁcult for the agent to identify

promising behaviour. Since the agent has no idea how well

it is performing during the training before reaching the

goal, it may fail to ﬁnd the optimal policy. Handling sparse

rewards has been an active topic in the ﬁeld of reinforcement

learning [1]–[5]. However, it still remains an open question

how an RL agent be successfully trained in a sparse reward

setting using an approach that is applicable to a variety of

domains.

All authors are with the Humanoid Robots Lab, University of Bonn,

Germany. Murad Dawood and Maren Bennewitz are additionally with

the Lamarr Institute for Machine Learning and Artiﬁcial Intelligence,

Germany. This work has partially been funded by the Deutsche Forschungs-

gemeinschaft (DFG, German Research Foundation) under the grant number

BE 4420/2-2 (FOR 2535 Anticipating Human Behavior).

MPC

State

training

Demonstrations

Action

Fig. 1: Handling sparse rewards using our approach. The model

predictive controller (MPC) provides demonstrations to the rein-

forcement learning (RL) agent during training, while exploration

still takes place. The MPC demonstrations in combination with RL

guide the agent to ﬁnd better policies to reach its goal.

One possibility is to use demonstrations to provide the

agent with a course of actions that solve the task at hand.

While demonstrations have been shown to improve the

training process in case of sparse rewards [6]–[8], and

human demonstrations speciﬁcally are commonly used in

the literature, providing these demonstrations can be quite

costly. In addition to that, human demonstrations typically

require hardware equipment or virtual reality sets to provide

the demonstrations [6], [9]–[11]. In this work, we therefore

propose to use MPC as an experience source for RL in the

case of sparse rewards. MPC has been very popular lately

in robotics and industry [12]–[17] as it is able to handle

constraints on both states and control signals, can handle

multiple-input multiple-output systems as well as nonlinear

systems, and the cost function can be constructed in a

straightforward way by minimizing the deviation between

the reference states and the current states. The aim of our

work is therefore to show that MPC can be used to provide

demonstrations for an RL agent in sparse reward settings.

The motivation of using RL with MPC demonstrations is

as follows: First, MPC is computationally demanding since it

solves an optimization control problem at each time step. For

highly nonlinear models with numerous states, this may not

be feasible to run in real time on real-world applications [16].

In contrast, inferring a trained policy online for actions is

less demanding even for systems with large state spaces.

Second, while MPC can be tuned to satisfying performance

in a certain scenario, the performance will not be as sat-

isfying when the same controller is deployed in another

scenario. This becomes obvious in trajectory tracking, where

the weight matrices have to be further tuned for different

arXiv:2210.01525v2 [cs.RO] 3 Mar 2023

Policy Selector

MPC

𝜖

Gazebo

Simulation

Environment

Action

State

Replay

Buffer

s, a, r, s’

MPC

Rate

Observation Space

Obstacle

Goal

𝜽goal

dgoal

Action Space

Robot

𝜽

𝜔

Fig. 2: Illustration of our approach. Left: Both the model predictive control (MPC) and reinforcement learning (RL) interact with the

simulated environment during training. Based on the MPC rate , the policy selector chooses between MPC demonstrations and RL. The

gathered tuples (s, a, r, s0)are stored in the replay buffer. Top middle: The robot’s kinematics are used for nonlinear MPC. Right: The

observation space includes a laser scan (red lines), the closest obstacle (thick red line), and the relative distance and heading to the goal.

trajectories [18]. Third, unlike MPC which demands the full

state of the robot dynamic model, RL agents need only to

attain partial observations from the environment which can

be provided using onboard sensors [19].

We investigate our approach in the ﬁeld of robot navigation

for the following reasons: 1) For MPC, the kinematic predic-

tion model of mobile robots is straightforward. 2) The state

and action space of mobile robots is small in comparison

to, e.g., humanoids, so that tuning of the MPC is not time-

consuming. 3) The learned policy can easily be tested in

different scenarios and usually successfully be transferred to

a real mobile robot.

To summarize, our main contribution is to demonstrate that

MPC as an experience source improves the training process

of RL agents in sparse rewards settings. We showcase our

approach in a mobile robot navigation scenario with static

and dynamic obstacles, both in simulations and on a real

robot. We also perform an ablation study to analyze the effect

of varying the number of MPC demonstrations during the

training. We make the following key claims: (i) MPC guides

the RL agent to learn tasks in a pure sparse reward setting.

(ii) The learned behavior policy leads to higher success rates

than pure RL. (iii) The balance between MPC demonstrations

and RL exploration inﬂuences the convergence rate of the

training. (iv) Our approach can successfully be applied to

the task of mobile robot navigation.

II. RELATED WORK

We will ﬁrst discuss the use of human demonstrations in

the context of RL, followed by non-human demonstrations,

and ﬁnally previous approaches combining MPC with RL.

Human Demonstrations in Reinforcement Learning:

Several approaches have been presented that use human

expert demonstrations to boost the training process of RL

agents by showing examples of how to perform a certain task.

The agents subsequently learn faster in comparison to explor-

ing randomly. For example, [20] used human demonstrations

to boost the training of deep Q-networks and showed great

improvement over different RL approaches in Atari games. A

similar approach [2] used human demonstrations to improve

the training of deep deterministic policy gradient (DDPG) in

robotics tasks. Recently, [6] and [21] proposed combining

supervised learning with RL and providing expert samples

to play a video game and control a self-driving vehicle,

respectively. Additionally, [10] also combined supervised

learning with RL and human demonstrations to perform robot

arm tasks and showed that using demonstrations outperforms

the Hindsight Experience Replay (HER) approach [1].

Non-Human Demonstrations: To overcome the need

for costly human demonstrations, several approaches for

providing non-human demonstrations have been proposed.

[3] applied a hand-crafted policy of low success rate to

improve the training of an unmanned aerial vehicle (UAV) in

a sparse reward setting. As stated in that work, these hand-

crafted policies cannot be applied to diverse scenarios since

they are only able to perform ﬁxed maneuvers. [22] used

a partially trained RL agent with shaped rewards to pro-

vide demonstrations for another RL agent in sparse reward

settings on MuJoCo [23] simulations and to mobile robot

navigation. [24], [25] proposed using proportional controllers

to provide the demonstrations for RL agents for mobile

robot navigation and robotic arm manipulation, respectively.

[5] used demonstrations generated by a global planner to

train a network using imitation learning along with RL for

mobile robot navigation. Unlike the discussed approaches,

we use a model predictive controller as an experience source

since MPC can be applied to a variety of applications and

does not involve reward shaping.

Combining MPC with RL: Several approaches using

MPC and differential dynamic programming (DDP) along

with RL have been presented. In [19] and [26] the authors

implemented the guided policy search (GPS) approach where

they transform the RL problem into a supervised learning

problem using demonstrations from MPC and DDP respec-

tively to train a UAV and MuJoCo environments. [27] used

MPC as an experience source and trained their network using

supervised learning for the navigation of a simulated car

model. However, their approach keeps the MPC running as

a safe fail policy in case the RL agent fails to ﬁnd a better

action than the MPC. In [28] the authors proposed to apply

meta reinforcement learning along with MPC for demon-

strations to train a mobile robot navigate through randomly

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HandlingSparseRewardsinReinforcementLearningUsingModelPredictiveControlMuradDawoodNilsDenglerJorgedeHeuvelMarenBennewitzAbstractReinforcementlearning(RL)hasrecentlyprovengreatsuccessinvariousdomains.Yet,thedesignoftherewardfunctionrequiresdetaileddomainexpertiseandtediousne-tuningtoensurethatagent...

展开>> 收起<<

Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control Murad Dawood Nils Dengler Jorge de Heuvel Maren Bennewitz.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control Murad Dawood Nils Dengler Jorge de Heuvel Maren Bennewitz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: