
Handling Sparse Rewards in Reinforcement Learning
Using Model Predictive Control
Murad Dawood Nils Dengler Jorge de Heuvel Maren Bennewitz
Abstract— Reinforcement learning (RL) has recently proven
great success in various domains. Yet, the design of the reward
function requires detailed domain expertise and tedious fine-
tuning to ensure that agents are able to learn the desired
behaviour. Using a sparse reward conveniently mitigates these
challenges. However, the sparse reward represents a challenge
on its own, often resulting in unsuccessful training of the
agent. In this paper, we therefore address the sparse reward
problem in RL. Our goal is to find an effective alternative to
reward shaping, without using costly human demonstrations,
that would also be applicable to a wide range of domains.
Hence, we propose to use model predictive control (MPC)
as an experience source for training RL agents in sparse
reward environments. Without the need for reward shaping,
we successfully apply our approach in the field of mobile
robot navigation both in simulation and real-world experiments
with a Kuboki Turtlebot 2. We furthermore demonstrate great
improvement over pure RL algorithms in terms of success rate
as well as number of collisions and timeouts. Our experiments
show that MPC as an experience source improves the agent’s
learning process for a given task in the case of sparse rewards.
I. INTRODUCTION
Reinforcement learning (RL) as well as model predictive
control (MPC) have been applied lately to various fields
and shown impressive results. However, there are still great
challenges that need to be dealt with in both approaches.
One major challenge in RL is the design of the reward
function. Shaping the reward function to achieve desired
results requires lots of trials to get the expected behaviour
of the trained policy. This is mainly due to the fact that
during the training, the agents exploit any opportunity given
by the reward function. An obvious solution to this issue
would be to use sparse rewards, i.e., rewarding the agent only
for achieving the goal and giving zero rewards otherwise.
While this approach encourages the agent to complete a
certain task, it is more difficult for the agent to identify
promising behaviour. Since the agent has no idea how well
it is performing during the training before reaching the
goal, it may fail to find the optimal policy. Handling sparse
rewards has been an active topic in the field of reinforcement
learning [1]–[5]. However, it still remains an open question
how an RL agent be successfully trained in a sparse reward
setting using an approach that is applicable to a variety of
domains.
All authors are with the Humanoid Robots Lab, University of Bonn,
Germany. Murad Dawood and Maren Bennewitz are additionally with
the Lamarr Institute for Machine Learning and Artificial Intelligence,
Germany. This work has partially been funded by the Deutsche Forschungs-
gemeinschaft (DFG, German Research Foundation) under the grant number
BE 4420/2-2 (FOR 2535 Anticipating Human Behavior).
MPC
State
RL
training
Demonstrations
Action
Fig. 1: Handling sparse rewards using our approach. The model
predictive controller (MPC) provides demonstrations to the rein-
forcement learning (RL) agent during training, while exploration
still takes place. The MPC demonstrations in combination with RL
guide the agent to find better policies to reach its goal.
One possibility is to use demonstrations to provide the
agent with a course of actions that solve the task at hand.
While demonstrations have been shown to improve the
training process in case of sparse rewards [6]–[8], and
human demonstrations specifically are commonly used in
the literature, providing these demonstrations can be quite
costly. In addition to that, human demonstrations typically
require hardware equipment or virtual reality sets to provide
the demonstrations [6], [9]–[11]. In this work, we therefore
propose to use MPC as an experience source for RL in the
case of sparse rewards. MPC has been very popular lately
in robotics and industry [12]–[17] as it is able to handle
constraints on both states and control signals, can handle
multiple-input multiple-output systems as well as nonlinear
systems, and the cost function can be constructed in a
straightforward way by minimizing the deviation between
the reference states and the current states. The aim of our
work is therefore to show that MPC can be used to provide
demonstrations for an RL agent in sparse reward settings.
The motivation of using RL with MPC demonstrations is
as follows: First, MPC is computationally demanding since it
solves an optimization control problem at each time step. For
highly nonlinear models with numerous states, this may not
be feasible to run in real time on real-world applications [16].
In contrast, inferring a trained policy online for actions is
less demanding even for systems with large state spaces.
Second, while MPC can be tuned to satisfying performance
in a certain scenario, the performance will not be as sat-
isfying when the same controller is deployed in another
scenario. This becomes obvious in trajectory tracking, where
the weight matrices have to be further tuned for different
arXiv:2210.01525v2 [cs.RO] 3 Mar 2023