
1
Coaching with PID Controllers: A Novel Approach
for Merging Control with Reinforcement Learning
Liping Bai
Abstract—We propose a Proportional Integral Derivative (PID)
controller-based coaching scheme to expedite reinforcement learn-
ing (RL). Previous attempts to fuse classical control and RL are
variations on imitation learning or human-in-the-loop learning.
Also, in their approaches, the training acceleration comes with
an implicit cap on what is attainable by the RL agent, therefore it
is vital to have high-quality controllers. We ask if it is possible to
accelerate RL with even a primitive hand-tuned PID controller,
and we draw inspiration from the relationship between athletes
and their coaches. At the top level of the athletic world, a coach’s
job is not to function as a template to be imitated after, but rather
to provide conditions for the athletes to collect critical experiences.
We seek to construct a coaching relationship between the PID
controller and the RL agent, where the controller helps the agent
experience the most pertinent states for the task. We conduct
experiments in Mujoco locomotion simulation, but the setup can
be easily converted into real-world circumstances. We conclude
from the data that when the coaching structure between the PID
controller and its respective RL agent is set at its goldilocks spot,
the agent’s training can be accelerated by up to 37%, yielding
uncompromised training results in the meantime. This is an
important proof of concept that controller-based coaching can
be a novel and effective paradigm for merging classical control
with learning and warrants further investigations in this direction.
All the code and data can be found at github/BaiLiping/Coaching
Index Terms—Reinforcement Learning, Control, Learning for
Dynamic Control, L4DC
I. INTRODUCTION
L
EARNING for Dynamic Control is an emerging field of
research located at the interaction between classic control
and reinforcement learning (RL). Although RL community
routinely generate jaw-dropping results that seem out of reach
to the control community[1][2][3], the theories that undergird
RL are as bleak as it was first introduced[4]. Today, those
deficiencies can be easily papered over by the advent of
Deep Neural Networks (DNN) and ever faster computational
capacities. For RL to reach its full potential, existing control
theories and strategies have to be part of that new combined
formulation.
There are three ways that classic control finds its way
into RL. First, theorists who are well versed in optimiza-
tion techniques and mathematical formalism can provide
systematic perspectives to RL and invent the much needed
analytical tools[5][6][7][8][9]. Second, system identification
researchers are exploring all possible configurations to com-
bine existing system models with DNN and its varia-
tions[10][11][12][13][14]. Third, proven controllers can provide
Nanjing Unversity of Posts and Telecommunications, College of Automa-
tion & College of Artificial Intelligence, Nanjing, Jiangsu,210000 China
email:zqpang@njupt.edu.cn
data on successful control trajectories to be used in imitation
learning, reverse reinforcement learning, and "human"-in-the-
loop learning[15][16][17][18][19].
Our approach is an extension of the third way of combining
classing control with RL. Previous researches[20][21][22]
are about making a functioning controller works better. To
begin with, they require high-quality controllers, and the
improvements brought about by the RL agents are merely
icing on the cake. In addition, the controllers inadvertently
impose limits on what can be achieved by the RL agents. If,
unfortunately, a bad controller is chosen, then the RL training
process would be hindered rather than expedited. We ask the
question, can we speed up RL training with hand-tuned PID
controllers, which are primitive but still captures some of our
understanding of the system? This inquiry leads us to the
relationship between athletes and their coaches.
Professional athletes don’t get where they are via trial-and-
error. Their skillsets are forged through painstakingly designed
coaching techniques. At the top level, a coach’s objective is
not to be a template for the athletes to imitate after, but rather
is to facilitate data collection on critical states. Top athletes
are not necessarily good coaches and vice versa.
In our approach, the ’coaches’ are PID controllers which
we deliberately tuned to be barely functioning, as shown by
Table I. Yet, even with such bad controllers, when appropriately
structured, training acceleration is still observed in our experi-
ments, as shown by Table II. The critical idea of coaching is for
the PID controllers to take over when the RL agents deviated
from the essential states. Our approach differs from previous
researches in one significant way: controllers’ interventions
and the rewards associated with such interventions are hidden
from the RL agents. They are not part of the training data. We
also restrain from reward engineering and leave everything as
it is, other than the coaching intervention. This way, we can
be confident that the observed acceleration does not stem from
other alterations. The implementations would be detailed in
subsequent sections.
Environment PID Controller RL Agent PID/RL
Inverted Pendulum 240 1000 24.0%
Double Pendulum 1107 9319 11.9%
Hopper 581 989 58.7%
Walker 528 1005 52.5%
TABLE I: Performance Comparison between PID controller and its respective RL agent.
We interfaced with Mujoco simulation through OpenAI GYM, and every simulated
environment comes with predetermined maximum episode steps. The scores achieved
by the RL agents would probably be high if not for this reason.
arXiv:2210.00770v1 [eess.SY] 3 Oct 2022