1 Coaching with PID Controllers A Novel Approach for Merging Control with Reinforcement Learning

2025-04-28 0 0 1.03MB 6 页 10玖币
侵权投诉
1
Coaching with PID Controllers: A Novel Approach
for Merging Control with Reinforcement Learning
Liping Bai
Abstract—We propose a Proportional Integral Derivative (PID)
controller-based coaching scheme to expedite reinforcement learn-
ing (RL). Previous attempts to fuse classical control and RL are
variations on imitation learning or human-in-the-loop learning.
Also, in their approaches, the training acceleration comes with
an implicit cap on what is attainable by the RL agent, therefore it
is vital to have high-quality controllers. We ask if it is possible to
accelerate RL with even a primitive hand-tuned PID controller,
and we draw inspiration from the relationship between athletes
and their coaches. At the top level of the athletic world, a coach’s
job is not to function as a template to be imitated after, but rather
to provide conditions for the athletes to collect critical experiences.
We seek to construct a coaching relationship between the PID
controller and the RL agent, where the controller helps the agent
experience the most pertinent states for the task. We conduct
experiments in Mujoco locomotion simulation, but the setup can
be easily converted into real-world circumstances. We conclude
from the data that when the coaching structure between the PID
controller and its respective RL agent is set at its goldilocks spot,
the agent’s training can be accelerated by up to 37%, yielding
uncompromised training results in the meantime. This is an
important proof of concept that controller-based coaching can
be a novel and effective paradigm for merging classical control
with learning and warrants further investigations in this direction.
All the code and data can be found at github/BaiLiping/Coaching
Index Terms—Reinforcement Learning, Control, Learning for
Dynamic Control, L4DC
I. INTRODUCTION
L
EARNING for Dynamic Control is an emerging field of
research located at the interaction between classic control
and reinforcement learning (RL). Although RL community
routinely generate jaw-dropping results that seem out of reach
to the control community[1][2][3], the theories that undergird
RL are as bleak as it was first introduced[4]. Today, those
deficiencies can be easily papered over by the advent of
Deep Neural Networks (DNN) and ever faster computational
capacities. For RL to reach its full potential, existing control
theories and strategies have to be part of that new combined
formulation.
There are three ways that classic control finds its way
into RL. First, theorists who are well versed in optimiza-
tion techniques and mathematical formalism can provide
systematic perspectives to RL and invent the much needed
analytical tools[5][6][7][8][9]. Second, system identification
researchers are exploring all possible configurations to com-
bine existing system models with DNN and its varia-
tions[10][11][12][13][14]. Third, proven controllers can provide
Nanjing Unversity of Posts and Telecommunications, College of Automa-
tion & College of Artificial Intelligence, Nanjing, Jiangsu,210000 China
email:zqpang@njupt.edu.cn
data on successful control trajectories to be used in imitation
learning, reverse reinforcement learning, and "human"-in-the-
loop learning[15][16][17][18][19].
Our approach is an extension of the third way of combining
classing control with RL. Previous researches[20][21][22]
are about making a functioning controller works better. To
begin with, they require high-quality controllers, and the
improvements brought about by the RL agents are merely
icing on the cake. In addition, the controllers inadvertently
impose limits on what can be achieved by the RL agents. If,
unfortunately, a bad controller is chosen, then the RL training
process would be hindered rather than expedited. We ask the
question, can we speed up RL training with hand-tuned PID
controllers, which are primitive but still captures some of our
understanding of the system? This inquiry leads us to the
relationship between athletes and their coaches.
Professional athletes don’t get where they are via trial-and-
error. Their skillsets are forged through painstakingly designed
coaching techniques. At the top level, a coach’s objective is
not to be a template for the athletes to imitate after, but rather
is to facilitate data collection on critical states. Top athletes
are not necessarily good coaches and vice versa.
In our approach, the ’coaches’ are PID controllers which
we deliberately tuned to be barely functioning, as shown by
Table I. Yet, even with such bad controllers, when appropriately
structured, training acceleration is still observed in our experi-
ments, as shown by Table II. The critical idea of coaching is for
the PID controllers to take over when the RL agents deviated
from the essential states. Our approach differs from previous
researches in one significant way: controllers’ interventions
and the rewards associated with such interventions are hidden
from the RL agents. They are not part of the training data. We
also restrain from reward engineering and leave everything as
it is, other than the coaching intervention. This way, we can
be confident that the observed acceleration does not stem from
other alterations. The implementations would be detailed in
subsequent sections.
Environment PID Controller RL Agent PID/RL
Inverted Pendulum 240 1000 24.0%
Double Pendulum 1107 9319 11.9%
Hopper 581 989 58.7%
Walker 528 1005 52.5%
TABLE I: Performance Comparison between PID controller and its respective RL agent.
We interfaced with Mujoco simulation through OpenAI GYM, and every simulated
environment comes with predetermined maximum episode steps. The scores achieved
by the RL agents would probably be high if not for this reason.
arXiv:2210.00770v1 [eess.SY] 3 Oct 2022
2
Environment Target Measure With PID Without Percentage
Name Score Coaching Coaching Increase
Inverted 800 Win Streak 100 160 37.5%
Pendulum Average 104 159 34.6%
Double 5500 5 Wins 908 1335 31.9%
Pendulum Average 935 1370 29.9%
Hopper 800 5 Wins 2073 2851 27.3%
Average 2155 2911 25.9%
Walker 800 5 Wins 4784 5170 7.5%
Average 5659 7135 20.7%
TABLE II: Comparison Between Agents Trained With and Without PID Controller
Coaching. Even though the PID controllers are less capable than the eventual RL agent,
they are still useful and can accelerate the RL agent training. There two measures we
used to gauge training acceleration. The first is five consecutive wins, and the second is
the scoring average. The "win" is a predetermined benchmark.
In section II, we present the idea of controller-based coaching.
In section III, we present the results of our experiments and
their detailed implementations. We conclude what we have
learned and layout directions for further research in section IV.
II. CONTROLLER BASED COACHING
Reinforcement Learning is the process of cycling between
interaction with an environment and refinement of the under-
standing of that environment. RL agents methodically extract
information from experiences, gradually bounding system
models, policy distributions, or cost-to-go approximations
to maximize the expected rewards along a trajectory, as
shown by Figure1 which is an adaption of Benjamin Rechts
presentation[23].
Fig. 1: From Optimization to Learning. Model-Based or Model-Free learning refers
to whether or not learning is used to approximate the system dynamics function. If
there is an explicit action policy, it is called on-policy learning. Otherwise, the optimal
action would be implicitly captured by the Q value function, and that would be called
off-policy learning instead. Importance sampling allows "limited off-policy learning"
capacity, which enables data reuse in a trusted region. Online learning means interleaving
data collection and iterative network parameters update. Offline learning means the
data is collected in bulk first, and then the network parameters are set with regression
computation. Batch learning, as the name suggested, is in between online and offline
learning. An agent would first generate data that fill its batch memory and then sample
from the batch memories for iterative parameter update. New data would be generated
with the updated parameters to replace older data in the memory. This taxonomy is
somewhat outdated now. When Richard Sutton wrote his book, the algorisms he had in
mind fall nicely into various categories. Today, however, the popular algorisms would
combine more than one route to derive superior performance and can’t be pigeonholed.
A fundamental concept for RL is convergence through
bootstrap. Instead of asymptotically approaching a known target
function2a, bootstrap methods approach an assumed target first
and then update the target assumption based on collected data2b.
When evaluating estimation functions with belief rather than
of the real value, things could just run around in circles and
never converge. Without any guidance, the RL agent would
(a) With Known Evaluation Function
(b) Bootstrap
have just explored all the possible states, potentially resulting
in this unstable behavior.
One method to produce more efficient exploration and avoid
instability is to give more weight to critical states. Not all
observational states are created equal. Some are vital, while
others have nothing to do with the eventual objective. For
instance, in the inverted pendulum task, any states outside of
the Lyapunov stability bounds should be ignored since they
can’t be properly controlled anyway.
There are statistical techniques to distinguish critical states
from the non-essential ones, and imitation learning works
by marking crucial states with demonstrations. However, the
former approach is hard to implement, and the latter one
requires high-quality controllers. Our proposed controller-based
coaching method is easy to implement and does not have
stringent requirements on the controllers it uses.
Controller-based coaching works by adjusting the trajectory
of the RL agent and avoid wasting valuable data collection
cycle on the states that merely lead to eventual termination.
When the agent is about to deviate from essential states, the
controller will apply a force to nudge the agent back to where
it should be, much like a human coach course-corrects on
behalf of the athlete. Crucially, the agent is oblivious to this
intervention step, and it would not be part of the agent’s
data. Even if the controller didn’t adjust the agent to where it
should be, it would not have mattered since it is unaware of it
because it is a high-quality controller. On the other hand, if the
controller successfully adjusts the trajectory, the RL agent’s
next data collection cycle will be spent in a critical state. We
test our approach on four mujoco locomotion environments as a
proof of concept, and in all four experiments, the hypothesized
acceleration on RL training is observed.
III. EXPERIMENT SETUP
Mujoco physics engine[24], is one of many such simulation
tools. We interface with it through a python wrapper provided
by the OpenAI Gym[25] team. We choose four environments
for our experiments: inverted pendulum, double inverted
pendulum, hopper, and walker. Every environment comes with
a set of predetermined rewards and maximum episode steps.
We did not tinker with those parameters. The only change we
made to each environment is a controller-based coach ready
to intervene when the agent steps out of the predetermined
critical states.
We use tensorforce’s[26] implementation of RL agents,
specifically the Proximal Policy Optimization (PPO) agent be-
cause the learning curves generated by PPO agent are smoother,
摘要:

1CoachingwithPIDControllers:ANovelApproachforMergingControlwithReinforcementLearningLipingBaiAbstract—WeproposeaProportionalIntegralDerivative(PID)controller-basedcoachingschemetoexpeditereinforcementlearn-ing(RL).PreviousattemptstofuseclassicalcontrolandRLarevariationsonimitationlearningorhuman-in-...

展开>> 收起<<
1 Coaching with PID Controllers A Novel Approach for Merging Control with Reinforcement Learning.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:1.03MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注