1 Coaching with PID Controllers A Novel Approach for Merging Control with Reinforcement Learning

2025-04-28 0 0 1.03MB 6 页 10玖币

侵权投诉

Coaching with PID Controllers: A Novel Approach

for Merging Control with Reinforcement Learning

Liping Bai

Abstract—We propose a Proportional Integral Derivative (PID)

controller-based coaching scheme to expedite reinforcement learn-

ing (RL). Previous attempts to fuse classical control and RL are

variations on imitation learning or human-in-the-loop learning.

Also, in their approaches, the training acceleration comes with

an implicit cap on what is attainable by the RL agent, therefore it

is vital to have high-quality controllers. We ask if it is possible to

accelerate RL with even a primitive hand-tuned PID controller,

and we draw inspiration from the relationship between athletes

and their coaches. At the top level of the athletic world, a coach’s

job is not to function as a template to be imitated after, but rather

to provide conditions for the athletes to collect critical experiences.

We seek to construct a coaching relationship between the PID

controller and the RL agent, where the controller helps the agent

experience the most pertinent states for the task. We conduct

experiments in Mujoco locomotion simulation, but the setup can

be easily converted into real-world circumstances. We conclude

from the data that when the coaching structure between the PID

controller and its respective RL agent is set at its goldilocks spot,

the agent’s training can be accelerated by up to 37%, yielding

uncompromised training results in the meantime. This is an

important proof of concept that controller-based coaching can

be a novel and effective paradigm for merging classical control

with learning and warrants further investigations in this direction.

All the code and data can be found at github/BaiLiping/Coaching

Index Terms—Reinforcement Learning, Control, Learning for

Dynamic Control, L4DC

I. INTRODUCTION

EARNING for Dynamic Control is an emerging ﬁeld of

research located at the interaction between classic control

and reinforcement learning (RL). Although RL community

routinely generate jaw-dropping results that seem out of reach

to the control community[1][2][3], the theories that undergird

RL are as bleak as it was ﬁrst introduced[4]. Today, those

deﬁciencies can be easily papered over by the advent of

Deep Neural Networks (DNN) and ever faster computational

capacities. For RL to reach its full potential, existing control

theories and strategies have to be part of that new combined

formulation.

There are three ways that classic control ﬁnds its way

into RL. First, theorists who are well versed in optimiza-

tion techniques and mathematical formalism can provide

systematic perspectives to RL and invent the much needed

analytical tools[5][6][7][8][9]. Second, system identiﬁcation

researchers are exploring all possible conﬁgurations to com-

bine existing system models with DNN and its varia-

tions[10][11][12][13][14]. Third, proven controllers can provide

Nanjing Unversity of Posts and Telecommunications, College of Automa-

tion & College of Artiﬁcial Intelligence, Nanjing, Jiangsu,210000 China

email:zqpang@njupt.edu.cn

data on successful control trajectories to be used in imitation

learning, reverse reinforcement learning, and "human"-in-the-

loop learning[15][16][17][18][19].

Our approach is an extension of the third way of combining

classing control with RL. Previous researches[20][21][22]

are about making a functioning controller works better. To

begin with, they require high-quality controllers, and the

improvements brought about by the RL agents are merely

icing on the cake. In addition, the controllers inadvertently

impose limits on what can be achieved by the RL agents. If,

unfortunately, a bad controller is chosen, then the RL training

process would be hindered rather than expedited. We ask the

question, can we speed up RL training with hand-tuned PID

controllers, which are primitive but still captures some of our

understanding of the system? This inquiry leads us to the

relationship between athletes and their coaches.

Professional athletes don’t get where they are via trial-and-

error. Their skillsets are forged through painstakingly designed

coaching techniques. At the top level, a coach’s objective is

not to be a template for the athletes to imitate after, but rather

is to facilitate data collection on critical states. Top athletes

are not necessarily good coaches and vice versa.

In our approach, the ’coaches’ are PID controllers which

we deliberately tuned to be barely functioning, as shown by

Table I. Yet, even with such bad controllers, when appropriately

structured, training acceleration is still observed in our experi-

ments, as shown by Table II. The critical idea of coaching is for

the PID controllers to take over when the RL agents deviated

from the essential states. Our approach differs from previous

researches in one signiﬁcant way: controllers’ interventions

and the rewards associated with such interventions are hidden

from the RL agents. They are not part of the training data. We

also restrain from reward engineering and leave everything as

it is, other than the coaching intervention. This way, we can

be conﬁdent that the observed acceleration does not stem from

other alterations. The implementations would be detailed in

subsequent sections.

Environment PID Controller RL Agent PID/RL

Inverted Pendulum 240 1000 24.0%

Double Pendulum 1107 9319 11.9%

Hopper 581 989 58.7%

Walker 528 1005 52.5%

TABLE I: Performance Comparison between PID controller and its respective RL agent.

We interfaced with Mujoco simulation through OpenAI GYM, and every simulated

environment comes with predetermined maximum episode steps. The scores achieved

by the RL agents would probably be high if not for this reason.

arXiv:2210.00770v1 [eess.SY] 3 Oct 2022

Environment Target Measure With PID Without Percentage

Name Score Coaching Coaching Increase

Inverted 800 Win Streak 100 160 37.5%

Pendulum Average 104 159 34.6%

Double 5500 5 Wins 908 1335 31.9%

Pendulum Average 935 1370 29.9%

Hopper 800 5 Wins 2073 2851 27.3%

Average 2155 2911 25.9%

Walker 800 5 Wins 4784 5170 7.5%

Average 5659 7135 20.7%

TABLE II: Comparison Between Agents Trained With and Without PID Controller

Coaching. Even though the PID controllers are less capable than the eventual RL agent,

they are still useful and can accelerate the RL agent training. There two measures we

used to gauge training acceleration. The ﬁrst is ﬁve consecutive wins, and the second is

the scoring average. The "win" is a predetermined benchmark.

In section II, we present the idea of controller-based coaching.

In section III, we present the results of our experiments and

their detailed implementations. We conclude what we have

learned and layout directions for further research in section IV.

II. CONTROLLER BASED COACHING

Reinforcement Learning is the process of cycling between

interaction with an environment and reﬁnement of the under-

standing of that environment. RL agents methodically extract

information from experiences, gradually bounding system

models, policy distributions, or cost-to-go approximations

to maximize the expected rewards along a trajectory, as

shown by Figure1 which is an adaption of Benjamin Recht’s

presentation[23].

Fig. 1: From Optimization to Learning. Model-Based or Model-Free learning refers

to whether or not learning is used to approximate the system dynamics function. If

there is an explicit action policy, it is called on-policy learning. Otherwise, the optimal

action would be implicitly captured by the Q value function, and that would be called

off-policy learning instead. Importance sampling allows "limited off-policy learning"

capacity, which enables data reuse in a trusted region. Online learning means interleaving

data collection and iterative network parameters update. Ofﬂine learning means the

data is collected in bulk ﬁrst, and then the network parameters are set with regression

computation. Batch learning, as the name suggested, is in between online and ofﬂine

learning. An agent would ﬁrst generate data that ﬁll its batch memory and then sample

from the batch memories for iterative parameter update. New data would be generated

with the updated parameters to replace older data in the memory. This taxonomy is

somewhat outdated now. When Richard Sutton wrote his book, the algorisms he had in

mind fall nicely into various categories. Today, however, the popular algorisms would

combine more than one route to derive superior performance and can’t be pigeonholed.

A fundamental concept for RL is convergence through

bootstrap. Instead of asymptotically approaching a known target

function2a, bootstrap methods approach an assumed target ﬁrst

and then update the target assumption based on collected data2b.

When evaluating estimation functions with belief rather than

of the real value, things could just run around in circles and

never converge. Without any guidance, the RL agent would

(a) With Known Evaluation Function

(b) Bootstrap

have just explored all the possible states, potentially resulting

in this unstable behavior.

One method to produce more efﬁcient exploration and avoid

instability is to give more weight to critical states. Not all

observational states are created equal. Some are vital, while

others have nothing to do with the eventual objective. For

instance, in the inverted pendulum task, any states outside of

the Lyapunov stability bounds should be ignored since they

can’t be properly controlled anyway.

There are statistical techniques to distinguish critical states

from the non-essential ones, and imitation learning works

by marking crucial states with demonstrations. However, the

former approach is hard to implement, and the latter one

requires high-quality controllers. Our proposed controller-based

coaching method is easy to implement and does not have

stringent requirements on the controllers it uses.

Controller-based coaching works by adjusting the trajectory

of the RL agent and avoid wasting valuable data collection

cycle on the states that merely lead to eventual termination.

When the agent is about to deviate from essential states, the

controller will apply a force to nudge the agent back to where

it should be, much like a human coach course-corrects on

behalf of the athlete. Crucially, the agent is oblivious to this

intervention step, and it would not be part of the agent’s

data. Even if the controller didn’t adjust the agent to where it

should be, it would not have mattered since it is unaware of it

because it is a high-quality controller. On the other hand, if the

controller successfully adjusts the trajectory, the RL agent’s

next data collection cycle will be spent in a critical state. We

test our approach on four mujoco locomotion environments as a

proof of concept, and in all four experiments, the hypothesized

acceleration on RL training is observed.

III. EXPERIMENT SETUP

Mujoco physics engine[24], is one of many such simulation

tools. We interface with it through a python wrapper provided

by the OpenAI Gym[25] team. We choose four environments

for our experiments: inverted pendulum, double inverted

pendulum, hopper, and walker. Every environment comes with

a set of predetermined rewards and maximum episode steps.

We did not tinker with those parameters. The only change we

made to each environment is a controller-based coach ready

to intervene when the agent steps out of the predetermined

critical states.

We use tensorforce’s[26] implementation of RL agents,

speciﬁcally the Proximal Policy Optimization (PPO) agent be-

cause the learning curves generated by PPO agent are smoother,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1CoachingwithPIDControllers:ANovelApproachforMergingControlwithReinforcementLearningLipingBaiAbstractWeproposeaProportionalIntegralDerivative(PID)controller-basedcoachingschemetoexpeditereinforcementlearn-ing(RL).PreviousattemptstofuseclassicalcontrolandRLarevariationsonimitationlearningorhuman-in-...

展开>> 收起<<

1 Coaching with PID Controllers A Novel Approach for Merging Control with Reinforcement Learning.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Coaching with PID Controllers A Novel Approach for Merging Control with Reinforcement Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: