A Novel Entropy-Maximizing TD3-based Reinforcement Learning for Automatic PID Tuning Myisha A. Chowdhury1and Qiugang Lu1y

2025-04-27 0 0 709.52KB 6 页 10玖币

侵权投诉

A Novel Entropy-Maximizing TD3-based Reinforcement Learning for

Automatic PID Tuning

Myisha A. Chowdhury1and Qiugang Lu1,†

Abstract— Proportional-integral-derivative (PID) controllers

have been widely used in the process industry. However, the

satisfactory control performance of a PID controller depends

strongly on the tuning parameters. Conventional PID tuning

methods require extensive knowledge of the system model,

which is not always known especially in the case of complex

dynamical systems. In contrast, reinforcement learning-based

PID tuning has gained popularity since it can treat PID tuning

as a black-box problem and deliver the optimal PID parameters

without requiring explicit process models. In this paper, we

present a novel entropy-maximizing twin-delayed deep deter-

ministic policy gradient (EMTD3) method for automating the

PID tuning. In the proposed method, an entropy-maximizing

stochastic actor is employed at the beginning to encourage the

exploration of the action space. Then a deterministic actor is

deployed to focus on local exploitation and discover the optimal

solution. The incorporation of the entropy-maximizing term can

signiﬁcantly improve the sample efﬁciency and assist in fast

convergence to the global solution. Our proposed method is

applied to the PID tuning of a second-order system to verify its

effectiveness in improving the sample efﬁciency and discovering

the optimal PID parameters compared to traditional TD3.

I. INTRODUCTION

Modern process industry is characterized by high com-

plexity due to strong coupling among different units and

thus efﬁcient control is critical for maintaining high-level

operations. Among various process control strategies, PID

control has received widespread attention owing to its

simplicity (with only three tunable parameters) in design

and effectiveness in delivering high control performance

for numerous real-world applications [1]. However, these

PID parameters require careful tuning for yielding superior

control performance. One class of naive PID tuning methods

is based on the trial-and-error approach, where different PID

parameters are tested and the one giving the best control

performance is deployed to the system. However, this trial-

and-error approach is time-consuming and may result in sub-

optimal PID parameters [2]. Another category of PID tuning

techniques is the rule-based methods where the models of the

process are often required [3]. However, such models may

not be always available, especially for complex processes,

which restricts the applicability of these methods. To address

this issue, data-driven techniques based on, e.g., evolution

optimization [4], genetic algorithm [5], and neural networks

(NNs) [6], have been developed for PID tuning without

requiring a process model. However, these methods need a

*This work was supported by the Texas Tech University.

1M.A. Chowdhury and Q. Lu are with the Department of Chemical

Engineering, Texas Tech University, Lubbock, TX 79405, USA. Email:

myisha.chowdhury@ttu.edu; jay.lu@ttu.edu

†Corresponding author: Q. Lu

large quantity of labeled data for training due to their model-

free nature, making them sample inefﬁcient [7].

Recently, reinforcement learning (RL) has seen a surge

in popularity in the control of dynamical systems [8]. RL

is essentially a sequential decision-making process for opti-

mizing a black-box objective function [9]. Unlike supervised

learning, model-free RL can learn the best policy to optimize

the objective function from direct interactions with the en-

vironment and does not require prior knowledge about the

environment. On the other hand, the PID tuning can be seen

as a black-box optimization problem where the relation be-

tween PID parameters and the resultant control performance

is unknown. On this aspect, RL has great potential to be

an effective technique for solving the PID tuning problem

through the sequential decision-making process [10].

In light of these observations, research studies have been

reported on using RL-based algorithms for PID tuning [1],

[2], [7], [8], [10]–[13]. For example, Q-learning, a popular

RL algorithm, has been used to design PID controllers that

are adaptive to system operating condition changes [11].

However, the Q-learning algorithm works only on discrete

state-action spaces, and cannot handle the continuous space

[12]. One solution to this problem is to discretize the

states and actions into bins before applying Q-learning

[13]. Nonetheless, such discretizations may lead to errors

due to the ﬁnite resolution, and eventually result in poor

control performance [12], [14]. To this end, actor-critic-

based algorithms have been utilized where the policy and

Q-values are approximated by parameterized functions [1],

[2], [9]. In particular, the deep deterministic policy gradient

(DDPG) algorithm proposed by DeepMind [15], which is

sample-efﬁcient and effective in handling continuous state-

action space, has been adapted to the PID tuning problem

[10]. An updated version of DDPG with lower variance,

known as twin delayed DDPG (TD3), has also been applied

to PID tuning for nonlinear systems [12]. Although the

TD3 algorithm has demonstrated improved sample efﬁciency

compared to the other existing methods such as DDPG [16],

it often performs poorly in high-dimensional state-action

space due to its inherent lack of exploration [17], which

limits its efﬁcacy in PID tuning.

In this article, we present a novel algorithm for facili-

tating the PID tuning, termed as entropy-maximizing TD3

(EMTD3), to overcome the issue of insufﬁcient exploration

associated with the traditional TD3 algorithm. Speciﬁcally,

the developed EMTD3 algorithm deploys an entropy-based

stochastic actor to ensure signiﬁcant exploration at the begin-

ning, followed by a deterministic actor based on the TD3 al-

arXiv:2210.02381v1 [eess.SY] 5 Oct 2022

gorithm. For our method, the stochastic actor can well ensure

sufﬁcient explorations initially and the deterministic actor

can facilitate fast convergence once knowledge about the

environment is acquired. Moreover, the off-policy nature of

the EMTD3 algorithm enables it to use past experience and

makes this method sample efﬁcient. The proposed method is

applied to the PID tuning of a second-order system. Simu-

lation results show that our method outperforms traditional

TD3 signiﬁcantly in improving the sampling efﬁciency and

discovering optimal PID parameters.

This paper is organized as follows. Section II presents

fundamentals about PID control and the TD3 algorithm,

followed by our proposed EMTD3 method elaborated in III.

The numerical case study is presented in Section IV with

our conclusions given in Section V.

II. PRELIMINARIES

A. PID Control

The typical form of a PID controller has three adjustable

parameters and can be expressed as [3]:

u(tn) = Kp"e(tn) + 1

τI

i=1

e(ti) + τD

e(tn)−e(tn−1)

∆t#,

(1)

where tnis the n-th time instant, u(tn)is the manipulated

variable (MV), e(tn)is the error between the setpoint ysp and

the controlled variable (CV) y(tn), and ∆tis the sampling

interval. Kp,τI, and τDare the PID parameters. In this work,

we use the following anti-reset windup to accommodate the

potential windup issue from the integral part [10]:

sat(u) = 









umin,if u < umin,

u, if umin ≤u≤umax,

umax,if u > umax,

(2)

where umin and umax are the lower and upper bounds of

the MV, respectively.

B. Reinforcement Learning

RL formulates the optimal control problem as a Markov

decision process expressed as (S,A,P, R), where Sis the

state space, Ais the action space, Pis the transition prob-

ability matrix, and Ris the reward signal [9]. In RL, an

agent observes the state stand deploys an action atunder

current policy to the environment. The environment then

evolves one step forward under at, generating a reward signal

rt:= R(st, at)to the agent. To assess the action attaken

at state st, the Q-value Qπ(st, at)is deﬁned as the expected

sum of future discounted rewards from time tonward:

Qπ(st, at) = E"∞

t=0

γtrt+1|s=st, a =at, π#,(3)

where γ∈[0,1] is the discount factor and π(at|st) : st→

atis the underlying policy mapping state stto action at.

The objective of the RL problem is to maximize the total

cumulative reward by ﬁnding the optimal policy π∗(at|st)

from the optimal Q-value Q∗(st, at)

π∗(at|st) = (1,if at= arg max

at∈AQ∗(st, at),

0,otherwise.

(4)

For a deterministic policy µ(at|st), the Q-value can be

described by the Bellman expectation condition as

Qµ(st, at) = E[rt+γQµ(st+1, µ(st+1))] .(5)

For a high-dimensional RL problem, it is challenging to learn

the Q-value for each state-action pair individually. Under

such circumstances, a function approximator (FA), generally

a NN Qθ(st, at), parameterized by θ, is used to learn the

true Q-value. The objective of the FA is to minimize the

approximation error by ﬁnding the optimal parameter θ∗:

θ∗= arg min

E(yt−Qθ(st, at))2,(6)

where yt=rt+γmaxaQθ(st+1, a)is the temporal-

difference (TD) target.

In addition to the value-based method above, another

important RL category is policy-based methods, where the

RL searches directly the optimal policy µφ(st), parameter-

ized by φ, to maximize the accumulated reward J(µφ) =

Eµφ[PtR(st, at)]. This class of methods uses gradient

ascent to update φ[10]

φt+1 ←φt+α∇φJ(µφ)|φ=φt,(7)

where αis the learning rate, and the gradient of the cost

J(µφ), using the policy-gradient theorem [9], is

∇φJ(µφ) = Eµφ∇aQµ(st, a)|a=µφ(st)∇φµφ(st).(8)

The actor-critic method combines both value-based and

policy-gradient methods where the actor generates actions

depending on the current state and the critic evaluates the

action that was taken [1], [2], [10].

C. DDPG and TD3

As a model-free off-policy actor-critic method, the DDPG

framework employs a deterministic policy and thus is

sample-efﬁcient [15]. The application of DDPG to PID

tuning has been reported in [10]. However, the critic network

in the DDPG suffers from the issue of overestimation bias

due to the maximizing operation for calculating the TD target

in (6). To address this problem, the TD3 method is proposed

where two different target Q networks, parameterized by θ0

i= 1,2, are used for estimating two Q-values and the smaller

one is then used for computing the TD target [16]:

yt=rt+γmin

i=1,2Qθ0

i(st+1, µφ0(st+1)),(9)

where φ0is the parameter of the target actor network.

Moreover, the parameters of the actor and target networks

are updated less frequently to reduce the accumulation of

function approximation error and hence reduce the variance

of Q-value estimate. In addition, a clipped noise is added

to the target action to enable better exploration by adding

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANovelEntropy-MaximizingTD3-basedReinforcementLearningforAutomaticPIDTuningMyishaA.Chowdhury1andQiugangLu1;yAbstractProportional-integral-derivative(PID)controllershavebeenwidelyusedintheprocessindustry.However,thesatisfactorycontrolperformanceofaPIDcontrollerdependsstronglyonthetuningparameters.Co...

展开>> 收起<<

A Novel Entropy-Maximizing TD3-based Reinforcement Learning for Automatic PID Tuning Myisha A. Chowdhury1and Qiugang Lu1y.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Novel Entropy-Maximizing TD3-based Reinforcement Learning for Automatic PID Tuning Myisha A. Chowdhury1and Qiugang Lu1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: