A Novel Entropy-Maximizing TD3-based Reinforcement Learning for
Automatic PID Tuning
Myisha A. Chowdhury1and Qiugang Lu1,†
Abstract— Proportional-integral-derivative (PID) controllers
have been widely used in the process industry. However, the
satisfactory control performance of a PID controller depends
strongly on the tuning parameters. Conventional PID tuning
methods require extensive knowledge of the system model,
which is not always known especially in the case of complex
dynamical systems. In contrast, reinforcement learning-based
PID tuning has gained popularity since it can treat PID tuning
as a black-box problem and deliver the optimal PID parameters
without requiring explicit process models. In this paper, we
present a novel entropy-maximizing twin-delayed deep deter-
ministic policy gradient (EMTD3) method for automating the
PID tuning. In the proposed method, an entropy-maximizing
stochastic actor is employed at the beginning to encourage the
exploration of the action space. Then a deterministic actor is
deployed to focus on local exploitation and discover the optimal
solution. The incorporation of the entropy-maximizing term can
significantly improve the sample efficiency and assist in fast
convergence to the global solution. Our proposed method is
applied to the PID tuning of a second-order system to verify its
effectiveness in improving the sample efficiency and discovering
the optimal PID parameters compared to traditional TD3.
I. INTRODUCTION
Modern process industry is characterized by high com-
plexity due to strong coupling among different units and
thus efficient control is critical for maintaining high-level
operations. Among various process control strategies, PID
control has received widespread attention owing to its
simplicity (with only three tunable parameters) in design
and effectiveness in delivering high control performance
for numerous real-world applications [1]. However, these
PID parameters require careful tuning for yielding superior
control performance. One class of naive PID tuning methods
is based on the trial-and-error approach, where different PID
parameters are tested and the one giving the best control
performance is deployed to the system. However, this trial-
and-error approach is time-consuming and may result in sub-
optimal PID parameters [2]. Another category of PID tuning
techniques is the rule-based methods where the models of the
process are often required [3]. However, such models may
not be always available, especially for complex processes,
which restricts the applicability of these methods. To address
this issue, data-driven techniques based on, e.g., evolution
optimization [4], genetic algorithm [5], and neural networks
(NNs) [6], have been developed for PID tuning without
requiring a process model. However, these methods need a
*This work was supported by the Texas Tech University.
1M.A. Chowdhury and Q. Lu are with the Department of Chemical
Engineering, Texas Tech University, Lubbock, TX 79405, USA. Email:
myisha.chowdhury@ttu.edu; jay.lu@ttu.edu
†Corresponding author: Q. Lu
large quantity of labeled data for training due to their model-
free nature, making them sample inefficient [7].
Recently, reinforcement learning (RL) has seen a surge
in popularity in the control of dynamical systems [8]. RL
is essentially a sequential decision-making process for opti-
mizing a black-box objective function [9]. Unlike supervised
learning, model-free RL can learn the best policy to optimize
the objective function from direct interactions with the en-
vironment and does not require prior knowledge about the
environment. On the other hand, the PID tuning can be seen
as a black-box optimization problem where the relation be-
tween PID parameters and the resultant control performance
is unknown. On this aspect, RL has great potential to be
an effective technique for solving the PID tuning problem
through the sequential decision-making process [10].
In light of these observations, research studies have been
reported on using RL-based algorithms for PID tuning [1],
[2], [7], [8], [10]–[13]. For example, Q-learning, a popular
RL algorithm, has been used to design PID controllers that
are adaptive to system operating condition changes [11].
However, the Q-learning algorithm works only on discrete
state-action spaces, and cannot handle the continuous space
[12]. One solution to this problem is to discretize the
states and actions into bins before applying Q-learning
[13]. Nonetheless, such discretizations may lead to errors
due to the finite resolution, and eventually result in poor
control performance [12], [14]. To this end, actor-critic-
based algorithms have been utilized where the policy and
Q-values are approximated by parameterized functions [1],
[2], [9]. In particular, the deep deterministic policy gradient
(DDPG) algorithm proposed by DeepMind [15], which is
sample-efficient and effective in handling continuous state-
action space, has been adapted to the PID tuning problem
[10]. An updated version of DDPG with lower variance,
known as twin delayed DDPG (TD3), has also been applied
to PID tuning for nonlinear systems [12]. Although the
TD3 algorithm has demonstrated improved sample efficiency
compared to the other existing methods such as DDPG [16],
it often performs poorly in high-dimensional state-action
space due to its inherent lack of exploration [17], which
limits its efficacy in PID tuning.
In this article, we present a novel algorithm for facili-
tating the PID tuning, termed as entropy-maximizing TD3
(EMTD3), to overcome the issue of insufficient exploration
associated with the traditional TD3 algorithm. Specifically,
the developed EMTD3 algorithm deploys an entropy-based
stochastic actor to ensure significant exploration at the begin-
ning, followed by a deterministic actor based on the TD3 al-
arXiv:2210.02381v1 [eess.SY] 5 Oct 2022