Adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems Heng Zhang

2025-04-27 0 0 260.53KB 6 页 10玖币
侵权投诉
Adaptive dynamic programming-based algorithm for infinite-horizon
linear quadratic stochastic optimal control problems
Heng Zhang
School of Control Science and Engineering, Shandong University, Jinan, 250061, China.
E-mail: zhangh2828@163.com
Abstract: This paper investigates an infinite-horizon linear quadratic stochastic (LQS) optimal control problem for a class
of continuous-time stochastic systems. By employing the technique of adaptive dynamic programming (ADP), we propose
a novel model-free policy iteration (PI) algorithm. Without needing all information of the system coefficient matrices, the
proposed PI algorithm iterates by using the data of the input and system state collected on a fixed time interval. Finally, a
numerical example is presented to demonstrate the feasibility of the obtained algorithm.
Key Words: Linear quadratic stochastic optimal control, Policy iteration, Adaptive dynamic programming
1 INTRODUCTION
The linear quadratic stochastic (LQS) optimal control
problem, initiated by Wonham [15] has been broadly applied
in a lot of fields such as engineering. As is known to all, the
continuous-time LQS problem in infinite horizon is closely
related to the stochastic algebraic Riccati equation (SARE),
which is difficult to solve due to its nonlinear structure.
With the in-depth study of the LQS optimal control problem,
researchers developed some approximation methods to
obtain the solution of the SARE. For instance, Ni and Fang
[18] proposed a PI algorithm to solve the SARE iteratively.
With the help of the positive operators, a Newton’s method
was proposed by Damm and Hinrichsen [12] to solve the
SARE. However, the above methods need all knowledge
of the system, i.e., all parameters of the system have to be
known beforehand. In fact, the system matrices are difficult
to obtain directly in applications such as engineering and
finance. The methods mentioned above will become invalid
if the system coefficient matrices are unknown. Thus, it
is of great importance to propose a model-free strategy to
solve LQS optimal control problems, without using the
information of system matrices.
For the past decade, adaptive dynamic programming (ADP)
(Werbos [7]) and reinforcement learning (RL) (Sutton and
Barto [9]) theories have been broadly used to solve optimal
control problems with partially model-free or model-free
system dynamics. About the development of deterministic
system case, see, e.g., Shi and Wang [20], Pang et al. [2],
Kiumarsi et al. [1], Vamvoudakis et al. [4], Bian and Jiang
The author acknowledges the financial support from the NSFC under
Grant Nos. 11831010, 61925306 and 61821004, and the NSF of Shandong
Province under Grant Nos. ZR2019ZD42 and ZR2020ZD24.
[11], Palanisamy et al. [6], Vrabie et al. [3], Wei et al. [8],
Jiang and Jiang [16], Mukherjee et al. [10] and the references
therein.
Regarding to stochastic optimal control problems, Ge et al.
[19] proposed a model-free methodology to get the optimal
policy for a kind of mean-field discrete-time stochastic
systems by the method of Q-learning. By the technique of
ADP, Wang et al. [13] solved a class of discrete-time LQS
optimal control problems. Wang et al. [14] developed a
model-free Q-learning algorithm to get the optimal control
for discrete-time LQS problems. By applying RL techniques,
Jiang and Jiang [17] developed an ADP strategy to solve
continuous-time optimal control problems where the systems
subject to control-dependent noise.
However, to the author’s best knowledge, there is no
model-free results for continuous-time LQS optimal control
problems where drift and diffusion terms contain both
control and state variables. The main contribution of this
paper is that we propose a model-free algorithm to solve this
class of continuous-time LQS problems.
To be specific, we propose a novel data-driven model-free
PI algorithm to get the maximal solution to the SARE by
using the data of the input and state collected on some time
interval. The convergence proof of our model-free strategy is
also been provided.
The rest of the paper is organized as follows. In Section 2,
the formulation of our problem and some preliminaries are
presented. Section 3 develops our data-driven model-free PI
algorithm. In Section 4, we provide a simulation example
arXiv:2210.04486v1 [math.OC] 10 Oct 2022
to illustrate the applicability of the proposed algorithm. In
Section 5, some conclusions are presented.
Notation. We denote the collections of non-negative inte-
gers, positive integers and real numbers by Z,Z+and R.
Rn×mrepresents the collection of all n×mreal matrices. Rn
is the n-dimensional Euclidean space and | · | denotes its Eu-
clidean norm for vector or matrix of proper size. Zero matrix
(or vector) with appropriate dimension is denoted by O. We
use diag{v}to denote a square diagonal matrix whose main
diagonal is the elements of vector v. The sets of all symmetric
matrices, positive definite matrices and semipositive definite
matrices in Rn×nare represented by Sn,Sn
++ and Sn
+, respec-
tively. w(·)is a one-dimensional standard Brownian motion
defined on a filtered probability space (,F,{Ft}t>0,P) that
satisfies usual conditions. Moreover, we use to denote the
Kronecker product and for any matrix BRm×n,vec(B)
denotes a vectorization map from the matrix Binto a column
vector of proper size, which stacks the columns of Bon top
of one another, that is, vec(B) = [bT
1, bT
2,· · · , bT
n]T, where
bjRn,j= 1,2,3,· · · , n, are the columns of B. For any
ξRnand FSn, we define two operators as follows:
vecs :ξRlvecs(ξ)Rn(n+1)
2,
and vech :FSlvech(F)Rn(n+1)
2,
where
vecs(ξ) = [ξ2
1, ξ1ξ2,· · · , ξ1ξn, x2
2, x2x3,· · · , ξn1ξn, ξ2
n]T,
vech(F) = [f11,2f12,· · · ,2f1n, f22,2f23,· · · ,2fn1,n, fnn]T,
and ξj,j= 1,2,· · · , n, is the jth element of ξand fji,
j, i = 1,2,· · · , n, is the (j, i)th element of matrix F. For
simplity, we denote vecs(ξ)by ξin this paper.
2 PROBLEM FORMULATION
This section presents the formulation of our LQS optimal
control problems.
Consider a continuous-time time-invariant stochastic linear
system as follows
dx(s) = [Ax(s) + Bu(s)]ds
+ [Cx(s) + Du(s)]dw(s),
x(0) = x0,
(1)
where x0Rnis the initial state. The cost functional is
defined as
J(u(·)) = EZ
0
[x(s)TQx(s) + u(s)TRu(s)]ds, (2)
where R > 0,Q0and [A, C|Q]is exactly detectable.
Now we give the definition of mean-square stabilizability.
Definition 1. System (1) is called mean-square stabilizable
for any initial state x0, if there exists a matrix KRm×n
such that the solution of
dx(s) = (A+BK)x(s)ds
+ (C+DK)x(s)dw(s),
x(0) = x0
(3)
satisfies lims→∞ E[x(s)Tx(s)] = 0. In this case, the
feedback control u(·) = Kx(·)is called stabilizing and the
constant matrix Kis called a stabilizer of system (1).
Assumption 1. System (1) is mean-square stabilizable.
Under Assumption 1, we define the sets of admissible control
as
Uad ={u(·)L2
F(Rm)|u(·)is stabilizing}.(4)
Our continuous-time LQS optimal control problems are
given as follows:
Problem (LQS). For any initial state x0Rn, we want to
find an optimal control u(·)∈ Uad such that
J(u(·)) = inf
u(·)∈Uad
J(u(·)).(5)
Ni and Fang [18] shows that the optimal control of Problem
(LQS) can be obtained by solving the following stochastic
algebraic Riccati equation (SARE)
P A +ATP+CTP C +Q(P B +CTP D)
×(R+DTP D)1(BTP+DTP C)=0.(6)
Due to the nonlinear structure of SARE (6), the analytical
solution of (6) is difficult to obtain. To our best knowledge,
there are some iterative algorithms to get the approximation
solution of (6), one of which is the PI method developed
in Ni and Fang [18]. We summarize the method as the
following lemma.
Lemma 1. Assume [A, C|Q]is exactly detectable. For a
given stabilizer K0, let PiSn
+be the solution of
Pi(A+BKi)+(A+BKi)TPi+Q
+ (C+DKi)TPi(C+DKi) + KT
iRKi= 0,(7)
where Kiis updated by
Ki+1 =(R+DTPiD)1(BTPi+DTPiC).(8)
Then Piand Ki,i= 0,1,2,3,· · · can be uniquely deter-
mined at each iteration step, and the following conclusions
hold:
(i) Ki,i= 0,1,2,· · · , are stabilizers.
(ii) limi→∞ Pi=P,limi→∞ Ki=K, where P
is a nonnegative definite solution to SARE (6) and
摘要:

Adaptivedynamicprogramming-basedalgorithmforinnite-horizonlinearquadraticstochasticoptimalcontrolproblemsHengZhangSchoolofControlScienceandEngineering,ShandongUniversity,Jinan,250061,China.E-mail:zhangh2828@163.comAbstract:Thispaperinvestigatesaninnite-horizonlinearquadraticstochastic(LQS)optimalc...

展开>> 收起<<
Adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems Heng Zhang.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:260.53KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注