Adaptive dynamic programming-based algorithm for inﬁnite-horizon linear quadratic stochastic optimal control problems Heng Zhang

2025-04-27 0 0 260.53KB 6 页 10玖币

侵权投诉

Adaptive dynamic programming-based algorithm for inﬁnite-horizon

linear quadratic stochastic optimal control problems

Heng Zhang

School of Control Science and Engineering, Shandong University, Jinan, 250061, China.

E-mail: zhangh2828@163.com

Abstract: This paper investigates an inﬁnite-horizon linear quadratic stochastic (LQS) optimal control problem for a class

of continuous-time stochastic systems. By employing the technique of adaptive dynamic programming (ADP), we propose

a novel model-free policy iteration (PI) algorithm. Without needing all information of the system coefﬁcient matrices, the

proposed PI algorithm iterates by using the data of the input and system state collected on a ﬁxed time interval. Finally, a

numerical example is presented to demonstrate the feasibility of the obtained algorithm.

Key Words: Linear quadratic stochastic optimal control, Policy iteration, Adaptive dynamic programming

1 INTRODUCTION

The linear quadratic stochastic (LQS) optimal control

problem, initiated by Wonham [15] has been broadly applied

in a lot of ﬁelds such as engineering. As is known to all, the

continuous-time LQS problem in inﬁnite horizon is closely

related to the stochastic algebraic Riccati equation (SARE),

which is difﬁcult to solve due to its nonlinear structure.

With the in-depth study of the LQS optimal control problem,

researchers developed some approximation methods to

obtain the solution of the SARE. For instance, Ni and Fang

[18] proposed a PI algorithm to solve the SARE iteratively.

With the help of the positive operators, a Newton’s method

was proposed by Damm and Hinrichsen [12] to solve the

SARE. However, the above methods need all knowledge

of the system, i.e., all parameters of the system have to be

known beforehand. In fact, the system matrices are difﬁcult

to obtain directly in applications such as engineering and

ﬁnance. The methods mentioned above will become invalid

if the system coefﬁcient matrices are unknown. Thus, it

is of great importance to propose a model-free strategy to

solve LQS optimal control problems, without using the

information of system matrices.

For the past decade, adaptive dynamic programming (ADP)

(Werbos [7]) and reinforcement learning (RL) (Sutton and

Barto [9]) theories have been broadly used to solve optimal

control problems with partially model-free or model-free

system dynamics. About the development of deterministic

system case, see, e.g., Shi and Wang [20], Pang et al. [2],

Kiumarsi et al. [1], Vamvoudakis et al. [4], Bian and Jiang

The author acknowledges the ﬁnancial support from the NSFC under

Grant Nos. 11831010, 61925306 and 61821004, and the NSF of Shandong

Province under Grant Nos. ZR2019ZD42 and ZR2020ZD24.

[11], Palanisamy et al. [6], Vrabie et al. [3], Wei et al. [8],

Jiang and Jiang [16], Mukherjee et al. [10] and the references

therein.

Regarding to stochastic optimal control problems, Ge et al.

[19] proposed a model-free methodology to get the optimal

policy for a kind of mean-ﬁeld discrete-time stochastic

systems by the method of Q-learning. By the technique of

ADP, Wang et al. [13] solved a class of discrete-time LQS

optimal control problems. Wang et al. [14] developed a

model-free Q-learning algorithm to get the optimal control

for discrete-time LQS problems. By applying RL techniques,

Jiang and Jiang [17] developed an ADP strategy to solve

continuous-time optimal control problems where the systems

subject to control-dependent noise.

However, to the author’s best knowledge, there is no

model-free results for continuous-time LQS optimal control

problems where drift and diffusion terms contain both

control and state variables. The main contribution of this

paper is that we propose a model-free algorithm to solve this

class of continuous-time LQS problems.

To be speciﬁc, we propose a novel data-driven model-free

PI algorithm to get the maximal solution to the SARE by

using the data of the input and state collected on some time

interval. The convergence proof of our model-free strategy is

also been provided.

The rest of the paper is organized as follows. In Section 2,

the formulation of our problem and some preliminaries are

presented. Section 3 develops our data-driven model-free PI

algorithm. In Section 4, we provide a simulation example

arXiv:2210.04486v1 [math.OC] 10 Oct 2022

to illustrate the applicability of the proposed algorithm. In

Section 5, some conclusions are presented.

Notation. We denote the collections of non-negative inte-

gers, positive integers and real numbers by Z,Z+and R.

Rn×mrepresents the collection of all n×mreal matrices. Rn

is the n-dimensional Euclidean space and | · | denotes its Eu-

clidean norm for vector or matrix of proper size. Zero matrix

(or vector) with appropriate dimension is denoted by O. We

use diag{v}to denote a square diagonal matrix whose main

diagonal is the elements of vector v. The sets of all symmetric

matrices, positive deﬁnite matrices and semipositive deﬁnite

matrices in Rn×nare represented by Sn,Sn

++ and Sn

+, respec-

tively. w(·)is a one-dimensional standard Brownian motion

deﬁned on a ﬁltered probability space (Ω,F,{Ft}t>0,P) that

satisﬁes usual conditions. Moreover, we use ⊗to denote the

Kronecker product and for any matrix B∈Rm×n,vec(B)

denotes a vectorization map from the matrix Binto a column

vector of proper size, which stacks the columns of Bon top

of one another, that is, vec(B) = [bT

1, bT

2,· · · , bT

n]T, where

bj∈Rn,j= 1,2,3,· · · , n, are the columns of B. For any

ξ∈Rnand F∈Sn, we deﬁne two operators as follows:

vecs :ξ∈Rl→vecs(ξ)∈Rn(n+1)

and vech :F∈Sl→vech(F)∈Rn(n+1)

where

vecs(ξ) = [ξ2

1, ξ1ξ2,· · · , ξ1ξn, x2

2, x2x3,· · · , ξn−1ξn, ξ2

n]T,

vech(F) = [f11,2f12,· · · ,2f1n, f22,2f23,· · · ,2fn−1,n, fnn]T,

and ξj,j= 1,2,· · · , n, is the jth element of ξand fji,

j, i = 1,2,· · · , n, is the (j, i)th element of matrix F. For

simplity, we denote vecs(ξ)by ξin this paper.

2 PROBLEM FORMULATION

This section presents the formulation of our LQS optimal

control problems.

Consider a continuous-time time-invariant stochastic linear

system as follows











dx(s) = [Ax(s) + Bu(s)]ds

+ [Cx(s) + Du(s)]dw(s),

x(0) = x0,

(1)

where x0∈Rnis the initial state. The cost functional is

deﬁned as

J(u(·)) = EZ∞

[x(s)TQx(s) + u(s)TRu(s)]ds, (2)

where R > 0,Q≥0and [A, C|Q]is exactly detectable.

Now we give the deﬁnition of mean-square stabilizability.

Deﬁnition 1. System (1) is called mean-square stabilizable

for any initial state x0, if there exists a matrix K∈Rm×n

such that the solution of











dx(s) = (A+BK)x(s)ds

+ (C+DK)x(s)dw(s),

x(0) = x0

(3)

satisﬁes lims→∞ E[x(s)Tx(s)] = 0. In this case, the

feedback control u(·) = Kx(·)is called stabilizing and the

constant matrix Kis called a stabilizer of system (1).

Assumption 1. System (1) is mean-square stabilizable.

Under Assumption 1, we deﬁne the sets of admissible control

Uad ={u(·)∈L2

F(Rm)|u(·)is stabilizing}.(4)

Our continuous-time LQS optimal control problems are

given as follows:

Problem (LQS). For any initial state x0∈Rn, we want to

ﬁnd an optimal control u∗(·)∈ Uad such that

J(u∗(·)) = inf

u(·)∈Uad

J(u(·)).(5)

Ni and Fang [18] shows that the optimal control of Problem

(LQS) can be obtained by solving the following stochastic

algebraic Riccati equation (SARE)

P A +ATP+CTP C +Q−(P B +CTP D)

×(R+DTP D)−1(BTP+DTP C)=0.(6)

Due to the nonlinear structure of SARE (6), the analytical

solution of (6) is difﬁcult to obtain. To our best knowledge,

there are some iterative algorithms to get the approximation

solution of (6), one of which is the PI method developed

in Ni and Fang [18]. We summarize the method as the

following lemma.

Lemma 1. Assume [A, C|Q]is exactly detectable. For a

given stabilizer K0, let Pi∈Sn

+be the solution of

Pi(A+BKi)+(A+BKi)TPi+Q

+ (C+DKi)TPi(C+DKi) + KT

iRKi= 0,(7)

where Kiis updated by

Ki+1 =−(R+DTPiD)−1(BTPi+DTPiC).(8)

Then Piand Ki,i= 0,1,2,3,· · · can be uniquely deter-

mined at each iteration step, and the following conclusions

hold:

(i) Ki,i= 0,1,2,· · · , are stabilizers.

(ii) limi→∞ Pi=P∗,limi→∞ Ki=K∗, where P∗

is a nonnegative deﬁnite solution to SARE (6) and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Adaptivedynamicprogramming-basedalgorithmforinnite-horizonlinearquadraticstochasticoptimalcontrolproblemsHengZhangSchoolofControlScienceandEngineering,ShandongUniversity,Jinan,250061,China.E-mail:zhangh2828@163.comAbstract:Thispaperinvestigatesaninnite-horizonlinearquadraticstochastic(LQS)optimalc...

展开>> 收起<<

Adaptive dynamic programming-based algorithm for inﬁnite-horizon linear quadratic stochastic optimal control problems Heng Zhang.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive dynamic programming-based algorithm for inﬁnite-horizon linear quadratic stochastic optimal control problems Heng Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: