Change Point Detection Approach for Online Control of Unknown Time Varying Dynamical Systems

2025-04-30 1 0 754.04KB 24 页 10玖币

侵权投诉

Change Point Detection Approach for Online

Control of Unknown Time Varying Dynamical

Systems

Deepan Muthirayan, Ruijie Du, Yanning Shen, and Pramod P. Khargonekar

Abstract—We propose a novel change point detection approach

for online learning control with full information feedback (state,

disturbance, and cost feedback) for unknown time-varying dy-

namical systems. We show that our algorithm can achieve a sub-

linear regret with respect to the class of Disturbance Action

Control (DAC) policies, which are a widely studied class of

policies for online control of dynamical systems, for any sub-

linear number of changes and very general class of systems: (i)

matched disturbance system with general convex cost functions,

(ii) general system with linear cost functions. Speciﬁcally, a

(dynamic) regret of Γ1/5

TT4/5can be achieved for these class of

systems, where ΓTis the number of changes of the underlying

system and Tis the duration of the control episode. That is,

the change point detection approach achieves a sub-linear regret

for any sub-linear number of changes, which other previous

algorithms such as in [1] cannot. Numerically, we demonstrate

that the change point detection approach is superior to a standard

restart approach [1] and to standard online learning approaches

for time-invariant dynamical systems. Our work presents the ﬁrst

regret guarantee for unknown time-varying dynamical systems

in terms of a stronger notion of variability like the number of

changes in the underlying system. The extension of our work to

state and output feedback controllers is a subject of future work.

I. INTRODUCTION

In recent years, there has been signiﬁcant interest in the

ﬁnite-time performance of learning-based control algorithms

for uncertain dynamical systems. Such a control setting is

broadly termed as online control, borrowing the notion from

online learning, where a learner’s performance is assessed

by their ability to learn from a ﬁnite number of samples.

The performance in online control is typically measured in

terms of regret, which is the loss of performance using the

proposed algorithm as compared with the best possible policy.

Predominantly, the goal is to design algorithms that adapt to

uncertainties arising from disturbances and adversarial cost

function so that the regret scales sub-linearly in T, i.e., as Tα

with α < 1, where Tis the duration of the control episode.

Signiﬁcant progress has been made in online control. For ex-

ample, algorithms have been developed for control of unknown

systems, with adversarial cost functions and disturbances [2]–

[5], algorithms for known systems with some predictability

This work is supported in part by the National Science Foundation under

Grant ECCS-1839429 and ECCS-2207457. Deepan Muthirayan, Ruijie Du,

Yanning Shen and Pramod P. Khargonekar are with the Department of

Electrical Engineering and Computer Sciences, University of California Irvine,

Irvine, CA (emails: deepan.m@uci.edu, ruijied@uci.edu, yannings@uci.edu,

pramod.khargonekar@uci.edu).

of future disturbances [6], [7], and for unknown systems with

predictability [8].

Control of uncertain systems is an extensively researched

theme in control theory. Stochastic control, robust control and

adaptive control are large subﬁelds with voluminous literature

that address the analysis and synthesis of control for different

types of uncertainties. In particular, adaptive control comes

closest to “online control” described above. While the primary

focus in adaptive control is on closed-loop stability and asymp-

totic performance, there have been some papers on transient

performance. Adaptive control has been studied for systems of

all types such as linear, non-linear, and stochastic. There are

many variants of adaptive control such as adaptive model pre-

dictive control, adaptive learning control, stochastic adaptive

control, and robust adaptive control. These variations address

the design of adaptive controllers for different variations of the

basic adaptive control setting. Thus, adaptive control is a very

rich and extensively studied topic. The key differences in the

“online control” setting from the classical adaptive control are

(a) the consideration of regret as the measure of performance

and (b) in some cases the more general nature of the costs,

which could be adversarial and/or unknown. Consequently,

the classical adaptive control approaches can be inadequate

to analyze online control problems. From a techniques point

of view, progress in online control is achieved by merging

tools from statistical learning, online optimization, and control

theory.

A typical assumption in online control is that the system

is time-invariant. In many circumstances, however, the un-

derlying system or environment can be time-varying. While

some works have studied time-varying dynamical systems [9],

[10], they have been limited to quadratic cost functions. Very

recently, authors of [1] explored the problem of online control

of unknown time-varying linear dynamical systems for generic

convex cost functions. Their work presents some impossibility

results and a regret guarantee of e

O|I|σI+T2/3for any

interval I, where |I|denotes the length of the interval and

σIis the square root of the average squared deviation of

the system parameters in the interval I. Clearly, in their

case [1], the achievability of sub-linear regret is limited to

scenarios with number of changes of the underlying system

within o(T1/3). Motivated by this observation, we investigate

the question, whether sub-linear regret is achievable for any

number of changes over the duration T, and under what

system, information and cost structures assumptions can we

achieve sub-linear guarantees.

arXiv:2210.11684v6 [eess.SY] 25 Mar 2023

Contribution: Distinct from most of prior works in online

control, which study the control of time invariant dynamical

systems, the present paper studies the problem of control of a

time varying dynamical system over a ﬁnite time horizon for

generic convex cost functions. Speciﬁcally, a linear dynamical

system with arbitrary disturbances, whose system matrices can

be time varying is considered. For such systems, we address

the question of how to learn online and optimize when the

system matrices are unknown, in addition to the cost functions

and disturbances being arbitrary and unknown a priori. The

goal is to design algorithms with regret guarantees in terms

of stronger notions of variability (compared to σ), such as the

number of changes. Towards this end, we consider the full

information feedback structure, where in addition to the cost

and state feedback at the end of a time step, the controller also

receives disturbance as a feedback. We speciﬁcally consider

the regret with respect to the class of Disturbance Action

Control (DAC) policies [1], which are a widely used class

of policies for online control of dynamical systems.

We propose a novel change point detection-based online

control algorithm for unknown time-varying dynamical sys-

tems. We present guarantees for very general class of sys-

tems: (i) matched disturbance system with general convex

cost functions, (ii) general system with linear cost functions.

We show that, in both these settings, a (dynamic) regret of

OΓ1/5

TT4/5is achievable with a high probability, where

ΓTis the number of times the system changes in Ttime

steps and Tis the duration of the control episode. Through

numerical simulations, we demonstrate that the change point

detection approach is superior to a standard restart approach,

the adaptive algorithm of [1], and also standard online learning

approach for time-invariant dynamical systems such as [5].

Our result guarantees sub-linear regret for any sub-linear

number of changes, which is an improvement over [1] which

cannot guarantee sub-linear regret for any number of changes.

Our work presents the ﬁrst regret guarantee in terms of a

stronger notion of variability like the number of changes in the

underlying system. The extension of our work to the setting

without disturbance feedback is a subject of future work.

Notation: We denote the spectral radius of a matrix A

by ρ(A), the discrete time interval from m1to m2by

[m1, m2], and the sequence (xm1, xm1+1, ..., xm2)compactly

by xm1:m2. Unless otherwise speciﬁed, k·k is the 2-norm

of a vector and the Frobenious norm of a matrix. We use

O(·)for the standard order notation, and e

O(·)denotes the

order neglecting the poly-log terms in T. We denote the inner

product of two vectors xand yby hx, yi.

II. PROBLEM FORMULATION

We consider the online control of a general linear time-

varying dynamical system. Let tdenote the time index, xt,

the state of the system, yt, the output of the system that is to

be controlled, ut, the control input, wtand et, the disturbance

and measurement noise, and θt= [At, Bt], the time-varying

system matrices. Then, the equation governing the dynamical

system is given by

xt+1 =Atxt+Btut+Bt,wwt,

yt=Ctxt+et.(1)

Let wt∈Rq,et∈Rp,xt∈Rn,yt∈Rp, and ut∈Rm.

We assume that the sequence of system parameters θ1:Tis

unknown to the controller. The disturbance wtcould arise

from unmodeled dynamics and thus need not be stochastic. For

generality, we assume that the disturbances and measurement

noise are bounded and arbitrary. We denote the total duration

of the control episode by T.

Like in any control problem, at any time t, the controller

incurs a cost ct(yt, ut), which is a function of the output

and the control input. In addition to the system parameters

being unknown, the sequence of cost functions c1:Tand the

disturbances w1:Tfor the duration Tis arbitrary and unknown

a priori. We assume that the full cost function ct(·,·)and the

disturbance wtare revealed to the controller after its action at

t. Such a feedback is typical in online control and optimization

and is termed the full information feedback. The difference

here compared to a standard online control formulation is the

feedback of the disturbance wt. Thus, a control policy has the

following information by any time t: (i) the cost functions and

the disturbances till t−1,c1:t−1and w1:t−1, (ii) the control

inputs till t−1,u1:t−1, and (iii) the observations till t,y1:t.

Let ΠIdenote the set of policies that satisfy this information

setting.

We denote a control policy by π. The state, output, and the

control input under the policy is denoted by xπ

t, yπ

tand uπ

respectively. Given that the cost functions and disturbances are

only revealed incrementally, one step at a time, the control pol-

icy will have to be adapted online as and when the controller

gathers information to achieve the best performance over a

period of time. Like in a standard online control problem, we

characterize the performance of a control policy over a ﬁnite

time by its regret. We denote the regret of a policy πover a

duration Twith respect to a policy class ΠM∈ΠIby RT(π):

RT(π) =

t=1

ct(yπ

t, uπ

|{z }

Policy Cost

−min

κ∈ΠM

t=1

ct(yκ

t, uκ

| {z }

comparator cost

.(2)

The primary goal is to design a control policy that minimizes

the regret for the stated control problem. Since the regret

minimization problem is typically hard, a typical goal is to

design a policy that achieves sub-linear regret, i.e., a regret

that scales as Tαwith T, with a α < 1that is minimal. Such

a regret scaling implies that the realized costs converge to that

of the best policy from the comparator class asymptotically.

Our objective is to design an adaptive policy that can track

time variations and achieve sub-linear regret. We note that

the regret deﬁned above is static regret. Later, we present the

extension to dynamic regret, which is a notion that is more

suitable for time-varying dynamical systems.

The comparator class we consider is the class of Disturbance

Action Control (DAC) policies (see [1]). A Disturbance Action

Control (DAC) policy is deﬁned as the linear feedback of the

disturbances up to a certain history h. Let’s denote a DAC

policy by πDAC. Then, the control input uπDAC

tunder policy

πDAC is given by

uπDAC

k=1

M[k]

twt−k.(3)

Here, Mt=hM[1]

t, . . . , M[h]

tiare the feedback gains or the

disturbance gains and are the (time-varying) parameters of

πDAC. Here, we note that, πDAC can be dynamic, i.e., it’s

parameters can be varying with time. Therefore, the regret

deﬁned in Eq. (2) is the notion of dynamic regret. We note

that the policy is implementable with disturbance feedback.

Extension to the case without the disturbance feedback can be

made by using estimates of the disturbances instead. We defer

the treatment without any disturbance feedback to future work.

Our objective here is to optimize the parameter Monline so

that the regret with respect to the best DAC policy in hindsight

is sub-linear.

The DAC policy is typically used in online control for

regulating systems with disturbances; see [4]. The important

feature of the DAC policy is that the optimization problem to

ﬁnd the optimal ﬁxed disturbance gain for a given sequence

of cost functions is a convex problem and is thus amenable to

online optimization and online performance analysis. A very

appealing feature of DAC is that, for time-invariant systems,

the optimal disturbance action control for a given sequence of

cost functions is very close in terms of the performance to the

optimal linear feedback controller of the state; see [4]. Thus,

for time-invariant systems, by optimizing the DAC online, it is

possible to achieve a sub-linear regret with respect to the best

linear feedback controller of the state, whose computation is

a non-convex optimization problem.

For time-varying dynamical systems, as pointed out in [1,

Thoerem 2.1], there exist problem instances where the DAC

class (with disturbance feedback) incurs a much better cost

than other types of classes such as linear state or output

feedback policies and vice versa. Therefore, the DAC class

is not a weaker class to compete against compared to these

standard classes. Moreover, as pointed out by the impossibility

result [1, Thoerem 3.1], it is an equally harder class to compete

against in terms of regret. In this work, we focus our study

on the regret minimization problem with respect to the DAC

class (with disturbance feedback) and defer the treatment of

other control structures to future work.

Even with the disturbance feedback, the challenge of es-

timating the unknown system parameters does not diminish.

This is because of the presence of measurement noise and

the variations itself. In the time-invariant case, following

an analysis similar to [5], it can be shown that, even with

disturbance feedback, only a regret of T2/3can be achieved

with the state-of-the-art methods, which is not any better than

the regret that can be achieved without disturbance feedback

(see [5]). The same holds for the time-varying case. It can be

shown that, what [1] can achieve for the system in Eq. (1), even

with disturbance feedback, cannot be improved. Therefore, the

conclusions we draw later on comparing the bounds we derive

and the regret upper bound of [1] are valid. We state our other

assumptions below.

Assumption 1 (System).(i) The system is stable, i.e.,

kCt+k+1At+k. . . At+1Btk2≤κaκb(1 −γ)k,∀k≥0,∀t,

where κa>0, κb>0and γis such that 0< γ < 1, and where

κa, κband γare constants. Btis bounded, i.e., kBtk ≤ κb. (ii)

The disturbance and noise wtand etis bounded. Speciﬁcally,

kwtk ≤ κw, where κw>0is a constant, and ketk ≤ κe,

where κe>0is a constant.

Assumption 2 (Cost Functions).(i) The cost function ctis

convex ∀t. (ii) kct(x, u)−ct(x0, u0)k ≤ LRkz−z0kfor a

given z>:= [x>, u>],(z0)>:= [(x0)>,(u0)>], where R:=

max{kzk,kz0k,1}. (iii) For any d > 0, when kxk ≤ dand

kuk ≤ d,∇xc(x, u)≤Gd, ∇uc(x, u)≤Gd.

Remark 1 (System Assumptions).Assumption 1.(i) is the

equivalent of stability assumption used in time invariant sys-

tems. Such an assumption is typically used in online control

when the system is unknown; see for eg., [1], [5]. Assumption

1.(iii) that noise is bounded is necessary, especially in the non-

stochastic setting [4], [5]. The assumption on cost functions

is also standard [4].

Deﬁnition 1. (i) M:={M= (M[1], . . . , M[h]) : kM[k]k ≤

κM}(Disturbance Action Policy Class). (ii) G={G[1:h]:

kG[k]k2≤κaκb(1 −γ)k−1}. (iii) Setting (S-1): Matched dis-

turbance system with convex cost functions: Bt=Bt,w, C =

I, et= 0. Setting (S-2): General system with linear cost

functions: Bt,w =I, and there exists a coefﬁcient αt∈Rp+m

such that ct(y, u) = α>

tz,kα>

tk ≤ G.

III. ONLINE LEARNING CONTROL ALGORITHM

Typically, online learning control algorithms for time-

invariant dynamical systems explore ﬁrst for a period of time,

and then exploit, i.e., adapt or optimize the control policy.

While, in the time-invariant case, this strategy results in sub-

linear regret, in the time-varying case, it can be less effective.

For instance, consider the case where the system remains

unchanged for the duration of the exploration phase and

then changes around the instant when the exploration ends.

Clearly, in this case, the estimate made at the end of the

exploration phase will be very distant from the underlying

system parameter realized after the exploration phase and

therefore not result in a sub-linear regret.

We propose an online algorithm that continuously learns

to compute an estimate of the time varying system param-

eters and that simultaneously optimizes the control policy

online. Our estimation algorithm combines (i) a change point

detection algorithm to detect the changes in the underlying

system and (ii) a regular estimation algorithm. The online

algorithm runs an online optimization parallel to the estimation

to optimize the parameters of the control policy, which in our

case is a DAC policy.

Online Optimization: Since the cost functions and the

disturbances are unknown a priori, the optimal parameter

Mof the DAC policy cannot be computed a priori. Rather,

the parameters have to be adapted online continuously with

the information gathered along the way to achieve the best

performance. Given the convexity of the cost functions and

the linearity of the system dynamics, we can apply the Online

Convex Optimization (OCO) framework to optimize the policy

parameters online.

We call a policy that learns the DAC policy parameters

online as an online DAC policy. We formally denote such a

policy by πDAC−O. Let the parameters estimated by πDAC−O

be denoted by Mt=hM[1]

t, . . . , M[h]

ti. Given that the param-

eter Mtis continuously updated, the control input uπDAC−O

can be computed by,

uπDAC−O

k=1

M[k]

twt−k.(4)

Given that the realized cost is dependent on the past control

inputs, we will have to employ an extension of the OCO

framework called Online Convex Optimization with Memory

(OCO-M) to optimize the parameters of the DAC policy.

For the beneﬁt of the readers, we brieﬂy review the online

convex optimization (OCO) setting (see [11]). OCO is a game

played between a player who is learning to minimize its

overall cost and an adversary who is attempting to maximize

the cost incurred by the player. At any time t, the player

chooses a decision Mtfrom some convex subset Mgiven

by maxM∈MkMk ≤ κM, and the adversary chooses a convex

cost function ft(·). As a result, the player incurs a cost ft(Mt)

for its decision Mt. The goal of the player is to minimize the

regret over a duration T, given by

RT=

t=1

ft(Mt)−min

M∈M

t=1

ft(M).

The challenge is that the player does not know the cost

function that the adversary will pick. Once the adversary picks

a cost function, the player observes the realized cost and

in some cases can also observe the full cost function. The

objective of the learner is to achieve the minimal regret or at

the least a sub-linear regret. We direct the readers to [11]

for a more detailed exposition and the various algorithmic

approaches for this problem.

The difference in the OCO-M setting is that the cost

functions can be dependent on the history of past decisions

up to a certain time. Let the length of the history dependence

be denoted by h. The regret in the OCO-M problem is then

given by

RT=

t=1

ft(Mt−h:t)−min

M∈M

t=1

ft(M).

One limitation of the OCO-M framework is that it can only

be applied when the length his ﬁxed or bounded above. In a

control setting though, the cost is typically a function of the

state or the output, which is dependent on the full history of

decisions M1:t, the length of which grows unbounded with the

duration of the control episode. Let

Gt= [G[1]

t, G[2]

t, . . . , G[h]

t],e

Gt= [ e

G[1]

t,e

G[2]

t,..., e

G[t−1]

t],

G[k]

t=CtAt−1. . . At−k+2At−k+1,∀k≥2,e

G[1]

t=Ct,

G[k]

t=CtAt−1. . . At−k+2At−k+1Bt−k,∀k≥2,

and G[1]

t=CtBt−1. Thus, the history of dependence

increases with tand is not ﬁxed. In order to apply the OCO-

M framework, typically, a truncated output ˜ytis constructed,

whose dependence on the history of control inputs is limited

to htime steps:

˜yπDAC−O

t[Mt:t−h|Gt, s1:t] = st+

k=1

G[k]

tuπDAC−O

t−k,

where st=yt−

t−1

k=1

G[k]

tuπDAC−O

t−k.

Using the truncated output, a truncated cost function ˜ctis

constructed as

˜ct(Mt:t−h|Gt, s1:t)

=ct(˜yπDAC−O

t[Mt:t−h|Gt, s1:t], uπDAC−O

t).

We denote the function ˜ct(Mt:t−h|Gt, s1:t)succinctly by

˜ct(M|Gt, s1:t)when each Mkin Mt:t−his equal to M. This

denotes the (truncated) cost that would have been incurred had

the policy parameter been ﬁxed to Mat all the past htime

steps.

A standard gradient algorithm for OCO-M framework

updates the decision Mtby the gradient of the function

ft(Mt:t−h)with all Mkin Mt:t−hﬁxed to Mt. Using the same

compact notation as above, this gradient is equal to ∂ft(Mt).

An interpretation of this gradient is that, it is the gradient

of the cost that would have been incurred had the policy

parameter been ﬁxed at Mtthe past htime steps. We employ

the same idea to update the policy parameters of the DAC

policy online. The online optimization algorithm we propose

updates the policy parameter Mtby the gradient of the cost

function ˜ct(Mt|Gt, s1:t)where each Mkin Mt:t−his ﬁxed to

Mt, i.e., as

Mt+1 =ProjMMt−η∂˜ct(Mt|Gt, s1:t)

∂Mt,(5)

where Mis a convex set of policy parameters.

Deﬁnition 2 (Disturbance Action Policy Class).M:= {M=

(M[1], . . . , M[h]) : kM[k]k ≤ κM}

Optimization for Dynamic Regret: The online optimiza-

tion procedure described above can only fetch a sub-linear

regret for static regret. To fetch a sub-linear dynamic regret,

multiple online optimizers like in Eq. (5) are required to be

run parallelly as in [12]. Let’s index the parallel learners by

iand let the parameters corresponding to the learner ibe

Mt,i. Just as in [12], the ﬁnal parameter Mtis computed by

Mt=PH

i=1 pt,iMt,i, where pt,i are a set of weights such that

i=1 pt,i = 1 and pt,i are also updated online along with

Mt,is. Speciﬁcally, pt,i is updated by pt+1,i ∝pt,ie−lt,i(Mt,i),

where lt,i(M) = ζkMt,i −Mt−1,ik+hMt,i, ∂˜ct(Mt|Gt, s1:t)i.

The Mt,is are updated by

Mt+1,i =ProjMMt,i −ηi

∂˜ct(Mt|Gt, s1:t)

∂Mt,(6)

The complete online optimization algorithm is given in Algo-

rithm 1.

Main Result: We state the performance of the algorithm

Algorithm 1 Online Learning Control with Full Knowledge

(OLC-FK) Algorithm [12, scream.control]

Input: ζ, H, Step sizes ηis, parameters θ1:T.

Initialize M1,i ∈ M arbitrarily for all i∈[1, H]

Initialize p1,i ∝1/(i2+i)for all i∈[1, H]

for t = 1,. . . ,T do

Apply uπDAC−O

t=Ph

k=1 M[k]

twt−k

Observe ct, wtand incur cost ct(yπDAC−O

t, uπDAC−O

Compute: lt,i =ζkMt,i −Mt−1,ik+

hMt,i, ∂˜ct(Mt|Gt, s1:t)ifor all i∈[1, H]

Update: pt+1,i ∝pt,ie−lt,i for all i∈[1, H]

Update: Mt+1,i =ProjMMt,i −ηi∂˜ct(Mt|Gt,s1:t)

∂Mt

end

OLC-FK formally below.

Theorem 1 (Full System Knowledge).Suppose the setting is

the general setting S-2, and the cost functions are general con-

vex functions. Then, under Algorithm 1 [12, scream.control],

with h=log T

(log (1/1−γ)) ,H=O(log(T)),ζ=O(h2)and

ηi=O(2i−1/√ζT ), the regret with respect to any DAC policy

1:T,

RT≤ OpT(1 + PT),(7)

where PTis the path length of the sequence M?

1:T.

The proof follows from a standard proof for online opti-

mization. Please see Appendix VI for the full proof.

A. Disturbance Action Control without System Knowledge

In the previous case, where the system parameters are

known, the control policy parameters are optimized online

through the truncated cost ˜ct(·), whose construction explicitly

utilizes the knowledge of the underlying system parameters

G[k]

t. In this case, since the underlying system parameters are

not available, we construct an estimate of the truncated state

and the truncated cost by estimating the underlying system

parameters G[k]

ts. With this approach, the control policy will

have to solve an online estimation problem to compute an

estimate of the system parameters. Since the parameters are

time-variant, the online estimation has to be run throughout,

unlike the other online estimation approaches [5], [8], along

with the policy optimization. Below, we describe in detail

how our algorithm simultaneously performs estimation and

optimizes the control policy.

Online Estimation and Optimization: The Online Learn-

ing Control with Zero Knowledge (OLC-ZK) of the system

parameters has two components: (i) a control policy and

(ii) an online estimator that runs in parallel to the control

policy and throughout the control episode. The control policy

and online optimization algorithm is similar to the online

algorithm 1, except that the control policy parameters are

updated through an estimate of the truncated cost function. The

online estimation algorithm employs a change point detection

to identify the changes in the underlying system and a standard

estimation algorithm to estimate the underlying system that is

restarted after every detection of change. We discuss the details

of our algorithm below.

A. Online Control Policy: We use the same notation for

the control policy and the control input, i.e., πDAC−Oand

uπDAC−O

trespectively. The estimation algorithm constructs an

estimate b

G[k]

tof the parameters G[k]

tof the system in Eq. (1)

for k∈[1, h]. Thus, the estimation algorithm estimates G[k]

only for a truncated time horizon (looking backwards), i.e.,

for k∈[1, h]. We describe the estimation algorithm later.

The policy πDAC−Ocomputes the control input uπDAC−O

(zero knowledge case) by combining two terms: (i) distur-

bance action control just as in the full knowledge case and

(ii) a perturbation for exploration. In this case, we require an

additional perturbation, just as in [2], so as to be able to run

the estimation parallel to the Online DAC, the control for reg-

ulating the cost. Let ˜uπDAC−O

t[Mt|w1:t] = Ph

k=1 M[k]

twt−k.

Therefore, the total control input by πDAC−Ois given by

uπDAC−O

t= ˜uπDAC−O

t[Mt|w1:t]

|{z }

DAC

+δuπDAC−O

| {z }

Perturbation

.(8)

As in [2], we apply a Gaussian random variable as the

perturbation, i.e.,

δuπDAC−O

t∼ N(0, σ2I),(9)

where σdenotes the standard deviation, and is a constant to

be speciﬁed later.

In this case the policy parameters are optimized by applying

OCO-M on an estimate of the truncated cost. To construct

this estimate, we construct an estimate of stand the truncated

state ˜xπDAC−O

t(·). Given that stis the state response when

the control inputs are zero, we estimate stby subtracting the

contribution of the control inputs from the observed state:

ˆst=

k=1 b

G[k]

twt−k(S-1)

ˆst=yπDAC−O

t−

k=1 b

G[k]

tuπDAC−O

t−k(S-2).(10)

Then the estimate of the truncated output follows by substi-

tuting ˆstin place stand using the estimated b

Gtin place Gt:

˜yπDAC−O

t[Mt:t−h|b

Gt,ˆs1:t] = ˆst+

k=1 b

G[k]

t˜uπDAC−O

t−k.(11)

Then, the estimate of the truncated cost is calculated as

˜ct(Mt:t−h|b

Gt,ˆs1:t)

=ct(˜

˜yπDAC−O

t[Mt:t−h|b

Gt,ˆs1:t],˜uπDAC−O

t).

The online update to the policy parameters is just as in

Algorithm 1, i.e., by the gradient of the estimate of the

truncated cost

Mt+1,i =ProjM Mt,i −ηi

∂˜ct(Mt|b

Gt,ˆs1:t)

∂Mt!.(12)

A. Online Estimation: The online estimation algorithm

is a combination of a change point detection algorithm and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ChangePointDetectionApproachforOnlineControlofUnknownTimeVaryingDynamicalSystemsDeepanMuthirayan,RuijieDu,YanningShen,andPramodP.KhargonekarAbstractWeproposeanovelchangepointdetectionapproachforonlinelearningcontrolwithfullinformationfeedback(state,disturbance,andcostfeedback)forunknowntime-varying...

展开>> 收起<<

Change Point Detection Approach for Online Control of Unknown Time Varying Dynamical Systems.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Change Point Detection Approach for Online Control of Unknown Time Varying Dynamical Systems

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: