Change Point Detection Approach for Online Control of Unknown Time Varying Dynamical Systems

2025-04-30 0 0 754.04KB 24 页 10玖币
侵权投诉
Change Point Detection Approach for Online
Control of Unknown Time Varying Dynamical
Systems
Deepan Muthirayan, Ruijie Du, Yanning Shen, and Pramod P. Khargonekar
Abstract—We propose a novel change point detection approach
for online learning control with full information feedback (state,
disturbance, and cost feedback) for unknown time-varying dy-
namical systems. We show that our algorithm can achieve a sub-
linear regret with respect to the class of Disturbance Action
Control (DAC) policies, which are a widely studied class of
policies for online control of dynamical systems, for any sub-
linear number of changes and very general class of systems: (i)
matched disturbance system with general convex cost functions,
(ii) general system with linear cost functions. Specifically, a
(dynamic) regret of Γ1/5
TT4/5can be achieved for these class of
systems, where ΓTis the number of changes of the underlying
system and Tis the duration of the control episode. That is,
the change point detection approach achieves a sub-linear regret
for any sub-linear number of changes, which other previous
algorithms such as in [1] cannot. Numerically, we demonstrate
that the change point detection approach is superior to a standard
restart approach [1] and to standard online learning approaches
for time-invariant dynamical systems. Our work presents the first
regret guarantee for unknown time-varying dynamical systems
in terms of a stronger notion of variability like the number of
changes in the underlying system. The extension of our work to
state and output feedback controllers is a subject of future work.
I. INTRODUCTION
In recent years, there has been significant interest in the
finite-time performance of learning-based control algorithms
for uncertain dynamical systems. Such a control setting is
broadly termed as online control, borrowing the notion from
online learning, where a learner’s performance is assessed
by their ability to learn from a finite number of samples.
The performance in online control is typically measured in
terms of regret, which is the loss of performance using the
proposed algorithm as compared with the best possible policy.
Predominantly, the goal is to design algorithms that adapt to
uncertainties arising from disturbances and adversarial cost
function so that the regret scales sub-linearly in T, i.e., as Tα
with α < 1, where Tis the duration of the control episode.
Significant progress has been made in online control. For ex-
ample, algorithms have been developed for control of unknown
systems, with adversarial cost functions and disturbances [2]–
[5], algorithms for known systems with some predictability
This work is supported in part by the National Science Foundation under
Grant ECCS-1839429 and ECCS-2207457. Deepan Muthirayan, Ruijie Du,
Yanning Shen and Pramod P. Khargonekar are with the Department of
Electrical Engineering and Computer Sciences, University of California Irvine,
Irvine, CA (emails: deepan.m@uci.edu, ruijied@uci.edu, yannings@uci.edu,
pramod.khargonekar@uci.edu).
of future disturbances [6], [7], and for unknown systems with
predictability [8].
Control of uncertain systems is an extensively researched
theme in control theory. Stochastic control, robust control and
adaptive control are large subfields with voluminous literature
that address the analysis and synthesis of control for different
types of uncertainties. In particular, adaptive control comes
closest to “online control” described above. While the primary
focus in adaptive control is on closed-loop stability and asymp-
totic performance, there have been some papers on transient
performance. Adaptive control has been studied for systems of
all types such as linear, non-linear, and stochastic. There are
many variants of adaptive control such as adaptive model pre-
dictive control, adaptive learning control, stochastic adaptive
control, and robust adaptive control. These variations address
the design of adaptive controllers for different variations of the
basic adaptive control setting. Thus, adaptive control is a very
rich and extensively studied topic. The key differences in the
“online control” setting from the classical adaptive control are
(a) the consideration of regret as the measure of performance
and (b) in some cases the more general nature of the costs,
which could be adversarial and/or unknown. Consequently,
the classical adaptive control approaches can be inadequate
to analyze online control problems. From a techniques point
of view, progress in online control is achieved by merging
tools from statistical learning, online optimization, and control
theory.
A typical assumption in online control is that the system
is time-invariant. In many circumstances, however, the un-
derlying system or environment can be time-varying. While
some works have studied time-varying dynamical systems [9],
[10], they have been limited to quadratic cost functions. Very
recently, authors of [1] explored the problem of online control
of unknown time-varying linear dynamical systems for generic
convex cost functions. Their work presents some impossibility
results and a regret guarantee of e
O|I|σI+T2/3for any
interval I, where |I|denotes the length of the interval and
σIis the square root of the average squared deviation of
the system parameters in the interval I. Clearly, in their
case [1], the achievability of sub-linear regret is limited to
scenarios with number of changes of the underlying system
within o(T1/3). Motivated by this observation, we investigate
the question, whether sub-linear regret is achievable for any
number of changes over the duration T, and under what
system, information and cost structures assumptions can we
achieve sub-linear guarantees.
arXiv:2210.11684v6 [eess.SY] 25 Mar 2023
Contribution: Distinct from most of prior works in online
control, which study the control of time invariant dynamical
systems, the present paper studies the problem of control of a
time varying dynamical system over a finite time horizon for
generic convex cost functions. Specifically, a linear dynamical
system with arbitrary disturbances, whose system matrices can
be time varying is considered. For such systems, we address
the question of how to learn online and optimize when the
system matrices are unknown, in addition to the cost functions
and disturbances being arbitrary and unknown a priori. The
goal is to design algorithms with regret guarantees in terms
of stronger notions of variability (compared to σ), such as the
number of changes. Towards this end, we consider the full
information feedback structure, where in addition to the cost
and state feedback at the end of a time step, the controller also
receives disturbance as a feedback. We specifically consider
the regret with respect to the class of Disturbance Action
Control (DAC) policies [1], which are a widely used class
of policies for online control of dynamical systems.
We propose a novel change point detection-based online
control algorithm for unknown time-varying dynamical sys-
tems. We present guarantees for very general class of sys-
tems: (i) matched disturbance system with general convex
cost functions, (ii) general system with linear cost functions.
We show that, in both these settings, a (dynamic) regret of
e
OΓ1/5
TT4/5is achievable with a high probability, where
ΓTis the number of times the system changes in Ttime
steps and Tis the duration of the control episode. Through
numerical simulations, we demonstrate that the change point
detection approach is superior to a standard restart approach,
the adaptive algorithm of [1], and also standard online learning
approach for time-invariant dynamical systems such as [5].
Our result guarantees sub-linear regret for any sub-linear
number of changes, which is an improvement over [1] which
cannot guarantee sub-linear regret for any number of changes.
Our work presents the first regret guarantee in terms of a
stronger notion of variability like the number of changes in the
underlying system. The extension of our work to the setting
without disturbance feedback is a subject of future work.
Notation: We denote the spectral radius of a matrix A
by ρ(A), the discrete time interval from m1to m2by
[m1, m2], and the sequence (xm1, xm1+1, ..., xm2)compactly
by xm1:m2. Unless otherwise specified, k·k is the 2-norm
of a vector and the Frobenious norm of a matrix. We use
O(·)for the standard order notation, and e
O(·)denotes the
order neglecting the poly-log terms in T. We denote the inner
product of two vectors xand yby hx, yi.
II. PROBLEM FORMULATION
We consider the online control of a general linear time-
varying dynamical system. Let tdenote the time index, xt,
the state of the system, yt, the output of the system that is to
be controlled, ut, the control input, wtand et, the disturbance
and measurement noise, and θt= [At, Bt], the time-varying
system matrices. Then, the equation governing the dynamical
system is given by
xt+1 =Atxt+Btut+Bt,wwt,
yt=Ctxt+et.(1)
Let wtRq,etRp,xtRn,ytRp, and utRm.
We assume that the sequence of system parameters θ1:Tis
unknown to the controller. The disturbance wtcould arise
from unmodeled dynamics and thus need not be stochastic. For
generality, we assume that the disturbances and measurement
noise are bounded and arbitrary. We denote the total duration
of the control episode by T.
Like in any control problem, at any time t, the controller
incurs a cost ct(yt, ut), which is a function of the output
and the control input. In addition to the system parameters
being unknown, the sequence of cost functions c1:Tand the
disturbances w1:Tfor the duration Tis arbitrary and unknown
a priori. We assume that the full cost function ct(·,·)and the
disturbance wtare revealed to the controller after its action at
t. Such a feedback is typical in online control and optimization
and is termed the full information feedback. The difference
here compared to a standard online control formulation is the
feedback of the disturbance wt. Thus, a control policy has the
following information by any time t: (i) the cost functions and
the disturbances till t1,c1:t1and w1:t1, (ii) the control
inputs till t1,u1:t1, and (iii) the observations till t,y1:t.
Let ΠIdenote the set of policies that satisfy this information
setting.
We denote a control policy by π. The state, output, and the
control input under the policy is denoted by xπ
t, yπ
tand uπ
t
respectively. Given that the cost functions and disturbances are
only revealed incrementally, one step at a time, the control pol-
icy will have to be adapted online as and when the controller
gathers information to achieve the best performance over a
period of time. Like in a standard online control problem, we
characterize the performance of a control policy over a finite
time by its regret. We denote the regret of a policy πover a
duration Twith respect to a policy class ΠMΠIby RT(π):
RT(π) =
T
X
t=1
ct(yπ
t, uπ
t)
|{z }
Policy Cost
min
κΠM
T
X
t=1
ct(yκ
t, uκ
t)
| {z }
comparator cost
.(2)
The primary goal is to design a control policy that minimizes
the regret for the stated control problem. Since the regret
minimization problem is typically hard, a typical goal is to
design a policy that achieves sub-linear regret, i.e., a regret
that scales as Tαwith T, with a α < 1that is minimal. Such
a regret scaling implies that the realized costs converge to that
of the best policy from the comparator class asymptotically.
Our objective is to design an adaptive policy that can track
time variations and achieve sub-linear regret. We note that
the regret defined above is static regret. Later, we present the
extension to dynamic regret, which is a notion that is more
suitable for time-varying dynamical systems.
The comparator class we consider is the class of Disturbance
Action Control (DAC) policies (see [1]). A Disturbance Action
Control (DAC) policy is defined as the linear feedback of the
disturbances up to a certain history h. Let’s denote a DAC
policy by πDAC. Then, the control input uπDAC
tunder policy
πDAC is given by
uπDAC
t=
h
X
k=1
M[k]
twtk.(3)
Here, Mt=hM[1]
t, . . . , M[h]
tiare the feedback gains or the
disturbance gains and are the (time-varying) parameters of
πDAC. Here, we note that, πDAC can be dynamic, i.e., it’s
parameters can be varying with time. Therefore, the regret
defined in Eq. (2) is the notion of dynamic regret. We note
that the policy is implementable with disturbance feedback.
Extension to the case without the disturbance feedback can be
made by using estimates of the disturbances instead. We defer
the treatment without any disturbance feedback to future work.
Our objective here is to optimize the parameter Monline so
that the regret with respect to the best DAC policy in hindsight
is sub-linear.
The DAC policy is typically used in online control for
regulating systems with disturbances; see [4]. The important
feature of the DAC policy is that the optimization problem to
find the optimal fixed disturbance gain for a given sequence
of cost functions is a convex problem and is thus amenable to
online optimization and online performance analysis. A very
appealing feature of DAC is that, for time-invariant systems,
the optimal disturbance action control for a given sequence of
cost functions is very close in terms of the performance to the
optimal linear feedback controller of the state; see [4]. Thus,
for time-invariant systems, by optimizing the DAC online, it is
possible to achieve a sub-linear regret with respect to the best
linear feedback controller of the state, whose computation is
a non-convex optimization problem.
For time-varying dynamical systems, as pointed out in [1,
Thoerem 2.1], there exist problem instances where the DAC
class (with disturbance feedback) incurs a much better cost
than other types of classes such as linear state or output
feedback policies and vice versa. Therefore, the DAC class
is not a weaker class to compete against compared to these
standard classes. Moreover, as pointed out by the impossibility
result [1, Thoerem 3.1], it is an equally harder class to compete
against in terms of regret. In this work, we focus our study
on the regret minimization problem with respect to the DAC
class (with disturbance feedback) and defer the treatment of
other control structures to future work.
Even with the disturbance feedback, the challenge of es-
timating the unknown system parameters does not diminish.
This is because of the presence of measurement noise and
the variations itself. In the time-invariant case, following
an analysis similar to [5], it can be shown that, even with
disturbance feedback, only a regret of T2/3can be achieved
with the state-of-the-art methods, which is not any better than
the regret that can be achieved without disturbance feedback
(see [5]). The same holds for the time-varying case. It can be
shown that, what [1] can achieve for the system in Eq. (1), even
with disturbance feedback, cannot be improved. Therefore, the
conclusions we draw later on comparing the bounds we derive
and the regret upper bound of [1] are valid. We state our other
assumptions below.
Assumption 1 (System).(i) The system is stable, i.e.,
kCt+k+1At+k. . . At+1Btk2κaκb(1 γ)k,k0,t,
where κa>0, κb>0and γis such that 0< γ < 1, and where
κa, κband γare constants. Btis bounded, i.e., kBtk ≤ κb. (ii)
The disturbance and noise wtand etis bounded. Specifically,
kwtk ≤ κw, where κw>0is a constant, and ketk ≤ κe,
where κe>0is a constant.
Assumption 2 (Cost Functions).(i) The cost function ctis
convex t. (ii) kct(x, u)ct(x0, u0)k ≤ LRkzz0kfor a
given z>:= [x>, u>],(z0)>:= [(x0)>,(u0)>], where R:=
max{kzk,kz0k,1}. (iii) For any d > 0, when kxk ≤ dand
kuk ≤ d,xc(x, u)Gd, uc(x, u)Gd.
Remark 1 (System Assumptions).Assumption 1.(i) is the
equivalent of stability assumption used in time invariant sys-
tems. Such an assumption is typically used in online control
when the system is unknown; see for eg., [1], [5]. Assumption
1.(iii) that noise is bounded is necessary, especially in the non-
stochastic setting [4], [5]. The assumption on cost functions
is also standard [4].
Definition 1. (i) M:={M= (M[1], . . . , M[h]) : kM[k]k ≤
κM}(Disturbance Action Policy Class). (ii) G={G[1:h]:
kG[k]k2κaκb(1 γ)k1}. (iii) Setting (S-1): Matched dis-
turbance system with convex cost functions: Bt=Bt,w, C =
I, et= 0. Setting (S-2): General system with linear cost
functions: Bt,w =I, and there exists a coefficient αtRp+m
such that ct(y, u) = α>
tz,kα>
tk ≤ G.
III. ONLINE LEARNING CONTROL ALGORITHM
Typically, online learning control algorithms for time-
invariant dynamical systems explore first for a period of time,
and then exploit, i.e., adapt or optimize the control policy.
While, in the time-invariant case, this strategy results in sub-
linear regret, in the time-varying case, it can be less effective.
For instance, consider the case where the system remains
unchanged for the duration of the exploration phase and
then changes around the instant when the exploration ends.
Clearly, in this case, the estimate made at the end of the
exploration phase will be very distant from the underlying
system parameter realized after the exploration phase and
therefore not result in a sub-linear regret.
We propose an online algorithm that continuously learns
to compute an estimate of the time varying system param-
eters and that simultaneously optimizes the control policy
online. Our estimation algorithm combines (i) a change point
detection algorithm to detect the changes in the underlying
system and (ii) a regular estimation algorithm. The online
algorithm runs an online optimization parallel to the estimation
to optimize the parameters of the control policy, which in our
case is a DAC policy.
Online Optimization: Since the cost functions and the
disturbances are unknown a priori, the optimal parameter
Mof the DAC policy cannot be computed a priori. Rather,
the parameters have to be adapted online continuously with
the information gathered along the way to achieve the best
performance. Given the convexity of the cost functions and
the linearity of the system dynamics, we can apply the Online
Convex Optimization (OCO) framework to optimize the policy
parameters online.
We call a policy that learns the DAC policy parameters
online as an online DAC policy. We formally denote such a
policy by πDACO. Let the parameters estimated by πDACO
be denoted by Mt=hM[1]
t, . . . , M[h]
ti. Given that the param-
eter Mtis continuously updated, the control input uπDACO
t
can be computed by,
uπDACO
t=
h
X
k=1
M[k]
twtk.(4)
Given that the realized cost is dependent on the past control
inputs, we will have to employ an extension of the OCO
framework called Online Convex Optimization with Memory
(OCO-M) to optimize the parameters of the DAC policy.
For the benefit of the readers, we briefly review the online
convex optimization (OCO) setting (see [11]). OCO is a game
played between a player who is learning to minimize its
overall cost and an adversary who is attempting to maximize
the cost incurred by the player. At any time t, the player
chooses a decision Mtfrom some convex subset Mgiven
by maxM∈MkMk ≤ κM, and the adversary chooses a convex
cost function ft(·). As a result, the player incurs a cost ft(Mt)
for its decision Mt. The goal of the player is to minimize the
regret over a duration T, given by
RT=
T
X
t=1
ft(Mt)min
M∈M
T
X
t=1
ft(M).
The challenge is that the player does not know the cost
function that the adversary will pick. Once the adversary picks
a cost function, the player observes the realized cost and
in some cases can also observe the full cost function. The
objective of the learner is to achieve the minimal regret or at
the least a sub-linear regret. We direct the readers to [11]
for a more detailed exposition and the various algorithmic
approaches for this problem.
The difference in the OCO-M setting is that the cost
functions can be dependent on the history of past decisions
up to a certain time. Let the length of the history dependence
be denoted by h. The regret in the OCO-M problem is then
given by
RT=
T
X
t=1
ft(Mth:t)min
M∈M
T
X
t=1
ft(M).
One limitation of the OCO-M framework is that it can only
be applied when the length his fixed or bounded above. In a
control setting though, the cost is typically a function of the
state or the output, which is dependent on the full history of
decisions M1:t, the length of which grows unbounded with the
duration of the control episode. Let
Gt= [G[1]
t, G[2]
t, . . . , G[h]
t],e
Gt= [ e
G[1]
t,e
G[2]
t,..., e
G[t1]
t],
e
G[k]
t=CtAt1. . . Atk+2Atk+1,k2,e
G[1]
t=Ct,
G[k]
t=CtAt1. . . Atk+2Atk+1Btk,k2,
and G[1]
t=CtBt1. Thus, the history of dependence
increases with tand is not fixed. In order to apply the OCO-
M framework, typically, a truncated output ˜ytis constructed,
whose dependence on the history of control inputs is limited
to htime steps:
˜yπDACO
t[Mt:th|Gt, s1:t] = st+
h
X
k=1
G[k]
tuπDACO
tk,
where st=yt
t1
X
k=1
G[k]
tuπDACO
tk.
Using the truncated output, a truncated cost function ˜ctis
constructed as
˜ct(Mt:th|Gt, s1:t)
=ct(˜yπDACO
t[Mt:th|Gt, s1:t], uπDACO
t).
We denote the function ˜ct(Mt:th|Gt, s1:t)succinctly by
˜ct(M|Gt, s1:t)when each Mkin Mt:this equal to M. This
denotes the (truncated) cost that would have been incurred had
the policy parameter been fixed to Mat all the past htime
steps.
A standard gradient algorithm for OCO-M framework
updates the decision Mtby the gradient of the function
ft(Mt:th)with all Mkin Mt:thfixed to Mt. Using the same
compact notation as above, this gradient is equal to ft(Mt).
An interpretation of this gradient is that, it is the gradient
of the cost that would have been incurred had the policy
parameter been fixed at Mtthe past htime steps. We employ
the same idea to update the policy parameters of the DAC
policy online. The online optimization algorithm we propose
updates the policy parameter Mtby the gradient of the cost
function ˜ct(Mt|Gt, s1:t)where each Mkin Mt:this fixed to
Mt, i.e., as
Mt+1 =ProjMMtη˜ct(Mt|Gt, s1:t)
Mt,(5)
where Mis a convex set of policy parameters.
Definition 2 (Disturbance Action Policy Class).M:= {M=
(M[1], . . . , M[h]) : kM[k]k ≤ κM}
Optimization for Dynamic Regret: The online optimiza-
tion procedure described above can only fetch a sub-linear
regret for static regret. To fetch a sub-linear dynamic regret,
multiple online optimizers like in Eq. (5) are required to be
run parallelly as in [12]. Let’s index the parallel learners by
iand let the parameters corresponding to the learner ibe
Mt,i. Just as in [12], the final parameter Mtis computed by
Mt=PH
i=1 pt,iMt,i, where pt,i are a set of weights such that
PN
i=1 pt,i = 1 and pt,i are also updated online along with
Mt,is. Specifically, pt,i is updated by pt+1,i pt,ielt,i(Mt,i),
where lt,i(M) = ζkMt,i Mt1,ik+hMt,i, ∂˜ct(Mt|Gt, s1:t)i.
The Mt,is are updated by
Mt+1,i =ProjMMt,i ηi
˜ct(Mt|Gt, s1:t)
Mt,(6)
The complete online optimization algorithm is given in Algo-
rithm 1.
Main Result: We state the performance of the algorithm
Algorithm 1 Online Learning Control with Full Knowledge
(OLC-FK) Algorithm [12, scream.control]
Input: ζ, H, Step sizes ηis, parameters θ1:T.
Initialize M1,i ∈ M arbitrarily for all i[1, H]
Initialize p1,i 1/(i2+i)for all i[1, H]
for t = 1,. . . ,T do
Apply uπDACO
t=Ph
k=1 M[k]
twtk
Observe ct, wtand incur cost ct(yπDACO
t, uπDACO
t)
Compute: lt,i =ζkMt,i Mt1,ik+
hMt,i, ∂˜ct(Mt|Gt, s1:t)ifor all i[1, H]
Update: pt+1,i pt,ielt,i for all i[1, H]
Update: Mt+1,i =ProjMMt,i ηi˜ct(Mt|Gt,s1:t)
Mt
end
OLC-FK formally below.
Theorem 1 (Full System Knowledge).Suppose the setting is
the general setting S-2, and the cost functions are general con-
vex functions. Then, under Algorithm 1 [12, scream.control],
with h=log T
(log (1/1γ)) ,H=O(log(T)),ζ=O(h2)and
ηi=O(2i1/ζT ), the regret with respect to any DAC policy
M?
1:T,
RT≤ OpT(1 + PT),(7)
where PTis the path length of the sequence M?
1:T.
The proof follows from a standard proof for online opti-
mization. Please see Appendix VI for the full proof.
A. Disturbance Action Control without System Knowledge
In the previous case, where the system parameters are
known, the control policy parameters are optimized online
through the truncated cost ˜ct(·), whose construction explicitly
utilizes the knowledge of the underlying system parameters
G[k]
t. In this case, since the underlying system parameters are
not available, we construct an estimate of the truncated state
and the truncated cost by estimating the underlying system
parameters G[k]
ts. With this approach, the control policy will
have to solve an online estimation problem to compute an
estimate of the system parameters. Since the parameters are
time-variant, the online estimation has to be run throughout,
unlike the other online estimation approaches [5], [8], along
with the policy optimization. Below, we describe in detail
how our algorithm simultaneously performs estimation and
optimizes the control policy.
Online Estimation and Optimization: The Online Learn-
ing Control with Zero Knowledge (OLC-ZK) of the system
parameters has two components: (i) a control policy and
(ii) an online estimator that runs in parallel to the control
policy and throughout the control episode. The control policy
and online optimization algorithm is similar to the online
algorithm 1, except that the control policy parameters are
updated through an estimate of the truncated cost function. The
online estimation algorithm employs a change point detection
to identify the changes in the underlying system and a standard
estimation algorithm to estimate the underlying system that is
restarted after every detection of change. We discuss the details
of our algorithm below.
A. Online Control Policy: We use the same notation for
the control policy and the control input, i.e., πDACOand
uπDACO
trespectively. The estimation algorithm constructs an
estimate b
G[k]
tof the parameters G[k]
tof the system in Eq. (1)
for k[1, h]. Thus, the estimation algorithm estimates G[k]
ts
only for a truncated time horizon (looking backwards), i.e.,
for k[1, h]. We describe the estimation algorithm later.
The policy πDACOcomputes the control input uπDACO
t
(zero knowledge case) by combining two terms: (i) distur-
bance action control just as in the full knowledge case and
(ii) a perturbation for exploration. In this case, we require an
additional perturbation, just as in [2], so as to be able to run
the estimation parallel to the Online DAC, the control for reg-
ulating the cost. Let ˜uπDACO
t[Mt|w1:t] = Ph
k=1 M[k]
twtk.
Therefore, the total control input by πDACOis given by
uπDACO
t= ˜uπDACO
t[Mt|w1:t]
|{z }
DAC
+δuπDACO
t
| {z }
Perturbation
.(8)
As in [2], we apply a Gaussian random variable as the
perturbation, i.e.,
δuπDACO
t∼ N(0, σ2I),(9)
where σdenotes the standard deviation, and is a constant to
be specified later.
In this case the policy parameters are optimized by applying
OCO-M on an estimate of the truncated cost. To construct
this estimate, we construct an estimate of stand the truncated
state ˜xπDACO
t(·). Given that stis the state response when
the control inputs are zero, we estimate stby subtracting the
contribution of the control inputs from the observed state:
ˆst=
h
X
k=1 b
G[k]
twtk(S-1)
ˆst=yπDACO
t
h
X
k=1 b
G[k]
tuπDACO
tk(S-2).(10)
Then the estimate of the truncated output follows by substi-
tuting ˆstin place stand using the estimated b
Gtin place Gt:
˜
˜yπDACO
t[Mt:th|b
Gt,ˆs1:t] = ˆst+
h
X
k=1 b
G[k]
t˜uπDACO
tk.(11)
Then, the estimate of the truncated cost is calculated as
˜ct(Mt:th|b
Gt,ˆs1:t)
=ct(˜
˜yπDACO
t[Mt:th|b
Gt,ˆs1:t],˜uπDACO
t).
The online update to the policy parameters is just as in
Algorithm 1, i.e., by the gradient of the estimate of the
truncated cost
Mt+1,i =ProjM Mt,i ηi
˜ct(Mt|b
Gt,ˆs1:t)
Mt!.(12)
A. Online Estimation: The online estimation algorithm
is a combination of a change point detection algorithm and
摘要:

ChangePointDetectionApproachforOnlineControlofUnknownTimeVaryingDynamicalSystemsDeepanMuthirayan,RuijieDu,YanningShen,andPramodP.KhargonekarAbstract—Weproposeanovelchangepointdetectionapproachforonlinelearningcontrolwithfullinformationfeedback(state,disturbance,andcostfeedback)forunknowntime-varying...

展开>> 收起<<
Change Point Detection Approach for Online Control of Unknown Time Varying Dynamical Systems.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:754.04KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注