Duality-Based Stochastic Policy Optimization for Estimation with Unknown Noise Covariances Shahriar Talebi Amirhossein Taghvaei Mehran Mesbahi

2025-05-03 0 0 453.62KB 11 页 10玖币
侵权投诉
Duality-Based Stochastic Policy Optimization for
Estimation with Unknown Noise Covariances
Shahriar Talebi, Amirhossein Taghvaei, Mehran Mesbahi
Abstract—Duality of control and estimation allows mapping
recent advances in data-guided control to the estimation setup.
This paper formalizes and utilizes such a mapping to consider
learning the optimal (steady-state) Kalman gain when process
and measurement noise statistics are unknown. Specifically,
building on the duality between synthesizing optimal control
and estimation gains, the filter design problem is formalized
as direct policy learning. In this direction, the duality is
used to extend existing theoretical guarantees of direct policy
updates for Linear Quadratic Regulator (LQR) to establish
global convergence of the Gradient Descent (GD) algorithm
for the estimation problem–while addressing subtle differences
between the two synthesis problems. Subsequently, a Stochastic
Gradient Descent (SGD) approach is adopted to learn the opti-
mal Kalman gain without the knowledge of noise covariances.
The results are illustrated via several numerical examples.
I. INTRODUCTION
Duality of control and estimation provides an important
relationship between two distinct synthesis problems in
system theory [1]–[3]. In fact, duality has served as an
effective bridge for developing theoretical and computational
techniques in one domain and then “dualized” for use in
the other. For instance, the stability proof of the Kalman
filter relies on the stabilizing feature of the optimal feedback
gain for the dual LQR optimal control problem [4, Ch. 9].
The aim of this paper is to build on this dualization for the
purpose of learning the optimal estimation policy via recent
advances in data-driven algorithms for optimal control.
The setup that we consider is the estimation problem for a
system with known linear dynamics and observation model,
but unknown process and measurement noise covariances.
The problem is to learn the optimal steady-state Kalman
gain using a training data that consists of independent
realizations of the observation signal. This problem has a
long history in system theory, often examined in the context
of adaptive Kalman filtering [5]–[10]. The classical reference
[6] includes a comprehensive summary of four solution
approaches to this problem: Bayesian inference [11]–[13],
Maximum likelihood [14], [15], covariance matching [9],
and innovation correlation methods [5], [7]. The Bayesian
and maximum likelihood setup are known to be computa-
tionally costly and covariance matching admits undesirable
biases in practice. For these reasons, the innovation corre-
lation based approaches are more popular and have been
The authors are with the William E. Boeing Department of Aeronautics
and Astronautics, University of Washington, Seattle, WA, USA. S. Talebi
is also with the Department of Mathematics at the University of Wash-
ington. The research of the first and last authors has been supported by
AFOSR grant FA9550-20-1-0053 and NSF grant ECCS-2149470. Emails:
shahriar@uw.edu,amirtag@uw.edu, and mesbahi@uw.edu.
subject of more recent research [16]–[18]. The article [19]
includes an excellent survey on this topic. Though relying
strongly on the statistical assumptions on the model, these
approaches do not provide non-asymptotic guarantees.
On the optimal control side, there has been a number
of recent advances in data-driven synthesis methods. For
example, first order methods have been adopted for state-
feedback LQR problems [20], [21]. This direct policy op-
timization perspective has been particularly effective as it
has been shown that the LQR cost is gradient dominant
[22], allowing the adoption and global convergence of first
order methods for optimal feedback synthesis despite the
non-convexity of the cost, when represented directly in terms
of this policy. Since then, Policy Optimization (PO) using
first order methods has been investigated for variants of LQR
problem, such as Output-feedback Linear Quadratic Regula-
tors (OLQR) [23], model-free setup [24], risk-constrained
setup [25], Linear Quadratic Gaussian (LQG) [26], and
recently, Riemannian constrained LQR [27].
This paper aims to bring new insights to the classical
estimation problem through the lens of control-estimation
duality and utilizing recent advances in data-driven optimal
control. In particular, we first argue that the optimal mean-
squared error estimation problem is “equivalent” to an LQR
problem. This in turn, allows representing the problem of
finding the optimal Kalman gain as that of optimal policy
synthesis for the LQR problem—under conditions distinct
from what has been examined in the literature. In particular
in this equivalent LQR formulation, the cost parameters–
relating to the noise covariances–are unknown and the
covariance of initial state is not positive definite. By ad-
dressing these technical issues, we show how exploring this
relationship leads to computational algorithms for learning
optimal Kalman gain with non-asymptotic error guarantees.
The rest of the paper is organized as follows. The es-
timation problem is formulated in §II, followed by the
estimation-control duality relationship in §III. The theoret-
ical analysis on policy optimization for the Kalman gain
appears in §IV while the proofs are deferred to [28]. We
propose an SGD algorithm in §V with several numerical
examples, followed by concluding remarks in §VI.
II. BACKGROUND AND PROBLEM FORMULATION
Consider the stochastic difference equation,
x(t+ 1) =Ax(t) + ξ(t),(1a)
y(t) =Hx(t) + ω(t),(1b)
arXiv:2210.14878v2 [eess.SY] 6 Mar 2023
where x(t)Rnis the state of the system, y(t)Rmis
the observation, and {ξ(t)}tZand {ω(t)}tZare the uncor-
related zero-mean process and measurement noise vectors,
respectively, with the following covariances,
E[ξ(t)ξ|(t)] = QRn×n,E[ω(t)ω|(t)] = RRm×m,
for some (possibly time-varying) positive (semi-)definite
matrices Q, R <0. Let m0and P0<0denote the mean
and covariance of the initial condition x0.
Now, let us fix a time horizon T > 0and define an
estimation policy, denoted by P, as a map that takes a history
of the observation signal YT={y(0), y(1), . . . , y(T1)}as
an input and outputs an estimate of the state x(T), denoted
by ˆxP(T). The filtering problem of interest is finding the
estimation policy Pthat minimizes the mean-squared error,
Ekx(T)ˆxP(T)k2.(2)
We make the following assumptions in our problem setup:
1) The matrices Aand Hare known, but the process and
the measurement noise covariance matrices, Qand R, are
not available. 2) We have access to a training data-set that
consists of independent realizations of the observation signal
{y(t)}T
t=0. However, ground-truth measurements of x(T)is
not available.1
It is not possible to directly minimize (2) as the ground-
truth measurement x(T)is not available. Instead, we propose
to minimize the mean-squared error in predicting the obser-
vation y(T)as a surrogate objective function. In particular,
let us first define ˆyP(T) = HˆxP(T)as the prediction
for the observation y(T). This is indeed a prediction since
the estimate ˆxP(T)depends only on the observations up
to time T1. The optimization problem is now finding
the estimation policy Pthat minimizes the mean-squared
prediction error,
Jest
T(P):=Eky(T)ˆyP(T)k2.(3)
1) Kalman filter: Indeed, when Qand Rare known,
the solution is given by the celebrated Kalman filter algo-
rithm [2]. The algorithm involves an iterative procedure to
update the estimate ˆx(t)according to
ˆx(t+ 1) = Aˆx(t) + L(t)(y(t)Hˆx(t)),ˆx(0) = m0,(4)
where L(t) := AP (t)H|(HP (t)H|+R)1is the Kalman
gain, and P(t) := E[(x(t)ˆx(t))(x(t)ˆx(t))|]is the error
covariance matrix that satisfies the Ricatti equation,
P(t+ 1) = (AL(t)H)P(t)A|+Q, P (t0) = P0.
Note that the update law presented here combines the
information and dynamic update steps of the Kalman filter.
It is known that P(t)converges to an steady-state value
Pwhen the pair (A, H)is observable and the pair (A, Q 1
2)
is controllable [29], [30]. In such a case, the gain converges
1This setting arises in various applications, such as aircraft wing dy-
namics, when approximate or reduced-order models are employed, and the
effect of unmodelled dynamics and disturbances are captured by the process
noise.
to L:= APH|(HPH|+R)1, the so-called steady-
state Kalman gain. It is a common practice to evaluate the
steady-state Kalman gain Loffline and use it, instead of
L(t), to update the estimate in real-time.
2) Learning the optimal Kalman gain: Inspired by the
structure of the Kalman filter, we consider restriction of the
estimation policies Pto those realized with a constant gain.
In particular, we define the estimate ˆxL(T)as one given by
the Kalman filter at time Trealized by the constant gain L.
Rolling out the update law (4) for t= 0 to t=T1, and
replacing L(t)with L, leads to the following expression for
the estimate ˆxL(T)as a function of L,
ˆxL(T) = AT
Lm0+PT1
t=0 ATt1
LLy(t),(5)
where AL:=ALH. Note that this estimate does not
require knowledge of the matrices Qor R. By considering
ˆyL(T) := HˆxL(T), the problem is now finding the optimal
gain Lthat minimizes the mean-squared prediction error
Jest
T(L) := Eky(T)ˆyL(T)k2.(6)
Numerically, this problem falls into the realm of stochas-
tic optimization and can be solved by algorithms such
as Stochastic Gradient Descent (SGD). Such an algorithm
would require accessing independent realizations of the
observation signal. An algorithm that utilizes such realiza-
tions is presented in §V. Theoretically, however, it is not
yet clear if this optimization problem is well-posed and
admits a unique minimizer. This is the subject of §IV,
where certain properties of the objective function, such as
its gradient dominance and smoothness, are established.
These theoretical results are then used to analyze first-order
optimization algorithms and provide stability guarantees of
the estimation policy iterates. The results are based on the
duality relationship between estimation and control that is
presented next.
III. ESTIMATION-CONTROL DUALITY RELATIONSHIP
We use the duality framework, as described in [31,
Ch.7.5], to relate the problem of learning the optimal es-
timation policy to that of learning the optimal control policy
for an LQR problem. In order to do so, we introduce the
adjoint system:
z(t) = A|z(t+ 1) H|u(t+ 1),(7)
where z(t)Rnis the adjoint state and UT:=
{u(1), . . . , u(T)} ∈ RmT are the control variables (dual to
the observation signal YT). The adjoint state is initialized
at z(T) = aRnand simulated backward in time starting
with t=T1. We now formalize a relationship between
estimation policies for the system (1) and control policies
for the adjoint system (7). Consider estimation policies that
are linear functions of the observation history YTRmT
and the initial mean vector m0Rn. We characterize
such policies with a linear map L:RmT +nRnand
let the estimate ˆxL(T) := L(m0,YT). The adjoint of this
linear map, denoted by L:RnRmT +n, is used
to define a control policy for the adjoint system (7). In
particular, the adjoint map takes aRnas input and outputs
L(a) = {b, u(1), . . . , u(T)} ∈ RmT +n. This relationship
can be depicted as,
{m0, y(0), . . . , y(T1)}L
ˆxL(T)
{b, u(1), . . . , u(T)}L
a
Note that ha, L(m0,YT)iRn=hL(a),(m0,YT)iRmT +n,so
b|m0+PT1
t=0 u(t+ 1)|y(t) = a|ˆxL(T).(8)
The following proposition relates the mean-squared error
for a linear estimation policy, to the following LQR cost:
JLQR
T(a, {b, UT}) := [z|(0)m0b|m0]2
+z|(0)P0z(0) + PT
t=1 [z|(t)Qz(t) + u|(t)Ru(t)] .(9)
Proposition 1. Consider the estimation problem for the
system (1) and the LQR problem (9) subject to the adjoint dy-
namics (7). For each estimation policy ˆxL(T) = L(m0,YT),
with a linear map L, and for any aRnwe have the identity
E|a|x(T)a|ˆxL(T)|2=JLQR
T(a, L(a)).
Furthermore, the prediction error as in (6) satisfies
Jest
T(L) =
m
X
i=1
JLQR
T(Hi,L(Hi)) + tr [R],
where ˆyL(T) := HˆxL(T)and H|
iRnis the i-th row of
the m×nmatrix Hfor i= 1, . . . , m.
Remark 1.The duality is also true in the continuous-
time setting where the estimation problem is related to a
continuous-time LQR problem. Recent extensions to the
nonlinear setting appears in [32] with a comprehensive study
in [33]. This duality is different than the maximum likeli-
hood approach which involves an optimal control problem
over the original dynamics instead of the adjoint system.
1) Duality in the constant control gain regime: In this
section, we use the aforementioned duality relationship to
show that the estimation policy with constant gain is dual to
the control policy with constant feedback gain. This result
is then used to obtain an explicit formula for the objective
function (6).
Consider the adjoint system (7) with the linear feedback
law u(t) = L|z(t). Then,
z(t)=(A|
L)Tta, for t= 0,1, . . . , T. (10)
Therefore, as a function of a,u(t) = L|(A|
L)Tta. More-
over, for this choice of control, the optimal b=z(0) =
(A|
L)Ta. These relationships are used to identify the control
policy L(a) = ((A|
L)Ta, L|(A|
L)T1a, . . . , L|a).This
control policy corresponds to an estimation policy by the
adjoint relationship (8):
a|ˆxL(T) = a|AT
Lm0+PT1
t=0 a|ATt1
LLy(t),aRn.
As this relationship holds for all aRn, we have,
ˆxL(T) = AT
Lm0+PT1
t=0 ATt1
LLy(t),
that coincides with the Kalman filter estimate with constant
gain Lgiven by the formula (5). Therefore, the adjoint
relationship (8) relates the control policy with constant gain
L|to the Kalman filter with the constant gain L.
Next, we use this relationship to evaluate the mean-
squared prediction error (6). Denote by JLQR
T(a, L|)as the
LQR cost (9) associated with the control policy with constant
gain L|and b=z(0). Then, from the explicit formula for
z(t)and u(t)above, we have,
JLQR
T(a, L|) = a|XT(L)a,
where
XT(L):=AT
LP0(A|
L)T+
T
X
t=1
ATt
L(Q+LRL|)(A|
L)Tt.
Therefore, by the second claim in Proposition 1, the mean-
squared prediction error (6) becomes,
Jest
T(L)tr [R] =
m
X
i=1
JLQR
T(Hi, L|) = tr [XT(L)H|H],
where we have used the cyclic permutation property of the
trace and the identity H|H=Pm
i=1 HiH|
i.
2) Duality in steady-state regime: Define the set of Schur
stabilizing gains
S:={LRn×m:ρ(ALH)<1}.
For any L∈ S, in the steady-state limit as T→ ∞:
XT(L)X(L):=P
t=0 (AL)t(Q+LRL|) (A|
L)t.
The limit coincides with the unique solution Xof the
discrete Lyapunov equation X=ALXA|
L+Q+LRL|,
which exists as ρ(AL)<1. Therefore, the steady-state limit
of the mean-squared prediction error assumes the form,
J(L):= lim
T→∞ Jest
T(L) = tr [X(L)H|H] + tr [R].
Given the steady-state limit, we formally analyze the fol-
lowing constrained optimization problem:
min
L∈S J(L) = tr X(L)H|H+ tr [R],(11)
s.t. X(L)=ALX(L)A|
L+Q+LRL|.
Remark 2.Note that the latter problem is technically the
dual of the optimal LQR problem as formulated in [20] by
relating AA|,HB|,LK|, and H|HΣ.
However, one main difference here is that the matrices Q
and Rare unknown, and the H|Hmay not be positive
definite, for example, due to rank deficiency in Hspecially
whenever m < n. Thus, in general, the cost function J(L)
is not necessarily coercive in L, which can drastcially effect
the optimization landscape. For the same reason, in contrast
to the LQR case [20], [22], the gradient dominant property
of J(L)is not clear in the filtering setup. In the next section,
we show that such issues can be avoided as long as the pair
(A, H)is observable.
摘要:

Duality-BasedStochasticPolicyOptimizationforEstimationwithUnknownNoiseCovariancesShahriarTalebi,AmirhosseinTaghvaei,MehranMesbahiAbstract—Dualityofcontrolandestimationallowsmappingrecentadvancesindata-guidedcontroltotheestimationsetup.Thispaperformalizesandutilizessuchamappingtoconsiderlearningtheop...

展开>> 收起<<
Duality-Based Stochastic Policy Optimization for Estimation with Unknown Noise Covariances Shahriar Talebi Amirhossein Taghvaei Mehran Mesbahi.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:453.62KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注