Duality-Based Stochastic Policy Optimization for Estimation with Unknown Noise Covariances Shahriar Talebi Amirhossein Taghvaei Mehran Mesbahi

2025-05-03 0 0 453.62KB 11 页 10玖币

侵权投诉

Duality-Based Stochastic Policy Optimization for

Estimation with Unknown Noise Covariances

Shahriar Talebi, Amirhossein Taghvaei, Mehran Mesbahi

Abstract—Duality of control and estimation allows mapping

recent advances in data-guided control to the estimation setup.

This paper formalizes and utilizes such a mapping to consider

learning the optimal (steady-state) Kalman gain when process

and measurement noise statistics are unknown. Speciﬁcally,

building on the duality between synthesizing optimal control

and estimation gains, the ﬁlter design problem is formalized

as direct policy learning. In this direction, the duality is

used to extend existing theoretical guarantees of direct policy

updates for Linear Quadratic Regulator (LQR) to establish

global convergence of the Gradient Descent (GD) algorithm

for the estimation problem–while addressing subtle differences

between the two synthesis problems. Subsequently, a Stochastic

Gradient Descent (SGD) approach is adopted to learn the opti-

mal Kalman gain without the knowledge of noise covariances.

The results are illustrated via several numerical examples.

I. INTRODUCTION

Duality of control and estimation provides an important

relationship between two distinct synthesis problems in

system theory [1]–[3]. In fact, duality has served as an

effective bridge for developing theoretical and computational

techniques in one domain and then “dualized” for use in

the other. For instance, the stability proof of the Kalman

ﬁlter relies on the stabilizing feature of the optimal feedback

gain for the dual LQR optimal control problem [4, Ch. 9].

The aim of this paper is to build on this dualization for the

purpose of learning the optimal estimation policy via recent

advances in data-driven algorithms for optimal control.

The setup that we consider is the estimation problem for a

system with known linear dynamics and observation model,

but unknown process and measurement noise covariances.

The problem is to learn the optimal steady-state Kalman

gain using a training data that consists of independent

realizations of the observation signal. This problem has a

long history in system theory, often examined in the context

of adaptive Kalman ﬁltering [5]–[10]. The classical reference

[6] includes a comprehensive summary of four solution

approaches to this problem: Bayesian inference [11]–[13],

Maximum likelihood [14], [15], covariance matching [9],

and innovation correlation methods [5], [7]. The Bayesian

and maximum likelihood setup are known to be computa-

tionally costly and covariance matching admits undesirable

biases in practice. For these reasons, the innovation corre-

lation based approaches are more popular and have been

The authors are with the William E. Boeing Department of Aeronautics

and Astronautics, University of Washington, Seattle, WA, USA. S. Talebi

is also with the Department of Mathematics at the University of Wash-

ington. The research of the ﬁrst and last authors has been supported by

AFOSR grant FA9550-20-1-0053 and NSF grant ECCS-2149470. Emails:

shahriar@uw.edu,amirtag@uw.edu, and mesbahi@uw.edu.

subject of more recent research [16]–[18]. The article [19]

includes an excellent survey on this topic. Though relying

strongly on the statistical assumptions on the model, these

approaches do not provide non-asymptotic guarantees.

On the optimal control side, there has been a number

of recent advances in data-driven synthesis methods. For

example, ﬁrst order methods have been adopted for state-

feedback LQR problems [20], [21]. This direct policy op-

timization perspective has been particularly effective as it

has been shown that the LQR cost is gradient dominant

[22], allowing the adoption and global convergence of ﬁrst

order methods for optimal feedback synthesis despite the

non-convexity of the cost, when represented directly in terms

of this policy. Since then, Policy Optimization (PO) using

ﬁrst order methods has been investigated for variants of LQR

problem, such as Output-feedback Linear Quadratic Regula-

tors (OLQR) [23], model-free setup [24], risk-constrained

setup [25], Linear Quadratic Gaussian (LQG) [26], and

recently, Riemannian constrained LQR [27].

This paper aims to bring new insights to the classical

estimation problem through the lens of control-estimation

duality and utilizing recent advances in data-driven optimal

control. In particular, we ﬁrst argue that the optimal mean-

squared error estimation problem is “equivalent” to an LQR

problem. This in turn, allows representing the problem of

ﬁnding the optimal Kalman gain as that of optimal policy

synthesis for the LQR problem—under conditions distinct

from what has been examined in the literature. In particular

in this equivalent LQR formulation, the cost parameters–

relating to the noise covariances–are unknown and the

covariance of initial state is not positive deﬁnite. By ad-

dressing these technical issues, we show how exploring this

relationship leads to computational algorithms for learning

optimal Kalman gain with non-asymptotic error guarantees.

The rest of the paper is organized as follows. The es-

timation problem is formulated in §II, followed by the

estimation-control duality relationship in §III. The theoret-

ical analysis on policy optimization for the Kalman gain

appears in §IV while the proofs are deferred to [28]. We

propose an SGD algorithm in §V with several numerical

examples, followed by concluding remarks in §VI.

II. BACKGROUND AND PROBLEM FORMULATION

Consider the stochastic difference equation,

x(t+ 1) =Ax(t) + ξ(t),(1a)

y(t) =Hx(t) + ω(t),(1b)

arXiv:2210.14878v2 [eess.SY] 6 Mar 2023

where x(t)∈Rnis the state of the system, y(t)∈Rmis

the observation, and {ξ(t)}t∈Zand {ω(t)}t∈Zare the uncor-

related zero-mean process and measurement noise vectors,

respectively, with the following covariances,

E[ξ(t)ξ|(t)] = Q∈Rn×n,E[ω(t)ω|(t)] = R∈Rm×m,

for some (possibly time-varying) positive (semi-)deﬁnite

matrices Q, R <0. Let m0and P0<0denote the mean

and covariance of the initial condition x0.

Now, let us ﬁx a time horizon T > 0and deﬁne an

estimation policy, denoted by P, as a map that takes a history

of the observation signal YT={y(0), y(1), . . . , y(T−1)}as

an input and outputs an estimate of the state x(T), denoted

by ˆxP(T). The ﬁltering problem of interest is ﬁnding the

estimation policy Pthat minimizes the mean-squared error,

Ekx(T)−ˆxP(T)k2.(2)

We make the following assumptions in our problem setup:

1) The matrices Aand Hare known, but the process and

the measurement noise covariance matrices, Qand R, are

not available. 2) We have access to a training data-set that

consists of independent realizations of the observation signal

{y(t)}T

t=0. However, ground-truth measurements of x(T)is

not available.1

It is not possible to directly minimize (2) as the ground-

truth measurement x(T)is not available. Instead, we propose

to minimize the mean-squared error in predicting the obser-

vation y(T)as a surrogate objective function. In particular,

let us ﬁrst deﬁne ˆyP(T) = HˆxP(T)as the prediction

for the observation y(T). This is indeed a prediction since

the estimate ˆxP(T)depends only on the observations up

to time T−1. The optimization problem is now ﬁnding

the estimation policy Pthat minimizes the mean-squared

prediction error,

Jest

T(P):=Eky(T)−ˆyP(T)k2.(3)

1) Kalman ﬁlter: Indeed, when Qand Rare known,

the solution is given by the celebrated Kalman ﬁlter algo-

rithm [2]. The algorithm involves an iterative procedure to

update the estimate ˆx(t)according to

ˆx(t+ 1) = Aˆx(t) + L(t)(y(t)−Hˆx(t)),ˆx(0) = m0,(4)

where L(t) := AP (t)H|(HP (t)H|+R)−1is the Kalman

gain, and P(t) := E[(x(t)−ˆx(t))(x(t)−ˆx(t))|]is the error

covariance matrix that satisﬁes the Ricatti equation,

P(t+ 1) = (A−L(t)H)P(t)A|+Q, P (t0) = P0.

Note that the update law presented here combines the

information and dynamic update steps of the Kalman ﬁlter.

It is known that P(t)converges to an steady-state value

P∞when the pair (A, H)is observable and the pair (A, Q 1

is controllable [29], [30]. In such a case, the gain converges

1This setting arises in various applications, such as aircraft wing dy-

namics, when approximate or reduced-order models are employed, and the

effect of unmodelled dynamics and disturbances are captured by the process

noise.

to L∞:= AP∞H|(HP∞H|+R)−1, the so-called steady-

state Kalman gain. It is a common practice to evaluate the

steady-state Kalman gain L∞ofﬂine and use it, instead of

L(t), to update the estimate in real-time.

2) Learning the optimal Kalman gain: Inspired by the

structure of the Kalman ﬁlter, we consider restriction of the

estimation policies Pto those realized with a constant gain.

In particular, we deﬁne the estimate ˆxL(T)as one given by

the Kalman ﬁlter at time Trealized by the constant gain L.

Rolling out the update law (4) for t= 0 to t=T−1, and

replacing L(t)with L, leads to the following expression for

the estimate ˆxL(T)as a function of L,

ˆxL(T) = AT

Lm0+PT−1

t=0 AT−t−1

LLy(t),(5)

where AL:=A−LH. Note that this estimate does not

require knowledge of the matrices Qor R. By considering

ˆyL(T) := HˆxL(T), the problem is now ﬁnding the optimal

gain Lthat minimizes the mean-squared prediction error

Jest

T(L) := Eky(T)−ˆyL(T)k2.(6)

Numerically, this problem falls into the realm of stochas-

tic optimization and can be solved by algorithms such

as Stochastic Gradient Descent (SGD). Such an algorithm

would require accessing independent realizations of the

observation signal. An algorithm that utilizes such realiza-

tions is presented in §V. Theoretically, however, it is not

yet clear if this optimization problem is well-posed and

admits a unique minimizer. This is the subject of §IV,

where certain properties of the objective function, such as

its gradient dominance and smoothness, are established.

These theoretical results are then used to analyze ﬁrst-order

optimization algorithms and provide stability guarantees of

the estimation policy iterates. The results are based on the

duality relationship between estimation and control that is

presented next.

III. ESTIMATION-CONTROL DUALITY RELATIONSHIP

We use the duality framework, as described in [31,

Ch.7.5], to relate the problem of learning the optimal es-

timation policy to that of learning the optimal control policy

for an LQR problem. In order to do so, we introduce the

adjoint system:

z(t) = A|z(t+ 1) −H|u(t+ 1),(7)

where z(t)∈Rnis the adjoint state and UT:=

{u(1), . . . , u(T)} ∈ RmT are the control variables (dual to

the observation signal YT). The adjoint state is initialized

at z(T) = a∈Rnand simulated backward in time starting

with t=T−1. We now formalize a relationship between

estimation policies for the system (1) and control policies

for the adjoint system (7). Consider estimation policies that

are linear functions of the observation history YT∈RmT

and the initial mean vector m0∈Rn. We characterize

such policies with a linear map L:RmT +n→Rnand

let the estimate ˆxL(T) := L(m0,YT). The adjoint of this

linear map, denoted by L†:Rn→RmT +n, is used

to deﬁne a control policy for the adjoint system (7). In

particular, the adjoint map takes a∈Rnas input and outputs

L†(a) = {b, u(1), . . . , u(T)} ∈ RmT +n. This relationship

can be depicted as,

{m0, y(0), . . . , y(T−1)}L

−→ ˆxL(T)

{b, u(1), . . . , u(T)}L†

←− a

Note that ha, L(m0,YT)iRn=hL†(a),(m0,YT)iRmT +n,so

b|m0+PT−1

t=0 u(t+ 1)|y(t) = a|ˆxL(T).(8)

The following proposition relates the mean-squared error

for a linear estimation policy, to the following LQR cost:

JLQR

T(a, {b, UT}) := [z|(0)m0−b|m0]2

+z|(0)P0z(0) + PT

t=1 [z|(t)Qz(t) + u|(t)Ru(t)] .(9)

Proposition 1. Consider the estimation problem for the

system (1) and the LQR problem (9) subject to the adjoint dy-

namics (7). For each estimation policy ˆxL(T) = L(m0,YT),

with a linear map L, and for any a∈Rnwe have the identity

E|a|x(T)−a|ˆxL(T)|2=JLQR

T(a, L†(a)).

Furthermore, the prediction error as in (6) satisﬁes

Jest

T(L) =

i=1

JLQR

T(Hi,L†(Hi)) + tr [R],

where ˆyL(T) := HˆxL(T)and H|

i∈Rnis the i-th row of

the m×nmatrix Hfor i= 1, . . . , m.

Remark 1.The duality is also true in the continuous-

time setting where the estimation problem is related to a

continuous-time LQR problem. Recent extensions to the

nonlinear setting appears in [32] with a comprehensive study

in [33]. This duality is different than the maximum likeli-

hood approach which involves an optimal control problem

over the original dynamics instead of the adjoint system.

1) Duality in the constant control gain regime: In this

section, we use the aforementioned duality relationship to

show that the estimation policy with constant gain is dual to

the control policy with constant feedback gain. This result

is then used to obtain an explicit formula for the objective

function (6).

Consider the adjoint system (7) with the linear feedback

law u(t) = L|z(t). Then,

z(t)=(A|

L)T−ta, for t= 0,1, . . . , T. (10)

Therefore, as a function of a,u(t) = L|(A|

L)T−ta. More-

over, for this choice of control, the optimal b=z(0) =

(A|

L)Ta. These relationships are used to identify the control

policy L†(a) = ((A|

L)Ta, L|(A|

L)T−1a, . . . , L|a).This

control policy corresponds to an estimation policy by the

adjoint relationship (8):

a|ˆxL(T) = a|AT

Lm0+PT−1

t=0 a|AT−t−1

LLy(t),∀a∈Rn.

As this relationship holds for all a∈Rn, we have,

ˆxL(T) = AT

Lm0+PT−1

t=0 AT−t−1

LLy(t),

that coincides with the Kalman ﬁlter estimate with constant

gain Lgiven by the formula (5). Therefore, the adjoint

relationship (8) relates the control policy with constant gain

L|to the Kalman ﬁlter with the constant gain L.

Next, we use this relationship to evaluate the mean-

squared prediction error (6). Denote by JLQR

T(a, L|)as the

LQR cost (9) associated with the control policy with constant

gain L|and b=z(0). Then, from the explicit formula for

z(t)and u(t)above, we have,

JLQR

T(a, L|) = a|XT(L)a,

where

XT(L):=AT

LP0(A|

L)T+

t=1

AT−t

L(Q+LRL|)(A|

L)T−t.

Therefore, by the second claim in Proposition 1, the mean-

squared prediction error (6) becomes,

Jest

T(L)−tr [R] =

i=1

JLQR

T(Hi, L|) = tr [XT(L)H|H],

where we have used the cyclic permutation property of the

trace and the identity H|H=Pm

i=1 HiH|

2) Duality in steady-state regime: Deﬁne the set of Schur

stabilizing gains

S:={L∈Rn×m:ρ(A−LH)<1}.

For any L∈ S, in the steady-state limit as T→ ∞:

XT(L)→X∞(L):=P∞

t=0 (AL)t(Q+LRL|) (A|

L)t.

The limit coincides with the unique solution Xof the

discrete Lyapunov equation X=ALXA|

L+Q+LRL|,

which exists as ρ(AL)<1. Therefore, the steady-state limit

of the mean-squared prediction error assumes the form,

J(L):= lim

T→∞ Jest

T(L) = tr [X∞(L)H|H] + tr [R].

Given the steady-state limit, we formally analyze the fol-

lowing constrained optimization problem:

min

L∈S ←J(L) = tr X(L)H|H+ tr [R],(11)

s.t. X(L)=ALX(L)A|

L+Q+LRL|.

Remark 2.Note that the latter problem is technically the

dual of the optimal LQR problem as formulated in [20] by

relating A↔A|,−H↔B|,L↔K|, and H|H↔Σ.

However, one main difference here is that the matrices Q

and Rare unknown, and the H|Hmay not be positive

deﬁnite, for example, due to rank deﬁciency in Hspecially

whenever m < n. Thus, in general, the cost function J(L)

is not necessarily coercive in L, which can drastcially effect

the optimization landscape. For the same reason, in contrast

to the LQR case [20], [22], the gradient dominant property

of J(L)is not clear in the ﬁltering setup. In the next section,

we show that such issues can be avoided as long as the pair

(A, H)is observable.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Duality-BasedStochasticPolicyOptimizationforEstimationwithUnknownNoiseCovariancesShahriarTalebi,AmirhosseinTaghvaei,MehranMesbahiAbstractDualityofcontrolandestimationallowsmappingrecentadvancesindata-guidedcontroltotheestimationsetup.Thispaperformalizesandutilizessuchamappingtoconsiderlearningtheop...

展开>> 收起<<

Duality-Based Stochastic Policy Optimization for Estimation with Unknown Noise Covariances Shahriar Talebi Amirhossein Taghvaei Mehran Mesbahi.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Duality-Based Stochastic Policy Optimization for Estimation with Unknown Noise Covariances Shahriar Talebi Amirhossein Taghvaei Mehran Mesbahi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: