
where x(t)∈Rnis the state of the system, y(t)∈Rmis
the observation, and {ξ(t)}t∈Zand {ω(t)}t∈Zare the uncor-
related zero-mean process and measurement noise vectors,
respectively, with the following covariances,
E[ξ(t)ξ|(t)] = Q∈Rn×n,E[ω(t)ω|(t)] = R∈Rm×m,
for some (possibly time-varying) positive (semi-)definite
matrices Q, R <0. Let m0and P0<0denote the mean
and covariance of the initial condition x0.
Now, let us fix a time horizon T > 0and define an
estimation policy, denoted by P, as a map that takes a history
of the observation signal YT={y(0), y(1), . . . , y(T−1)}as
an input and outputs an estimate of the state x(T), denoted
by ˆxP(T). The filtering problem of interest is finding the
estimation policy Pthat minimizes the mean-squared error,
Ekx(T)−ˆxP(T)k2.(2)
We make the following assumptions in our problem setup:
1) The matrices Aand Hare known, but the process and
the measurement noise covariance matrices, Qand R, are
not available. 2) We have access to a training data-set that
consists of independent realizations of the observation signal
{y(t)}T
t=0. However, ground-truth measurements of x(T)is
not available.1
It is not possible to directly minimize (2) as the ground-
truth measurement x(T)is not available. Instead, we propose
to minimize the mean-squared error in predicting the obser-
vation y(T)as a surrogate objective function. In particular,
let us first define ˆyP(T) = HˆxP(T)as the prediction
for the observation y(T). This is indeed a prediction since
the estimate ˆxP(T)depends only on the observations up
to time T−1. The optimization problem is now finding
the estimation policy Pthat minimizes the mean-squared
prediction error,
Jest
T(P):=Eky(T)−ˆyP(T)k2.(3)
1) Kalman filter: Indeed, when Qand Rare known,
the solution is given by the celebrated Kalman filter algo-
rithm [2]. The algorithm involves an iterative procedure to
update the estimate ˆx(t)according to
ˆx(t+ 1) = Aˆx(t) + L(t)(y(t)−Hˆx(t)),ˆx(0) = m0,(4)
where L(t) := AP (t)H|(HP (t)H|+R)−1is the Kalman
gain, and P(t) := E[(x(t)−ˆx(t))(x(t)−ˆx(t))|]is the error
covariance matrix that satisfies the Ricatti equation,
P(t+ 1) = (A−L(t)H)P(t)A|+Q, P (t0) = P0.
Note that the update law presented here combines the
information and dynamic update steps of the Kalman filter.
It is known that P(t)converges to an steady-state value
P∞when the pair (A, H)is observable and the pair (A, Q 1
2)
is controllable [29], [30]. In such a case, the gain converges
1This setting arises in various applications, such as aircraft wing dy-
namics, when approximate or reduced-order models are employed, and the
effect of unmodelled dynamics and disturbances are captured by the process
noise.
to L∞:= AP∞H|(HP∞H|+R)−1, the so-called steady-
state Kalman gain. It is a common practice to evaluate the
steady-state Kalman gain L∞offline and use it, instead of
L(t), to update the estimate in real-time.
2) Learning the optimal Kalman gain: Inspired by the
structure of the Kalman filter, we consider restriction of the
estimation policies Pto those realized with a constant gain.
In particular, we define the estimate ˆxL(T)as one given by
the Kalman filter at time Trealized by the constant gain L.
Rolling out the update law (4) for t= 0 to t=T−1, and
replacing L(t)with L, leads to the following expression for
the estimate ˆxL(T)as a function of L,
ˆxL(T) = AT
Lm0+PT−1
t=0 AT−t−1
LLy(t),(5)
where AL:=A−LH. Note that this estimate does not
require knowledge of the matrices Qor R. By considering
ˆyL(T) := HˆxL(T), the problem is now finding the optimal
gain Lthat minimizes the mean-squared prediction error
Jest
T(L) := Eky(T)−ˆyL(T)k2.(6)
Numerically, this problem falls into the realm of stochas-
tic optimization and can be solved by algorithms such
as Stochastic Gradient Descent (SGD). Such an algorithm
would require accessing independent realizations of the
observation signal. An algorithm that utilizes such realiza-
tions is presented in §V. Theoretically, however, it is not
yet clear if this optimization problem is well-posed and
admits a unique minimizer. This is the subject of §IV,
where certain properties of the objective function, such as
its gradient dominance and smoothness, are established.
These theoretical results are then used to analyze first-order
optimization algorithms and provide stability guarantees of
the estimation policy iterates. The results are based on the
duality relationship between estimation and control that is
presented next.
III. ESTIMATION-CONTROL DUALITY RELATIONSHIP
We use the duality framework, as described in [31,
Ch.7.5], to relate the problem of learning the optimal es-
timation policy to that of learning the optimal control policy
for an LQR problem. In order to do so, we introduce the
adjoint system:
z(t) = A|z(t+ 1) −H|u(t+ 1),(7)
where z(t)∈Rnis the adjoint state and UT:=
{u(1), . . . , u(T)} ∈ RmT are the control variables (dual to
the observation signal YT). The adjoint state is initialized
at z(T) = a∈Rnand simulated backward in time starting
with t=T−1. We now formalize a relationship between
estimation policies for the system (1) and control policies
for the adjoint system (7). Consider estimation policies that
are linear functions of the observation history YT∈RmT
and the initial mean vector m0∈Rn. We characterize
such policies with a linear map L:RmT +n→Rnand
let the estimate ˆxL(T) := L(m0,YT). The adjoint of this
linear map, denoted by L†:Rn→RmT +n, is used