ONLINE AND LIGHTWEIGHT KERNEL-BASED APPROXIMATE POLICY ITERATION
FOR DYNAMIC P-NORM LINEAR ADAPTIVE FILTERING
Yuki Akiyama Minh Vu Konstantinos Slavakis
Tokyo Institute of Technology, Japan
Department of Information and Communications Engineering
Emails: {akiyama.y.am, vu.d.aa, slavakis.k.aa}@m.titech.ac.jp
ABSTRACT
This paper introduces a solution to the problem of selecting dy-
namically (online) the “optimal” p-norm to combat outliers in lin-
ear adaptive filtering without any knowledge on the probability den-
sity function of the outliers. The proposed online and data-driven
framework is built on kernel-based reinforcement learning (KBRL).
To this end, novel Bellman mappings on reproducing kernel Hilbert
spaces (RKHSs) are introduced. These mappings do not require
any knowledge on transition probabilities of Markov decision pro-
cesses, and are nonexpansive with respect to the underlying Hilber-
tian norm. The fixed-point sets of the proposed Bellman mappings
are utilized to build an approximate policy-iteration (API) frame-
work for the problem at hand. To address the “curse of dimension-
ality” in RKHSs, random Fourier features are utilized to bound the
computational complexity of the API. Numerical tests on synthetic
data for several outlier scenarios demonstrate the superior perfor-
mance of the proposed API framework over several non-RL and
KBRL schemes.
1. INTRODUCTION
The least-squares (LS) error/loss (between an observed value and its
predicted one) plays a pivotal role in signal processing, e.g., adap-
tive filtering [1], and machine learning [2]. For example, the least-
mean squares (LMS) and recursive (R)LS [1] are two celebrated al-
gorithms in adaptive filtering and stochastic approximation based on
the LS-error criterion. Notwithstanding, LS methods are notoriously
sensitive to the presence of outliers within data [3], where outliers
are defined as (sparsely) contaminating data that do not adhere to
a nominal data generation model, and are often modeled as ran-
dom variables (RVs) with non-Gaussian heavy tailed distributions,
e.g., α-stable ones [4]. To combat outliers, several non-LS criteria,
such as least mean p-power (LMP) [5–10] and maximum corren-
tropy (MC) [11], have been studied. This work focuses on the LMP
criterion, owing to the well-documented robustness of LMP against
outliers [12], while results on MC will be reported elsewhere.
This study is built on the classical data-generation model yn=
θ|
∗xn+on, where n∈Ndenotes discrete time (Nis the set of all
non-negative integers), θ∗is the L×1vector whose entries are the
system parameters that need to be identified, onis the RV which
models outliers/noise, (xn, yn)stands for the input-output pair of
available data, where xnis an L×1vector and ynis real-valued, and
|denotes vector/matrix transposition. For an arbitrarily fixed θ0, the
LMP algorithm [5] generates estimates (θn)n∈Nof θ∗according to
the following recursion:
θn+1 :=θn+ρp|en|p−2enxn,(1)
where en:=yn−x|
nθn,ρis the learning rate (step size), and p
is a fixed user-defined real-valued number within the interval [1,2]
to ensure that the p-norm loss |yn−x|
nθ|pis a convex function of
θ[5]. Notice that if p= 1 and 2, then (1) boils down to the classical
sign-LMS and LMS, respectively [1].
Intuition suggests that the choice of pshould be based on the
probability density function (PDF) of the RV on. For example, if
onobeys a Gaussian PDF, then p= 2 should be chosen (recall
the maximum-likelihood criterion). To enhance robustness against
outliers, combination of adaptive filters with different forgetting fac-
tors, but with the same fixed p-norm, have been also introduced [8].
Nevertheless, it seems that an online and data-driven solution to the
problem of dynamically selecting p,without any prior knowledge on
the PDF of on, is yet to be found.
This work offers a solution to the aforementioned open prob-
lem via reinforcement learning (RL) [13]; a machine-learning
paradigm where an “agent” interacts with the surrounding envi-
ronment to identify iteratively the policy which minimizes the
cost of its “actions.” More specifically, the well-known policy-
iteration (PI) framework [13] of RL is adopted, because of its well-
documented merits (e.g., [14–16]) over the alternative RL frame-
works of temporal-difference (TD) and Q-learning [13], especially
for continuous and high-dimensional state spaces. PI comprises two
stages at every iteration n:policy evaluation and policy improve-
ment. At policy evaluation, the current policy is evaluated by a
Q-function [13], which represents, loosely speaking, the long-term
cost that the agent would suffer had the current policy been chosen to
determine the next state, whereas at the policy-improvement stage,
the agent uses the Q-function value to update the policy. The under-
lying state space is considered to be continuous, due to the nature
of (xn, yn), while the action space is considered to be discrete: an
action is a value of ptaken from a finite grid of the interval [1,2].
Deep neural networks offer approximating spaces for Q-functions,
e.g., [17], but they may require processing of batch data (even re-
training) during online-mode operation, since they may face test data
generated by PDFs different from those of the training ones (dy-
namic environments). Such batch processing inflicts large compu-
tational times and complexity, discouraging the application of deep
neural networks to online modes of operation where a small compu-
tational footprint is desired. To meet such computational complexity
requirements, this study builds an approximate (A)PI framework for
online RL along the lines of kernel-based (KB)RL [14–16, 18–25].
Central to the proposed API is the construction of novel Bellman
mappings [13, 26]. The proposed Bellman mappings are defined on
a reproducing kernel Hilbert space (RKHS) H[27,28], which serves
as the approximating space for the Q-functions. Unlike the classi-
cal Bellman operators, where information on transition probabilities
in a Markov decision process is needed [13], the proposed Bellman
mappings make no use of such information, and need neither train-
ing/offline data nor past policies, but sample and average the sample
space on the fly, at each iteration n, to perform exploration of the
surrounding environment. This suits the current adaptive-filtering
setting, where the presence of outliers, with a possibly time-varying
arXiv:2210.11755v1 [cs.LG] 21 Oct 2022