Robust Q-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty Ariel Neufelda Julian Sesterb

2025-05-03 1 0 636.32KB 16 页 10玖币
侵权投诉
Robust Q-learning Algorithm for Markov Decision Processes
under Wasserstein Uncertainty
Ariel Neufeld a, Julian Sester b
aNTU Singapore, Division of Mathematical Sciences, 21 Nanyang Link, Singapore 637371.
bNational University of Singapore, Department of Mathematics, 21 Lower Kent Ridge Road, 119077
Abstract
We present a novel Q-learning algorithm tailored to solve distributionally robust Markov decision problems where the corre-
sponding ambiguity set of transition probabilities for the underlying Markov decision process is a Wasserstein ball around a
(possibly estimated) reference measure. We prove convergence of the presented algorithm and provide several examples also
using real data to illustrate both the tractability of our algorithm as well as the benefits of considering distributional robustness
when solving stochastic optimal control problems, in particular when the estimated distributions turn out to be misspecified
in practice.
Key words: Q-learning; Markov Decision Process; Wasserstein Uncertainty; Distributionally Robust Optimization;
Reinforcement Learning.
1 Introduction
The among practitioners popular and widely applied Q-
learning algorithm provides a tractable reinforcement
learning methodology to solve Markov decision problems
(MDP). The Q-learning algorithm learns an optimal pol-
icy online via observing at each time the current state
of the underlying process as well as the reward depend-
ing on the current (and possibly next) state when acting
according to a (not necessarily) optimal policy and by
assuming to act optimally after the next state. The ob-
served rewards determine a function Qdepending on a
state-action pair that describes the quality of the chosen
action when being in the observed state. After a suffi-
cient amount of observations the function Qthen allows
in each state to decide which actions possess the most
quality. In this way the Q-learning algorithm determines
an optimal policy.
The Q-learning algorithm was initially proposed in
Watkins’ PhD thesis ([57]). [27] and [58] then provided
a rigorous mathematical proof of the convergence of
the Q-learning algorithm to the optimal Q-value func-
tion using results from stochastic approximation theory
Corresponding author A. Neufeld
Email addresses: ariel.neufeld@ntu.edu.sg (Ariel
Neufeld), jul ses@nus.edu.sg (Julian Sester).
(see e.g. [16] and [41]). The design of the Q-learning
algorithm as well as the proof of its convergence to the
optimal Q-value both rely on the dynamic programming
principle of the corresponding Markov decision problem,
which allows to find an optimal policy for the involved
infinite horizon stochastic optimal control problem by
solving a one time-step optimization problem. We refer
to [1], [2], [3], [11], [12], [24], [25], [28], [29], [35], [38], and
[55] for various successful applications of the Q-learning
algorithm.
Recently, there has been a huge focus in the literature
starting from the viewpoint that one might have an esti-
mate of the correct transition probability of the under-
lying Markov decision process, for example through the
empirical measure derived from past observed data, but
one faces the risk of misspecifying the correct distribu-
tion and hence would like to consider a distributionally
robust Markov decision process (compare [5], [6], [13],
[17], [23], [30], [31], [32], [37], [39], [47], [48], [52], [56],
[59], [61], [62], [64], and [66]), also called Markov decision
process under model uncertainty, where one maximizes
over the worst-case scenario among all probability mea-
sures of an ambiguity set of transition probabilities. We
also refer to, e.g, the following related distributionally
robust stochastic control problems [13], [14], [22], [52],
[53], [60], and [63] beyond the MDP setting. Indeed, as
discussed in [31], there is a common risk in practice that
Preprint available on arXiv:2210.00898 21 June 2024
arXiv:2210.00898v3 [cs.LG] 20 Jun 2024
one cannot fully capture the probabilities of the real-
world environment due to its complexity and hence the
corresponding reinforcement learning algorithm will be
trained based on misspecified probabilities. In addition,
there is the risk that the environment shifts between the
training period and the testing period. This situation
can often be observed in practice as the future evolution
of random processes rarely behaves exactly according to,
for example, the observed historical evolution. One may
think as a prime example of financial markets, where sev-
eral financial crises revealed repeatedly that used models
were strongly misspecified. We refer to [31] for further
examples, e.g. in robotics, and a further general discus-
sion on the need of considering distributionally robust
Markov decision processes and corresponding reinforce-
ment learning based algorithms.
While there has been a lot of contributions in the liter-
ature on distributionally robust Markov decision prob-
lems, only very recently, to the best of our knowledge,
there has been a first Q-learning algorithm developed
in [31] to solve distributionally robust Markov decision
problems. More precisely, in [31] the authors recently
introduced a Q-learning algorithm tailored for distribu-
tionally robust Markov decision problems where the cor-
responding ambiguity set of transition probabilities con-
sists of all probability measures which are ε-close to a
reference measure with respect to the Kullback-Leibler
(KL) divergence, and prove its convergence to the opti-
mal robust Q-value function.
The goal of this paper is to provide a Q-learning algo-
rithm which can solve distributionally robust Markov
decision problems where the corresponding ambiguity
set of transition probabilities for the underlying Markov
decision process is a Wasserstein ball around a (possi-
bly estimated) reference measure. We obtain theoretical
guarantees of convergence of our Q-learning algorithm
to the corresponding optimal robust Q-value function
(see also (12)). The design of our Q-learning algorithm
combines the dynamic programming principle of the cor-
responding Markov decision process under model uncer-
tainty (see, e.g., [37]) and a convex duality result for
worst-case expectations with respect to a Wasserstein
ball (see [4], [9], [19], [34], and [65]).
From an application point of view, considering the
Wasserstein distance has the crucial advantage that
a corresponding Wasserstein-ball consists of probabil-
ity measures which do not necessarily share the same
support as the reference measure, compared to the KL-
divergence, where by definition probability measures
within a certain fixed distance to the reference measure
all need to have a corresponding support included in
the support of the reference measure. We highlight that
from a structural point of view, our Q-learning algo-
rithm is different than the one in [31], which roughly
speaking comes from the fact that the dual optimization
problem with respect to the Wasserstein distance has
a different structure than the corresponding one with
respect to the KL-divergence.
We demonstrate in several examples also using real data
that our robust Q-learning algorithm determines robust
policies that outperform non-robust policies, determined
by the classical Q-learning algorithm, given that the
probabilities for the underlying Markov decision process
turn out to be misspecified.
The remainder of the paper is as follows. In Section 2
we introduce the underlying setting of the correspond-
ing Markov decision process under model uncertainty.
In Section 3 we present our new Q-learning algorithm
and provide our main result: the convergence of this al-
gorithm to the optimal robust Q-value function. Numer-
ical examples demonstrating the applicability as well as
the benefits of our Q-learning algorithm compared to
the classical Q-learning algorithm are provided in Sec-
tion 4. All proofs and auxiliary results are provided in
Appendix A.1 and A.2, respectively
2 Setting and Preliminaries
In this section we provide the setting and define nec-
essary quantities to define our Q-learning algorithm for
distributionally robust stochastic optimization problems
under Wasserstein uncertainty.
2.1 Setting
Optimal control problems are defined on a state space
containing all the states an underlying stochastic pro-
cess can attain. We model this state space as a finite
subset X Rdwhere dN refers to the dimension of
the state space. We consider the robust control prob-
lem over an infinite time horizon, hence the space of
all attainable states in this horizon is given by the in-
finite Cartesian product Ω := XN0=X × X × · · · ,
with the corresponding σ-algebra F:= 2X2X · · · .
On Ω we consider a stochastic process that describes
the states that are attained over time. To this end,
we let (Xt)tN0be the canonical process on Ω, that
is defined by Xt(x0, x1, . . . , xt, . . . ) := xtfor each
(x0, x1, . . . , xt, . . . )Ω, tN0.
Given a realization Xtof the underlying stochastic pro-
cess at some time tN0, the outcome of the next state
Xt+1 can be influenced through actions that are exe-
cuted in dependence of the current state Xt. At any time
the set of possible actions is given by a finite set ARm,
where mN is the dimension of the action space (also
referred to as control space). The set of admissible poli-
cies Aover the entire time horizon contains all sequences
of actions that depend at any time only on the current
2
observation of the state process (Xt)tN0formalized by
A: = a= (at)tN0(at)tN0: Ω A;
atis σ(Xt)-measurable for all tN0
=(at(Xt))tN0at:X ABorel measurable
for all tN0.
The current state and the chosen action influence the
outcome of the next state by influencing the probabil-
ity distribution with which the subsequent state is real-
ized. As we take into account model uncertainty we as-
sume that the correct probability kernel is unknown and
hence, for each given state xand action a, we consider
an ambiguity set of probability distributions represent-
ing the set of possible probability laws for the next state.
We denote by M1(Ω) and M1(X) the set of probabil-
ity measures on (Ω,F) and (X,2X) respectively, and we
assume that an ambiguity set of probability measures is
modelled by a set-valued map
X × A(x, a)→ P(x, a)⊆ M1(X).(1)
Hence, if at time tN0the process Xtattains the value
x X , and the agent decides to execute action aA,
then P(x, a) describes the set of possible probability dis-
tributions with which the next state Xt+1 is realized.
If P(x, a) is single-valued, then the state-action pair
(x, a) determines unambiguously the transition proba-
bility, and the setting coincides with the usual setting
used for classical (i.e., non-robust) Markov decision pro-
cesses, compare e.g. [7].
The ambiguity set of admissible probability distribu-
tions on Ω depends therefore on the initial state xX
and the chosen policy a∈ A. We define for every initial
state x∈ X and every policy a A the set of admissible
underlying probability distributions of (Xt)tN0by
Px,a:= δxP0P1 · · · for all tN0:
Pt:X → M1(X) Borel-measurable,
and Pt(xt)∈ P(xt, at(xt)) for all xt X ,
where the notation P = δxP0P1 · · · Px,a
abbreviates
P(B) := X
x0∈X
·X
xt∈X
· · · 1lB((xt)tN0)· · · Pt1(xt1;{xt})
· · · P0(x0;{x1})δx({x0}), B ∈ F.
Remark 1 In the literature of robust Markov decision
processes one refers to Px,aas being (s, a)-rectangular,
see, e.g., [26], [45], [59]. This is a common assumption
which turns out to be crucial to obtain a dynamic pro-
gramming principle (see, e.g., [37, Theorem 2.7] and
[43]) and therefore to enable efficient and tractable com-
putations. Indeed, if one weakens this assumption the
problem becomes computationally more expensive (see,
e.g, [8, Section 2]), or can be provably intractable (com-
pare [30]) and therefore cannot be solved by dynamic pro-
gramming methods. Several approaches to solve robust
MDPs w.r.t. non-rectangular ambiguity sets using meth-
ods other than dynamic programming however have re-
cently been proposed, and are described in [21], [30], and
[50].
To determine optimal policies we reward actions in de-
pendence of the current state-action pair and the subse-
quent realized state. To this end, let r:X × A× X R
be some reward function, and let αR be a discount
factor fulfilling
0< α < 1.(2)
Then, our robust optimization problem consists, for
every initial value x X , in maximizing the expected
value of P
t=0 αtr(Xt, at, Xt+1) under the worst case
measure from Px,aover all possible policies a∈ A.
More precisely, we aim for every x∈ X to maximize
infPPx,aEPP
t=0 αtr(Xt, at, Xt+1) among all
policies a∈ A. The value function given by
X x7→ V(x) : = sup
a∈A
inf
PPx,a
EP
X
t=0
αtr(Xt, at, Xt+1)
(3)
then describes the expectation of P
t=0 αtr(Xt, at, Xt+1)
under the worst case measure from Px,aand under the
optimal policy from a∈ A in dependence of the initial
value.
2.2 Specification of the Ambiguity Sets
To specify the ambiguity set P(x, a) for each (x, a)
X × A, we first consider for each (x, a) X × Aa refer-
ence probability measure. In applications, this reference
measure may be derived from observed data. Consider-
ing an ambiguity set related to this reference measure
then allows to respect deviations from the historic be-
havior in the future and leads therefore to a more robust
optimal control problem that allows to take into account
adverse scenarios, compare also [37]. To that end, let
X × A(x, a)7→ b
P(x, a)∈ M1(X).(4)
be a probability kernel, where b
P(x, a) acts as reference
probability measure for each (x, a) X × A. Then, for
3
every (x0,a) X × A we denote by
b
Px0,a:= δx0b
P(·, a0(·))b
P(·, a1(·))· · · ∈ M1(Ω) (5)
the corresponding probability measure on Ω that deter-
mines the distribution of (Xt)tN0in dependence of ini-
tial value x0∈ X and the policy a∈ A, i.e., we have for
any B∈ F that
b
Px0,a(B) := X
x0∈X
· · · X
xt∈X
· · · 1lB((xt)tN0)· · ·
·b
P(xt1, at1(xt1); {xt})
· · · b
P(x0, a0(x0); {x1})δx({x0}).
We provide two specifications of ambiguity sets of prob-
ability measures P(x, a), (x, a) X × A, as defined in
(1). Both ambiguity sets rely on the assumption that for
each given (x, a) X × Athe uncertainty with respect
to the underlying probability distribution is modelled
through a Wasserstein-ball around the reference proba-
bility measure b
P(x, a) on X.
To that end, for any qN, and any P1,P2∈ M1(X),
consider the q-Wasserstein-distance
Wq(P1,P2):=inf
πΠ(P1,P2)ZX ×X
xyqdπ(x, y)1/q
,
where ∥·∥denotes the Euclidean norm on Rdand where
Π(P1,P2)⊂ M1(X × X ) denotes the set of joint dis-
tributions of P1and P2. Since we consider probability
measures on a finite space we have a representation of
the form
Pi=X
x∈X
ai,xδx,with X
x∈X
ai,x = 1, ai,x 0
for all x∈ X for i= 1,2, where δxdenotes the Dirac-
measure at point x X . Hence, the q-Wasserstein-
distance can also be written as
Wq(P1,P2):=min
πx,y e
Π(P1,P2)X
x,y∈X
xyq·πx,y1/q
,
where
e
Π(P1,P2) := (πx,y)x,y∈X [0,1] X
x∈X
πx,y =a2,y,
X
y∈X
πx,y=a1,x for all x, y X .
Relying on the above introduced Wasserstein-distance
we define two ambiguity sets of probability measures.
Setting 1.) The ambiguity set P(q)
1
We consider for any fixed ε > 0 and qN the ambiguity
set
X × A(x, a)→ P(q,ε)
1(x, a) := P∈ M1(X) s.t.
Wq(P,b
P(x, a)) ε
(6)
being the q-Wasserstein ball with radius εaround the ref-
erence measure b
P(x, a), defined in (4). For each (x, a)
X × Athe ambiguity set P(q)
1(x, a) contains all prob-
ability measures that are close to b
P(x, a) with respect
to the q-Wasserstein distance. In particular, P(q)
1(x, a)
contains also measures that are not necessarily domi-
nated by the reference measure b
P(x, a).
Setting 2.) The ambiguity set P(q,ε)
2
We next define an ambiguity set that can particularly be
applied when autocorrelated time-series are considered.
In this case we assume that the past hN[2,)
values of a time series (Yt)t=h+1,h+2,... may have an
influence on the subsequent value of the state process.
Then, at time tN0the state vector is given by
Xt= (Yth+1, . . . , Yt)∈ X := YhRD·h,(7)
with Y RDfinite, where DN describes the dimen-
sion of each value Yt∈ Y RD.
An example is given by financial time series of financial
assets, where not only the current state, but also past
realizations may influence the subsequent evolution of
the assets and can therefore be modelled to be a part of
the state vector, compare also the presentation in [37,
Section 4.3.].
Note that at each time tN0the part (Yth+2, . . . , Yt)
RD·(h1) of the state vector Xt+1 that relates to past
information can be derived once the current state
Xt= (Yth+1, Yth+2, . . . , Yt) is known. Only the real-
ization of Yt+1 is subject to uncertainty. Conditionally
on Xtthe distribution of Xt+1 should therefore be of the
form δ(Yth+2,...,Yt)e
P∈ M1(X) for some probability
measure e
P∈ M1(Y).
We write, given some x= (x1, . . . , xh) X ,
π(x) := (x2, . . . , xh)∈ Yh1(8)
such that x= (x1, π(x)) ∈ X and such that π(Xt) =
(Yth+2, . . . , Yt). The vector π(x) denotes the projection
4
摘要:

RobustQ-learningAlgorithmforMarkovDecisionProcessesunderWassersteinUncertainty⋆ArielNeufelda,JulianSesterbaNTUSingapore,DivisionofMathematicalSciences,21NanyangLink,Singapore637371.bNationalUniversityofSingapore,DepartmentofMathematics,21LowerKentRidgeRoad,119077AbstractWepresentanovelQ-learningalgo...

展开>> 收起<<
Robust Q-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty Ariel Neufelda Julian Sesterb.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:636.32KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注