Robust Q-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty Ariel Neufelda Julian Sesterb

2025-05-03 2 0 636.32KB 16 页 10玖币

侵权投诉

Robust Q-learning Algorithm for Markov Decision Processes

under Wasserstein Uncertainty ⋆

Ariel Neufeld a, Julian Sester b

aNTU Singapore, Division of Mathematical Sciences, 21 Nanyang Link, Singapore 637371.

bNational University of Singapore, Department of Mathematics, 21 Lower Kent Ridge Road, 119077

Abstract

We present a novel Q-learning algorithm tailored to solve distributionally robust Markov decision problems where the corre-

sponding ambiguity set of transition probabilities for the underlying Markov decision process is a Wasserstein ball around a

(possibly estimated) reference measure. We prove convergence of the presented algorithm and provide several examples also

using real data to illustrate both the tractability of our algorithm as well as the beneﬁts of considering distributional robustness

when solving stochastic optimal control problems, in particular when the estimated distributions turn out to be misspeciﬁed

in practice.

Key words: Q-learning; Markov Decision Process; Wasserstein Uncertainty; Distributionally Robust Optimization;

Reinforcement Learning.

1 Introduction

The among practitioners popular and widely applied Q-

learning algorithm provides a tractable reinforcement

learning methodology to solve Markov decision problems

(MDP). The Q-learning algorithm learns an optimal pol-

icy online via observing at each time the current state

of the underlying process as well as the reward depend-

ing on the current (and possibly next) state when acting

according to a (not necessarily) optimal policy and by

assuming to act optimally after the next state. The ob-

served rewards determine a function Qdepending on a

state-action pair that describes the quality of the chosen

action when being in the observed state. After a suﬃ-

cient amount of observations the function Qthen allows

in each state to decide which actions possess the most

quality. In this way the Q-learning algorithm determines

an optimal policy.

The Q-learning algorithm was initially proposed in

Watkins’ PhD thesis ([57]). [27] and [58] then provided

a rigorous mathematical proof of the convergence of

the Q-learning algorithm to the optimal Q-value func-

tion using results from stochastic approximation theory

⋆Corresponding author A. Neufeld

Email addresses: ariel.neufeld@ntu.edu.sg (Ariel

Neufeld), jul ses@nus.edu.sg (Julian Sester).

(see e.g. [16] and [41]). The design of the Q-learning

algorithm as well as the proof of its convergence to the

optimal Q-value both rely on the dynamic programming

principle of the corresponding Markov decision problem,

which allows to ﬁnd an optimal policy for the involved

inﬁnite horizon stochastic optimal control problem by

solving a one time-step optimization problem. We refer

to [1], [2], [3], [11], [12], [24], [25], [28], [29], [35], [38], and

[55] for various successful applications of the Q-learning

algorithm.

Recently, there has been a huge focus in the literature

starting from the viewpoint that one might have an esti-

mate of the correct transition probability of the under-

lying Markov decision process, for example through the

empirical measure derived from past observed data, but

one faces the risk of misspecifying the correct distribu-

tion and hence would like to consider a distributionally

robust Markov decision process (compare [5], [6], [13],

[17], [23], [30], [31], [32], [37], [39], [47], [48], [52], [56],

[59], [61], [62], [64], and [66]), also called Markov decision

process under model uncertainty, where one maximizes

over the worst-case scenario among all probability mea-

sures of an ambiguity set of transition probabilities. We

also refer to, e.g, the following related distributionally

robust stochastic control problems [13], [14], [22], [52],

[53], [60], and [63] beyond the MDP setting. Indeed, as

discussed in [31], there is a common risk in practice that

Preprint available on arXiv:2210.00898 21 June 2024

arXiv:2210.00898v3 [cs.LG] 20 Jun 2024

one cannot fully capture the probabilities of the real-

world environment due to its complexity and hence the

corresponding reinforcement learning algorithm will be

trained based on misspeciﬁed probabilities. In addition,

there is the risk that the environment shifts between the

training period and the testing period. This situation

can often be observed in practice as the future evolution

of random processes rarely behaves exactly according to,

for example, the observed historical evolution. One may

think as a prime example of ﬁnancial markets, where sev-

eral ﬁnancial crises revealed repeatedly that used models

were strongly misspeciﬁed. We refer to [31] for further

examples, e.g. in robotics, and a further general discus-

sion on the need of considering distributionally robust

Markov decision processes and corresponding reinforce-

ment learning based algorithms.

While there has been a lot of contributions in the liter-

ature on distributionally robust Markov decision prob-

lems, only very recently, to the best of our knowledge,

there has been a ﬁrst Q-learning algorithm developed

in [31] to solve distributionally robust Markov decision

problems. More precisely, in [31] the authors recently

introduced a Q-learning algorithm tailored for distribu-

tionally robust Markov decision problems where the cor-

responding ambiguity set of transition probabilities con-

sists of all probability measures which are ε-close to a

reference measure with respect to the Kullback-Leibler

(KL) divergence, and prove its convergence to the opti-

mal robust Q-value function.

The goal of this paper is to provide a Q-learning algo-

rithm which can solve distributionally robust Markov

decision problems where the corresponding ambiguity

set of transition probabilities for the underlying Markov

decision process is a Wasserstein ball around a (possi-

bly estimated) reference measure. We obtain theoretical

guarantees of convergence of our Q-learning algorithm

to the corresponding optimal robust Q-value function

(see also (12)). The design of our Q-learning algorithm

combines the dynamic programming principle of the cor-

responding Markov decision process under model uncer-

tainty (see, e.g., [37]) and a convex duality result for

worst-case expectations with respect to a Wasserstein

ball (see [4], [9], [19], [34], and [65]).

From an application point of view, considering the

Wasserstein distance has the crucial advantage that

a corresponding Wasserstein-ball consists of probabil-

ity measures which do not necessarily share the same

support as the reference measure, compared to the KL-

divergence, where by deﬁnition probability measures

within a certain ﬁxed distance to the reference measure

all need to have a corresponding support included in

the support of the reference measure. We highlight that

from a structural point of view, our Q-learning algo-

rithm is diﬀerent than the one in [31], which roughly

speaking comes from the fact that the dual optimization

problem with respect to the Wasserstein distance has

a diﬀerent structure than the corresponding one with

respect to the KL-divergence.

We demonstrate in several examples also using real data

that our robust Q-learning algorithm determines robust

policies that outperform non-robust policies, determined

by the classical Q-learning algorithm, given that the

probabilities for the underlying Markov decision process

turn out to be misspeciﬁed.

The remainder of the paper is as follows. In Section 2

we introduce the underlying setting of the correspond-

ing Markov decision process under model uncertainty.

In Section 3 we present our new Q-learning algorithm

and provide our main result: the convergence of this al-

gorithm to the optimal robust Q-value function. Numer-

ical examples demonstrating the applicability as well as

the beneﬁts of our Q-learning algorithm compared to

the classical Q-learning algorithm are provided in Sec-

tion 4. All proofs and auxiliary results are provided in

Appendix A.1 and A.2, respectively

2 Setting and Preliminaries

In this section we provide the setting and deﬁne nec-

essary quantities to deﬁne our Q-learning algorithm for

distributionally robust stochastic optimization problems

under Wasserstein uncertainty.

2.1 Setting

Optimal control problems are deﬁned on a state space

containing all the states an underlying stochastic pro-

cess can attain. We model this state space as a ﬁnite

subset X ⊂ Rdwhere d∈N refers to the dimension of

the state space. We consider the robust control prob-

lem over an inﬁnite time horizon, hence the space of

all attainable states in this horizon is given by the in-

ﬁnite Cartesian product Ω := XN0=X × X × · · · ,

with the corresponding σ-algebra F:= 2X⊗2X⊗ · · · .

On Ω we consider a stochastic process that describes

the states that are attained over time. To this end,

we let (Xt)t∈N0be the canonical process on Ω, that

is deﬁned by Xt(x0, x1, . . . , xt, . . . ) := xtfor each

(x0, x1, . . . , xt, . . . )∈Ω, t∈N0.

Given a realization Xtof the underlying stochastic pro-

cess at some time t∈N0, the outcome of the next state

Xt+1 can be inﬂuenced through actions that are exe-

cuted in dependence of the current state Xt. At any time

the set of possible actions is given by a ﬁnite set A⊆Rm,

where m∈N is the dimension of the action space (also

referred to as control space). The set of admissible poli-

cies Aover the entire time horizon contains all sequences

of actions that depend at any time only on the current

observation of the state process (Xt)t∈N0formalized by

A: = a= (at)t∈N0(at)t∈N0: Ω →A;

atis σ(Xt)-measurable for all t∈N0

=(at(Xt))t∈N0at:X → ABorel measurable

for all t∈N0.

The current state and the chosen action inﬂuence the

outcome of the next state by inﬂuencing the probabil-

ity distribution with which the subsequent state is real-

ized. As we take into account model uncertainty we as-

sume that the correct probability kernel is unknown and

hence, for each given state xand action a, we consider

an ambiguity set of probability distributions represent-

ing the set of possible probability laws for the next state.

We denote by M1(Ω) and M1(X) the set of probabil-

ity measures on (Ω,F) and (X,2X) respectively, and we

assume that an ambiguity set of probability measures is

modelled by a set-valued map

X × A∋(x, a)→→ P(x, a)⊆ M1(X).(1)

Hence, if at time t∈N0the process Xtattains the value

x∈ X , and the agent decides to execute action a∈A,

then P(x, a) describes the set of possible probability dis-

tributions with which the next state Xt+1 is realized.

If P(x, a) is single-valued, then the state-action pair

(x, a) determines unambiguously the transition proba-

bility, and the setting coincides with the usual setting

used for classical (i.e., non-robust) Markov decision pro-

cesses, compare e.g. [7].

The ambiguity set of admissible probability distribu-

tions on Ω depends therefore on the initial state x∈X

and the chosen policy a∈ A. We deﬁne for every initial

state x∈ X and every policy a∈ A the set of admissible

underlying probability distributions of (Xt)t∈N0by

Px,a:= δx⊗P0⊗P1⊗ · · · for all t∈N0:

Pt:X → M1(X) Borel-measurable,

and Pt(xt)∈ P(xt, at(xt)) for all xt∈ X ,

where the notation P = δx⊗P0⊗P1⊗ · · · ∈ Px,a

abbreviates

P(B) := X

x0∈X

·X

xt∈X

· · · 1lB((xt)t∈N0)· · · Pt−1(xt−1;{xt})

· · · P0(x0;{x1})δx({x0}), B ∈ F.

Remark 1 In the literature of robust Markov decision

processes one refers to Px,aas being (s, a)-rectangular,

see, e.g., [26], [45], [59]. This is a common assumption

which turns out to be crucial to obtain a dynamic pro-

gramming principle (see, e.g., [37, Theorem 2.7] and

[43]) and therefore to enable eﬃcient and tractable com-

putations. Indeed, if one weakens this assumption the

problem becomes computationally more expensive (see,

e.g, [8, Section 2]), or can be provably intractable (com-

pare [30]) and therefore cannot be solved by dynamic pro-

gramming methods. Several approaches to solve robust

MDPs w.r.t. non-rectangular ambiguity sets using meth-

ods other than dynamic programming however have re-

cently been proposed, and are described in [21], [30], and

[50].

To determine optimal policies we reward actions in de-

pendence of the current state-action pair and the subse-

quent realized state. To this end, let r:X × A× X → R

be some reward function, and let α∈R be a discount

factor fulﬁlling

0< α < 1.(2)

Then, our robust optimization problem consists, for

every initial value x∈ X , in maximizing the expected

value of P∞

t=0 αtr(Xt, at, Xt+1) under the worst case

measure from Px,aover all possible policies a∈ A.

More precisely, we aim for every x∈ X to maximize

infP∈Px,aEPP∞

t=0 αtr(Xt, at, Xt+1) among all

policies a∈ A. The value function given by

X ∋ x7→ V(x) : = sup

a∈A

inf

P∈Px,a

EP∞

t=0

αtr(Xt, at, Xt+1)

(3)

then describes the expectation of P∞

t=0 αtr(Xt, at, Xt+1)

under the worst case measure from Px,aand under the

optimal policy from a∈ A in dependence of the initial

value.

2.2 Speciﬁcation of the Ambiguity Sets

To specify the ambiguity set P(x, a) for each (x, a)∈

X × A, we ﬁrst consider for each (x, a)∈ X × Aa refer-

ence probability measure. In applications, this reference

measure may be derived from observed data. Consider-

ing an ambiguity set related to this reference measure

then allows to respect deviations from the historic be-

havior in the future and leads therefore to a more robust

optimal control problem that allows to take into account

adverse scenarios, compare also [37]. To that end, let

X × A∋(x, a)7→ b

P(x, a)∈ M1(X).(4)

be a probability kernel, where b

P(x, a) acts as reference

probability measure for each (x, a)∈ X × A. Then, for

every (x0,a)∈ X × A we denote by

Px0,a:= δx0⊗b

P(·, a0(·))⊗b

P(·, a1(·))⊗· · · ∈ M1(Ω) (5)

the corresponding probability measure on Ω that deter-

mines the distribution of (Xt)t∈N0in dependence of ini-

tial value x0∈ X and the policy a∈ A, i.e., we have for

any B∈ F that

Px0,a(B) := X

x0∈X

· · · X

xt∈X

· · · 1lB((xt)t∈N0)· · ·

·b

P(xt−1, at−1(xt−1); {xt})

· · · b

P(x0, a0(x0); {x1})δx({x0}).

We provide two speciﬁcations of ambiguity sets of prob-

ability measures P(x, a), (x, a)∈ X × A, as deﬁned in

(1). Both ambiguity sets rely on the assumption that for

each given (x, a)∈ X × Athe uncertainty with respect

to the underlying probability distribution is modelled

through a Wasserstein-ball around the reference proba-

bility measure b

P(x, a) on X.

To that end, for any q∈N, and any P1,P2∈ M1(X),

consider the q-Wasserstein-distance

Wq(P1,P2):=inf

π∈Π(P1,P2)ZX ×X

∥x−y∥qdπ(x, y)1/q

where ∥·∥denotes the Euclidean norm on Rdand where

Π(P1,P2)⊂ M1(X × X ) denotes the set of joint dis-

tributions of P1and P2. Since we consider probability

measures on a ﬁnite space we have a representation of

the form

Pi=X

x∈X

ai,xδx,with X

x∈X

ai,x = 1, ai,x ≥0

for all x∈ X for i= 1,2, where δxdenotes the Dirac-

measure at point x∈ X . Hence, the q-Wasserstein-

distance can also be written as

Wq(P1,P2):=min

πx,y ∈e

Π(P1,P2)X

x,y∈X

∥x−y∥q·πx,y1/q

where

Π(P1,P2) := (πx,y)x,y∈X ⊆[0,1] X

x′∈X

πx′,y =a2,y,

y′∈X

πx,y′=a1,x for all x, y ∈ X .

Relying on the above introduced Wasserstein-distance

we deﬁne two ambiguity sets of probability measures.

Setting 1.) The ambiguity set P(q,ε)

We consider for any ﬁxed ε > 0 and q∈N the ambiguity

set

X × A∋(x, a)→→ P(q,ε)

1(x, a) := P∈ M1(X) s.t.

Wq(P,b

P(x, a)) ≤ε

(6)

being the q-Wasserstein ball with radius εaround the ref-

erence measure b

P(x, a), deﬁned in (4). For each (x, a)∈

X × Athe ambiguity set P(q,ε)

1(x, a) contains all prob-

ability measures that are close to b

P(x, a) with respect

to the q-Wasserstein distance. In particular, P(q,ε)

1(x, a)

contains also measures that are not necessarily domi-

nated by the reference measure b

P(x, a).

Setting 2.) The ambiguity set P(q,ε)

We next deﬁne an ambiguity set that can particularly be

applied when autocorrelated time-series are considered.

In this case we assume that the past h∈N∩[2,∞)

values of a time series (Yt)t=−h+1,−h+2,... may have an

inﬂuence on the subsequent value of the state process.

Then, at time t∈N0the state vector is given by

Xt= (Yt−h+1, . . . , Yt)∈ X := Yh⊂RD·h,(7)

with Y ⊂ RDﬁnite, where D∈N describes the dimen-

sion of each value Yt∈ Y ⊂ RD.

An example is given by ﬁnancial time series of ﬁnancial

assets, where not only the current state, but also past

realizations may inﬂuence the subsequent evolution of

the assets and can therefore be modelled to be a part of

the state vector, compare also the presentation in [37,

Section 4.3.].

Note that at each time t∈N0the part (Yt−h+2, . . . , Yt)∈

RD·(h−1) of the state vector Xt+1 that relates to past

information can be derived once the current state

Xt= (Yt−h+1, Yt−h+2, . . . , Yt) is known. Only the real-

ization of Yt+1 is subject to uncertainty. Conditionally

on Xtthe distribution of Xt+1 should therefore be of the

form δ(Yt−h+2,...,Yt)⊗e

P∈ M1(X) for some probability

measure e

P∈ M1(Y).

We write, given some x= (x1, . . . , xh)∈ X ,

π(x) := (x2, . . . , xh)∈ Yh−1(8)

such that x= (x1, π(x)) ∈ X and such that π(Xt) =

(Yt−h+2, . . . , Yt). The vector π(x) denotes the projection

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustQ-learningAlgorithmforMarkovDecisionProcessesunderWassersteinUncertainty⋆ArielNeufelda,JulianSesterbaNTUSingapore,DivisionofMathematicalSciences,21NanyangLink,Singapore637371.bNationalUniversityofSingapore,DepartmentofMathematics,21LowerKentRidgeRoad,119077AbstractWepresentanovelQ-learningalgo...

展开>> 收起<<

Robust Q-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty Ariel Neufelda Julian Sesterb.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Q-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty Ariel Neufelda Julian Sesterb

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: