one cannot fully capture the probabilities of the real-
world environment due to its complexity and hence the
corresponding reinforcement learning algorithm will be
trained based on misspecified probabilities. In addition,
there is the risk that the environment shifts between the
training period and the testing period. This situation
can often be observed in practice as the future evolution
of random processes rarely behaves exactly according to,
for example, the observed historical evolution. One may
think as a prime example of financial markets, where sev-
eral financial crises revealed repeatedly that used models
were strongly misspecified. We refer to [31] for further
examples, e.g. in robotics, and a further general discus-
sion on the need of considering distributionally robust
Markov decision processes and corresponding reinforce-
ment learning based algorithms.
While there has been a lot of contributions in the liter-
ature on distributionally robust Markov decision prob-
lems, only very recently, to the best of our knowledge,
there has been a first Q-learning algorithm developed
in [31] to solve distributionally robust Markov decision
problems. More precisely, in [31] the authors recently
introduced a Q-learning algorithm tailored for distribu-
tionally robust Markov decision problems where the cor-
responding ambiguity set of transition probabilities con-
sists of all probability measures which are ε-close to a
reference measure with respect to the Kullback-Leibler
(KL) divergence, and prove its convergence to the opti-
mal robust Q-value function.
The goal of this paper is to provide a Q-learning algo-
rithm which can solve distributionally robust Markov
decision problems where the corresponding ambiguity
set of transition probabilities for the underlying Markov
decision process is a Wasserstein ball around a (possi-
bly estimated) reference measure. We obtain theoretical
guarantees of convergence of our Q-learning algorithm
to the corresponding optimal robust Q-value function
(see also (12)). The design of our Q-learning algorithm
combines the dynamic programming principle of the cor-
responding Markov decision process under model uncer-
tainty (see, e.g., [37]) and a convex duality result for
worst-case expectations with respect to a Wasserstein
ball (see [4], [9], [19], [34], and [65]).
From an application point of view, considering the
Wasserstein distance has the crucial advantage that
a corresponding Wasserstein-ball consists of probabil-
ity measures which do not necessarily share the same
support as the reference measure, compared to the KL-
divergence, where by definition probability measures
within a certain fixed distance to the reference measure
all need to have a corresponding support included in
the support of the reference measure. We highlight that
from a structural point of view, our Q-learning algo-
rithm is different than the one in [31], which roughly
speaking comes from the fact that the dual optimization
problem with respect to the Wasserstein distance has
a different structure than the corresponding one with
respect to the KL-divergence.
We demonstrate in several examples also using real data
that our robust Q-learning algorithm determines robust
policies that outperform non-robust policies, determined
by the classical Q-learning algorithm, given that the
probabilities for the underlying Markov decision process
turn out to be misspecified.
The remainder of the paper is as follows. In Section 2
we introduce the underlying setting of the correspond-
ing Markov decision process under model uncertainty.
In Section 3 we present our new Q-learning algorithm
and provide our main result: the convergence of this al-
gorithm to the optimal robust Q-value function. Numer-
ical examples demonstrating the applicability as well as
the benefits of our Q-learning algorithm compared to
the classical Q-learning algorithm are provided in Sec-
tion 4. All proofs and auxiliary results are provided in
Appendix A.1 and A.2, respectively
2 Setting and Preliminaries
In this section we provide the setting and define nec-
essary quantities to define our Q-learning algorithm for
distributionally robust stochastic optimization problems
under Wasserstein uncertainty.
2.1 Setting
Optimal control problems are defined on a state space
containing all the states an underlying stochastic pro-
cess can attain. We model this state space as a finite
subset X ⊂ Rdwhere d∈N refers to the dimension of
the state space. We consider the robust control prob-
lem over an infinite time horizon, hence the space of
all attainable states in this horizon is given by the in-
finite Cartesian product Ω := XN0=X × X × · · · ,
with the corresponding σ-algebra F:= 2X⊗2X⊗ · · · .
On Ω we consider a stochastic process that describes
the states that are attained over time. To this end,
we let (Xt)t∈N0be the canonical process on Ω, that
is defined by Xt(x0, x1, . . . , xt, . . . ) := xtfor each
(x0, x1, . . . , xt, . . . )∈Ω, t∈N0.
Given a realization Xtof the underlying stochastic pro-
cess at some time t∈N0, the outcome of the next state
Xt+1 can be influenced through actions that are exe-
cuted in dependence of the current state Xt. At any time
the set of possible actions is given by a finite set A⊆Rm,
where m∈N is the dimension of the action space (also
referred to as control space). The set of admissible poli-
cies Aover the entire time horizon contains all sequences
of actions that depend at any time only on the current
2