IRL, and maximum likelihood IRL [
35
–
37
] for modeling expert distribution to parameterized func-
tions. Notably, MaxEnt IRL [
7
,
38
] is one of the representative classical IRL algorithms based on
an information-theoretic perspective toward IRL solutions. Also, discriminators of AIL are trained
by logistic regression; thus, the logit score of the discriminator defines an energy function that
approximates the truth data density for the expert distribution [
39
]. Other statistical entropies have
also been applied to AIL, such as the Tsallis entropy [40]. On the one hand, our approach is closely
related to RAIRL [
18
], which defined its AIL objective using the Bregman divergence. On the other
hand, this work further employs the Bregman divergence to derive iterative MD updates for reward
functions, resulting in theoretically pleasing properties while retaining the scalability of AIL.
Learning theory.
There have been considerable achievements in dealing with temporal costs
{Ft}∞
t=1
, often referred to as online learning [
41
]. The most ordinary approach is stochastic gradient
descent (SGD):
wt+1 =wt−ηt∇Ft(wt)
. In particular, SGD is a desirable algorithm when the
parameter
wt
resides in the Euclidean space since it ensures unbiased minimization of the expected
cost. Apparently, policies appear in geometries of probabilities; thus, an incurred gradient may not be
the direction of the steepest descent due to geometric constraints [
27
,
34
]. An online form of MD
in Eq. (1) is analogous to SGD for non-Euclidean spaces, where each local metric is specified by
a Bregman divergence [
30
]. Our theoretical findings and proofs follow the results of online mirror
descent (OMD) that appeared in previous literature for general aspects [
15
,
16
,
42
,
28
,
17
,
30
]. Our
analyses extend existing theoretical results to IRL; at the same time, they are also highly general to
cover various online imitation learning problems which require making decisions sequentially.
3 Background
For sets
X
and
Y
, let
YX
be a set of functions from
X
to
Y
and
∆X
(
∆Y
X
) be a set of (conditional)
probabilities over
X
(conditioned on
Y
). We consider an MDP defined as a tuple
(S,A, P, r, γ)
with the state space
S
, the action space
A
, the Markovian transition kernel
P∈∆S×A
S
, the reward
function
r∈RS×A
and the discount factor
γ∈[0,1)
. Let a function
Ω: ∆
A→R
be strongly convex.
Using Ω, the Bregman divergence is defined as
DΩ(πskˆπs):= Ω(πs)−Ω(ˆπs)−∇Ω(ˆπs), πs−ˆπsA,
where
πs
and
ˆπs
denote arbitrary policies for a given state
s
. For a representative divergence, one can
consider the popular Kullback-Leibler (KL) divergence. The KL divergence is a Bregman divergence
when Ωis specified as the negative Shannon entropy: Ω(πs) = Paπs(a) ln πs(a).
Regularization of the policy distribution with respect to convex
Ω
brings distinct properties to the
learning agent [
43
,
44
]. The objective of regularized RL is to find
π∈Π
that maximizes the expected
value of discounted cumulative returns along with a causal convex regularizer Ω, i.e.,
JΩ(π, r):=Eπ
hX∞
i=0 γinr(si, ai)−Ωπ(· |si)oi,(2)
where the subscript
π
on the expectation indicates that each action is sampled by
π(·|si)
for the given
MDP. In this setup, a regularized RL algorithm finds a unique solution in a subset of the conditional
probability space denoted as Π:=[Πs]s∈S⊂∆S
Aconstrained by the parameterization of a policy.
The objective of IRL is to find a function
r
E
that rationalizes the behavior of an expert policy
π
E
. For
an inner product h·,·i
A, consider Ω∗
, the Legendre-Fenchel transform (convex conjugate) of Ω:
∀qs∈RA
,Ω∗qs= max
πs∈∆A
πs
, qsA
−Ωπs,(3)
where
qs
and
πs
denote the shorthand notation of
q(s, ·)
and
π(·|s)
. Differentiating both sides with
respect to
qs
, the gradient of conjugate
∇Ω∗
maps
qs
to a policy distribution. One fundamental
property in regularized IRL [
43
] is that
π
E
is the maximizing argument of
Ω∗
for
qE
, where
qE
is
the regularized state-action value function
qE(s, a) = Eπ
E[P∞
i=0 γi{r
E(s, a)−Ω(πs
E)}|s0=s, a0=a]
.
Note that the problem is ill-posed, and every
ˆr
E
that makes its value function
ˆqE
satisfy
πs
E=
∇Ω∗(ˆqs
E)∀s∈ S
is a valid solution. Addressing this issue, Jeon et al.
[18]
proposed a reward
operator Ψ
Ω:∆S
A→RS×A, providing a unique IRL solution by Ψ
Ω(π
E).
Definition 1
(Regularized reward operators)
.
Define the regularized reward operator
Ψ
Ω
as
ψπ(s, a):= Ω0(s, a;π)−πs
,∇Ω(πs)A
+ Ω(πs), for Ω0(s, ·;π):=∇Ω(πs) = ∇pΩ(p)p=π(·|s).
The reward function
ψ
E:= Ψ
Ω(π
E)
replaces its state-action value function, since the sum of composite
Bregman divergences derived from Eq. (2) allows reward learning in a greedy manner [18].
3