Robust Imitation via Mirror Descent Inverse Reinforcement Learning Dong-Sig Han Hyunseo Kim Hyundo Lee Je-Hwan Ryu Byoung-Tak Zhang

2025-05-03 0 0 1.92MB 27 页 10玖币
侵权投诉
Robust Imitation via
Mirror Descent Inverse Reinforcement Learning
Dong-Sig Han, Hyunseo Kim, Hyundo Lee, Je-Hwan Ryu, Byoung-Tak Zhang
Artificial Intelligence Institute, Seoul National University
{dshan, hskim, hdlee, jhryu, btzhang}@bi.snu.ac.kr
Abstract
Recently, adversarial imitation learning has shown a scalable reward acquisition
method for inverse reinforcement learning (IRL) problems. However, estimated
reward signals often become uncertain and fail to train a reliable statistical model
since the existing methods tend to solve hard optimization problems directly.
Inspired by a first-order optimization method called mirror descent, this paper
proposes to predict a sequence of reward functions, which are iterative solutions
for a constrained convex problem. IRL solutions derived by mirror descent are
tolerant to the uncertainty incurred by target density estimation since the amount
of reward learning is regulated with respect to local geometric constraints. We
prove that the proposed mirror descent update rule ensures robust minimization of
a Bregman divergence in terms of a rigorous regret bound of
O(1/T )
for step sizes
{ηt}T
t=1
. Our IRL method was applied on top of an adversarial framework, and it
outperformed existing adversarial methods in an extensive suite of benchmarks.
1 Introduction
One crucial requirement of practical imitation learning methods is robustness, often described
as learning expert behavior for a finite number of demonstrations, overcoming various realistic
challenges [
1
]. In real-world problems such as motor control tasks, the demonstration size can
be insufficient to create a precise model of an expert [
2
], and even in some cases, demonstrations
can be noisy or suboptimal to solve the problem [
3
]. For such challenging scenarios, imitation
learning algorithms inevitably struggle with unreliable statistical models; thus, the way of handling
the uncertainty of estimated cost functions dramatically affects imitation performance. Therefore, a
thorough analysis of addressing these issues is required to construct a robust algorithm.
Inverse reinforcement learning (IRL) is an algorithm for learning ground-truth rewards from expert
demonstrations where the expert acts optimally with respect to an unknown reward function [
4
,
5
].
Traditional IRL studies solve the imitation problem based on iterative algorithms [
6
,
7
], alternating
between the reward estimation process and a reinforcement learning (RL) [
8
] algorithm. In contrast,
newer studies of adversarial imitation learning (AIL) [
9
,
10
] rather suggest learning reward functions
of a certain form “directly,” by using adversarial learning objectives [
11
] and nonlinear discriminative
neural networks [
10
]. Compared to classical approaches, the AIL methods have shown great success
on control benchmarks in terms of scalability for challenging control tasks [12].
Technically, it is well known that AIL formulates a divergence minimization problem with its
discriminative signals, which incorporates fine-tuned estimations of the target densities [
13
]. Through
the lens of differential geometries, the limitation of AIL naturally comes from the implication
that minimizing the divergence does not guarantee unbiased progression due to constraints of the
underlying space [
14
]. In order to ensure further stability, we argue that an IRL algorithm’s progress
needs to be regulated, yielding gradual updates with respect to local geometries of policy distributions.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11201v2 [cs.LG] 5 Jan 2023
We claim that there are two issues leading to unconstrained policy updates:
1
a statistical divergence
often cannot be accurately obtained for challenging problems, and
2
an immediate divergence
between agent and expert densities does not guarantee unbiased learning directions. Our approach is
connected to a collection of optimization processes called mirror descent (MD) [
15
]. For a sequence
of parameters {wt}T
t=1 and a convex function , an MD update for a cost function Ftis derived as
wt+1=wtηtFtwt.(1)
In the equation, the gradient
Ω(·)
creates a transformation that links a parametric space to its dual
space. Theoretically, MD is a first-order method for solving constrained problems, which enjoys
rigorous regret bounds for various geometries [16, 17] including probability spaces. Thus, applying
MD to the reward estimation process can be efficient in terms of the number of learning phases.
In this paper, we derive an MD update rule in IRL upon a postulate of nonstationary estimations
of the expert density, resulting in convergent reward acquisition even for challenging problems.
Compared to MD algorithms in optimization studies, our methodology draws a sequence of functions
on an alternative space induced by a reward operator
Ψ
(Definition 1). To this end, we propose an
AIL algorithm called mirror descent adversarial inverse reinforcement learning (MD-AIRL). Our
empirical evidence showed that MD-AIRL outperforms the regularized adversarial IRL (RAIRL)
[
18
] methods. For example, MD-AIRL showed higher performance in 30 distinct cases among 32
different configurations in challenging MuJoCo [
19
] benchmarks, and it also clearly showed higher
tolerance to suboptimal data. All of these results are strongly aligned with our theoretical analyses.
Table 1: A technical overview. Traditional IRL methods lack scalability, and RAIRL does not
guarantee convergence of its solution for realistic cases. MD-AIRL combines desirable properties.
Method Reference Scalability Rewards Bregman divergence Iterative solutions Convergence analyses
BC (1991) [20] 37 7 7 3
MM-IRL (2004) [6] 7373 3
GAIL (2016) [9] 3 3 7 7 7
RAIRL (2021) [18] 3 3 3 7 7
MD-AIRL (ours) 3 3 3 3 3
Our contributions.
Our work is complementary to previous IRL studies; the theoretical and
technical contributions are built upon a novel perspective of considering iterative RL and IRL
algorithms as a combined optimization process with dual aspects. Comparing MD-AIRL and RAIRL,
both are highly generalized algorithms in terms of a variety of choices of divergence functions. Tab. 1
shows that MD-AIRL brings beneficial results in realistic situations of limited time and data, since our
approach is more aligned with earlier theoretical IRL studies providing formalized reward learning
schemes and convergence guarantees. In summary, we list our main contributions below:
Instead of a monolithic estimation process of a global solution in AIL, we derive a sequence of
reward functions that provides iterative local objectives (Section 4).
We formally prove that rewards derived by an MD update rule guarantee the robust performance of
divergence minimization along with a rigorous regret bound (Section 5).
We propose a novel adversarial algorithm that is motivated by mirror descent, which is tolerant of
unreliable discriminative signals of the AIL framework (Sections 6 and 7).
2 Related Works
Mirror descent.
We are interested in a family of statistical divergences called the Bregman
divergence [
21
]. The divergence generalizes constrained optimization problems such as least squares
[
22
,
23
], and it also has been applied in various subfields of machine learning [
24
,
25
]. In differential
geometries, the Bregman divergence is a first-order approximation for a metric tensor and satisfies
metric-like properties [
14
,
26
]. MD is also closely related to optimization methods regarding non-
Euclidean geometries with a discretization of steps such as natural gradients [
27
,
28
]. In the primal
space, training with the infinitesimal limit of MD steps corresponds to a Riemannian gradient flow
[
29
,
30
]. In the RL domain, MD has been recently studied for policy optimization [
31
33
]. In this
paper, we focus on learning with suboptimal representations of policy, and our distinct goal is to draw
a robust reward learning scheme based on MD for the IRL problem.
Imitation learning.
As a statistical model for the information geometry [
34
], energy-based policies
(i.e., Boltzmann distributions) appeared in early IRL studies, such as Bayesian IRL, natural gradient
2
IRL, and maximum likelihood IRL [
35
37
] for modeling expert distribution to parameterized func-
tions. Notably, MaxEnt IRL [
7
,
38
] is one of the representative classical IRL algorithms based on
an information-theoretic perspective toward IRL solutions. Also, discriminators of AIL are trained
by logistic regression; thus, the logit score of the discriminator defines an energy function that
approximates the truth data density for the expert distribution [
39
]. Other statistical entropies have
also been applied to AIL, such as the Tsallis entropy [40]. On the one hand, our approach is closely
related to RAIRL [
18
], which defined its AIL objective using the Bregman divergence. On the other
hand, this work further employs the Bregman divergence to derive iterative MD updates for reward
functions, resulting in theoretically pleasing properties while retaining the scalability of AIL.
Learning theory.
There have been considerable achievements in dealing with temporal costs
{Ft}
t=1
, often referred to as online learning [
41
]. The most ordinary approach is stochastic gradient
descent (SGD):
wt+1 =wtηtFt(wt)
. In particular, SGD is a desirable algorithm when the
parameter
wt
resides in the Euclidean space since it ensures unbiased minimization of the expected
cost. Apparently, policies appear in geometries of probabilities; thus, an incurred gradient may not be
the direction of the steepest descent due to geometric constraints [
27
,
34
]. An online form of MD
in Eq. (1) is analogous to SGD for non-Euclidean spaces, where each local metric is specified by
a Bregman divergence [
30
]. Our theoretical findings and proofs follow the results of online mirror
descent (OMD) that appeared in previous literature for general aspects [
15
,
16
,
42
,
28
,
17
,
30
]. Our
analyses extend existing theoretical results to IRL; at the same time, they are also highly general to
cover various online imitation learning problems which require making decisions sequentially.
3 Background
For sets
X
and
Y
, let
YX
be a set of functions from
X
to
Y
and
X
(
Y
X
) be a set of (conditional)
probabilities over
X
(conditioned on
Y
). We consider an MDP defined as a tuple
(S,A, P, r, γ)
with the state space
S
, the action space
A
, the Markovian transition kernel
PS×A
S
, the reward
function
rRS×A
and the discount factor
γ[0,1)
. Let a function
: ∆
AR
be strongly convex.
Using , the Bregman divergence is defined as
D(πskˆπs):= Ω(πs)Ω(ˆπs)Ω(ˆπs), πsˆπsA,
where
πs
and
ˆπs
denote arbitrary policies for a given state
s
. For a representative divergence, one can
consider the popular Kullback-Leibler (KL) divergence. The KL divergence is a Bregman divergence
when is specified as the negative Shannon entropy: Ω(πs) = Paπs(a) ln πs(a).
Regularization of the policy distribution with respect to convex
brings distinct properties to the
learning agent [
43
,
44
]. The objective of regularized RL is to find
πΠ
that maximizes the expected
value of discounted cumulative returns along with a causal convex regularizer , i.e.,
J(π, r):=Eπ
hX
i=0 γinr(si, ai)π(· |si)oi,(2)
where the subscript
π
on the expectation indicates that each action is sampled by
π(·|si)
for the given
MDP. In this setup, a regularized RL algorithm finds a unique solution in a subset of the conditional
probability space denoted as Π:=s]sSS
Aconstrained by the parameterization of a policy.
The objective of IRL is to find a function
r
E
that rationalizes the behavior of an expert policy
π
E
. For
an inner product ,·i
A, consider
, the Legendre-Fenchel transform (convex conjugate) of :
qsRA
,qs= max
πsA
πs
, qsA
πs,(3)
where
qs
and
πs
denote the shorthand notation of
q(s, ·)
and
π(·|s)
. Differentiating both sides with
respect to
qs
, the gradient of conjugate
maps
qs
to a policy distribution. One fundamental
property in regularized IRL [
43
] is that
π
E
is the maximizing argument of
for
qE
, where
qE
is
the regularized state-action value function
qE(s, a) = Eπ
E[P
i=0 γi{r
E(s, a)Ω(πs
E)}|s0=s, a0=a]
.
Note that the problem is ill-posed, and every
ˆr
E
that makes its value function
ˆqE
satisfy
πs
E=
(ˆqs
E)s∈ S
is a valid solution. Addressing this issue, Jeon et al.
[18]
proposed a reward
operator Ψ
:S
ARS×A, providing a unique IRL solution by Ψ
(π
E).
Definition 1
(Regularized reward operators)
.
Define the regularized reward operator
Ψ
as
ψπ(s, a):= Ω0(s, a;π)πs
,Ω(πs)A
+ Ω(πs), for 0(s, ·;π):=Ω(πs) = pΩ(p)p=π(·|s).
The reward function
ψ
E:= Ψ
(π
E)
replaces its state-action value function, since the sum of composite
Bregman divergences derived from Eq. (2) allows reward learning in a greedy manner [18].
3
4 RL-IRL as a Proximal Method
Policy Space
Π
πt
πt+1
˜πt+1
P
Reward Space
ψt
ψt+1
RS×A
Ψ
Figure 1: A schematic illustration. MD is lo-
cally constrained by a divergence (gray area),
i.e.,
D(· kπt)
. An MD update is performed for
the reward function
ψt
in an associated reward
space of defined by
Ψ
, and
πt+1
is achieved in
the desired space of
Π
by applying
for the
function
ψt+1
. The gray dashed lines provide
another interpretation of MD with
˜πt+1
and the
projection operator P
.
Associated reward functions.
We consider the
RL-IRL processes as a sequential algorithm with
local constraints and define sequences
{πt}
t=1
and
{ψt}
t=1
that denote policies and associated reward
functions, respectively. The associated reward func-
tions are in a space
Ψ
(Π)
, which is an alternative
space of the dual space, defined by the regular-
ized reward operator
Ψ
. Formally, we provide
Lemma 1, which shows a bijective relation between
the operators
and
Ψ
in the set
Π
. The proof
is in Appendix A.
Lemma 1
(Natural isomorphism)
.
Let
ψΨ
(Π)
for
Ψ
(X):={ψ|ψ(s, a) = ψπ(s, a),s
S, aA, π X}
. Then,
(ψ)
is unique and for
every π=
(ψ),πΠ.
Fig. 1 illustrates that there is a unique
ψt
for
πt
in every time step. Note that
Ψ
(πt)
is different
from
Ω(πt)
; it is shifted by a vector
1c
with a constant
c= Ω(πs
t)− hπs
t,Ω(πs
t)i
A
. Since the
underlying space is a probability simplex, the operator
reconstructs the original point for both
Ψ
and
, as the distributivity [
43
]
(y+1c)=Ω(y) + c
holds (so
(y+1c) = (y)
).
An alternative interpretation is of considering a projection (gray dashed line in Fig. 1). Suppose that a
policy
πt
is updated to
˜πt+1 RS×A
. The Bregman projection operator
P
is applied that locates the
subsequent update πt+1 to the “feasible” region, i.e., P
(˜πt+1):= argminπΠ[D(πsk˜πs
t+1)]sS.
Consequently, one can consider an updated reward function
ψt+1
as a projected target of MD
associated with an alternative parameterization of
Π
. For instance, the parameters of
ψt
can construct
a softmax policy for a discrete space, or a Gaussian policy for a continuous space. Using the reward
function ψt+1, an arbitrary regularized RL process maximizing Eq. (2) at the t-th step [18]
J(π, ψt+1)=Eπ
hX
i=0 γiDπ(· |si)
πt+1(· |si)
i (4)
becomes finding the next iteration
πt+1 =(ψt+1)
by maximizing the expected cumulative return.
The equation shows that a regularized RL algorithm with the regularizer
forms a cumulative sum
of Bregman divergences; thus, the policy πt+1 is uniquely achieved by the property of divergence.
Online imitation learning.
Our setup starts from the apparent yet vital premise that an imitation
learning algorithm does not retain the global target
π
E
during training. That is, it is fundamentally
uncertain to model global objectives (such as
J(π, ψ
E)
), which are not attainable for both RL and
IRL. Instead, we hypothesize on the existence of a random process
{¯π
E,t}
t=1
where each estimation
¯π
E,t
resides in a closed, convex neighborhood of
π
E
, generated by an arbitrary estimation algorithm.
Substituting
ψ
E
to
ψ¯π
E,t =Ψ
(¯π
E,t)
, the nonstationary objective
J(π, ψ¯π
E,t)
forms a temporal cost:
Ft(π) = Eπ
hX
i=0 γiDπ(· |si)
¯π
E,t(· |si)i.(5)
For the sake of better understanding, we considered an actual experiment depicted in Fig. 2. Suppose
that the policies of the learning agent and the expert follow multivariate Gaussian distributions at
πtπt+1 ¯π
E,t
π
Eη= 1.0
η= 0.2
t=1 t=10 t=100 (4X)
t=1 t=10 t=100 (4X)
Bregman div.
Shannon regularizer
Time Step
Tsallis regularizer
Time Step
abc
Figure 2: (a) A policy
πt
learns from MD updates for temporal costs
D(· k¯π
E,t)
. (b) The updates of
πt
vary by
η
, and the distance between
πt
and
π
E
can be closer than the distance between
¯π
E,t
and
π
E
when
t
is sufficiently large and the
η
is effectively low. (c) Two plots show
D(πt,kπ
E)
associated
with entropic regularizers for four different η(10 trials), with the red baselines D(¯π
E,tkπ
E).
4
N([0,0]
T
,I)
and
N([5,3]
T
,Σ
E)
with
|Σ
E|<1
. Let a (suboptimal) reference policy
¯π
E,t
be independently
fitted with a maximum likelihood estimator with a relatively high learning rate, starting from
¯π
E,1=π1
.
The policy
πt
was trained by a cost function
D(·k¯π
E,t)
using the MD update rule in Eq. (1). In Fig. 2,
we first observed that choosing a high step size constant
η
accelerated the training speed mainly in
the early phase. The results also showed that the performance of MD (
D(πtkπ
E)
) outperformed
that of referenced maximum likelihood estimation (
D(¯π
E,tkπ
E)
) by choosing an effectively low
step size. This empirical evidence suggests that there are clear advantages in formalization of the
training steps and scheduling the step sizes, especially for unreliable statistical model ¯π
E,t.
MD update rules.
As a result of these findings, we formulate subsequent MD steps with a regularized
reward function. Let
wt
be a parameter on a set
W
and
Ft:W R
be a convex cost function from
a class of functions
F
at the
t
-th step. Replacing the L2 proximity term of proximal gradient descent
with a Bregman divergence, the proximal form of the MD update for Eq. (1) is written as [45]
minimize
w∈W Ft(wt), wwtW
+αtDw
wt,(6)
where
αt:=1
/ηt
denotes an inverse of the current step size
ηt
[
46
]. Plugging each divergence of
the cumulative cost
Ft
to Eq. (6), the MD-IRL update for the subsequent reward function
ψt+1 =
Ψ
(πt+1)is derived by solving a problem
minimize
πsΠsDπs
t
¯πs
E,t
| {z }
Ω(πs
t)−∇Ω(¯πs
E,t)
, πsπs
tA+αtDπs
πs
t
minimize
πsΠsDπs
¯πs
E,tDπs
πs
t+αtDπs
πs
t
minimize
πsΠsηtDπs
¯πs
E,t
| {z }
estimated expert
+ (1ηt)Dπs
πs
t
| {z }
learning agent
s∈ S,(7)
where the gradient of
D
is taken with respect to its first argument
πs
t
. Note that solving the
optimization Eq. (7) requires interaction between
πt
and the dynamics of the given environment
in order to minimize
Ft
; thus, the corresponding RL process plays an essential role in sequential
learning by the induced the value measures. At a glance, the objective is analogous to finding an
interpolation at each iteration where the point is controlled
ηt
. Fig. 3 shows that the uncertainty of
πt
(blue region) gets minimal regardless of persisting uncertainty of ¯π
E,t (red region).
a
Ground-truth cost
MD with ηt
b
Ground-truth cost
MD with ηt+1
Figure 3: Illustrations of MD at the (a)
t
-th iteration and (b)
(t+ 1)
-th iteration where
ηt> ηt+1
.
{¯π
E,t
}
t=1
is an arbitrary estimation process attained from a neighborhood of
π
E
with respect to a
norm. The MD update is taken inside the interval of πtand ¯π
E,t using Eq. (7).
5 Convergence Analyses
In this section, we present our theoretical results. The main goals of the following arguments are to
address
1
the convergence of MD updates for various cases and
2
the necessity of scheduling the
amount of learning. Suppose that state instances of
s(t)
iτt
cover the entire
S
by executing the policy
πt
in an infinite horizon. From this assumption, we define a temporal cost function at the time step
t
:
f(πt, τt):=P
i=0 γiDπt(· | s(t)
i)
¯π
E,t(· | s(t)
i),(8)
that involves
πt
, and additionally a trajectory
τt
as inputs. We refer to the global objective as finding
a unique fixed point
πΠ
that minimizes a total cost
F(π):=E[f(π, τt)]
, where the expectation is
taken over trajectories of entire steps, i.e.,
limt→∞Eτ1:t[f(π, τt)]
. Taking the (stepwise) gradient for
each
π(·|s)
, an optimal policy
π
is found by
E[Ω(π(·|s)) − ∇Ω(¯π
E,t(·|s))] = 0
when
t→ ∞
;
hence,
Ω(π(·|s)) = limt→∞E[Ω(¯π
E,t(·|s))]
. Introducing the optimal policy
π
allows not only
the specific situation when
1
π
E=πΠ
and the estimation algorithm of
¯π
E,t
is actually convergent
with
t→ ∞
, but also more general situations where
2
π
E/Π
or the estimated expert policy
¯π
E,t
does not converge; the algorithm finds convergence to a fixed point by scheduling updates.
5
摘要:

RobustImitationviaMirrorDescentInverseReinforcementLearningDong-SigHan,HyunseoKim,HyundoLee,Je-HwanRyu,Byoung-TakZhangArticialIntelligenceInstitute,SeoulNationalUniversity{dshan,hskim,hdlee,jhryu,btzhang}@bi.snu.ac.krAbstractRecently,adversarialimitationlearninghasshownascalablerewardacquisitionmet...

展开>> 收起<<
Robust Imitation via Mirror Descent Inverse Reinforcement Learning Dong-Sig Han Hyunseo Kim Hyundo Lee Je-Hwan Ryu Byoung-Tak Zhang.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:1.92MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注