Robust Imitation via Mirror Descent Inverse Reinforcement Learning Dong-Sig Han Hyunseo Kim Hyundo Lee Je-Hwan Ryu Byoung-Tak Zhang

2025-05-03 0 0 1.92MB 27 页 10玖币

侵权投诉

Robust Imitation via

Mirror Descent Inverse Reinforcement Learning

Dong-Sig Han, Hyunseo Kim, Hyundo Lee, Je-Hwan Ryu, Byoung-Tak Zhang

Artiﬁcial Intelligence Institute, Seoul National University

{dshan, hskim, hdlee, jhryu, btzhang}@bi.snu.ac.kr

Abstract

Recently, adversarial imitation learning has shown a scalable reward acquisition

method for inverse reinforcement learning (IRL) problems. However, estimated

reward signals often become uncertain and fail to train a reliable statistical model

since the existing methods tend to solve hard optimization problems directly.

Inspired by a ﬁrst-order optimization method called mirror descent, this paper

proposes to predict a sequence of reward functions, which are iterative solutions

for a constrained convex problem. IRL solutions derived by mirror descent are

tolerant to the uncertainty incurred by target density estimation since the amount

of reward learning is regulated with respect to local geometric constraints. We

prove that the proposed mirror descent update rule ensures robust minimization of

a Bregman divergence in terms of a rigorous regret bound of

O(1/T )

for step sizes

{ηt}T

t=1

. Our IRL method was applied on top of an adversarial framework, and it

outperformed existing adversarial methods in an extensive suite of benchmarks.

1 Introduction

One crucial requirement of practical imitation learning methods is robustness, often described

as learning expert behavior for a ﬁnite number of demonstrations, overcoming various realistic

challenges [

]. In real-world problems such as motor control tasks, the demonstration size can

be insufﬁcient to create a precise model of an expert [

], and even in some cases, demonstrations

can be noisy or suboptimal to solve the problem [

]. For such challenging scenarios, imitation

learning algorithms inevitably struggle with unreliable statistical models; thus, the way of handling

the uncertainty of estimated cost functions dramatically affects imitation performance. Therefore, a

thorough analysis of addressing these issues is required to construct a robust algorithm.

Inverse reinforcement learning (IRL) is an algorithm for learning ground-truth rewards from expert

demonstrations where the expert acts optimally with respect to an unknown reward function [

Traditional IRL studies solve the imitation problem based on iterative algorithms [

], alternating

between the reward estimation process and a reinforcement learning (RL) [

] algorithm. In contrast,

newer studies of adversarial imitation learning (AIL) [

] rather suggest learning reward functions

of a certain form “directly,” by using adversarial learning objectives [

] and nonlinear discriminative

neural networks [

]. Compared to classical approaches, the AIL methods have shown great success

on control benchmarks in terms of scalability for challenging control tasks [12].

Technically, it is well known that AIL formulates a divergence minimization problem with its

discriminative signals, which incorporates ﬁne-tuned estimations of the target densities [

]. Through

the lens of differential geometries, the limitation of AIL naturally comes from the implication

that minimizing the divergence does not guarantee unbiased progression due to constraints of the

underlying space [

]. In order to ensure further stability, we argue that an IRL algorithm’s progress

needs to be regulated, yielding gradual updates with respect to local geometries of policy distributions.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11201v2 [cs.LG] 5 Jan 2023

We claim that there are two issues leading to unconstrained policy updates:



a statistical divergence

often cannot be accurately obtained for challenging problems, and



an immediate divergence

between agent and expert densities does not guarantee unbiased learning directions. Our approach is

connected to a collection of optimization processes called mirror descent (MD) [

]. For a sequence

of parameters {wt}T

t=1 and a convex function Ω, an MD update for a cost function Ftis derived as

∇Ωwt+1=∇Ωwt−ηt∇Ftwt.(1)

In the equation, the gradient

∇Ω(·)

creates a transformation that links a parametric space to its dual

space. Theoretically, MD is a ﬁrst-order method for solving constrained problems, which enjoys

rigorous regret bounds for various geometries [16, 17] including probability spaces. Thus, applying

MD to the reward estimation process can be efﬁcient in terms of the number of learning phases.

In this paper, we derive an MD update rule in IRL upon a postulate of nonstationary estimations

of the expert density, resulting in convergent reward acquisition even for challenging problems.

Compared to MD algorithms in optimization studies, our methodology draws a sequence of functions

on an alternative space induced by a reward operator

Ω

(Deﬁnition 1). To this end, we propose an

AIL algorithm called mirror descent adversarial inverse reinforcement learning (MD-AIRL). Our

empirical evidence showed that MD-AIRL outperforms the regularized adversarial IRL (RAIRL)

[

] methods. For example, MD-AIRL showed higher performance in 30 distinct cases among 32

different conﬁgurations in challenging MuJoCo [

] benchmarks, and it also clearly showed higher

tolerance to suboptimal data. All of these results are strongly aligned with our theoretical analyses.

Table 1: A technical overview. Traditional IRL methods lack scalability, and RAIRL does not

guarantee convergence of its solution for realistic cases. MD-AIRL combines desirable properties.

Method Reference Scalability Rewards Bregman divergence Iterative solutions Convergence analyses

BC (1991) [20] 37 7 7 3

MM-IRL (2004) [6] 7373 3

GAIL (2016) [9] 3 3 7 7 7

RAIRL (2021) [18] 3 3 3 7 7

MD-AIRL (ours) – 3 3 3 3 3

Our contributions.

Our work is complementary to previous IRL studies; the theoretical and

technical contributions are built upon a novel perspective of considering iterative RL and IRL

algorithms as a combined optimization process with dual aspects. Comparing MD-AIRL and RAIRL,

both are highly generalized algorithms in terms of a variety of choices of divergence functions. Tab. 1

shows that MD-AIRL brings beneﬁcial results in realistic situations of limited time and data, since our

approach is more aligned with earlier theoretical IRL studies providing formalized reward learning

schemes and convergence guarantees. In summary, we list our main contributions below:

•

Instead of a monolithic estimation process of a global solution in AIL, we derive a sequence of

reward functions that provides iterative local objectives (Section 4).

•

We formally prove that rewards derived by an MD update rule guarantee the robust performance of

divergence minimization along with a rigorous regret bound (Section 5).

•

We propose a novel adversarial algorithm that is motivated by mirror descent, which is tolerant of

unreliable discriminative signals of the AIL framework (Sections 6 and 7).

2 Related Works

Mirror descent.

We are interested in a family of statistical divergences called the Bregman

divergence [

]. The divergence generalizes constrained optimization problems such as least squares

[

], and it also has been applied in various subﬁelds of machine learning [

]. In differential

geometries, the Bregman divergence is a ﬁrst-order approximation for a metric tensor and satisﬁes

metric-like properties [

]. MD is also closely related to optimization methods regarding non-

Euclidean geometries with a discretization of steps such as natural gradients [

]. In the primal

space, training with the inﬁnitesimal limit of MD steps corresponds to a Riemannian gradient ﬂow

[

]. In the RL domain, MD has been recently studied for policy optimization [

–

]. In this

paper, we focus on learning with suboptimal representations of policy, and our distinct goal is to draw

a robust reward learning scheme based on MD for the IRL problem.

Imitation learning.

As a statistical model for the information geometry [

], energy-based policies

(i.e., Boltzmann distributions) appeared in early IRL studies, such as Bayesian IRL, natural gradient

IRL, and maximum likelihood IRL [

–

] for modeling expert distribution to parameterized func-

tions. Notably, MaxEnt IRL [

] is one of the representative classical IRL algorithms based on

an information-theoretic perspective toward IRL solutions. Also, discriminators of AIL are trained

by logistic regression; thus, the logit score of the discriminator deﬁnes an energy function that

approximates the truth data density for the expert distribution [

]. Other statistical entropies have

also been applied to AIL, such as the Tsallis entropy [40]. On the one hand, our approach is closely

related to RAIRL [

], which deﬁned its AIL objective using the Bregman divergence. On the other

hand, this work further employs the Bregman divergence to derive iterative MD updates for reward

functions, resulting in theoretically pleasing properties while retaining the scalability of AIL.

Learning theory.

There have been considerable achievements in dealing with temporal costs

{Ft}∞

t=1

, often referred to as online learning [

]. The most ordinary approach is stochastic gradient

descent (SGD):

wt+1 =wt−ηt∇Ft(wt)

. In particular, SGD is a desirable algorithm when the

parameter

resides in the Euclidean space since it ensures unbiased minimization of the expected

cost. Apparently, policies appear in geometries of probabilities; thus, an incurred gradient may not be

the direction of the steepest descent due to geometric constraints [

]. An online form of MD

in Eq. (1) is analogous to SGD for non-Euclidean spaces, where each local metric is speciﬁed by

a Bregman divergence [

]. Our theoretical ﬁndings and proofs follow the results of online mirror

descent (OMD) that appeared in previous literature for general aspects [

]. Our

analyses extend existing theoretical results to IRL; at the same time, they are also highly general to

cover various online imitation learning problems which require making decisions sequentially.

3 Background

For sets

and

, let

be a set of functions from

and

∆X

(

∆Y

) be a set of (conditional)

probabilities over

(conditioned on

). We consider an MDP deﬁned as a tuple

(S,A, P, r, γ)

with the state space

, the action space

, the Markovian transition kernel

P∈∆S×A

, the reward

function

r∈RS×A

and the discount factor

γ∈[0,1)

. Let a function

Ω: ∆

A→R

be strongly convex.

Using Ω, the Bregman divergence is deﬁned as

DΩ(πskˆπs):= Ω(πs)−Ω(ˆπs)−∇Ω(ˆπs), πs−ˆπsA,

where

πs

and

ˆπs

denote arbitrary policies for a given state

. For a representative divergence, one can

consider the popular Kullback-Leibler (KL) divergence. The KL divergence is a Bregman divergence

when Ωis speciﬁed as the negative Shannon entropy: Ω(πs) = Paπs(a) ln πs(a).

Regularization of the policy distribution with respect to convex

Ω

brings distinct properties to the

learning agent [

]. The objective of regularized RL is to ﬁnd

π∈Π

that maximizes the expected

value of discounted cumulative returns along with a causal convex regularizer Ω, i.e.,

JΩ(π, r):=Eπ

hX∞

i=0 γinr(si, ai)−Ωπ(· |si)oi,(2)

where the subscript

on the expectation indicates that each action is sampled by

π(·|si)

for the given

MDP. In this setup, a regularized RL algorithm ﬁnds a unique solution in a subset of the conditional

probability space denoted as Π:=[Πs]s∈S⊂∆S

Aconstrained by the parameterization of a policy.

The objective of IRL is to ﬁnd a function

that rationalizes the behavior of an expert policy

. For

an inner product h·,·i

A, consider Ω∗

, the Legendre-Fenchel transform (convex conjugate) of Ω:

∀qs∈RA

,Ω∗qs= max

πs∈∆A

πs

, qsA

−Ωπs,(3)

where

and

πs

denote the shorthand notation of

q(s, ·)

and

π(·|s)

. Differentiating both sides with

respect to

, the gradient of conjugate

∇Ω∗

maps

to a policy distribution. One fundamental

property in regularized IRL [

] is that

is the maximizing argument of

Ω∗

for

, where

the regularized state-action value function

qE(s, a) = Eπ

E[P∞

i=0 γi{r

E(s, a)−Ω(πs

E)}|s0=s, a0=a]

Note that the problem is ill-posed, and every

ˆr

that makes its value function

ˆqE

satisfy

πs

∇Ω∗(ˆqs

E)∀s∈ S

is a valid solution. Addressing this issue, Jeon et al.

[18]

proposed a reward

operator Ψ

Ω:∆S

A→RS×A, providing a unique IRL solution by Ψ

Ω(π

E).

Deﬁnition 1

(Regularized reward operators)

Deﬁne the regularized reward operator

Ω

ψπ(s, a):= Ω0(s, a;π)−πs

,∇Ω(πs)A

+ Ω(πs), for Ω0(s, ·;π):=∇Ω(πs) = ∇pΩ(p)p=π(·|s).

The reward function

E:= Ψ

Ω(π

replaces its state-action value function, since the sum of composite

Bregman divergences derived from Eq. (2) allows reward learning in a greedy manner [18].

4 RL-IRL as a Proximal Method

Policy Space

πt

πt+1

˜πt+1

PΩ

Reward Space

ψt

ψt+1

RS×A

Ω

∇Ω∗

Figure 1: A schematic illustration. MD is lo-

cally constrained by a divergence (gray area),

i.e.,

DΩ(· kπt)

. An MD update is performed for

the reward function

ψt

in an associated reward

space of deﬁned by

Ω

, and

πt+1

is achieved in

the desired space of

by applying

∇Ω∗

for the

function

ψt+1

. The gray dashed lines provide

another interpretation of MD with

˜πt+1

and the

projection operator P

Ω.

Associated reward functions.

We consider the

RL-IRL processes as a sequential algorithm with

local constraints and deﬁne sequences

{πt}∞

t=1

and

{ψt}∞

t=1

that denote policies and associated reward

functions, respectively. The associated reward func-

tions are in a space

Ω(Π)

, which is an alternative

space of the dual space, deﬁned by the regular-

ized reward operator

Ω

. Formally, we provide

Lemma 1, which shows a bijective relation between

the operators

∇Ω∗

and

Ω

in the set

. The proof

is in Appendix A.

Lemma 1

(Natural isomorphism)

Let

ψ∈Ψ

Ω(Π)

for

Ω(X):={ψ|ψ(s, a) = ψπ(s, a),∀s∈

S, a∈A, π ∈X}

. Then,

∇Ω∗(ψ)

is unique and for

every π=∇Ω∗

(ψ),π∈Π.

Fig. 1 illustrates that there is a unique

ψt

for

πt

in every time step. Note that

Ω(πt)

is different

from

∇Ω(πt)

; it is shifted by a vector

with a constant

c= Ω(πs

t)− hπs

t,∇Ω(πs

t)i

. Since the

underlying space is a probability simplex, the operator

∇Ω∗

reconstructs the original point for both

Ω

and

∇Ω

, as the distributivity [

]

Ω∗(y+1c)=Ω∗(y) + c

holds (so

∇Ω∗(y+1c) = ∇Ω∗(y)

An alternative interpretation is of considering a projection (gray dashed line in Fig. 1). Suppose that a

policy

πt

is updated to

˜πt+1 ∈RS×A

. The Bregman projection operator

Ω

is applied that locates the

subsequent update πt+1 to the “feasible” region, i.e., P

Ω(˜πt+1):= argminπ∈Π[DΩ(πsk˜πs

t+1)]s∈S.

Consequently, one can consider an updated reward function

ψt+1

as a projected target of MD

associated with an alternative parameterization of

. For instance, the parameters of

ψt

can construct

a softmax policy for a discrete space, or a Gaussian policy for a continuous space. Using the reward

function ψt+1, an arbitrary regularized RL process maximizing Eq. (2) at the t-th step [18]

JΩ(π, ψt+1)=−Eπ

hX∞

i=0 γiDΩπ(· |si)



πt+1(· |si)

i (4)

becomes ﬁnding the next iteration

πt+1 =∇Ω∗(ψt+1)

by maximizing the expected cumulative return.

The equation shows that a regularized RL algorithm with the regularizer

Ω

forms a cumulative sum

of Bregman divergences; thus, the policy πt+1 is uniquely achieved by the property of divergence.

Online imitation learning.

Our setup starts from the apparent yet vital premise that an imitation

learning algorithm does not retain the global target

during training. That is, it is fundamentally

uncertain to model global objectives (such as

JΩ(π, ψ

), which are not attainable for both RL and

IRL. Instead, we hypothesize on the existence of a random process

{¯π

E,t}∞

t=1

where each estimation

¯π

E,t

resides in a closed, convex neighborhood of

, generated by an arbitrary estimation algorithm.

Substituting

ψ¯π

E,t =Ψ

Ω(¯π

E,t)

, the nonstationary objective

JΩ(π, ψ¯π

E,t)

forms a temporal cost:

Ft(π) = Eπ

hX∞

i=0 γiDΩπ(· |si)

¯π

E,t(· |si)i.(5)

For the sake of better understanding, we considered an actual experiment depicted in Fig. 2. Suppose

that the policies of the learning agent and the expert follow multivariate Gaussian distributions at

πtπt+1 ¯π

E,t

Eη= 1.0

η= 0.2

t=1 t=10 t=100 (4X)

Bregman div.

Shannon regularizer

Time Step

Tsallis regularizer

Time Step

abc

Figure 2: (a) A policy

πt

learns from MD updates for temporal costs

DΩ(· k¯π

E,t)

. (b) The updates of

πt

vary by

, and the distance between

πt

and

can be closer than the distance between

¯π

E,t

and

when

is sufﬁciently large and the

is effectively low. (c) Two plots show

DΩ(πt,kπ

associated

with entropic regularizers for four different η(10 trials), with the red baselines DΩ(¯π

E,tkπ

E).

N([0,0]

,I)

and

N([5,3]

,Σ

with

|Σ

E|<1

. Let a (suboptimal) reference policy

¯π

E,t

be independently

ﬁtted with a maximum likelihood estimator with a relatively high learning rate, starting from

¯π

E,1=π1

The policy

πt

was trained by a cost function

DΩ(·k¯π

E,t)

using the MD update rule in Eq. (1). In Fig. 2,

we ﬁrst observed that choosing a high step size constant

accelerated the training speed mainly in

the early phase. The results also showed that the performance of MD (

DΩ(πtkπ

) outperformed

that of referenced maximum likelihood estimation (

DΩ(¯π

E,tkπ

) by choosing an effectively low

step size. This empirical evidence suggests that there are clear advantages in formalization of the

training steps and scheduling the step sizes, especially for unreliable statistical model ¯π

E,t.

MD update rules.

As a result of these ﬁndings, we formulate subsequent MD steps with a regularized

reward function. Let

be a parameter on a set

and

Ft:W → R

be a convex cost function from

a class of functions

at the

-th step. Replacing the L2 proximity term of proximal gradient descent

with a Bregman divergence, the proximal form of the MD update for Eq. (1) is written as [45]

minimize

w∈W ∇Ft(wt), w−wtW

+αtDΩw

wt,(6)

where

αt:=1

/ηt

denotes an inverse of the current step size

ηt

[

]. Plugging each divergence of

the cumulative cost

to Eq. (6), the MD-IRL update for the subsequent reward function

ψt+1 =

Ω(πt+1)is derived by solving a problem

minimize

πs∈Πs∇DΩπs

t

¯πs

E,t

| {z }

∇Ω(πs

t)−∇Ω(¯πs

E,t)

, πs−πs

tA+αtDΩπs

πs

t

⇐⇒ minimize

πs∈ΠsDΩπs

¯πs

E,t−DΩπs

πs

t+αtDΩπs

πs

t

⇐⇒ minimize

πs∈ΠsηtDΩπs

¯πs

E,t

| {z }

estimated expert

+ (1−ηt)DΩπs

πs

t

| {z }

learning agent

∀s∈ S,(7)

where the gradient of

DΩ

is taken with respect to its ﬁrst argument

πs

. Note that solving the

optimization Eq. (7) requires interaction between

πt

and the dynamics of the given environment

in order to minimize

; thus, the corresponding RL process plays an essential role in sequential

learning by the induced the value measures. At a glance, the objective is analogous to ﬁnding an

interpolation at each iteration where the point is controlled

ηt

. Fig. 3 shows that the uncertainty of

πt

(blue region) gets minimal regardless of persisting uncertainty of ¯π

E,t (red region).

Ground-truth cost

MD with ηt

Ground-truth cost

MD with ηt+1

Figure 3: Illustrations of MD at the (a)

-th iteration and (b)

(t+ 1)

-th iteration where

ηt> ηt+1

{¯π

E,t

}∞

t=1

is an arbitrary estimation process attained from a neighborhood of

with respect to a

norm. The MD update is taken inside the interval of πtand ¯π

E,t using Eq. (7).

5 Convergence Analyses

In this section, we present our theoretical results. The main goals of the following arguments are to

address



the convergence of MD updates for various cases and



the necessity of scheduling the

amount of learning. Suppose that state instances of

s(t)

i∈τt

cover the entire

by executing the policy

πt

in an inﬁnite horizon. From this assumption, we deﬁne a temporal cost function at the time step

f(πt, τt):=P∞

i=0 γiDΩπt(· | s(t)

i)

¯π

E,t(· | s(t)

i),(8)

that involves

πt

, and additionally a trajectory

τt

as inputs. We refer to the global objective as ﬁnding

a unique ﬁxed point

π∗∈Π

that minimizes a total cost

F(π):=E[f(π, τt)]

, where the expectation is

taken over trajectories of entire steps, i.e.,

limt→∞Eτ1:t[f(π, τt)]

. Taking the (stepwise) gradient for

each

π(·|s)

, an optimal policy

π∗

is found by

E[∇Ω(π∗(·|s)) − ∇Ω(¯π

E,t(·|s))] = 0

when

t→ ∞

;

hence,

∇Ω(π∗(·|s)) = limt→∞E[∇Ω(¯π

E,t(·|s))]

. Introducing the optimal policy

π∗

allows not only

the speciﬁc situation when

π

E=π∗∈Π

and the estimation algorithm of

¯π

E,t

is actually convergent

with

t→ ∞

, but also more general situations where

π

E/∈Π

or the estimated expert policy

¯π

E,t

does not converge; the algorithm ﬁnds convergence to a ﬁxed point by scheduling updates.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustImitationviaMirrorDescentInverseReinforcementLearningDong-SigHan,HyunseoKim,HyundoLee,Je-HwanRyu,Byoung-TakZhangArticialIntelligenceInstitute,SeoulNationalUniversity{dshan,hskim,hdlee,jhryu,btzhang}@bi.snu.ac.krAbstractRecently,adversarialimitationlearninghasshownascalablerewardacquisitionmet...

展开>> 收起<<

Robust Imitation via Mirror Descent Inverse Reinforcement Learning Dong-Sig Han Hyunseo Kim Hyundo Lee Je-Hwan Ryu Byoung-Tak Zhang.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Imitation via Mirror Descent Inverse Reinforcement Learning Dong-Sig Han Hyunseo Kim Hyundo Lee Je-Hwan Ryu Byoung-Tak Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: