Option-Aware Adversarial Inverse Reinforcement Learning for Robotic Control Jiayu Chen1 Tian Lan2 and Vaneet Aggarwal1

2025-04-29 1 0 1.03MB 12 页 10玖币

侵权投诉

Option-Aware Adversarial Inverse Reinforcement Learning

for Robotic Control

Jiayu Chen1, Tian Lan2, and Vaneet Aggarwal1

Abstract— Hierarchical Imitation Learning (HIL) has been

proposed to recover highly-complex behaviors in long-horizon

tasks from expert demonstrations by modeling the task hi-

erarchy with the option framework. Existing methods either

overlook the causal relationship between the subtask and its

corresponding policy or cannot learn the policy in an end-to-

end fashion, which leads to suboptimality. In this work, we

develop a novel HIL algorithm based on Adversarial Inverse

Reinforcement Learning and adapt it with the Expectation-

Maximization algorithm in order to directly recover a hierar-

chical policy from the unannotated demonstrations. Further,

we introduce a directed information term to the objective

function to enhance the causality and propose a Variational

Autoencoder framework for learning with our objectives in an

end-to-end fashion. Theoretical justiﬁcations and evaluations

on challenging robotic control tasks are provided to show

the superiority of our algorithm. The codes are available at

https://github.com/LucasCJYSDL/HierAIRL.

I. INTRODUCTION

Reinforcement Learning (RL) has achieved impressive

performance in a variety of scenarios, such as games [1],

[2], [3] and robotic control [4], [5]. However, most of its

applications rely on carefully-crafted, task-speciﬁc reward

signals to drive exploration and learning, limiting its use

in real-life scenarios. In this case, Imitation Learning (IL)

methods have been developed to acquire a policy for a certain

task based on the corresponding expert demonstrations (e.g.,

trajectories of state-action pairs) rather than reinforcement

signals. However, complex long-horizon tasks can often be

broken down and processed as a series of subtasks which can

serve as basic skills for completing various compound tasks.

In this case, learning a single monolithic policy with IL to

represent a structured activity can be challenging. Therefore,

Hierarchical Imitation Learning (HIL) has been proposed to

recover a two-level policy for a long-horizon task from the

demonstrations. Speciﬁcally, HIL trains low-level policies

(i.e., skills) for accomplishing the speciﬁc control for each

sub-task, and a high-level policy for scheduling the switching

of the skills. Such a hierarchical policy, which is usually

formulated with the option framework [6], makes full use of

the sub-structure between the parts within the activity and

has the potential for better performance.

The most state-of-the-art (SOTA) works on HIL [7], [8]

are developed based on Generative Adversarial Imitation

1J. Chen and V. Aggarwal are with the School of Industrial

Engineering, Purdue University, West Lafayette, IN 47907, USA

chen3686@purdue.edu, vaneet@purdue.edu. V. Aggar-

wal is also with the CS Department, KAUST, Thuwal, Saudi Arabia.

2T. Lan is with the Department of Electrical and Computer Engi-

neering, George Washington University, Washington D.C., 20052, USA

tlan@gwu.edu

Learning (GAIL) [9] which is a widely-adopted IL algorithm.

In [7], they additionally introduce a directed information

[10] term to the GAIL objective function. In this way, their

method can enhance the causal relationship between the skill

choice and the corresponding state-action sequence to form

low-level policies for each subtask, while encouraging the

hierarchical policy to generate trajectories similar to the

expert ones in distribution through the GAIL objectives.

However, they update the high-level and low-level policy

on two separate stages. Speciﬁcally, the high-level policy

is learned with behavioral cloning, which is a supervised

IL algorithm and vulnerable to compounding errors [11],

and remains ﬁxed during the low-level policy learning with

GAIL. Given that the two-level policies are coupled with

each other, such a two-staged paradigm potentially leads to

sub-optimal solutions. On the other hand, in [8], they propose

to learn a hierarchical policy by option-occupancy measure-

ment matching, that is, imitating the joint distribution of the

options, states and actions of the expert demonstrations rather

than only matching the state-action distributions like GAIL.

However, they overlook the causal relationship between the

subtask structure and the policy hierarchy, so the recovered

policy may degenerate into a poorly-performed monolithic

policy for the whole task, especially when the option anno-

tations of the expert demonstrations are not provided.

In this work, we propose a novel HIL algorithm – Hi-

erarchical Adversarial Inverse Reinforcement Learning (H-

AIRL), which integrates the SOTA IL algorithm Adversarial

Inverse Reinforcement Learning (AIRL) [12] with the option

framework through an objective based on the directed infor-

mation. Compared with GAIL, AIRL is able to recover the

expert reward function along with the expert policy and has

more robust and stable performance for challenging robotic

tasks [12], [13], [14]. From the algorithm perspective, our

contributions are as follows: (1) We propose a practical

lower bound of the directed information between the option

choice and the corresponding state-action sequences, which

can be modeled as a variational posterior and updated in a

Variational Autoencoder (VAE) [15] framework. This design

enables our algorithm to update the high-level and low-level

policy at the same time in an end-to-end fashion, which is an

improvement compared with [7]. (2) We redeﬁne the AIRL

objectives on the extended state and action space, in order

to directly recover a hierarchical policy from the demon-

strations. (3) We provide an Expectation-Maximization (EM)

[16] adaption of our algorithm so that it can be applied to

the expert demonstrations without the sub-task annotations

(i.e., unsegmented expert data) which are easier to obtain in

arXiv:2210.01969v5 [cs.LG] 26 May 2023

practice. Further, we provide solid theoretical justiﬁcation

of the three folds mentioned above, and comparisons of

our algorithm with SOTA HIL and IL baselines on multiple

Mujoco [17] continuous control tasks where our algorithm

signiﬁcantly outperforms the others.

II. RELATED WORK

Imitation Learning. Imitation learning methods [18] seek

to learn to perform a task from expert demonstrations, where

the learner is given only samples of trajectories from the

expert and is not provided any reinforcement signals, such

as the environmental rewards which are usually hard to

acquire in real-life scenarios. There are two main branches

for this algorithm setting: behavioral cloning (BC) [19],

which learns a policy as a supervised learning problem

over state-action pairs from expert trajectories, and inverse

reinforcement learning (IRL) [20], which ﬁrst infers a reward

function under which the expert is uniquely optimal and then

recovers the expert policy based on it. Behavioral cloning

only tends to succeed with large amounts of data, due to the

compounding error caused by covariate shift [21]. Inverse

reinforcement learning, while avoiding the compounding

error, is extremely expensive to solve and scale, since it

requires reinforcement learning to get the corresponding

optimal policy in each iteration of updating the reward

function. GAIL [9] and AIRL [12] have been proposed

to scale IRL for complex high-dimensional control tasks.

They realize IRL through an adversarial learning framework,

where they alternatively update a policy and discriminator

network. The discriminator serves as the reward function

and learns to differentiate between the expert demonstrations

and state-action pairs from the learned policy. While, the

policy is trained to generate trajectories that are difﬁcult to be

distinguished from expert data by the discriminator. Mathe-

matical details are provided in Section III-A. AIRL explicitly

recovers the reward function and provides more robust and

stable performance among challenging tasks [12], [13], [14],

which is chosen as our base algorithm for extension.

Hierarchical Imitation Learning. Given the nature of

subtask decomposition in long-horizon tasks, hierarchical im-

itation learning can achieve better performance than imitation

learning by forming micro-policies for accomplishing the

speciﬁc control for each subtask ﬁrst and then learning a

macro-policy for scheduling among the micro-policies. The

micro-policies (a.k.a., skills) in RL can be modeled with

the option framework proposed in [6], which extends the

usual notion of actions to include options — the closed-loop

policies for taking actions over a period of time. We provide

further details about the option framework in Section III-

B. Through integrating IL with the options, the hierarchical

versions of the IL methods mentioned above have been

developed, including hierarchical behavioral cloning (HBC)

and hierarchical inverse reinforcement learning (HIRL). In

HBC, they train a policy for each subtask through super-

vised learning with the corresponding state-action pairs, due

to which the subtask annotations need to be provided or

inferred. In particular, the methods proposed in [22], [23]

require segmented data with the subtask information. While,

in [24], [25], they infer the subtask information as the hidden

variables in a Hidden Markov Model [26] and solve the HBC

as an MLE problem with the Expectation–Maximization

(EM) algorithm [16]. Despite its theoretical completeness,

HBC is also vulnerable to compounding errors in case

of limited demonstrations. On the other hand, the HIRL

methods proposed in [7], [8] have extended GAIL with the

option framework to recover the hierarchical policy (i.e.,

the high-level and low-level policies mentioned above) from

unsegmented expert data. Speciﬁcally, in [7], they introduce a

regularizer into the original GAIL objective function to max-

imize the directed information between generated trajectories

and the subtask/option annotations. However, the high-level

and low-level policies are trained in two separate stages in

their approach, which will inevitably lead to convergence

with a poor local optimum. As for the approach proposed

in [8] which claims to outperform [7] and HBC, it replaces

the occupancy measurement in GAIL, which measures the

distribution of the state-action pairs, with option-occupancy

measurement to encourage the hierarchical policy to generate

state-action-option tuples with similar distribution to the ex-

pert demonstrations. However, they do not adopt the directed

information objective to enhance the causal relationship

between the option choice and the corresponding state-action

sequence. In this paper, we propose a new HIL algorithm

based on AIRL, which takes advantage of the directed

information objective and updates the high-level and low-

level policies in an end-to-end fashion. Moreover, we provide

theoretical justiﬁcation of our algorithm, and demonstrate its

superiority on challenging robotic control tasks.

III. BACKGROUND

In this section, we introduce the background knowledge of

our work, including AIRL and the One-step Option Frame-

work. These are deﬁned with the Markov Decision Process

(MDP), denoted by M= (S,A,P, µ, R, γ), where Sis the

state space, Ais the action space, P:S × A × S → [0,1] is

the state transition function, µ:S → [0,1] is the distribution

of the initial state, R:S × A → Ris the reward function,

and γ∈(0,1] is the discount factor.

A. Adversarial Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) [20] aims to infer

an expert’s reward function from demonstrations, based on

which the policy of the expert can be recovered. As a

representative, Maximum Entropy IRL [27] solves it as a

maximum likelihood estimation (MLE) problem shown as

Equation (1). τE≜(S0, A0,· · · , ST−1, AT−1, ST)denotes

the expert trajectory, i.e., a sequence of state-action pairs

of horizon T.Zϑis the partition function deﬁned as Zϑ=

Pϑ(τE)dτE(continuous Sand A) or Zϑ=PτEb

Pϑ(τE)

(discrete Sand A).

max

EτE[log Pϑ(τE)] , Pϑ(τE) = b

Pϑ(τE)/Zϑ

Pϑ(τE) = µ(S0)

T−1

t=0

P(St+1|St, At) exp(Rϑ(St, At))

(1)

Since Zϑis intractable for the large-scale state-action

space, the authors of [12] propose Adversarial Inverse Rein-

forcement Learning (AIRL) to solve this MLE problem in a

sample-based manner. They realize this through alternatively

training a discriminator Dϑand policy network πin an

adversarial setting. Speciﬁcally, the discriminator is trained

by minimizing the cross-entropy loss between the expert

demonstrations τEand generated samples τby π:

min

ϑ−

T−1

t=0

EτE[log Dϑ(St, At)] −Eτ[log(1 −Dϑ(St, At))]

(2)

where Dϑ(S, A) = exp(fϑ(S, A))/[exp(fϑ(S, A)) +

π(A|S)]. Meanwhile, the policy πis trained with off-the-

shelf RL algorithms using the reward function deﬁned as

log Dϑ(S, A)−log(1 −Dϑ(S, A)). Further, they justify

that, at optimality, fϑ(S, A)can serve as the recovered

reward function Rϑ(S, A)and πis the recovered expert

policy which maximizes the entropy-regularized objective:

Eτ∼πhPT−1

t=0 Rϑ(St, At)−log π(At|St)i.

B. One-step Option Framework

As proposed in [6], an option Z∈ Z can be described with

three components: an initiation set IZ⊆ S, an intra-option

policy πZ(A|S) : S ×A → [0,1], and a termination function

βZ(S) : S → [0,1]. An option Zis available in state Sif

and only if S∈IZ. Once the option is taken, actions are

selected according to πZuntil it terminates stochastically

according to βZ, i.e., the termination probability at the

current state. A new option will be activated in this call-and-

return style by a high-level policy πZ(Z|S) : S ×Z → [0,1]

once the last option terminates. In this way, πZ(Z|S)and

πZ(A|S)constitute a hierarchical policy for a certain task.

However, it’s inconvenient to deal with the initiation set IZ

and termination function βZwhile learning this hierarchical

policy. Thus, in [28], [8], they adopt the one-step option

framework. It’s assumed that each option is available in each

state, i.e., IZ=S,∀Z∈ Z. Also, the high-level and low-

level (i.e., intra-option) policy are redeﬁned as πθand πϕ

respectively:

πθ(Z|S, Z′) = βZ′(S)πZ(Z|S) + [(1 −βZ′(S))δZ=Z′]

πϕ(A|S, Z) = πZ(A|S)

(3)

where Z′denotes the option in the last time step and δZ=Z′is

the indicator function. We can see that if the previous option

terminates (with probability βZ′(S)), the agent will select a

new option according to πZ(Z|S); otherwise, it will stick to

Z′. With the new deﬁnition and assumption, we can optimize

the hierarchical policy πθand πϕwithout the extra need to

justify the exact beginning and breaking condition of each

option. Nevertheless, πθ(Z|S, Z′)still includes two separate

parts, i.e., βZ′(S)and πZ(Z|S), and due to the indicator

function, the update gradients of πZwill be blocked/gated

by the termination function βZ′(S). In this case, the authors

of [29] propose to marginalize the termination function

away, and instead implement πθ(Z|S, Z′)as an end-to-end

Fig. 1. Illustration of the probabilistic graphical model and its implemen-

tation with the one-step option model.

neural network (NN) with the Multi-Head Attention (MHA)

mechanism [30] which enables their algorithm to temporally

extend options in the absence of the termination function.

We provide more details on MHA and the structure design

of πθand πϕin Appendix I 1. With the marginalized one-

step option framework, we only need to train the two NN-

based policy, i.e., πθand πϕ. In particular, we adopt the

Hierarchical Reinforcement Learning algorithm, i.e., SA,

proposed in [29] to learn πθand πϕ.

IV. PROPOSED APPROACH

A. Optimization with the Directed Information Objective

Our work focuses on learning a hierarchical policy from

expert demonstrations through integrating the one-step option

framework with AIRL. In this section, we deﬁne the directed

information objective function for training the hierarchical

policy, ﬁt it with the one-step option model, and propose how

to optimize it in an end-to-end fashion with an RNN-based

VAE structure, which is part of our novelty and contribution.

As mentioned in Section III-B, the hierarchical policy

agent will ﬁrst decide on its option choice Zusing the

high-level policy πθand then select the primitive action

based on the low-level policy πϕcorresponding to Z, when

observing a new state. In this case, the policy learned should

be conditioned on the option choice Z, and the option choice

is speciﬁc to each timestep t∈ {0,· · · , T }, so we view

the option choices Z0:Tas the local latent contexts in a

probabilistic graphical model shown as Figure 1. It can be

observed from Figure 1 that the local latent context Z0:T

has a directed causal relationship with the trajectory X0:T=

(X0,· · · , XT) = ((A−1, S0),· · · ,(AT−1, ST)), where A−1

is the dummy variable. Inspired by information theory [10],

[7], this kind of connection can be established by maximizing

the directed information (a.k.a., casual information) ﬂow

from the trajectory to the latent factors of variation within

the trajectory, i.e., I(X0:T→Z0:T), which is deﬁned as:

I(X0:T→Z0:T) =

t=1 H(Zt|Zt−1)−H(Zt|Xt, Zt−1)

t=1

[H(Zt|Zt−1) + X

Xt,Zt

P(Xt, Zt) log P(Zt|Xt, Zt−1)]

(4)

1All the appendices are available in the extended version of our paper at

https://github.com/LucasCJYSDL/HierAIRL/blob/main/ICRA.pdf

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Option-AwareAdversarialInverseReinforcementLearningforRoboticControlJiayuChen1,TianLan2,andVaneetAggarwal1Abstract—HierarchicalImitationLearning(HIL)hasbeenproposedtorecoverhighly-complexbehaviorsinlong-horizontasksfromexpertdemonstrationsbymodelingthetaskhi-erarchywiththeoptionframework.Existingmet...

展开>> 收起<<

Option-Aware Adversarial Inverse Reinforcement Learning for Robotic Control Jiayu Chen1 Tian Lan2 and Vaneet Aggarwal1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Option-Aware Adversarial Inverse Reinforcement Learning for Robotic Control Jiayu Chen1 Tian Lan2 and Vaneet Aggarwal1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: