Option-Aware Adversarial Inverse Reinforcement Learning for Robotic Control Jiayu Chen1 Tian Lan2 and Vaneet Aggarwal1

2025-04-29 0 0 1.03MB 12 页 10玖币
侵权投诉
Option-Aware Adversarial Inverse Reinforcement Learning
for Robotic Control
Jiayu Chen1, Tian Lan2, and Vaneet Aggarwal1
Abstract Hierarchical Imitation Learning (HIL) has been
proposed to recover highly-complex behaviors in long-horizon
tasks from expert demonstrations by modeling the task hi-
erarchy with the option framework. Existing methods either
overlook the causal relationship between the subtask and its
corresponding policy or cannot learn the policy in an end-to-
end fashion, which leads to suboptimality. In this work, we
develop a novel HIL algorithm based on Adversarial Inverse
Reinforcement Learning and adapt it with the Expectation-
Maximization algorithm in order to directly recover a hierar-
chical policy from the unannotated demonstrations. Further,
we introduce a directed information term to the objective
function to enhance the causality and propose a Variational
Autoencoder framework for learning with our objectives in an
end-to-end fashion. Theoretical justifications and evaluations
on challenging robotic control tasks are provided to show
the superiority of our algorithm. The codes are available at
https://github.com/LucasCJYSDL/HierAIRL.
I. INTRODUCTION
Reinforcement Learning (RL) has achieved impressive
performance in a variety of scenarios, such as games [1],
[2], [3] and robotic control [4], [5]. However, most of its
applications rely on carefully-crafted, task-specific reward
signals to drive exploration and learning, limiting its use
in real-life scenarios. In this case, Imitation Learning (IL)
methods have been developed to acquire a policy for a certain
task based on the corresponding expert demonstrations (e.g.,
trajectories of state-action pairs) rather than reinforcement
signals. However, complex long-horizon tasks can often be
broken down and processed as a series of subtasks which can
serve as basic skills for completing various compound tasks.
In this case, learning a single monolithic policy with IL to
represent a structured activity can be challenging. Therefore,
Hierarchical Imitation Learning (HIL) has been proposed to
recover a two-level policy for a long-horizon task from the
demonstrations. Specifically, HIL trains low-level policies
(i.e., skills) for accomplishing the specific control for each
sub-task, and a high-level policy for scheduling the switching
of the skills. Such a hierarchical policy, which is usually
formulated with the option framework [6], makes full use of
the sub-structure between the parts within the activity and
has the potential for better performance.
The most state-of-the-art (SOTA) works on HIL [7], [8]
are developed based on Generative Adversarial Imitation
1J. Chen and V. Aggarwal are with the School of Industrial
Engineering, Purdue University, West Lafayette, IN 47907, USA
chen3686@purdue.edu, vaneet@purdue.edu. V. Aggar-
wal is also with the CS Department, KAUST, Thuwal, Saudi Arabia.
2T. Lan is with the Department of Electrical and Computer Engi-
neering, George Washington University, Washington D.C., 20052, USA
tlan@gwu.edu
Learning (GAIL) [9] which is a widely-adopted IL algorithm.
In [7], they additionally introduce a directed information
[10] term to the GAIL objective function. In this way, their
method can enhance the causal relationship between the skill
choice and the corresponding state-action sequence to form
low-level policies for each subtask, while encouraging the
hierarchical policy to generate trajectories similar to the
expert ones in distribution through the GAIL objectives.
However, they update the high-level and low-level policy
on two separate stages. Specifically, the high-level policy
is learned with behavioral cloning, which is a supervised
IL algorithm and vulnerable to compounding errors [11],
and remains fixed during the low-level policy learning with
GAIL. Given that the two-level policies are coupled with
each other, such a two-staged paradigm potentially leads to
sub-optimal solutions. On the other hand, in [8], they propose
to learn a hierarchical policy by option-occupancy measure-
ment matching, that is, imitating the joint distribution of the
options, states and actions of the expert demonstrations rather
than only matching the state-action distributions like GAIL.
However, they overlook the causal relationship between the
subtask structure and the policy hierarchy, so the recovered
policy may degenerate into a poorly-performed monolithic
policy for the whole task, especially when the option anno-
tations of the expert demonstrations are not provided.
In this work, we propose a novel HIL algorithm – Hi-
erarchical Adversarial Inverse Reinforcement Learning (H-
AIRL), which integrates the SOTA IL algorithm Adversarial
Inverse Reinforcement Learning (AIRL) [12] with the option
framework through an objective based on the directed infor-
mation. Compared with GAIL, AIRL is able to recover the
expert reward function along with the expert policy and has
more robust and stable performance for challenging robotic
tasks [12], [13], [14]. From the algorithm perspective, our
contributions are as follows: (1) We propose a practical
lower bound of the directed information between the option
choice and the corresponding state-action sequences, which
can be modeled as a variational posterior and updated in a
Variational Autoencoder (VAE) [15] framework. This design
enables our algorithm to update the high-level and low-level
policy at the same time in an end-to-end fashion, which is an
improvement compared with [7]. (2) We redefine the AIRL
objectives on the extended state and action space, in order
to directly recover a hierarchical policy from the demon-
strations. (3) We provide an Expectation-Maximization (EM)
[16] adaption of our algorithm so that it can be applied to
the expert demonstrations without the sub-task annotations
(i.e., unsegmented expert data) which are easier to obtain in
arXiv:2210.01969v5 [cs.LG] 26 May 2023
practice. Further, we provide solid theoretical justification
of the three folds mentioned above, and comparisons of
our algorithm with SOTA HIL and IL baselines on multiple
Mujoco [17] continuous control tasks where our algorithm
significantly outperforms the others.
II. RELATED WORK
Imitation Learning. Imitation learning methods [18] seek
to learn to perform a task from expert demonstrations, where
the learner is given only samples of trajectories from the
expert and is not provided any reinforcement signals, such
as the environmental rewards which are usually hard to
acquire in real-life scenarios. There are two main branches
for this algorithm setting: behavioral cloning (BC) [19],
which learns a policy as a supervised learning problem
over state-action pairs from expert trajectories, and inverse
reinforcement learning (IRL) [20], which first infers a reward
function under which the expert is uniquely optimal and then
recovers the expert policy based on it. Behavioral cloning
only tends to succeed with large amounts of data, due to the
compounding error caused by covariate shift [21]. Inverse
reinforcement learning, while avoiding the compounding
error, is extremely expensive to solve and scale, since it
requires reinforcement learning to get the corresponding
optimal policy in each iteration of updating the reward
function. GAIL [9] and AIRL [12] have been proposed
to scale IRL for complex high-dimensional control tasks.
They realize IRL through an adversarial learning framework,
where they alternatively update a policy and discriminator
network. The discriminator serves as the reward function
and learns to differentiate between the expert demonstrations
and state-action pairs from the learned policy. While, the
policy is trained to generate trajectories that are difficult to be
distinguished from expert data by the discriminator. Mathe-
matical details are provided in Section III-A. AIRL explicitly
recovers the reward function and provides more robust and
stable performance among challenging tasks [12], [13], [14],
which is chosen as our base algorithm for extension.
Hierarchical Imitation Learning. Given the nature of
subtask decomposition in long-horizon tasks, hierarchical im-
itation learning can achieve better performance than imitation
learning by forming micro-policies for accomplishing the
specific control for each subtask first and then learning a
macro-policy for scheduling among the micro-policies. The
micro-policies (a.k.a., skills) in RL can be modeled with
the option framework proposed in [6], which extends the
usual notion of actions to include options — the closed-loop
policies for taking actions over a period of time. We provide
further details about the option framework in Section III-
B. Through integrating IL with the options, the hierarchical
versions of the IL methods mentioned above have been
developed, including hierarchical behavioral cloning (HBC)
and hierarchical inverse reinforcement learning (HIRL). In
HBC, they train a policy for each subtask through super-
vised learning with the corresponding state-action pairs, due
to which the subtask annotations need to be provided or
inferred. In particular, the methods proposed in [22], [23]
require segmented data with the subtask information. While,
in [24], [25], they infer the subtask information as the hidden
variables in a Hidden Markov Model [26] and solve the HBC
as an MLE problem with the Expectation–Maximization
(EM) algorithm [16]. Despite its theoretical completeness,
HBC is also vulnerable to compounding errors in case
of limited demonstrations. On the other hand, the HIRL
methods proposed in [7], [8] have extended GAIL with the
option framework to recover the hierarchical policy (i.e.,
the high-level and low-level policies mentioned above) from
unsegmented expert data. Specifically, in [7], they introduce a
regularizer into the original GAIL objective function to max-
imize the directed information between generated trajectories
and the subtask/option annotations. However, the high-level
and low-level policies are trained in two separate stages in
their approach, which will inevitably lead to convergence
with a poor local optimum. As for the approach proposed
in [8] which claims to outperform [7] and HBC, it replaces
the occupancy measurement in GAIL, which measures the
distribution of the state-action pairs, with option-occupancy
measurement to encourage the hierarchical policy to generate
state-action-option tuples with similar distribution to the ex-
pert demonstrations. However, they do not adopt the directed
information objective to enhance the causal relationship
between the option choice and the corresponding state-action
sequence. In this paper, we propose a new HIL algorithm
based on AIRL, which takes advantage of the directed
information objective and updates the high-level and low-
level policies in an end-to-end fashion. Moreover, we provide
theoretical justification of our algorithm, and demonstrate its
superiority on challenging robotic control tasks.
III. BACKGROUND
In this section, we introduce the background knowledge of
our work, including AIRL and the One-step Option Frame-
work. These are defined with the Markov Decision Process
(MDP), denoted by M= (S,A,P, µ, R, γ), where Sis the
state space, Ais the action space, P:S × A × S [0,1] is
the state transition function, µ:S [0,1] is the distribution
of the initial state, R:S × A Ris the reward function,
and γ(0,1] is the discount factor.
A. Adversarial Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) [20] aims to infer
an expert’s reward function from demonstrations, based on
which the policy of the expert can be recovered. As a
representative, Maximum Entropy IRL [27] solves it as a
maximum likelihood estimation (MLE) problem shown as
Equation (1). τE(S0, A0,· · · , ST1, AT1, ST)denotes
the expert trajectory, i.e., a sequence of state-action pairs
of horizon T.Zϑis the partition function defined as Zϑ=
Rb
Pϑ(τE)E(continuous Sand A) or Zϑ=PτEb
Pϑ(τE)
(discrete Sand A).
max
ϑ
EτE[log Pϑ(τE)] , Pϑ(τE) = b
Pϑ(τE)/Zϑ
b
Pϑ(τE) = µ(S0)
T1
Y
t=0
P(St+1|St, At) exp(Rϑ(St, At))
(1)
Since Zϑis intractable for the large-scale state-action
space, the authors of [12] propose Adversarial Inverse Rein-
forcement Learning (AIRL) to solve this MLE problem in a
sample-based manner. They realize this through alternatively
training a discriminator Dϑand policy network πin an
adversarial setting. Specifically, the discriminator is trained
by minimizing the cross-entropy loss between the expert
demonstrations τEand generated samples τby π:
min
ϑ
T1
X
t=0
EτE[log Dϑ(St, At)] Eτ[log(1 Dϑ(St, At))]
(2)
where Dϑ(S, A) = exp(fϑ(S, A))/[exp(fϑ(S, A)) +
π(A|S)]. Meanwhile, the policy πis trained with off-the-
shelf RL algorithms using the reward function defined as
log Dϑ(S, A)log(1 Dϑ(S, A)). Further, they justify
that, at optimality, fϑ(S, A)can serve as the recovered
reward function Rϑ(S, A)and πis the recovered expert
policy which maximizes the entropy-regularized objective:
EτπhPT1
t=0 Rϑ(St, At)log π(At|St)i.
B. One-step Option Framework
As proposed in [6], an option Z∈ Z can be described with
three components: an initiation set IZ⊆ S, an intra-option
policy πZ(A|S) : S ×A [0,1], and a termination function
βZ(S) : S [0,1]. An option Zis available in state Sif
and only if SIZ. Once the option is taken, actions are
selected according to πZuntil it terminates stochastically
according to βZ, i.e., the termination probability at the
current state. A new option will be activated in this call-and-
return style by a high-level policy πZ(Z|S) : S ×Z [0,1]
once the last option terminates. In this way, πZ(Z|S)and
πZ(A|S)constitute a hierarchical policy for a certain task.
However, it’s inconvenient to deal with the initiation set IZ
and termination function βZwhile learning this hierarchical
policy. Thus, in [28], [8], they adopt the one-step option
framework. It’s assumed that each option is available in each
state, i.e., IZ=S,Z∈ Z. Also, the high-level and low-
level (i.e., intra-option) policy are redefined as πθand πϕ
respectively:
πθ(Z|S, Z) = βZ(S)πZ(Z|S) + [(1 βZ(S))δZ=Z]
πϕ(A|S, Z) = πZ(A|S)
(3)
where Zdenotes the option in the last time step and δZ=Zis
the indicator function. We can see that if the previous option
terminates (with probability βZ(S)), the agent will select a
new option according to πZ(Z|S); otherwise, it will stick to
Z. With the new definition and assumption, we can optimize
the hierarchical policy πθand πϕwithout the extra need to
justify the exact beginning and breaking condition of each
option. Nevertheless, πθ(Z|S, Z)still includes two separate
parts, i.e., βZ(S)and πZ(Z|S), and due to the indicator
function, the update gradients of πZwill be blocked/gated
by the termination function βZ(S). In this case, the authors
of [29] propose to marginalize the termination function
away, and instead implement πθ(Z|S, Z)as an end-to-end
Fig. 1. Illustration of the probabilistic graphical model and its implemen-
tation with the one-step option model.
neural network (NN) with the Multi-Head Attention (MHA)
mechanism [30] which enables their algorithm to temporally
extend options in the absence of the termination function.
We provide more details on MHA and the structure design
of πθand πϕin Appendix I 1. With the marginalized one-
step option framework, we only need to train the two NN-
based policy, i.e., πθand πϕ. In particular, we adopt the
Hierarchical Reinforcement Learning algorithm, i.e., SA,
proposed in [29] to learn πθand πϕ.
IV. PROPOSED APPROACH
A. Optimization with the Directed Information Objective
Our work focuses on learning a hierarchical policy from
expert demonstrations through integrating the one-step option
framework with AIRL. In this section, we define the directed
information objective function for training the hierarchical
policy, fit it with the one-step option model, and propose how
to optimize it in an end-to-end fashion with an RNN-based
VAE structure, which is part of our novelty and contribution.
As mentioned in Section III-B, the hierarchical policy
agent will first decide on its option choice Zusing the
high-level policy πθand then select the primitive action
based on the low-level policy πϕcorresponding to Z, when
observing a new state. In this case, the policy learned should
be conditioned on the option choice Z, and the option choice
is specific to each timestep t∈ {0,· · · , T }, so we view
the option choices Z0:Tas the local latent contexts in a
probabilistic graphical model shown as Figure 1. It can be
observed from Figure 1 that the local latent context Z0:T
has a directed causal relationship with the trajectory X0:T=
(X0,· · · , XT) = ((A1, S0),· · · ,(AT1, ST)), where A1
is the dummy variable. Inspired by information theory [10],
[7], this kind of connection can be established by maximizing
the directed information (a.k.a., casual information) flow
from the trajectory to the latent factors of variation within
the trajectory, i.e., I(X0:TZ0:T), which is defined as:
I(X0:TZ0:T) =
T
X
t=1 H(Zt|Zt1)H(Zt|Xt, Zt1)
=
T
X
t=1
[H(Zt|Zt1) + X
Xt,Zt
P(Xt, Zt) log P(Zt|Xt, Zt1)]
(4)
1All the appendices are available in the extended version of our paper at
https://github.com/LucasCJYSDL/HierAIRL/blob/main/ICRA.pdf
摘要:

Option-AwareAdversarialInverseReinforcementLearningforRoboticControlJiayuChen1,TianLan2,andVaneetAggarwal1Abstract—HierarchicalImitationLearning(HIL)hasbeenproposedtorecoverhighly-complexbehaviorsinlong-horizontasksfromexpertdemonstrationsbymodelingthetaskhi-erarchywiththeoptionframework.Existingmet...

展开>> 收起<<
Option-Aware Adversarial Inverse Reinforcement Learning for Robotic Control Jiayu Chen1 Tian Lan2 and Vaneet Aggarwal1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:1.03MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注