practice. Further, we provide solid theoretical justification
of the three folds mentioned above, and comparisons of
our algorithm with SOTA HIL and IL baselines on multiple
Mujoco [17] continuous control tasks where our algorithm
significantly outperforms the others.
II. RELATED WORK
Imitation Learning. Imitation learning methods [18] seek
to learn to perform a task from expert demonstrations, where
the learner is given only samples of trajectories from the
expert and is not provided any reinforcement signals, such
as the environmental rewards which are usually hard to
acquire in real-life scenarios. There are two main branches
for this algorithm setting: behavioral cloning (BC) [19],
which learns a policy as a supervised learning problem
over state-action pairs from expert trajectories, and inverse
reinforcement learning (IRL) [20], which first infers a reward
function under which the expert is uniquely optimal and then
recovers the expert policy based on it. Behavioral cloning
only tends to succeed with large amounts of data, due to the
compounding error caused by covariate shift [21]. Inverse
reinforcement learning, while avoiding the compounding
error, is extremely expensive to solve and scale, since it
requires reinforcement learning to get the corresponding
optimal policy in each iteration of updating the reward
function. GAIL [9] and AIRL [12] have been proposed
to scale IRL for complex high-dimensional control tasks.
They realize IRL through an adversarial learning framework,
where they alternatively update a policy and discriminator
network. The discriminator serves as the reward function
and learns to differentiate between the expert demonstrations
and state-action pairs from the learned policy. While, the
policy is trained to generate trajectories that are difficult to be
distinguished from expert data by the discriminator. Mathe-
matical details are provided in Section III-A. AIRL explicitly
recovers the reward function and provides more robust and
stable performance among challenging tasks [12], [13], [14],
which is chosen as our base algorithm for extension.
Hierarchical Imitation Learning. Given the nature of
subtask decomposition in long-horizon tasks, hierarchical im-
itation learning can achieve better performance than imitation
learning by forming micro-policies for accomplishing the
specific control for each subtask first and then learning a
macro-policy for scheduling among the micro-policies. The
micro-policies (a.k.a., skills) in RL can be modeled with
the option framework proposed in [6], which extends the
usual notion of actions to include options — the closed-loop
policies for taking actions over a period of time. We provide
further details about the option framework in Section III-
B. Through integrating IL with the options, the hierarchical
versions of the IL methods mentioned above have been
developed, including hierarchical behavioral cloning (HBC)
and hierarchical inverse reinforcement learning (HIRL). In
HBC, they train a policy for each subtask through super-
vised learning with the corresponding state-action pairs, due
to which the subtask annotations need to be provided or
inferred. In particular, the methods proposed in [22], [23]
require segmented data with the subtask information. While,
in [24], [25], they infer the subtask information as the hidden
variables in a Hidden Markov Model [26] and solve the HBC
as an MLE problem with the Expectation–Maximization
(EM) algorithm [16]. Despite its theoretical completeness,
HBC is also vulnerable to compounding errors in case
of limited demonstrations. On the other hand, the HIRL
methods proposed in [7], [8] have extended GAIL with the
option framework to recover the hierarchical policy (i.e.,
the high-level and low-level policies mentioned above) from
unsegmented expert data. Specifically, in [7], they introduce a
regularizer into the original GAIL objective function to max-
imize the directed information between generated trajectories
and the subtask/option annotations. However, the high-level
and low-level policies are trained in two separate stages in
their approach, which will inevitably lead to convergence
with a poor local optimum. As for the approach proposed
in [8] which claims to outperform [7] and HBC, it replaces
the occupancy measurement in GAIL, which measures the
distribution of the state-action pairs, with option-occupancy
measurement to encourage the hierarchical policy to generate
state-action-option tuples with similar distribution to the ex-
pert demonstrations. However, they do not adopt the directed
information objective to enhance the causal relationship
between the option choice and the corresponding state-action
sequence. In this paper, we propose a new HIL algorithm
based on AIRL, which takes advantage of the directed
information objective and updates the high-level and low-
level policies in an end-to-end fashion. Moreover, we provide
theoretical justification of our algorithm, and demonstrate its
superiority on challenging robotic control tasks.
III. BACKGROUND
In this section, we introduce the background knowledge of
our work, including AIRL and the One-step Option Frame-
work. These are defined with the Markov Decision Process
(MDP), denoted by M= (S,A,P, µ, R, γ), where Sis the
state space, Ais the action space, P:S × A × S → [0,1] is
the state transition function, µ:S → [0,1] is the distribution
of the initial state, R:S × A → Ris the reward function,
and γ∈(0,1] is the discount factor.
A. Adversarial Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) [20] aims to infer
an expert’s reward function from demonstrations, based on
which the policy of the expert can be recovered. As a
representative, Maximum Entropy IRL [27] solves it as a
maximum likelihood estimation (MLE) problem shown as
Equation (1). τE≜(S0, A0,· · · , ST−1, AT−1, ST)denotes
the expert trajectory, i.e., a sequence of state-action pairs
of horizon T.Zϑis the partition function defined as Zϑ=
Rb
Pϑ(τE)dτE(continuous Sand A) or Zϑ=PτEb
Pϑ(τE)
(discrete Sand A).
max
ϑ
EτE[log Pϑ(τE)] , Pϑ(τE) = b
Pϑ(τE)/Zϑ
b
Pϑ(τE) = µ(S0)
T−1
Y
t=0
P(St+1|St, At) exp(Rϑ(St, At))
(1)