
observed actions and states. [
2
] proposed an iterative nested fixed point algorithm in which the inner
dynamic programming problem is solved repeatedly followed by maximum likelihood updates of the
structural parameters. Over the years, a significant literature on alternative estimation methods
requiring less computational effort has been developed. For example, [3] and [4] proposed two-step
algorithms which avoid the repeated solution of the inner stochastic dynamic programming problem.
In the first step, a non-parametric estimator of the policy (also referred to as conditional choice
probabilities) is obtained and the inverse of a map from differences in Bellman’s value function for
different states to randomized policies is computed. In the second step, a pseudo log-likelihood is
maximized. Two-step estimators may suffer from substantial finite sample bias if the estimated
policies in the first step are of poor quality. Sequential estimators which are recursively obtained by
alternating between pseudo-likelihood maximization and improved policy estimation are considered
in [
5
]. In general, the computational burden for all these methods is significant when the state space
is either discrete with large cardinality, or they are continuous in high dimensions. Discretization may
be avoided by using forward Monte Carlo simulations [
6
,
7
] but this also becomes computationally
demanding in high dimensions. A constrained optimization approach for maximum likelihood
estimation of dynamic discrete choice models is considered in [
8
]. However, the number of constraints
needed to represent Bellman’s equation becomes a significant computational burden with discrete
state space with large cardinality or continuous state space in high dimensions.
Recent work has addressed the computational challenges posed by high-dimensional state space.
For example, in [
9
] the authors extend the CCP estimator approach proposed in [
3
] by considering a
functional approximation approach coupled with a temporal difference (TD) algorithm to maximize
pseudo-likelihood. In [
10
] the authors consider an approach to adjust the CCP estimator to account
for finite simple bias in high-dimensional settings.
The literature in IRL features the seminal work [
11
] in which a model for the expert’s behavior is
formulated as the policy that maximizes entropy subject to a constraint requiring that the expected
features under such policy match the empirical averages in the expert’s observation dataset.
2
The
algorithms developed for maximum entropy estimation [
11
,
12
,
13
] have a nested loop structure,
alternating between an outer loop with a reward update step, and an inner loop that calculates
the explicit policy estimates. The computational burden of this nested structure is manageable
in tabular environments, but it becomes significant in high dimensional settings requiring value
function approximation.
Recent works have developed algorithms to alleviate the computational burden of nested-loop
estimation methods. For example, in [
14
], the authors propose to transform the standard formulation
of IRL into a single-level problem by estimating the Q-function rather than estimating the reward
function and associated optimal policy separately. However, the implicit reward function in the
Q-function identified is a poor estimate since it is not guaranteed to satisfy Bellman’s equation.
Finally, [
15
] considers an approach called
f
-IRL for estimating rewards based on the minimization of
several measures of divergence with respect to the expert’s state visitation measure. The approach
is limited to estimating rewards that only depend on state. While the results reported are based
upon a single-loop implementation, the paper does not provide a convergence guarantee to support
performance.
In contrast to the lines of works surveyed above, we focus our efforts in developing estimation
algorithms with finite-time guarantees for computing high-quality estimators. Towards addressing
2
In section 6, we show that if the reward is linearly parametrized, the maximum entropy formulation in [
11
] is the
dual of the maximum likelihood formulation of the estimation problem.
2