Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees Siliang Zeng Mingyi Hong Alfredo Garcia

2025-05-02 0 0 2.54MB 44 页 10玖币
侵权投诉
Structural Estimation of Markov Decision Processes in
High-Dimensional State Space with Finite-Time Guarantees
Siliang Zeng, Mingyi Hong, Alfredo Garcia
Department of Electrical and Computer Engineering,
University of Minnesota, MN, USA
Department of Industrial and Systems Engineering,
Texas A&M University, TX, USA
Email: {zeng0176, mhong}@umn.edu,alfredo.garcia@tamu.edu
March 4, 2024
Abstract
We consider the task of estimating a structural model of dynamic decisions by a human agent
based upon the observable history of implemented actions and visited states. This problem has an
inherent nested structure: in the inner problem, an optimal policy for a given reward function is
identified while in the outer problem, a measure of fit is maximized. Several approaches have been
proposed to alleviate the computational burden of this nested-loop structure, but these methods
still suffer from high complexity when the state space is either discrete with large cardinality or
continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL)
literature emphasize policy estimation at the expense of reduced reward estimation accuracy.
In this paper we propose a single-loop estimation algorithm with finite time guarantees that is
equipped to deal with high-dimensional state spaces without compromising reward estimation
accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic
gradient step for likelihood maximization. We show the proposed algorithm converges to a
stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly,
the algorithm approximates the maximum likelihood estimator sublinearly. For robotics control
problems in MuJoCo and their transfer settings, the proposed algorithm achieves superior
performance compared with other IRL and imitation learning benchmarks.1
1 Introduction
We consider the task of estimating a structural model of dynamic decisions by a single human agent
based upon the observable history of implemented actions and visited states. This problem has been
studied as the estimation of dynamic discrete choice (DDC) models in econometrics and inverse
reinforcement learning (IRL) in artificial intelligence and machine learning research.
[
2
] is a seminal piece in the literature on dynamic discrete choice estimation. In that paper,
the estimation task is formulated as a bi-level optimization problem where the inner problem is a
stochastic dynamic programming problem and the outer problem is the likelihood maximization of
1This paper is the journal version of [1]
1
arXiv:2210.01282v3 [cs.LG] 1 Mar 2024
observed actions and states. [
2
] proposed an iterative nested fixed point algorithm in which the inner
dynamic programming problem is solved repeatedly followed by maximum likelihood updates of the
structural parameters. Over the years, a significant literature on alternative estimation methods
requiring less computational effort has been developed. For example, [3] and [4] proposed two-step
algorithms which avoid the repeated solution of the inner stochastic dynamic programming problem.
In the first step, a non-parametric estimator of the policy (also referred to as conditional choice
probabilities) is obtained and the inverse of a map from differences in Bellman’s value function for
different states to randomized policies is computed. In the second step, a pseudo log-likelihood is
maximized. Two-step estimators may suffer from substantial finite sample bias if the estimated
policies in the first step are of poor quality. Sequential estimators which are recursively obtained by
alternating between pseudo-likelihood maximization and improved policy estimation are considered
in [
5
]. In general, the computational burden for all these methods is significant when the state space
is either discrete with large cardinality, or they are continuous in high dimensions. Discretization may
be avoided by using forward Monte Carlo simulations [
6
,
7
] but this also becomes computationally
demanding in high dimensions. A constrained optimization approach for maximum likelihood
estimation of dynamic discrete choice models is considered in [
8
]. However, the number of constraints
needed to represent Bellman’s equation becomes a significant computational burden with discrete
state space with large cardinality or continuous state space in high dimensions.
Recent work has addressed the computational challenges posed by high-dimensional state space.
For example, in [
9
] the authors extend the CCP estimator approach proposed in [
3
] by considering a
functional approximation approach coupled with a temporal difference (TD) algorithm to maximize
pseudo-likelihood. In [
10
] the authors consider an approach to adjust the CCP estimator to account
for finite simple bias in high-dimensional settings.
The literature in IRL features the seminal work [
11
] in which a model for the expert’s behavior is
formulated as the policy that maximizes entropy subject to a constraint requiring that the expected
features under such policy match the empirical averages in the expert’s observation dataset.
2
The
algorithms developed for maximum entropy estimation [
11
,
12
,
13
] have a nested loop structure,
alternating between an outer loop with a reward update step, and an inner loop that calculates
the explicit policy estimates. The computational burden of this nested structure is manageable
in tabular environments, but it becomes significant in high dimensional settings requiring value
function approximation.
Recent works have developed algorithms to alleviate the computational burden of nested-loop
estimation methods. For example, in [
14
], the authors propose to transform the standard formulation
of IRL into a single-level problem by estimating the Q-function rather than estimating the reward
function and associated optimal policy separately. However, the implicit reward function in the
Q-function identified is a poor estimate since it is not guaranteed to satisfy Bellman’s equation.
Finally, [
15
] considers an approach called
f
-IRL for estimating rewards based on the minimization of
several measures of divergence with respect to the expert’s state visitation measure. The approach
is limited to estimating rewards that only depend on state. While the results reported are based
upon a single-loop implementation, the paper does not provide a convergence guarantee to support
performance.
In contrast to the lines of works surveyed above, we focus our efforts in developing estimation
algorithms with finite-time guarantees for computing high-quality estimators. Towards addressing
2
In section 6, we show that if the reward is linearly parametrized, the maximum entropy formulation in [
11
] is the
dual of the maximum likelihood formulation of the estimation problem.
2
this challenge, in this paper, we propose a class of new algorithms which only require a finite
number of computational steps for a class of (non-linearly parameterized) structural estimation
problems assuming the environment dynamics are known (or samples from the environment dynamics
are available “on-line"). Specifically, the proposed algorithm has a single-loop structure wherein
a single-step update of the estimated policy is followed by an update of the reward parameter.
We show that the algorithm has strong theoretical guarantees: to achieve certain
ϵ
-approximate
stationary solution for a non-linearly parameterized problem, it requires
O
(
ϵ2
)steps of policy and
reward updates each. To our knowledge, it is the first algorithm which has finite-time guarantee
for the structural estimation of an MDP under nonlinear parameterization of the reward function.
We conduct extensive experiments to demonstrate that the proposed algorithm outperforms many
state-of-the-art IRL algorithms in both policy estimation and reward recovery. In particular, when
transferring to a new environment, the performance of state-of-the-art reinforcement learning (RL)
algorithms, using estimated rewards outperform those that use rewards recovered from existing IRL
and imitation learning benchmarks.
Finally, we consider the extension to the “off-line" case in which the estimation task also includes
the environment dynamics. Referring to our recent work (see [
16
]), we consider a two-stage estimation
approach. First, a maximum likelihood model of dynamics is identified. However, this first stage
estimator of the environment dynamics may be inaccurate due to limited data coverage. Thus, in
the second stage, a “conservative" reward estimator is obtained by introducing a penalty for model
uncertainty.
The structure of this paper is as follows. In Sec. 2, we introduce the basic setting for structural
estimation of MDPs. In Sec. 2.2, we introduce the problem formulation of the maximum likelihood
IRL. In Sec. 4, we introduce a single-loop algorithm for estimation and formalize a finite-time
performance guarantee for high dimensional states. In Sec. 5, we present the convergence results
of our proposed single-loop algorithm. In Sec. 6, we consider the case with linearly parameterized
rewards to show the proposed algorithm converges sublinearly to the maximum likelihood estimator.
This results is proven by establishing a duality relationship between maximum entropy IRL and
maximum likelihood IRL. In Sec. 7, we consider the case in which the agent’s preferences can be
represented by a reward that is only a function of the state. In Sec. 8we outline the extension of
the proposed algorithms to the offline case. Finally, in Sec. 9, we present the numerical results.
2 Background
2.1 Dynamic Discrete Choice Model
We now review the basic setting for dynamic discrete choice model as given for example in [
17
]. At
time
t
0, the agent implements an action
at
from a finite (discrete) action space
A
and receives a
reward
r
(
st, at
;
θ
) +
ϵt
(
at
), where
st∈ S
is the state at time
t
,
r
(
st, at
;
θ
)is the reward associated to
the state-action pair (
st, at
)with
θRp
a parameter and
ϵt
(
at
) :
R|A| R
is a random perturbation
which is observable by the agent (decision maker) but not by the modeler.
Upon implementing the action
at∈ A
, the state evolves according to a Markov process with
kernel
P
(
st+1|st, at
). Moreover, let
µ
(
ϵt|st
)denote the probability distribution for the random
perturbation, where the probability distribution is a function of the state.
Let
π
(
· | st, ϵt
)denote a randomized policy, i.e.
π
(
at|st, ϵt
)is the probability that action
at
is
implemented when the state is stand the observed reward perturbation vector is ϵt.
3
The agent’s optimal policy is characterized by the value function:
Vθ(s0, ϵ0) = max
π
Es0ρ,τπ
X
t=0
γtr(st, at;θ) + ϵt(at)s0, ϵ0
where the expectation is taken with respect to
atπ
(
·|st, ϵt
)
, st+1 P
(
·|st, at
)
, ϵt+1 µ
(
·|st+1
)and
γ[0,1) is the discount factor. The Bellman equation is:
Vθ(st, ϵt) = max
at∈A r(st, at;θ) + ϵt(at) + γEst+1P(·|st,at)t+1µ(·|st+1)Vθ(st+1, ϵt+1),
= max
at∈A Qθ(st, at) + ϵt(at)
where Qθ:S × A 7→ Ris the fixed point of the soft-Bellman operator:
ΛθQ(st, at)=r(st, at;θ) + γEst+1P(·|st,at)t+1µ(·|st+1)max
a∈A Q(st+1, a) + ϵt+1(a).(1)
As the realization of the reward perturbations is not observable by the modeler, a parametrized
model of the agent’s behavior is a map πθ(·|st)which satisfies Bellman’s optimality as follows:
πθ(at|st) = Patarg max
a∈A Qθ(st, a) + ϵt(a).(2)
Assume observations are in the form of expert state-action trajectories
τE
=
{
(
st, at
)
}t0
drawn
from a ground-truth (or “expert") policy
πE
, i.e.
atπE
(
·|st
),
st+1 P
(
·|st, at
)and
s0ρ
(
·
),
where
ρ
(
·
)denotes the initial distribution of the first state
s0
. The expected discounted log-likelihood
of observing such trajectory under model πθcan be written as:
EτEπE
X
t=0
γtlog P(st+1|st, at)πθ(at|st)=EτEπE
X
t=0
γtlog πθ(at|st)+EτEπE
X
t=0
γtlog P(st+1|st, at).
Given that the term
EτEπEP
t=0 γtlog P
(
st+1|st, at
)
is independent of the reward parameter
θ, the maximum likelihood estimation problem can be formulated as follows:
max
θL(θ) := EτEπEh
X
t=0
γtlog πθ(at|st)i(3a)
s.tπθ(at|st) = Patarg max
a∈A Qθ(st, a) + ϵt(a),(3b)
where Qθis the fixed point of the soft-Bellman operator in (1).
In the next subsection we review the literature on the entropy-regularized RL model and then
highlight the formal equivalence of entropy-regularized IRL with the dynamic discrete choice model
just introduced in (3).
4
2.2 Maximum Likelihood Inverse Reinforcement Learning (ML-IRL)
A recent literature has considered MDP models with information processing costs [
18
,
19
,
20
,
21
].
In these papers, optimal behavior is modeled as the solution to the following problem:
max
πΠJθ(π;ρ)Es0ρ,τAπ
X
t=0
γtr(st, at;θ)c(π(·|st),
where
ρ
(
·
)denotes the initial distribution of the first state
s0
,
τA
is a trajectory generate from the
agent policy
π
and
c
(
·
)is a function representing the information processing cost. A common specifi-
cation is
c
(
π
(
·|st
)) =
αDKL
(
π
(
·|st
)
||π0
(
·|st
)), where
DKL
(
π
(
·|st
)
||π0
(
·|st
)) =
P
a∈A
π
(
a|st
)
log π(a|st)
π0(a|st)
is Kullback-Leibler divergence between
π
(
·|st
)and a reference (or default) policy
π0
(
·|st
)and
α
0
is a scale parameter. As the objective function above can be re-scaled by
1
α
, we can set
α
= 1. To
model no prior knowledge the reference policy
π0
is the uniformly random policy, i.e.
π0
(
a
) =
1
|A|
for any a∈ A. In this case, we can further rewrite the problem as:
max
πΠJθ(π;ρ)Es0ρ,τAπ
X
t=0
γtr(st, at;θ) + H(π(·|st))+log |A|
1γ,
where
H
(
π
(
·|st
)) =
P
aA
π
(
a|st
)
log π
(
a|st
)is the entropy of
π
(
·|st
). This model has also been
recently used in the RL literature [
22
,
23
,
24
,
25
] where it is commonly referred to as an entropy
regularized MDP.
Denoting the expert policy as
πE
and assuming an entropy-regularized MDP model for behavior.
and the model for dynamic behavior described, the inverse reinforcement learning (IRL) problem
can be formulated as follows:
max
θL(θ) := EτEπE
X
t=0
γtlog πθ(at|st)(4a)
s.t. πθ:= arg max
π
Es0ρ,τAπ
X
t=0
γtr(st, at;θ) + H(π(·|st)).(4b)
When the reward perturbations
ϵt
(
a
)follow i.i.d. Gumbel distribution with zero mean and variance
π2
6
for
aA
, the models of behavior
(3b)
and
(4b)
are equivalent (see Proposition 1 in [
26
]).
Specifically, the fixed point
Qθ
(
s, a
)of the Bellman operator Λ
θ
in
(1)
and the optimal policy
(2)
are of the form:
Qθ(s, a) := r(s, a;θ) + γEsP(·|s,a)Vθ(s).(5a)
Vθ(s) = log X
˜a∼A
exp Qθ(s, ˜a),(5b)
πθ(a|s) = exp Qθ(s, a)
P˜a∈A exp Qθ(s, ˜a).(5c)
As has been shown in [
23
,
25
], the policy described in
(5c)
corresponds to the optimal policy in
(4b)
so that
Vθ(s) = max
π
Es0ρ,τAπ
X
t=0
γtr(st, at;θ) + H(π(·|st))s0=s.(6a)
5
摘要:

StructuralEstimationofMarkovDecisionProcessesinHigh-DimensionalStateSpacewithFinite-TimeGuaranteesSiliangZeng†,MingyiHong†,AlfredoGarcia‡†DepartmentofElectricalandComputerEngineering,UniversityofMinnesota,MN,USA‡DepartmentofIndustrialandSystemsEngineering,TexasA&MUniversity,TX,USAEmail:{zeng0176,mho...

展开>> 收起<<
Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees Siliang Zeng Mingyi Hong Alfredo Garcia.pdf

共44页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:44 页 大小:2.54MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 44
客服
关注