Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees Siliang Zeng Mingyi Hong Alfredo Garcia

2025-05-02 0 0 2.54MB 44 页 10玖币

侵权投诉

Structural Estimation of Markov Decision Processes in

High-Dimensional State Space with Finite-Time Guarantees

Siliang Zeng†, Mingyi Hong†, Alfredo Garcia‡

†Department of Electrical and Computer Engineering,

University of Minnesota, MN, USA

‡Department of Industrial and Systems Engineering,

Texas A&M University, TX, USA

Email: {zeng0176, mhong}@umn.edu,alfredo.garcia@tamu.edu

March 4, 2024

Abstract

We consider the task of estimating a structural model of dynamic decisions by a human agent

based upon the observable history of implemented actions and visited states. This problem has an

inherent nested structure: in the inner problem, an optimal policy for a given reward function is

identiﬁed while in the outer problem, a measure of ﬁt is maximized. Several approaches have been

proposed to alleviate the computational burden of this nested-loop structure, but these methods

still suﬀer from high complexity when the state space is either discrete with large cardinality or

continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL)

literature emphasize policy estimation at the expense of reduced reward estimation accuracy.

In this paper we propose a single-loop estimation algorithm with ﬁnite time guarantees that is

equipped to deal with high-dimensional state spaces without compromising reward estimation

accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic

gradient step for likelihood maximization. We show the proposed algorithm converges to a

stationary solution with a ﬁnite-time guarantee. Further, if the reward is parameterized linearly,

the algorithm approximates the maximum likelihood estimator sublinearly. For robotics control

problems in MuJoCo and their transfer settings, the proposed algorithm achieves superior

performance compared with other IRL and imitation learning benchmarks.1

1 Introduction

We consider the task of estimating a structural model of dynamic decisions by a single human agent

based upon the observable history of implemented actions and visited states. This problem has been

studied as the estimation of dynamic discrete choice (DDC) models in econometrics and inverse

reinforcement learning (IRL) in artiﬁcial intelligence and machine learning research.

[

] is a seminal piece in the literature on dynamic discrete choice estimation. In that paper,

the estimation task is formulated as a bi-level optimization problem where the inner problem is a

stochastic dynamic programming problem and the outer problem is the likelihood maximization of

1This paper is the journal version of [1]

arXiv:2210.01282v3 [cs.LG] 1 Mar 2024

observed actions and states. [

] proposed an iterative nested ﬁxed point algorithm in which the inner

dynamic programming problem is solved repeatedly followed by maximum likelihood updates of the

structural parameters. Over the years, a signiﬁcant literature on alternative estimation methods

requiring less computational eﬀort has been developed. For example, [3] and [4] proposed two-step

algorithms which avoid the repeated solution of the inner stochastic dynamic programming problem.

In the ﬁrst step, a non-parametric estimator of the policy (also referred to as conditional choice

probabilities) is obtained and the inverse of a map from diﬀerences in Bellman’s value function for

diﬀerent states to randomized policies is computed. In the second step, a pseudo log-likelihood is

maximized. Two-step estimators may suﬀer from substantial ﬁnite sample bias if the estimated

policies in the ﬁrst step are of poor quality. Sequential estimators which are recursively obtained by

alternating between pseudo-likelihood maximization and improved policy estimation are considered

in [

]. In general, the computational burden for all these methods is signiﬁcant when the state space

is either discrete with large cardinality, or they are continuous in high dimensions. Discretization may

be avoided by using forward Monte Carlo simulations [

] but this also becomes computationally

demanding in high dimensions. A constrained optimization approach for maximum likelihood

estimation of dynamic discrete choice models is considered in [

]. However, the number of constraints

needed to represent Bellman’s equation becomes a signiﬁcant computational burden with discrete

state space with large cardinality or continuous state space in high dimensions.

Recent work has addressed the computational challenges posed by high-dimensional state space.

For example, in [

] the authors extend the CCP estimator approach proposed in [

] by considering a

functional approximation approach coupled with a temporal diﬀerence (TD) algorithm to maximize

pseudo-likelihood. In [

] the authors consider an approach to adjust the CCP estimator to account

for ﬁnite simple bias in high-dimensional settings.

The literature in IRL features the seminal work [

] in which a model for the expert’s behavior is

formulated as the policy that maximizes entropy subject to a constraint requiring that the expected

features under such policy match the empirical averages in the expert’s observation dataset.

The

algorithms developed for maximum entropy estimation [

] have a nested loop structure,

alternating between an outer loop with a reward update step, and an inner loop that calculates

the explicit policy estimates. The computational burden of this nested structure is manageable

in tabular environments, but it becomes signiﬁcant in high dimensional settings requiring value

function approximation.

Recent works have developed algorithms to alleviate the computational burden of nested-loop

estimation methods. For example, in [

], the authors propose to transform the standard formulation

of IRL into a single-level problem by estimating the Q-function rather than estimating the reward

function and associated optimal policy separately. However, the implicit reward function in the

Q-function identiﬁed is a poor estimate since it is not guaranteed to satisfy Bellman’s equation.

Finally, [

] considers an approach called

-IRL for estimating rewards based on the minimization of

several measures of divergence with respect to the expert’s state visitation measure. The approach

is limited to estimating rewards that only depend on state. While the results reported are based

upon a single-loop implementation, the paper does not provide a convergence guarantee to support

performance.

In contrast to the lines of works surveyed above, we focus our eﬀorts in developing estimation

algorithms with ﬁnite-time guarantees for computing high-quality estimators. Towards addressing

In section 6, we show that if the reward is linearly parametrized, the maximum entropy formulation in [

] is the

dual of the maximum likelihood formulation of the estimation problem.

this challenge, in this paper, we propose a class of new algorithms which only require a ﬁnite

number of computational steps for a class of (non-linearly parameterized) structural estimation

problems assuming the environment dynamics are known (or samples from the environment dynamics

are available “on-line"). Speciﬁcally, the proposed algorithm has a single-loop structure wherein

a single-step update of the estimated policy is followed by an update of the reward parameter.

We show that the algorithm has strong theoretical guarantees: to achieve certain

-approximate

stationary solution for a non-linearly parameterized problem, it requires

(

ϵ−2

)steps of policy and

reward updates each. To our knowledge, it is the ﬁrst algorithm which has ﬁnite-time guarantee

for the structural estimation of an MDP under nonlinear parameterization of the reward function.

We conduct extensive experiments to demonstrate that the proposed algorithm outperforms many

state-of-the-art IRL algorithms in both policy estimation and reward recovery. In particular, when

transferring to a new environment, the performance of state-of-the-art reinforcement learning (RL)

algorithms, using estimated rewards outperform those that use rewards recovered from existing IRL

and imitation learning benchmarks.

Finally, we consider the extension to the “oﬀ-line" case in which the estimation task also includes

the environment dynamics. Referring to our recent work (see [

]), we consider a two-stage estimation

approach. First, a maximum likelihood model of dynamics is identiﬁed. However, this ﬁrst stage

estimator of the environment dynamics may be inaccurate due to limited data coverage. Thus, in

the second stage, a “conservative" reward estimator is obtained by introducing a penalty for model

uncertainty.

The structure of this paper is as follows. In Sec. 2, we introduce the basic setting for structural

estimation of MDPs. In Sec. 2.2, we introduce the problem formulation of the maximum likelihood

IRL. In Sec. 4, we introduce a single-loop algorithm for estimation and formalize a ﬁnite-time

performance guarantee for high dimensional states. In Sec. 5, we present the convergence results

of our proposed single-loop algorithm. In Sec. 6, we consider the case with linearly parameterized

rewards to show the proposed algorithm converges sublinearly to the maximum likelihood estimator.

This results is proven by establishing a duality relationship between maximum entropy IRL and

maximum likelihood IRL. In Sec. 7, we consider the case in which the agent’s preferences can be

represented by a reward that is only a function of the state. In Sec. 8we outline the extension of

the proposed algorithms to the oﬄine case. Finally, in Sec. 9, we present the numerical results.

2 Background

2.1 Dynamic Discrete Choice Model

We now review the basic setting for dynamic discrete choice model as given for example in [

]. At

time

t≥

0, the agent implements an action

from a ﬁnite (discrete) action space

and receives a

reward

(

st, at

;

) +

ϵt

(

), where

st∈ S

is the state at time

(

st, at

;

)is the reward associated to

the state-action pair (

st, at

)with

θ∈Rp

a parameter and

ϵt

(

) :

R|A| →R

is a random perturbation

which is observable by the agent (decision maker) but not by the modeler.

Upon implementing the action

at∈ A

, the state evolves according to a Markov process with

kernel

(

st+1|st, at

). Moreover, let

(

ϵt|st

)denote the probability distribution for the random

perturbation, where the probability distribution is a function of the state.

Let

(

· | st, ϵt

)denote a randomized policy, i.e.

(

at|st, ϵt

)is the probability that action

implemented when the state is stand the observed reward perturbation vector is ϵt.

The agent’s optimal policy is characterized by the value function:

Vθ(s0, ϵ0) = max

Es0∼ρ,τ∼π∞

t=0

γtr(st, at;θ) + ϵt(at)s0, ϵ0

where the expectation is taken with respect to

at∼π

(

·|st, ϵt

)

, st+1 ∼P

(

·|st, at

)

, ϵt+1 ∼µ

(

·|st+1

)and

γ∈[0,1) is the discount factor. The Bellman equation is:

Vθ(st, ϵt) = max

at∈A r(st, at;θ) + ϵt(at) + γEst+1∼P(·|st,at),ϵt+1∼µ(·|st+1)Vθ(st+1, ϵt+1),

= max

at∈A Qθ(st, at) + ϵt(at)

where Qθ:S × A 7→ Ris the ﬁxed point of the soft-Bellman operator:

ΛθQ(st, at)=r(st, at;θ) + γEst+1∼P(·|st,at),ϵt+1∼µ(·|st+1)max

a∈A Q(st+1, a) + ϵt+1(a).(1)

As the realization of the reward perturbations is not observable by the modeler, a parametrized

model of the agent’s behavior is a map πθ(·|st)which satisﬁes Bellman’s optimality as follows:

πθ(at|st) = Pat∈arg max

a∈A Qθ(st, a) + ϵt(a).(2)

Assume observations are in the form of expert state-action trajectories

τE

{

(

st, at

)

}t≥0

drawn

from a ground-truth (or “expert") policy

πE

, i.e.

at∼πE

(

·|st

st+1 ∼P

(

·|st, at

)and

s0∼ρ

(

where

(

)denotes the initial distribution of the ﬁrst state

. The expected discounted log-likelihood

of observing such trajectory under model πθcan be written as:

EτE∼πE∞

t=0

γtlog P(st+1|st, at)πθ(at|st)=EτE∼πE∞

t=0

γtlog πθ(at|st)+EτE∼πE∞

t=0

γtlog P(st+1|st, at).

Given that the term

EτE∼πEP∞

t=0 γtlog P

(

st+1|st, at

)



is independent of the reward parameter

θ, the maximum likelihood estimation problem can be formulated as follows:

max

θL(θ) := EτE∼πEh∞

t=0

γtlog πθ(at|st)i(3a)

s.tπθ(at|st) = Pat∈arg max

a∈A Qθ(st, a) + ϵt(a),(3b)

where Qθis the ﬁxed point of the soft-Bellman operator in (1).

In the next subsection we review the literature on the entropy-regularized RL model and then

highlight the formal equivalence of entropy-regularized IRL with the dynamic discrete choice model

just introduced in (3).

2.2 Maximum Likelihood Inverse Reinforcement Learning (ML-IRL)

A recent literature has considered MDP models with information processing costs [

In these papers, optimal behavior is modeled as the solution to the following problem:

max

π∈ΠJθ(π;ρ)≜Es0∼ρ,τA∼π∞

t=0

γtr(st, at;θ)−c(π(·|st),

where

(

)denotes the initial distribution of the ﬁrst state

τA

is a trajectory generate from the

agent policy

and

(

)is a function representing the information processing cost. A common speciﬁ-

cation is

(

·|st

)) =

αDKL

(

·|st

)

||π0

(

·|st

)), where

DKL

(

·|st

)

||π0

(

·|st

)) =

a∈A

(

a|st

)

log π(a|st)

π0(a|st)

is Kullback-Leibler divergence between

(

·|st

)and a reference (or default) policy

π0

(

·|st

)and

α≥

is a scale parameter. As the objective function above can be re-scaled by

, we can set

= 1. To

model no prior knowledge the reference policy

π0

is the uniformly random policy, i.e.

π0

(

) =

|A|

for any a∈ A. In this case, we can further rewrite the problem as:

max

π∈ΠJθ(π;ρ)≜Es0∼ρ,τA∼π∞

t=0

γtr(st, at;θ) + H(π(·|st))+log |A|

1−γ,

where

(

·|st

)) =

−P

a∈A

(

a|st

)

log π

(

a|st

)is the entropy of

(

·|st

). This model has also been

recently used in the RL literature [

] where it is commonly referred to as an entropy

regularized MDP.

Denoting the expert policy as

πE

and assuming an entropy-regularized MDP model for behavior.

and the model for dynamic behavior described, the inverse reinforcement learning (IRL) problem

can be formulated as follows:

max

θL(θ) := EτE∼πE∞

t=0

γtlog πθ(at|st)(4a)

s.t. πθ:= arg max

Es0∼ρ,τA∼π∞

t=0

γtr(st, at;θ) + H(π(·|st)).(4b)

When the reward perturbations

ϵt

(

)follow i.i.d. Gumbel distribution with zero mean and variance

π2

for

a∈A

, the models of behavior

(3b)

and

(4b)

are equivalent (see Proposition 1 in [

]).

Speciﬁcally, the ﬁxed point

Qθ

(

s, a

)of the Bellman operator Λ

(1)

and the optimal policy

(2)

are of the form:

Qθ(s, a) := r(s, a;θ) + γEs′∼P(·|s,a)Vθ(s′).(5a)

Vθ(s) = log X

˜a∼A

exp Qθ(s, ˜a),(5b)

πθ(a|s) = exp Qθ(s, a)

P˜a∈A exp Qθ(s, ˜a).(5c)

As has been shown in [

], the policy described in

(5c)

corresponds to the optimal policy in

(4b)

so that

Vθ(s) = max

Es0∼ρ,τA∼π∞

t=0

γtr(st, at;θ) + H(π(·|st))s0=s.(6a)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StructuralEstimationofMarkovDecisionProcessesinHigh-DimensionalStateSpacewithFinite-TimeGuaranteesSiliangZeng†,MingyiHong†,AlfredoGarcia‡†DepartmentofElectricalandComputerEngineering,UniversityofMinnesota,MN,USA‡DepartmentofIndustrialandSystemsEngineering,TexasA&MUniversity,TX,USAEmail:{zeng0176,mho...

展开>> 收起<<

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees Siliang Zeng Mingyi Hong Alfredo Garcia.pdf

共44页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees Siliang Zeng Mingyi Hong Alfredo Garcia

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: