Beyond the Return Off-policy Function Estimation under User-speciﬁed Error-measuring Distributions Audrey Huang

2025-05-06 0 0 601.03KB 27 页 10玖币

侵权投诉

Beyond the Return: Off-policy Function Estimation

under User-speciﬁed Error-measuring Distributions

Audrey Huang

Computer Science

University of Illinois at Urbana-Champaign

audreyh5@illinois.edu

Nan Jiang

Computer Science

University of Illinois at Urbana-Champaign

nanjiang@illinois.edu

Abstract

Off-policy evaluation often refers to two related tasks: estimating the expected

return of a policy and estimating its value function (or other functions of interest,

such as density ratios). While recent works on marginalized importance sampling

(MIS) show that the former can enjoy provable guarantees under realizable function

approximation, the latter is only known to be feasible under much stronger assump-

tions such as prohibitively expressive discriminators. In this work, we provide

guarantees for off-policy function estimation under only realizability, by imposing

proper regularization on the MIS objectives. Compared to commonly used regu-

larization in MIS, our regularizer is much more ﬂexible and can account for an

arbitrary user-speciﬁed distribution, under which the learned function will be close

to the groundtruth. We provide exact characterization of the optimal dual solution

that needs to be realized by the discriminator class, which determines the data-

coverage assumption in the case of value-function learning. As another surprising

observation, the regularizer can be altered to relax the data-coverage requirement,

and completely eliminate it in the ideal case with strong side information.

1 Introduction

Off-policy evaluation (OPE) often refers to two related tasks in reinforcement learning (RL): estimat-

ing the expected return of a target policy using a dataset collected from a different behavior policy,

versus estimating the policy’s value function (or other functions of interest, such as density ratios).

The former is crucial to hyperparameter tuning and verifying the performance of a policy before

real-world deployment in ofﬂine RL [VLJY19; Pai+20; ZJ21]. The latter, on the other hand, plays

an important role in (both online and ofﬂine) training, often as the subroutine of actor-critic-style

algorithms [LP03; LSAB19], but is also generally more difﬁcult than the former: if an accurate value

function is available, one could easily estimate the return by plugging in the initial distribution.

Between the two tasks, the theoretical nature of off-policy return estimation is relatively well

understood, especially in terms of the function-approximation assumptions needed for sample-

complexity guarantees. Among the available algorithms, importance sampling (IS) and its variants

[PSS00; TTG15; JL16] do not require any function approximation, but incur exponential-in-horizon

variance; Fitted-Q Evaluation [EGW05; LVY19] can enjoy polynomial sample complexity under

appropriate coverage assumptions, but the guarantee relies on the strong Bellman-completeness

assumption on the function class; marginalized importance sampling (MIS) methods, which have

gained signiﬁcant attention recently [LLTZ18; XMW19; UHJ20; NCDL19], use two function classes

to simultaneously approximate the value and the density-ratio (or weight) function and optimize

minimax objectives. Notably, it is the only family of methods known to produce accurate return

estimates with a polynomial sample complexity, when the function classes only satisfy the relatively

weak realizability assumptions (i.e., they contain the true value and weight functions).

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15543v1 [cs.LG] 27 Oct 2022

In comparison, little is known about off-policy function estimation, and the guarantees are generally

less desirable. Not only do the limitations of IS and FQE on return estimation carry over to this more

challenging task, but MIS also loses its major advantage over FQE: despite the somewhat misleading

impression left by many prior works, that MIS can handle function estimation the same way as return

estimation,

MIS for function estimation often requires unrealistic assumptions such as prohibitively

expressive discriminators. For concreteness, a typical guarantee for function estimation from MIS

looks like the following (see e.g., Theorem 4 of [LLTZ18] and Lemmas 1 and 3 of [UHJ20]):

Proposition 1

(Function estimation guarantee for MIS, informal)

Suppose the ofﬂine data distribu-

tion

satisﬁes

dD(s, a)>0,∀s, a

. Given value-function class

with

qπ∈ Q

and weight class

W=RS×A

qπ= arg minq∈Q maxw∈W L(w, q)

for some appropriate population loss function

To enable the identiﬁcation of the value function

qπ

, the result requires the discriminator class

be the space of all possible functions over the state-action space (

W=RS×A

). In the ﬁnite-sample

regime, using such a class incurs a sample complexity that depends on the size of the state-action

space, which completely defeats the purpose of function approximation.

In addition, these results only hold asymptotically, where the function of interest can be exactly

identiﬁed in a point-wise manner. Such an overly strong guarantee is unrealistic in the ﬁnite-sample

regime, where one can only hope to approximate the function well in an average sense under some

distribution, i.e., ﬁnite-sample performance guarantees should ideally bound

kbq−qπk2,ν

for the

learned

, where

k·k2,ν

-weighted 2-norm. Such ﬁne-grained analyses are non-existent in MIS.

Even in the broader literature, such results not only require Bellman-completeness-type assumptions

[UIJKSX21], they also come with some ﬁxed

(which is not necessarily

; see Section 2) and

the user has no freedom in choosing

. This creates a gap in the literature, as downstream learning

algorithms that use off-policy function estimation as a subroutine often assume the estimation to be

accurate under speciﬁc distributions [KL02; AYBBLSW19] (see details in Appendix A).

To summarize, below are two important open problems on off-policy function estimation:

Is it possible to obtain polynomial

sample complexity for off-policy function estimation, using

function classes that only satisfy realizability-type assumptions?

Can we specify a distribution

to the estimation algorithm, such that the learned function will be

close to the groundtruth under ν?

In this work, we answer both open questions in the positive. By imposing proper regularization on

the MIS objectives, we provide off-policy function estimation guarantees under only realizability

assumptions on the function classes. Compared to commonly used regularization in MIS [NCDL19;

ND20; YNDLS20], our regularizer is much more ﬂexible and can account for an arbitrary user-

speciﬁed distribution

, under which the learned function will be close to the groundtruth. We provide

exact characterization of the optimal dual solution that needs to be realized by the discriminator,

which determines the data-coverage assumption in value-function learning. As another surprising

observation, the regularizer can be altered to relax the data-coverage requirement, and in the ideal

case completely eliminate it when strong side information is available. Proof-of-concept experiments

are also conducted to validate our theoretical predictions.

2 Related Works

Regularization in MIS

The use of regularization is very common in the MIS literature, especially

in DICE algorithms [NCDL19; NDKCLS19; YNDLS20]. However, most prior works that consider

regularization use tabular derivations and seldom provide ﬁnite-sample function-approximation

guarantees on even return estimation, let alone function estimation. (An exception is the work of

[UIJKSX21], who analyze related estimators under Bellman-completeness-type assumptions; see the

next paragraph.) More importantly, prior works provide very limited understanding in how choice of

regularization affects learning guarantees, and have considered only naïve forms of regularization

(state-action-independent and typically under

), under which different forms of regularization are

essentially treated equally under a coarse-grained theory [YNDLS20]. In contrast, we provide much

1For example, [LSAB19] assumed a weight estimation oracle and cited [LLTZ18] as a possible instance.

Concretely, one can choose

as Eq.

(4)

without the regularization term, which recovers MQL in [UHJ20].

By “polynomial”, we mean polynomial in the horizon, the statistical capacities and the boundedness of the

function classes, and the parameter that measures the degree of data coverage.

more ﬁne-grained characterization of the effects of regularization, which leads to novel insights about

how to design better regularizers, and existing DICE estimators are subsumed as special cases of our

method when we choose very simple regularizers (see Remark 3 in Section 5).

Fitted-Q Evaluation (FQE)

Outside the MIS literature, one can obtain return and value-function

estimation guarantees via FQE [DJW20; CJ19; LVY19; UIJKSX21]. However, it is well understood

that FQE and related approaches require Bellman-completeness-type assumptions, such as the

function class being closed under the Bellman operator. Even putting aside the difference between

completeness vs. realizability, we allow for a user-speciﬁed error-measuring distribution, which is

not available in FQE or any other existing method. The only distribution these methods are aware of

is the data distribution

, and even so, FQE and variants rarely provide guarantees on

kbq−qπk2,dD

but often on the Bellman error (e.g.,

kbq− T πbqk2,dD

) instead [UIJKSX21], and obtaining guarantees

on a distribution of interest often requires multiple indirect translations and loose relaxations.

LSTDQ

Our analyses focus on general function approximation. When restricted to linear classes,

function estimation guarantees for qπunder dDcan be obtained by LSTDQ methods [LP03; BY09;

DNP14] when the function class only satisﬁes realizability of

qπ

[PKBK22]. However, this requires

an additional matrix invertibility condition (see Assumption 3 of [PKBK22]), and it is still unclear

what this condition corresponds to in general function approximation.

Moreover, many general

methods—including MIS [UHJ20] and other minimax methods [ASM08; XCJMA21]—coincide

with LSTDQ in the linear case, so the aforementioned results can be viewed as a specialized analysis

leveraging the properties of linear classes.

PRO-RL [ZHHJL22]

Our key proof techniques are adapted from [ZHHJL22], whose goal is ofﬂine

policy learning. They learn the importance weight function

wπ

for a near-optimal

, and provide

kbw−wπk2,dD

guarantees as an intermediate result. Despite using similar technical tools, our most

interesting and surprising results are in the value-function estimation setting, which is not considered

by [ZHHJL22]. Our novel algorithmic insights, such as incorporating error-measuring distributions

and approximate models in the regularizers, are also potentially useful in [ZHHJL22]’s policy learning

setting. Our analyses also reveal a number of important differences between OPE and ofﬂine policy

learning, which will be discussed in Appendix A.

3 Preliminaries

We consider off-policy evaluation (OPE) in Markov Decision Processes (MDPs). An MDP is speciﬁed

by its state space

, action space

, transition dynamics

P:S ×A → ∆(S)

(

∆(·)

is the probability

simplex), reward function

R:S × A → ∆([0,1])

, discount factor

γ∈[0,1)

, and an initial state

distribution

µ0∈∆(S)

. We assume

and

are ﬁnite and discrete, but their cardinalities can be ar-

bitrarily large. Given a target policy

π:S → ∆(A)

, a random trajectory

s0, a0, r0, s1, a1, r1, . . .

can

be generated as

s0∼µ0, at∼π(·|st), rt∼R(·|st, at), st+1 ∼P(·|st, at)

∀t≥0

; we use

Eπ

and

Pπ

to refer to expectation and probability under such a distribution. The expected discounted return

(or simply return) of

J(π) := Eπ[Ptγtrt]

. The Q-value function of

is the unique solution

of the Bellman equations

qπ=Tπqπ

, with the Bellman operator

Tπ:RS×A →RS×A

deﬁned as

∀q∈RS×A,(Tπq)(s, a) := Er∼R(·|s,a)[r] + γ(Pπq)(s, a)

. Here

Pπ∈R|S×A|×|S×A|

is the state-

action transition operator of

, deﬁned as

(Pπq)(s, a) := Es0∼P(·|s,a),a0∼π(·|s0)[q(s0, a0)]

. Functions

over S × A (such as q) are also treated as |S × A|-dimensional vectors interchangeably.

In OPE, we want to estimate

qπ

and other functions of interest based on a historical dataset collected

by a possibly different policy. As a standard simpliﬁcation, we assume that the ofﬂine dataset

consisting of

i.i.d. tuples

{(si, ai, ri, s0

i)}n

i=1

sampled as

(si, ai)∼dD

r∼R(·|si, ai)

, and

i∼P(·|si, ai)

. We call

the (ofﬂine) data distribution. As another function of interest, the

(marginalized importance) weight function

wπ

is deﬁned as

wπ(s, a) := dπ(s, a)/dD(s, a)

, where

dπ(s, a) = (1 −γ)P∞

t=0 γtPπ[st=s, at=a]

is the discounted state-action occupancy of

. For

technical convenience we assume

dD(s, a)>0∀s, a

, so that quantities like

wπ

are always well

deﬁned and ﬁnite.

Similarly to

qπ

wπ

also satisﬁes a recursive equation, inherited from the Bellman

It is hinted by [UHJ20] that the invertibility is related to a loss minimization condition in MIS, but the

connection only holds for return estimation.

It will be trivial to remove this assumption at the cost of cumbersome derivations. Also, these density ratios

can still take prohibitively large values even if they are ﬁnite, and we will need to make additional boundedness

assumptions to enable ﬁnite-sample guarantees anyway, so their ﬁniteness does not trivialize the analyses.

ﬂow equation for

dπ

dπ= (1 −γ)µπ

0+γe

Pπdπ

, where

(s, a)∼µπ

0⇔s∼µ0, a ∼π(·|s)

is the

initial state-action distribution, and e

Pπ:= (Pπ)>is the transpose of the transition matrix.

Function Approximation

We will use function classes

and

to approximate

qπ

and

wπ

, re-

spectively. We assume ﬁnite

and

, and extension to inﬁnite classes under appropriate complexity

measures (e.g., covering number) is provided in Appendix G.

Additional Notation k·k2,ν := pEν[(·)2]

is the weighted 2-norm of a function under distribution

We also use a standard shorthand

f(s, π) := Ea∼π(·|s)[q(s, a)]

. Elementwise multiplication between

two vectors uand vof the same dimension is u◦v, and elementwise division is u/v.

4 Value-function Estimation

In this section we show how to estimate

bq≈qπ

with guarantees on

kbq−qπk2,ν

for a user-speciﬁed

, and identify the assumptions under which provable sample-complexity guarantees can be obtained.

We begin with the familiar Bellman equations, that qπis the unique solution to:

Er∼R(·|s,a)[r] + γEs0∼P(·|s,a)[q(s0, π)] −q(s, a) = 0,∀s, a ∈ S × A.(1)

While the above set of equations uniquely determines

q=qπ

, this is only true if we can enforce all

the

|S × A|

constraints, which is intractable in large state-space problems. In fact, even estimating (a

candidate

’s violation of) a single constraint is infeasible as that requires sampling from the same

state multiple times, which is related to the infamous double-sampling problem [Bai95].

To overcome this challenge, prior MIS works often relax Eq.

(1)

by taking a weighted combination of

these equations, e.g.,

EdD[w(s, a) (r(s, a) + γq(s0, π)−q(s, a))] = 0,∀w∈ W.(2)

Instead of enforcing

|S × A|

equations, we only enforce their linear combinations; the linear

coefﬁcients are

dD(s, a)·w(s, a)

, and

belongs to a class

with limited statistical capacity to

enable sample-efﬁcient estimation. While each constraint in Eq.

(2)

can now be efﬁciently checked

on data, this comes with a big cost that a solution to Eq.

(2)

is not necessarily

qπ

. Prior works handle

this dilemma by aiming lower: instead of learning

bq≈qπ

, they only learn

that can approximate the

policy’s return, i.e.

Es∼µ0[bq(s, π)] ≈J(π) = Es∼µ0[qπ(s, π)]

. While [UHJ20] show that the latter

is possible when

wπ∈ W

, they also show explicit counterexamples where

bq6=qπ

even with inﬁnite

data. As a result, how to estimate

bq≈qπ

under comparable realizability assumptions (instead of the

prohibitive W=RS×A as in Proposition 1) is still an open problem.

4.1 Estimator

We now describe our approach to solving this problem. Recall that the goal is to obtain error bounds

for

kbq−qπk2,ν

for some distribution

ν∈∆(S × A)

speciﬁed by the user. Note that we do not

require information about

and

that are generated after

(s, a)∼ν

and only care about the

(s, a)

marginal itself, so the user can pick

in an arbitrary manner without knowing the transition and the

reward functions of the MDP. We assume that

is given in a way that we can take its expectation

E(s,a)∼ν[(·)], and extension to the case where νis given via samples is straightforward.

To achieve this goal, we ﬁrst turn Eq.

(1)

into an equivalent constrained convex program: given a

collection of strongly convex and differentiable functions

{fs,a :R→R}s,a

—we will refer to the

collection as fand discuss its choice later—consider

min

E(s,a)∼ν[fs,a(q(s, a))] (3)

s.t. Er∼R(·|s,a)[r] + γEs0∼P(·|s,a)[q(s0, π)] −q(s, a) = 0,∀s, a ∈ S × A.

The constraints here are the same as Eq.

(1)

. Since Eq.

(1)

uniquely determines

q=qπ

, the feasible

space of Eq.

(3)

is a singleton, so we can impose any objective function on top of these constraints

(here we use

E(s,a)∼ν[fs,a(q(s, a))]

) and it will not change the optimal solution (which is always

qπ

the only feasible point). As we will see, however,

f={fs,a :R→R}s,a

will serve as an important

regularizer in the function-approximation setting and is crucial to our estimation guarantees.

Remark 1 ((s, a)-dependence of f).Regularizers in prior works are (s, a)-independent [NCDL19;

YNDLS20; ZHHJL22]. As we will see in Section 4.3, allowing for

(s, a)

-dependence is very

important for designing regularizers with improved guarantees and performances.

We now rewrite (3) in its Lagrangian form, with dD◦wserving the role of dual variables:

min

qmax

wLq

f(q, w) := Eν[fs,a(q(s, a))] + EdD[w(s, a) (r(s, a) + γq(s0, π)−q(s, a))] .(4)

Finally, our actual estimator approximates Eq.(4) via ﬁnite-sample approximation of the population

loss Lq

f, and searches over restricted function classes Qand Wfor qand w, respectively:

bq= arg min

q∈Q

max

w∈W b

f(q, w),(5)

where b

f(q, w) = Eν[fs,a(q(s, a))] + 1

nPn

i=1 w(si, ai) (ri+γq(s0

i, π)−q(si, ai)) .

Intuition for identiﬁcation

Before giving the detailed ﬁnite-sample analysis, we provide some

high-level intuitions for why we can obtain the desired guarantee on

kbq−qπk2,ν

. Note that Eq.

(4)

structurally similar to Eq.

(2)

, and we still cannot verify the Bellman equation for

qπ

in a per-state-

action manner, so the caveat of Eq.(2) seems to remain; why can we identify qπunder ν?

The key here is to show that it sufﬁces to check the loss function

only under a special choice of

(as opposed to all of

RS×A

). Importantly, this special

is not

w=wπ

;

rather, it is the saddle point

of our regularized objective

: let

(qπ, w∗

be a saddle point of

(we will give the closed form of

w∗

later). As long as

w∗

f∈ W

—even if

is extremely “simple” and contains nothing but

w∗

—we

can identify qπ.

To see that, it is instructive to consider the special case of

W={w∗

and the limit of inﬁnite data.

In this case, our estimator becomes arg minq∈Q Lq

f(q, w∗

f). By the deﬁnition of saddle point:

f(qπ, w∗

f)≤Lq

f(q, w∗

f),∀q.

While this shows that

qπ

is a minimizer of the loss, it does not imply that it is a unique minimizer.

However, identiﬁcation immediately follows from the convexity brought by regularization: since

f:R→R

is strongly convex,

q→Eν[fs,a(q(s, a))]

as a mapping from

RS×A

is strongly

convex under

k · k2,ν

(see Lemma 7 in Appendix B for a formal statement and proof), and

f(q, w∗

inherits such convexity since the other terms are afﬁne in

. It is then obvious that

qπ

is the unique

minimizer of

f(qπ, w∗

up to

k·k2,ν

, that is, any minimizer of

must agree with

qπ

(s, a)

pairs supported on

. Our ﬁnite-sample analysis below shows that the above reasoning is robust to

ﬁnite-sample errors and the inclusion of functions other than w∗

fin W.

4.2 Finite-sample Guarantees

In this subsection we state the formal guarantee of our estimator for

qπ

and the assumptions under

which the guarantee holds. We start with the condition on the regularization function f:

Assumption 1

(Strong convexity of

)

Assume

fs,a :R→R

is nonnegative, differentiable, and

-strongly convex for each

s∈ S, a ∈ A

. In addition, assume both

fs,a

and its derivative

s,a

take

ﬁnite values for any ﬁnite input.

This assumption can be concretely satisﬁed by a simple choice of fs,a(x) = 1

2x2, which is indepen-

dent of

(s, a)

and yields

Mq= 1

. Alternative choices of

will be discussed in Section 4.3. Next are

the realizability and boundedness of Wand Q:

Assumption 2 (Realizability).Suppose w∗

f∈ W,qπ∈ Q.

Assumption 3 (Boundedness of Wand Q).Suppose Wand Qare bounded, that is,

Q:= maxq∈Q kqk∞<∞,Cq

W:= maxw∈W kwk∞<∞.

As a remark, Assumption 2 implicitly assumes the existence of

w∗

. As we will see in Section 4.3, the

existence and ﬁniteness of

w∗

is automatically guaranteed given the ﬁniteness of

s,a

(Assumption 1)

and

dD(s, a)>0∀s, a

. More importantly, Assumptions 2 and 3 together imply that

kqπk∞≤Cq

and

kw∗

fk∞≤Cq

, which puts constraints on how small

and

can be. For example, it is

common to assume that

Q=1

1−γ

, i.e., the maximum possible return when rewards are bounded

In fact,

wπ

should not appear in our analysis at all:

wπ

is deﬁned w.r.t. the initial distribution of the MDP,

µ0, which has nothing to do with our goal of bounding kbq−qπk2,ν .

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BeyondtheReturn:Off-policyFunctionEstimationunderUser-speciedError-measuringDistributionsAudreyHuangComputerScienceUniversityofIllinoisatUrbana-Champaignaudreyh5@illinois.eduNanJiangComputerScienceUniversityofIllinoisatUrbana-Champaignnanjiang@illinois.eduAbstractOff-policyevaluationoftenreferstotw...

展开>> 收起<<

Beyond the Return Off-policy Function Estimation under User-speciﬁed Error-measuring Distributions Audrey Huang.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Beyond the Return Off-policy Function Estimation under User-speciﬁed Error-measuring Distributions Audrey Huang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: