Beyond the Return Off-policy Function Estimation under User-specified Error-measuring Distributions Audrey Huang

2025-05-06 0 0 601.03KB 27 页 10玖币
侵权投诉
Beyond the Return: Off-policy Function Estimation
under User-specified Error-measuring Distributions
Audrey Huang
Computer Science
University of Illinois at Urbana-Champaign
audreyh5@illinois.edu
Nan Jiang
Computer Science
University of Illinois at Urbana-Champaign
nanjiang@illinois.edu
Abstract
Off-policy evaluation often refers to two related tasks: estimating the expected
return of a policy and estimating its value function (or other functions of interest,
such as density ratios). While recent works on marginalized importance sampling
(MIS) show that the former can enjoy provable guarantees under realizable function
approximation, the latter is only known to be feasible under much stronger assump-
tions such as prohibitively expressive discriminators. In this work, we provide
guarantees for off-policy function estimation under only realizability, by imposing
proper regularization on the MIS objectives. Compared to commonly used regu-
larization in MIS, our regularizer is much more flexible and can account for an
arbitrary user-specified distribution, under which the learned function will be close
to the groundtruth. We provide exact characterization of the optimal dual solution
that needs to be realized by the discriminator class, which determines the data-
coverage assumption in the case of value-function learning. As another surprising
observation, the regularizer can be altered to relax the data-coverage requirement,
and completely eliminate it in the ideal case with strong side information.
1 Introduction
Off-policy evaluation (OPE) often refers to two related tasks in reinforcement learning (RL): estimat-
ing the expected return of a target policy using a dataset collected from a different behavior policy,
versus estimating the policy’s value function (or other functions of interest, such as density ratios).
The former is crucial to hyperparameter tuning and verifying the performance of a policy before
real-world deployment in offline RL [VLJY19; Pai+20; ZJ21]. The latter, on the other hand, plays
an important role in (both online and offline) training, often as the subroutine of actor-critic-style
algorithms [LP03; LSAB19], but is also generally more difficult than the former: if an accurate value
function is available, one could easily estimate the return by plugging in the initial distribution.
Between the two tasks, the theoretical nature of off-policy return estimation is relatively well
understood, especially in terms of the function-approximation assumptions needed for sample-
complexity guarantees. Among the available algorithms, importance sampling (IS) and its variants
[PSS00; TTG15; JL16] do not require any function approximation, but incur exponential-in-horizon
variance; Fitted-Q Evaluation [EGW05; LVY19] can enjoy polynomial sample complexity under
appropriate coverage assumptions, but the guarantee relies on the strong Bellman-completeness
assumption on the function class; marginalized importance sampling (MIS) methods, which have
gained significant attention recently [LLTZ18; XMW19; UHJ20; NCDL19], use two function classes
to simultaneously approximate the value and the density-ratio (or weight) function and optimize
minimax objectives. Notably, it is the only family of methods known to produce accurate return
estimates with a polynomial sample complexity, when the function classes only satisfy the relatively
weak realizability assumptions (i.e., they contain the true value and weight functions).
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15543v1 [cs.LG] 27 Oct 2022
In comparison, little is known about off-policy function estimation, and the guarantees are generally
less desirable. Not only do the limitations of IS and FQE on return estimation carry over to this more
challenging task, but MIS also loses its major advantage over FQE: despite the somewhat misleading
impression left by many prior works, that MIS can handle function estimation the same way as return
estimation,
1
MIS for function estimation often requires unrealistic assumptions such as prohibitively
expressive discriminators. For concreteness, a typical guarantee for function estimation from MIS
looks like the following (see e.g., Theorem 4 of [LLTZ18] and Lemmas 1 and 3 of [UHJ20]):
Proposition 1
(Function estimation guarantee for MIS, informal)
.
Suppose the offline data distribu-
tion
dD
satisfies
dD(s, a)>0,s, a
. Given value-function class
Q
with
qπ∈ Q
and weight class
W=RS×A
,
qπ= arg minq∈Q maxw∈W L(w, q)
for some appropriate population loss function
L
.
2
To enable the identification of the value function
qπ
, the result requires the discriminator class
W
to
be the space of all possible functions over the state-action space (
W=RS×A
). In the finite-sample
regime, using such a class incurs a sample complexity that depends on the size of the state-action
space, which completely defeats the purpose of function approximation.
In addition, these results only hold asymptotically, where the function of interest can be exactly
identified in a point-wise manner. Such an overly strong guarantee is unrealistic in the finite-sample
regime, where one can only hope to approximate the function well in an average sense under some
distribution, i.e., finite-sample performance guarantees should ideally bound
kbqqπk2
for the
learned
bq
, where
k·k2
is
ν
-weighted 2-norm. Such fine-grained analyses are non-existent in MIS.
Even in the broader literature, such results not only require Bellman-completeness-type assumptions
[UIJKSX21], they also come with some fixed
ν
(which is not necessarily
dD
; see Section 2) and
the user has no freedom in choosing
ν
. This creates a gap in the literature, as downstream learning
algorithms that use off-policy function estimation as a subroutine often assume the estimation to be
accurate under specific distributions [KL02; AYBBLSW19] (see details in Appendix A).
To summarize, below are two important open problems on off-policy function estimation:
1.
Is it possible to obtain polynomial
3
sample complexity for off-policy function estimation, using
function classes that only satisfy realizability-type assumptions?
2.
Can we specify a distribution
ν
to the estimation algorithm, such that the learned function will be
close to the groundtruth under ν?
In this work, we answer both open questions in the positive. By imposing proper regularization on
the MIS objectives, we provide off-policy function estimation guarantees under only realizability
assumptions on the function classes. Compared to commonly used regularization in MIS [NCDL19;
ND20; YNDLS20], our regularizer is much more flexible and can account for an arbitrary user-
specified distribution
ν
, under which the learned function will be close to the groundtruth. We provide
exact characterization of the optimal dual solution that needs to be realized by the discriminator,
which determines the data-coverage assumption in value-function learning. As another surprising
observation, the regularizer can be altered to relax the data-coverage requirement, and in the ideal
case completely eliminate it when strong side information is available. Proof-of-concept experiments
are also conducted to validate our theoretical predictions.
2 Related Works
Regularization in MIS
The use of regularization is very common in the MIS literature, especially
in DICE algorithms [NCDL19; NDKCLS19; YNDLS20]. However, most prior works that consider
regularization use tabular derivations and seldom provide finite-sample function-approximation
guarantees on even return estimation, let alone function estimation. (An exception is the work of
[UIJKSX21], who analyze related estimators under Bellman-completeness-type assumptions; see the
next paragraph.) More importantly, prior works provide very limited understanding in how choice of
regularization affects learning guarantees, and have considered only naïve forms of regularization
(state-action-independent and typically under
dD
), under which different forms of regularization are
essentially treated equally under a coarse-grained theory [YNDLS20]. In contrast, we provide much
1For example, [LSAB19] assumed a weight estimation oracle and cited [LLTZ18] as a possible instance.
2
Concretely, one can choose
L
as Eq.
(4)
without the regularization term, which recovers MQL in [UHJ20].
3
By “polynomial”, we mean polynomial in the horizon, the statistical capacities and the boundedness of the
function classes, and the parameter that measures the degree of data coverage.
2
more fine-grained characterization of the effects of regularization, which leads to novel insights about
how to design better regularizers, and existing DICE estimators are subsumed as special cases of our
method when we choose very simple regularizers (see Remark 3 in Section 5).
Fitted-Q Evaluation (FQE)
Outside the MIS literature, one can obtain return and value-function
estimation guarantees via FQE [DJW20; CJ19; LVY19; UIJKSX21]. However, it is well understood
that FQE and related approaches require Bellman-completeness-type assumptions, such as the
function class being closed under the Bellman operator. Even putting aside the difference between
completeness vs. realizability, we allow for a user-specified error-measuring distribution, which is
not available in FQE or any other existing method. The only distribution these methods are aware of
is the data distribution
dD
, and even so, FQE and variants rarely provide guarantees on
kbqqπk2,dD
,
but often on the Bellman error (e.g.,
kbq− T πbqk2,dD
) instead [UIJKSX21], and obtaining guarantees
on a distribution of interest often requires multiple indirect translations and loose relaxations.
LSTDQ
Our analyses focus on general function approximation. When restricted to linear classes,
function estimation guarantees for qπunder dDcan be obtained by LSTDQ methods [LP03; BY09;
DNP14] when the function class only satisfies realizability of
qπ
[PKBK22]. However, this requires
an additional matrix invertibility condition (see Assumption 3 of [PKBK22]), and it is still unclear
what this condition corresponds to in general function approximation.
4
Moreover, many general
methods—including MIS [UHJ20] and other minimax methods [ASM08; XCJMA21]—coincide
with LSTDQ in the linear case, so the aforementioned results can be viewed as a specialized analysis
leveraging the properties of linear classes.
PRO-RL [ZHHJL22]
Our key proof techniques are adapted from [ZHHJL22], whose goal is offline
policy learning. They learn the importance weight function
wπ
for a near-optimal
π
, and provide
kbwwπk2,dD
guarantees as an intermediate result. Despite using similar technical tools, our most
interesting and surprising results are in the value-function estimation setting, which is not considered
by [ZHHJL22]. Our novel algorithmic insights, such as incorporating error-measuring distributions
and approximate models in the regularizers, are also potentially useful in [ZHHJL22]’s policy learning
setting. Our analyses also reveal a number of important differences between OPE and offline policy
learning, which will be discussed in Appendix A.
3 Preliminaries
We consider off-policy evaluation (OPE) in Markov Decision Processes (MDPs). An MDP is specified
by its state space
S
, action space
A
, transition dynamics
P:S ×A → ∆(S)
(
∆(·)
is the probability
simplex), reward function
R:S × A ∆([0,1])
, discount factor
γ[0,1)
, and an initial state
distribution
µ0∆(S)
. We assume
S
and
A
are finite and discrete, but their cardinalities can be ar-
bitrarily large. Given a target policy
π:S ∆(A)
, a random trajectory
s0, a0, r0, s1, a1, r1, . . .
can
be generated as
s0µ0, atπ(·|st), rtR(·|st, at), st+1 P(·|st, at)
,
t0
; we use
Eπ
and
Pπ
to refer to expectation and probability under such a distribution. The expected discounted return
(or simply return) of
π
is
J(π) := Eπ[Ptγtrt]
. The Q-value function of
π
is the unique solution
of the Bellman equations
qπ=Tπqπ
, with the Bellman operator
Tπ:RS×A RS×A
defined as
qRS×A,(Tπq)(s, a) := ErR(·|s,a)[r] + γ(Pπq)(s, a)
. Here
PπR|S×A|×|S×A|
is the state-
action transition operator of
π
, defined as
(Pπq)(s, a) := Es0P(·|s,a),a0π(·|s0)[q(s0, a0)]
. Functions
over S × A (such as q) are also treated as |S × A|-dimensional vectors interchangeably.
In OPE, we want to estimate
qπ
and other functions of interest based on a historical dataset collected
by a possibly different policy. As a standard simplification, we assume that the offline dataset
consisting of
n
i.i.d. tuples
{(si, ai, ri, s0
i)}n
i=1
sampled as
(si, ai)dD
,
rR(·|si, ai)
, and
s0
iP(·|si, ai)
. We call
dD
the (offline) data distribution. As another function of interest, the
(marginalized importance) weight function
wπ
is defined as
wπ(s, a) := dπ(s, a)/dD(s, a)
, where
dπ(s, a) = (1 γ)P
t=0 γtPπ[st=s, at=a]
is the discounted state-action occupancy of
π
. For
technical convenience we assume
dD(s, a)>0s, a
, so that quantities like
wπ
are always well
defined and finite.
5
Similarly to
qπ
,
wπ
also satisfies a recursive equation, inherited from the Bellman
4
It is hinted by [UHJ20] that the invertibility is related to a loss minimization condition in MIS, but the
connection only holds for return estimation.
5
It will be trivial to remove this assumption at the cost of cumbersome derivations. Also, these density ratios
can still take prohibitively large values even if they are finite, and we will need to make additional boundedness
assumptions to enable finite-sample guarantees anyway, so their finiteness does not trivialize the analyses.
3
flow equation for
dπ
:
dπ= (1 γ)µπ
0+γe
Pπdπ
, where
(s, a)µπ
0sµ0, a π(·|s)
is the
initial state-action distribution, and e
Pπ:= (Pπ)>is the transpose of the transition matrix.
Function Approximation
We will use function classes
Q
and
W
to approximate
qπ
and
wπ
, re-
spectively. We assume finite
Q
and
W
, and extension to infinite classes under appropriate complexity
measures (e.g., covering number) is provided in Appendix G.
Additional Notation k·k2:= pEν[(·)2]
is the weighted 2-norm of a function under distribution
ν
.
We also use a standard shorthand
f(s, π) := Eaπ(·|s)[q(s, a)]
. Elementwise multiplication between
two vectors uand vof the same dimension is uv, and elementwise division is u/v.
4 Value-function Estimation
In this section we show how to estimate
bqqπ
with guarantees on
kbqqπk2
for a user-specified
ν
, and identify the assumptions under which provable sample-complexity guarantees can be obtained.
We begin with the familiar Bellman equations, that qπis the unique solution to:
ErR(·|s,a)[r] + γEs0P(·|s,a)[q(s0, π)] q(s, a) = 0,s, a S × A.(1)
While the above set of equations uniquely determines
q=qπ
, this is only true if we can enforce all
the
|S × A|
constraints, which is intractable in large state-space problems. In fact, even estimating (a
candidate
q
s violation of) a single constraint is infeasible as that requires sampling from the same
state multiple times, which is related to the infamous double-sampling problem [Bai95].
To overcome this challenge, prior MIS works often relax Eq.
(1)
by taking a weighted combination of
these equations, e.g.,
EdD[w(s, a) (r(s, a) + γq(s0, π)q(s, a))] = 0,w∈ W.(2)
Instead of enforcing
|S × A|
equations, we only enforce their linear combinations; the linear
coefficients are
dD(s, a)·w(s, a)
, and
w
belongs to a class
W
with limited statistical capacity to
enable sample-efficient estimation. While each constraint in Eq.
(2)
can now be efficiently checked
on data, this comes with a big cost that a solution to Eq.
(2)
is not necessarily
qπ
. Prior works handle
this dilemma by aiming lower: instead of learning
bqqπ
, they only learn
bq
that can approximate the
policy’s return, i.e.
Esµ0[bq(s, π)] J(π) = Esµ0[qπ(s, π)]
. While [UHJ20] show that the latter
is possible when
wπ∈ W
, they also show explicit counterexamples where
bq6=qπ
even with infinite
data. As a result, how to estimate
bqqπ
under comparable realizability assumptions (instead of the
prohibitive W=RS×A as in Proposition 1) is still an open problem.
4.1 Estimator
We now describe our approach to solving this problem. Recall that the goal is to obtain error bounds
for
kbqqπk2
for some distribution
ν∆(S × A)
specified by the user. Note that we do not
require information about
r
and
s0
that are generated after
(s, a)ν
and only care about the
(s, a)
marginal itself, so the user can pick
ν
in an arbitrary manner without knowing the transition and the
reward functions of the MDP. We assume that
ν
is given in a way that we can take its expectation
E(s,a)ν[(·)], and extension to the case where νis given via samples is straightforward.
To achieve this goal, we first turn Eq.
(1)
into an equivalent constrained convex program: given a
collection of strongly convex and differentiable functions
{fs,a :RR}s,a
—we will refer to the
collection as fand discuss its choice later—consider
min
q
E(s,a)ν[fs,a(q(s, a))] (3)
s.t. ErR(·|s,a)[r] + γEs0P(·|s,a)[q(s0, π)] q(s, a) = 0,s, a S × A.
The constraints here are the same as Eq.
(1)
. Since Eq.
(1)
uniquely determines
q=qπ
, the feasible
space of Eq.
(3)
is a singleton, so we can impose any objective function on top of these constraints
(here we use
E(s,a)ν[fs,a(q(s, a))]
) and it will not change the optimal solution (which is always
qπ
,
the only feasible point). As we will see, however,
f={fs,a :RR}s,a
will serve as an important
regularizer in the function-approximation setting and is crucial to our estimation guarantees.
Remark 1 ((s, a)-dependence of f).Regularizers in prior works are (s, a)-independent [NCDL19;
YNDLS20; ZHHJL22]. As we will see in Section 4.3, allowing for
(s, a)
-dependence is very
important for designing regularizers with improved guarantees and performances.
4
We now rewrite (3) in its Lagrangian form, with dDwserving the role of dual variables:
min
qmax
wLq
f(q, w) := Eν[fs,a(q(s, a))] + EdD[w(s, a) (r(s, a) + γq(s0, π)q(s, a))] .(4)
Finally, our actual estimator approximates Eq.(4) via finite-sample approximation of the population
loss Lq
f, and searches over restricted function classes Qand Wfor qand w, respectively:
bq= arg min
q∈Q
max
w∈W b
Lq
f(q, w),(5)
where b
Lq
f(q, w) = Eν[fs,a(q(s, a))] + 1
nPn
i=1 w(si, ai) (ri+γq(s0
i, π)q(si, ai)) .
Intuition for identification
Before giving the detailed finite-sample analysis, we provide some
high-level intuitions for why we can obtain the desired guarantee on
kbqqπk2
. Note that Eq.
(4)
is
structurally similar to Eq.
(2)
, and we still cannot verify the Bellman equation for
qπ
in a per-state-
action manner, so the caveat of Eq.(2) seems to remain; why can we identify qπunder ν?
The key here is to show that it suffices to check the loss function
Lq
f
only under a special choice of
w
(as opposed to all of
RS×A
). Importantly, this special
w
is not
w=wπ
;
6
rather, it is the saddle point
of our regularized objective
Lq
f
: let
(qπ, w
f)
be a saddle point of
Lq
f
(we will give the closed form of
w
f
later). As long as
w
f∈ W
—even if
W
is extremely “simple” and contains nothing but
w
f
—we
can identify qπ.
To see that, it is instructive to consider the special case of
W={w
f}
and the limit of infinite data.
In this case, our estimator becomes arg minq∈Q Lq
f(q, w
f). By the definition of saddle point:
Lq
f(qπ, w
f)Lq
f(q, w
f),q.
While this shows that
qπ
is a minimizer of the loss, it does not imply that it is a unique minimizer.
However, identification immediately follows from the convexity brought by regularization: since
f:RR
is strongly convex,
qEν[fs,a(q(s, a))]
as a mapping from
RS×A
to
R
is strongly
convex under
k · k2
(see Lemma 7 in Appendix B for a formal statement and proof), and
Lq
f(q, w
f)
inherits such convexity since the other terms are affine in
q
. It is then obvious that
qπ
is the unique
minimizer of
Lq
f(qπ, w
f)
up to
k·k2
, that is, any minimizer of
Lq
f
must agree with
qπ
on
(s, a)
pairs supported on
ν
. Our finite-sample analysis below shows that the above reasoning is robust to
finite-sample errors and the inclusion of functions other than w
fin W.
4.2 Finite-sample Guarantees
In this subsection we state the formal guarantee of our estimator for
qπ
and the assumptions under
which the guarantee holds. We start with the condition on the regularization function f:
Assumption 1
(Strong convexity of
f
)
.
Assume
fs,a :RR
is nonnegative, differentiable, and
Mq
-strongly convex for each
s∈ S, a ∈ A
. In addition, assume both
fs,a
and its derivative
f0
s,a
take
finite values for any finite input.
This assumption can be concretely satisfied by a simple choice of fs,a(x) = 1
2x2, which is indepen-
dent of
(s, a)
and yields
Mq= 1
. Alternative choices of
f
will be discussed in Section 4.3. Next are
the realizability and boundedness of Wand Q:
Assumption 2 (Realizability).Suppose w
f∈ W,qπ∈ Q.
Assumption 3 (Boundedness of Wand Q).Suppose Wand Qare bounded, that is,
Cq
Q:= maxq∈Q kqk<,Cq
W:= maxw∈W kwk<.
As a remark, Assumption 2 implicitly assumes the existence of
w
f
. As we will see in Section 4.3, the
existence and finiteness of
w
f
is automatically guaranteed given the finiteness of
f0
s,a
(Assumption 1)
and
dD(s, a)>0s, a
. More importantly, Assumptions 2 and 3 together imply that
kqπkCq
Q
and
kw
fkCq
W
, which puts constraints on how small
Cq
Q
and
Cq
W
can be. For example, it is
common to assume that
Cq
Q=1
1γ
, i.e., the maximum possible return when rewards are bounded
6
In fact,
wπ
should not appear in our analysis at all:
wπ
is defined w.r.t. the initial distribution of the MDP,
µ0, which has nothing to do with our goal of bounding kbqqπk2.
5
摘要:

BeyondtheReturn:Off-policyFunctionEstimationunderUser-speciedError-measuringDistributionsAudreyHuangComputerScienceUniversityofIllinoisatUrbana-Champaignaudreyh5@illinois.eduNanJiangComputerScienceUniversityofIllinoisatUrbana-Champaignnanjiang@illinois.eduAbstractOff-policyevaluationoftenreferstotw...

展开>> 收起<<
Beyond the Return Off-policy Function Estimation under User-specified Error-measuring Distributions Audrey Huang.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:601.03KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注