Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions Haanvid Lee1 Jongmin Lee2 Yunseon Choi1 Wonseok Jeony

2025-05-02 0 0 1.03MB 29 页 10玖币
侵权投诉
Local Metric Learning for Off-Policy Evaluation in
Contextual Bandits with Continuous Actions
Haanvid Lee1, Jongmin Lee2, Yunseon Choi1, Wonseok Jeon,
Byung-Jun Lee3,4, Yung-Kyun Noh5,6, Kee-Eung Kim1
1KAIST, 2UC Berkeley, 3Korea Univ., 4Gauss Labs Inc., 5Hanyang Univ., 6KIAS
haanvid@kaist.ac.kr, jongmin.lee@berkeley.edu, cys9506@kaist.ac.kr
byungjunlee@korea.ac.kr, nohyung@hanyang.ac.kr, kekim@kaist.ac.kr
Abstract
We consider local kernel metric learning for off-policy evaluation (OPE) of deter-
ministic policies in contextual bandits with continuous action spaces. Our work is
motivated by practical scenarios where the target policy needs to be deterministic
due to domain requirements, such as prescription of treatment dosage and duration
in medicine. Although importance sampling (IS) provides a basic principle for
OPE, it is ill-posed for the deterministic target policy with continuous actions.
Our main idea is to relax the target policy and pose the problem as kernel-based
estimation, where we learn the kernel metric in order to minimize the overall mean
squared error (MSE). We present an analytic solution for the optimal metric, based
on the analysis of bias and variance. Whereas prior work has been limited to
scalar action spaces or kernel bandwidth selection, our work takes a step further
being capable of vector action spaces and metric optimization. We show that our
estimator is consistent, and significantly reduces the MSE compared to baseline
OPE methods through experiments on various domains.
1 Introduction
In order to deploy a contextual bandit policy to a real-world environment, such as personalized
pricing[
1
], treatments [
2
], recommendation [
3
], and advertisements [
4
], the performance of the policy
should be evaluated prior to the deployment to decide whether the trained policy is suitable for
deployment. This is because the interaction of the policy with the environment could be costly and/or
dangerous. For example, using an erroneous medical treatment policy to prescribe drugs for patients
could result in dire consequences. There then emerges a necessity for an algorithm that can evaluate
a policy’s performance without having it interact with the environment in an online manner. Such
algorithms are called “off-policy evaluation” (OPE) algorithms [
5
]. OPE algorithms for contextual
bandits evaluate a target policy by estimating its expected reward from the data sampled by a behavior
policy and without the target policy interacting with the environment.
Previous works on OPE for contextual bandits have mainly focused on environments with finite
actions [
4
,
6
10
]. The works can be largely divided into three approaches [
6
,
9
,
11
]. The first
approach is the direct method (DM) which learns an environment model for policy evaluation. DM
is known to have low variance [
9
]. However, since the environment model is learned with function
approximation, the estimator is biased. The second approach is importance sampling (IS) which
corrects the data distribution induced by a behavior policy to that of a target policy [
12
], and uses
the corrected distribution to approximate the expected reward of the target policy. IS estimates are
unbiased when the behavior and target policies are given. However, IS estimates can have large
The research was done while in Mila/McGill University, but the author is currently employed by Qualcomm
Technologies Inc.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.13373v3 [cs.LG] 27 Dec 2022
variance when there is a large mismatch between the behavior and target policy distributions [
9
]. The
last approach is doubly robust (DR), which uses DM to reduce the variance of IS while keeping the
unbiasedness of an IS estimate [6].
Although there are existing works on IS and DR that can be applied to continuous action spaces
[
13
16
], most of them cannot be easily extended to evaluate deterministic contextual bandit policies
with continuous actions. This is because IS weights used for both IS and DR estimators are almost
surely zero for deterministic target policies in continuous action spaces [
2
]. However, in practice, such
OPE algorithms are needed. For example, treatment prescription policies should not stochastically
prescribe drugs to patients.
To meet the needs, there are works for evaluating deterministic contextual bandit policies with
continuous actions [
2
,
5
,
17
]. These works focus on measuring the similarity between behavior and
target actions for assigning IS ratios. However, these works either assume a single-dimensional
action space [
17
] which cannot be straightforwardly extended to multiple action dimensions, or, use
Euclidean distances for the similarity measures [
2
,
5
]. In general, similarity measures should be
learned locally at a state to weigh differences between behavior and target actions in each action
dimension differently. For example, in the case of multi-drug prescription, the synergies and side
effects of the prescribed drugs are often very complex. Moreover, the different kinds of drugs are
likely to have different degrees of effect from person to person [18, 19].
To this end, we propose local kernel metric learning for IS (KMIS) OPE estimation of deterministic
contextual bandit policies with multidimensional continuous action spaces. Our proposed method
learns a Mahalanobis distance metric [
20
23
] locally at each state that lengthens or shortens the
distance between behavior and target actions to reduce the MSE. The metric-applied kernels measure
the similarities between actions according to the Mahalanobis distances induced by the metric. In our
work, we first analytically show that the leading-order bias [
2
] of a kernel-based IS estimator becomes
a dominant factor in the leading-order MSE [
2
] as the action dimension increases given the optimal
bandwidth [
2
] that minimizes the leading-order MSE without a metric. Then we derive the kernel
metric that minimizes the upper bound of the leading-order bias, which is bandwidth-agnostic. Our
analysis shows that the convergence speed of a kernel-based IS OPE estimator can be improved with
the application of the KMIS metrics. In the experiments, we demonstrate that MSEs of kernel-based
IS estimators are significantly reduced when combined with the proposed kernel metric learning. We
report empirical results in various synthetic domains as well as a real-world dataset.
2 Related Work
The works on OPE of deterministic contextual bandits with continuous actions can be largely divided
into importance sampling (IS) [
2
,
5
] and doubly robust estimators (DR) [
17
,
24
]. Both methods
eliminate the problem of almost surely having zero IS estimates when given a deterministic target
policy and a stochastic behavior policy in a continuous action space in two ways. Most of the works
relax the deterministic target policy, which can be seen as a Dirac delta function [
2
], to a kernel
[2, 5, 24], and the other work discretizes the action space [17].
Among these, kernel-based IS methods use a kernel to measure the similarities between the target
and behavior actions and focus on selecting the bandwidth of the kernel [
2
,
5
,
24
]. Su et al.
[5]
proposed the bandwidth selection algorithm that uses the Lepski’s principle for bandwidth selection
in the study of nonparametric statistics [
25
]. Kallus and Zhou
[2]
derived the leading-order MSE of a
kernel-based IS OPE estimation and chose the optimal bandwidth that minimizes the leading-order
MSE for the OPE estimation. One of the limitations of the existing kernel-based IS methods [
2
,
5
,
24
]
is that these methods use Euclidean distances for measuring the similarities between behavior and
target actions. The Euclidean distance is inadequate for measuring the similarities as assigning a
high similarity measure to the actions having similar rewards will induce less bias (bias definition
in Section 3.2). The other limitation is that these methods use one global bandwidth for the whole
action space [
17
]. Since the second-order derivative of the expected reward w.r.t. an action is related
to the leading-order bias derived by Kallus and Zhou
[2]
, the optimal bandwidth that balances the
bias-variance trade-off may vary depending on actions.
As for discretization methods, Cai et al.
[17]
proposed deep jump learning (DJL) [
17
] that avoids
the limitation of kernel-based methods due to using one global bandwidth by adaptively discretizing
one-dimensional continuous action spaces. The action space is discretized to have similar expected
2
rewards for each discretized action interval given a state. By using the discretized intervals for
DR estimation, DJL can estimate the policy value more accurately in comparison to kernel-based
methods when the action space has second-order derivatives of the expected rewards which change
significantly across the action space. In such cases, kernel-based methods use the same bandwidth
for all actions even though the optimal bandwidth varies depending on actions. On the other hand,
DJL can discretize the action space into smaller intervals for the parts of the action space where the
second-order derivative is high, and larger intervals when it is low. However, their work focuses
on domains with a single action dimension and cannot be easily extended to environments with
multidimensional action spaces.
To tackle the limitation of kernel-based methods caused by using Euclidean distances, kernel metric
learning can be applied to shrink or extend distances in the directions that reduce MSE. Metric
learning has been used for nearest neighbor classification [
22
,
26
29
], and kernel regression [
22
,
23
].
Among them, our work was inspired by the work of Noh et al. that learns a metric for reducing the
bias and MSE of kernel regression [
23
]. In their work, they made the assumption on the generative
model of the data where the input and output are jointly Gaussian. Under this assumption, they
derived the Mahalanobis metric that reduces the bias and MSE of Nadaraya-Watson kernel regression.
We learn our metric in a similar fashion in the context of OPE in contextual bandits except that we do
not have the generative assumption on the data as the assumption is unnatural in our problem setting.
3 Preliminaries
3.1 Problem Setting
In this work, we focus on OPE of a deterministic target policy in an environment with multidimen-
sional continuous action space
A ⊂ RDA
, state space
S RDS
, reward
rR
sampled from the
conditional distribution of the reward
p(r|s,a)
, state distribution
p(s)
. The deterministic target policy
π
can be regarded as having a Dirac delta distribution
π(a|s) = δ(π(s)a)
, where the probability
density function (PDF) value is zero everywhere except at the selected target action given a state
π(s)
.
The offline dataset
D={(si,ai, ri)}N
i=1
used for the evaluation of the deterministic target policy
π
is sampled from the environment using a known stochastic behavior policy
πb:S ∆(A)
. We
assume that the support of the behavior policy
πb
contains the actions selected by the target policy
π(s)
. The goal of OPE is to evaluate the target policy value
ρπ=Esp(s),aπ(a|s),rp(r|s,a)[r]
using Dand without πinteracting with the environment.
3.2 Bandwidth Selection for the Isotropic Kernel-Based IS Estimator
One of the methods to evaluate the target policy value by using the offline data
D
sampled with
πb
is
to perform IS estimation. The IS ratios correct the action sampling distribution for the expectation
from
πb
to
π
. However, since the density of a deterministic target policy
π
is a Dirac delta function,
its PDF value at the behavior action sampled from
πb
is almost surely zero. Existing works deal
with the problem by relaxing the target policy
π
to an isotropic kernel
K
with a bandwidth
h
and
computing the IS estimate of the policy value ˆρKas in Eq. (1) [2, 5, 24].
ρπ=Esp(s),aπb(a|s),rp(r|s,a)π(a|s)
πb(a|s)r
1
NhDA
N
X
i=1
Kaiπ(si)
hri
πb(ai|si).(1)
As the Dirac delta function can be regarded as a kernel having its bandwidth
h
approaching zero, the
relaxation of the Dirac delta function to a kernel can be seen as increasing its
h
. By the relaxation, the
bias of the kernel-based IS estimation
Bias ˆρK:= Esp(s),aπb(a|s),rp(r|s,a)ˆρKρπ
increases
while its variance is reduced. As the bias and the variance compose the MSE of the estimate, the
bandwidth that best balances between them and reduce the MSE should be selected. Kallus and
Zhou
[2]
derived the leading-order MSE (LOMSE) in terms of bandwidth
h
, sample size
N
, and
action dimension
DA
for selecting a bandwidth (Eq.
(2)
) assuming that
h0
and
1
NhDA0
as
N→ ∞
(derivation in Appendix A.1). They also derived the optimal bandwidth
h
that minimizes
the LOMSE (derivation in Appendix A.2).
3




 ,=1
 exp   
2 ,;=1
 exp   
  
2
For camera ready
(a) K(aat) = 1
2πexp (aat)>(aat)
2




 ,=1
 exp   
2 ,;=1
 exp   
  
2
For camera ready
(b) K(zzt) = 1
2πexp (aat)>A(s)(aat)
2
Figure 1: Illustration of bias reduction in a kernel-based IS estimate by the metric
A(s)
locally
learned at a given state
s
. The contour line is drawn for the reward over the action space given
s
.
Although behavior actions
a0
and
a00
are away from target action
at
(
=π(s)
) by an equal Euclidean
distance, their corresponding rewards are different. When an (a) isotropic Gaussian kernel is used,
bias can come from
a0
since the similarity measures of
a0
, and
a00
from
at
are the same even though
their rewards are different. However, when the (b) metric is applied, bias is reduced as the similarity
measure is higher on a00 that has a similar reward to atcompared to that of a0.
LOMSE(h, N, DA) = h4Cb
|{z}
(leading-order bias)2
+Cv
NhDA,
| {z }
(leading-order variance)
(2)
h=arg min
hLOMSE(h, N, DA) = DACv
4NCb1
DA+4
,(3)
Cb:= 1
4Esp(s)h2
ar(s,a)|a=π(s)i2, Cv:= R(K)Esp(s)E[r2|s,a=π(s)]
πb(a=π(s)|s),
where the first term in Eq.
(2)
is the squared leading-order bias and the second term is the leading-
order variance,
Cb
and
Cv
are constants related to the leading-order bias and variance, respectively,
the expected reward is
r(s,a) := E[r|s,a]
,
2
a
denotes the Laplacian operator w.r.t. action
a
,
the roughness of the kernel is
R(K) := RK(u)2du
. The kernel used for the derivation satisfies
RK(u)du= 1
and
K(u) = K(u)
for all
u
. For simplicity, we assumed a Gaussian kernel and
used the property Ruu>K(u)du=I.
In Section 4, we use the leading-order bias and variance [
2
] for the derivation of our proposed metric.
We also use the optimal bandwidth [
2
] for analyzing the properties of kernel-based IS estimator with
and without a metric.
3.3 Mahalanobis Distance Metric
Relaxing the Dirac delta target policy
π
to a kernel
K
prevents the kernel-based IS estimator from
estimating zero almost surely. By using an isotropic kernel for the relaxation, the difference between
the target and behavior actions in all directions are treated equally for measuring the similarities
between the actions with a kernel in Eq.
(1)
. However, to produce a more accurate OPE estimation,
the difference between the two actions in some directions should be ignored relative to the others for
measuring the similarity between the actions. For this, the Mahalanobis distance can be used. We
define the Mahalanobis distance between two
DA
-dimensional vectors
aiRDA
and
ajRDA
with the metric
ARDA×DA
as in Eq.
(4)
. Applying the Mahalanobis distance metric
A
(
=LL>
)
to a kernel function can be seen as linearly transforming the kernel inputs with the transformation
matrix L(L>a=z).
kaiajkA:= p(aiaj)>A(aiaj) = kzizjk,(A0, A>=A, |A|= 1).(4)
Figure 1 illustrates a case where the Mahalanobis metric is locally learned at a given state for reducing
the bias of a kernel-based IS estimate. The bias of the estimate is reduced by altering the shape of the
kernel to produce a higher similarity measure on a behavior action that has a similar reward to the
target action.
4 Local Metric Learning for Kernel-Based IS
4.1 Local Metric Learning via LOMSE Reduction
To reduce the LOMSE (Eq.
(2)
) of the kernel-based IS estimate by learning a metric, we first analyze
the LOMSE when the optimal bandwidth (Eq. (3)) is applied to the kernel. In Figure 1 we illustrate
4
how the bias of an IS estimation is induced due to kernel relaxation. The bias can worsen in high
dimensional spaces. To analytically show this, we adapt Proposition 4 in the work of Noh et al.
[23]
and show that the
Cb
contained in the squared leading-order bias (Eq.
(2)
) becomes a dominant term
in the LOMSE of kernel-based IS OPE given an optimal bandwidth and as the action dimension
increases.
Proposition 1.
(Adapted from Noh et al.
[23]
) For a high dimensional action space
DA4
, and
given optimal bandwidth
h
(Eq.
(3)
), the squared leading-order bias dominates over the leading-
order variance in LOMSE. Furthermore,
LOMSE(h, N, DA)
can be approximated by
Cb
in Eq.
(2)
.
LOMSE(h, N, DA) = N4
DA+4
DA
44
DA+4
+4
DADA
DA+4
C
DA
DA+4
bC
4
DA+4
vCb.(5)
The proof is made by plugging in Eq.
(3)
to Eq.
(2)
and taking the limit
DA→ ∞
. (See Appendix
B.1.) Proposition 1 implies that in an environment with high dimensional action space, the MSE can
be significantly reduced by the reduction of
Cb
contained in the squared leading-order bias (Eq.
(2)
).
Therefore, we aim to reduce
Cb
, or, reduce the leading bias in a bandwidth-agnostic manner with a
Mahalanobis distance metric that shortens the distance between the target and behavior actions in the
direction where their corresponding rewards are similar.
As mentioned in Section 3.3, applying a state-dependent metric (
A(s) = L(s)L(s)>
) to a kernel
is equivalent to linearly transforming input vectors of the kernel. Therefore,
Cb
with a metric
A:S RDA×DA
(i.e.
Cb,A
) should be analyzed with the linearly transformed actions
L(s)>a
(derivation in Appendix A.3), and we aim to find an optimal Athat minimizes it:
min
A:A(s)0,
A(s)=A(s)>,|A(s)|=1 s
Cb,A =1
4Esp(s)htr A(s)1Har(s,a)a=π(s)i2,(6)
where
H
is the Hessian operator. In the derivation, we assume a Gaussian kernel and use the property
Ruu>K(u)du=I
. Still, optimizing the function
A
in Eq.
(6)
itself is challenging since it requires
considering the overall effect of each metric matrix
A(s)
on the objective function. Therefore, we
instead consider minimizing the following upper bound, which allows us to compute the closed-form
metric matrix for each state in a nonparametric way:
min
A:A(s)0,
A(s)=A(s)>,|A(s)|=1 s
Ub,A =1
4Esp(s)tr A(s)1Har(s,a)a=π(s)2.(7)
In the following Theorem 1, we introduce the optimal Mahalanobis metric matrix
A(s)
that mini-
mizes Ub,A.
Theorem 1.
(Adapted from Noh et al.
[22]
) Assume that the
p(r|s,a)
is twice differentiable w.r.t.
an action
a
. Let
Λ+(s)
and
Λ(s)
be diagonal matrices of positive and negative eigenvalues of the
Hessian
HaE[r|s,a]|a=π(s)
,
U+(s)
and
U(s)
be matrices of eigenvectors corresponding to
Λ+(s)
and
Λ(s)
respectively, and
d+(s)
and
d(s)
be the numbers of positive and negative eigenvalues of
the Hessian. Then the metric A(s)that minimizes Ub,A is:
A(s) = α(s) [U+(s)U(s)] d+(s+(s) 0
0d(s(s)
| {z }
=:M(s)
[U+(s)U(s)]>,(8)
where α(s) := |M(s)|1/(d+(s)+d(s)).
The proof involves solving a Lagrangian equation to minimize the square of the trace term in Eq.
(7)
with constraints on
A(s)
. (See Appendix B.2.) In the special case where the Hessian matrix contains
both positive and negative eigenvalues for all states, the metric reduces the
Cb,A
to zero. (See
Appendix B.2.)
A(s)
is locally computed at a state
s
and the corresponding target action
π(s)
as the
optimal metric is derived from the Hessian at that point
Har(s,a)|a=π(s)
. When the optimal metric
is applied to the kernel, it measures the similarity between a behavior and target action using the
Mahalanobis distance instead of the Euclidean distance, as shown in Figure 1 and in Eq. (9):
kaπ(s)k2
A(s)=α(s)
d+
X
i=1 npd+(s)λ+,i(s)u+,i(s)>(aπ(s))o2
(9)
+α(s)
d
X
i=1 npd(s)λ,i(s)u,i(s)>(aπ(s))o2,
5
摘要:

LocalMetricLearningforOff-PolicyEvaluationinContextualBanditswithContinuousActionsHaanvidLee1,JongminLee2,YunseonChoi1,WonseokJeony,Byung-JunLee3;4,Yung-KyunNoh5;6,Kee-EungKim11KAIST,2UCBerkeley,3KoreaUniv.,4GaussLabsInc.,5HanyangUniv.,6KIAShaanvid@kaist.ac.kr,jongmin.lee@berkeley.edu,cys9506@kaist....

展开>> 收起<<
Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions Haanvid Lee1 Jongmin Lee2 Yunseon Choi1 Wonseok Jeony.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:1.03MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注