Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions Haanvid Lee1 Jongmin Lee2 Yunseon Choi1 Wonseok Jeony

2025-05-02 1 0 1.03MB 29 页 10玖币

侵权投诉

Local Metric Learning for Off-Policy Evaluation in

Contextual Bandits with Continuous Actions

Haanvid Lee1, Jongmin Lee2, Yunseon Choi1, Wonseok Jeon†,

Byung-Jun Lee3,4, Yung-Kyun Noh5,6, Kee-Eung Kim1

1KAIST, 2UC Berkeley, 3Korea Univ., 4Gauss Labs Inc., 5Hanyang Univ., 6KIAS

haanvid@kaist.ac.kr, jongmin.lee@berkeley.edu, cys9506@kaist.ac.kr

byungjunlee@korea.ac.kr, nohyung@hanyang.ac.kr, kekim@kaist.ac.kr

Abstract

We consider local kernel metric learning for off-policy evaluation (OPE) of deter-

ministic policies in contextual bandits with continuous action spaces. Our work is

motivated by practical scenarios where the target policy needs to be deterministic

due to domain requirements, such as prescription of treatment dosage and duration

in medicine. Although importance sampling (IS) provides a basic principle for

OPE, it is ill-posed for the deterministic target policy with continuous actions.

Our main idea is to relax the target policy and pose the problem as kernel-based

estimation, where we learn the kernel metric in order to minimize the overall mean

squared error (MSE). We present an analytic solution for the optimal metric, based

on the analysis of bias and variance. Whereas prior work has been limited to

scalar action spaces or kernel bandwidth selection, our work takes a step further

being capable of vector action spaces and metric optimization. We show that our

estimator is consistent, and signiﬁcantly reduces the MSE compared to baseline

OPE methods through experiments on various domains.

1 Introduction

In order to deploy a contextual bandit policy to a real-world environment, such as personalized

pricing[

], treatments [

], recommendation [

], and advertisements [

], the performance of the policy

should be evaluated prior to the deployment to decide whether the trained policy is suitable for

deployment. This is because the interaction of the policy with the environment could be costly and/or

dangerous. For example, using an erroneous medical treatment policy to prescribe drugs for patients

could result in dire consequences. There then emerges a necessity for an algorithm that can evaluate

a policy’s performance without having it interact with the environment in an online manner. Such

algorithms are called “off-policy evaluation” (OPE) algorithms [

]. OPE algorithms for contextual

bandits evaluate a target policy by estimating its expected reward from the data sampled by a behavior

policy and without the target policy interacting with the environment.

Previous works on OPE for contextual bandits have mainly focused on environments with ﬁnite

actions [

–

]. The works can be largely divided into three approaches [

]. The ﬁrst

approach is the direct method (DM) which learns an environment model for policy evaluation. DM

is known to have low variance [

]. However, since the environment model is learned with function

approximation, the estimator is biased. The second approach is importance sampling (IS) which

corrects the data distribution induced by a behavior policy to that of a target policy [

], and uses

the corrected distribution to approximate the expected reward of the target policy. IS estimates are

unbiased when the behavior and target policies are given. However, IS estimates can have large

†

The research was done while in Mila/McGill University, but the author is currently employed by Qualcomm

Technologies Inc.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.13373v3 [cs.LG] 27 Dec 2022

variance when there is a large mismatch between the behavior and target policy distributions [

]. The

last approach is doubly robust (DR), which uses DM to reduce the variance of IS while keeping the

unbiasedness of an IS estimate [6].

Although there are existing works on IS and DR that can be applied to continuous action spaces

[

–

], most of them cannot be easily extended to evaluate deterministic contextual bandit policies

with continuous actions. This is because IS weights used for both IS and DR estimators are almost

surely zero for deterministic target policies in continuous action spaces [

]. However, in practice, such

OPE algorithms are needed. For example, treatment prescription policies should not stochastically

prescribe drugs to patients.

To meet the needs, there are works for evaluating deterministic contextual bandit policies with

continuous actions [

]. These works focus on measuring the similarity between behavior and

target actions for assigning IS ratios. However, these works either assume a single-dimensional

action space [

] which cannot be straightforwardly extended to multiple action dimensions, or, use

Euclidean distances for the similarity measures [

]. In general, similarity measures should be

learned locally at a state to weigh differences between behavior and target actions in each action

dimension differently. For example, in the case of multi-drug prescription, the synergies and side

effects of the prescribed drugs are often very complex. Moreover, the different kinds of drugs are

likely to have different degrees of effect from person to person [18, 19].

To this end, we propose local kernel metric learning for IS (KMIS) OPE estimation of deterministic

contextual bandit policies with multidimensional continuous action spaces. Our proposed method

learns a Mahalanobis distance metric [

–

] locally at each state that lengthens or shortens the

distance between behavior and target actions to reduce the MSE. The metric-applied kernels measure

the similarities between actions according to the Mahalanobis distances induced by the metric. In our

work, we ﬁrst analytically show that the leading-order bias [

] of a kernel-based IS estimator becomes

a dominant factor in the leading-order MSE [

] as the action dimension increases given the optimal

bandwidth [

] that minimizes the leading-order MSE without a metric. Then we derive the kernel

metric that minimizes the upper bound of the leading-order bias, which is bandwidth-agnostic. Our

analysis shows that the convergence speed of a kernel-based IS OPE estimator can be improved with

the application of the KMIS metrics. In the experiments, we demonstrate that MSEs of kernel-based

IS estimators are signiﬁcantly reduced when combined with the proposed kernel metric learning. We

report empirical results in various synthetic domains as well as a real-world dataset.

2 Related Work

The works on OPE of deterministic contextual bandits with continuous actions can be largely divided

into importance sampling (IS) [

] and doubly robust estimators (DR) [

]. Both methods

eliminate the problem of almost surely having zero IS estimates when given a deterministic target

policy and a stochastic behavior policy in a continuous action space in two ways. Most of the works

relax the deterministic target policy, which can be seen as a Dirac delta function [

], to a kernel

[2, 5, 24], and the other work discretizes the action space [17].

Among these, kernel-based IS methods use a kernel to measure the similarities between the target

and behavior actions and focus on selecting the bandwidth of the kernel [

]. Su et al.

[5]

proposed the bandwidth selection algorithm that uses the Lepski’s principle for bandwidth selection

in the study of nonparametric statistics [

]. Kallus and Zhou

[2]

derived the leading-order MSE of a

kernel-based IS OPE estimation and chose the optimal bandwidth that minimizes the leading-order

MSE for the OPE estimation. One of the limitations of the existing kernel-based IS methods [

]

is that these methods use Euclidean distances for measuring the similarities between behavior and

target actions. The Euclidean distance is inadequate for measuring the similarities as assigning a

high similarity measure to the actions having similar rewards will induce less bias (bias deﬁnition

in Section 3.2). The other limitation is that these methods use one global bandwidth for the whole

action space [

]. Since the second-order derivative of the expected reward w.r.t. an action is related

to the leading-order bias derived by Kallus and Zhou

[2]

, the optimal bandwidth that balances the

bias-variance trade-off may vary depending on actions.

As for discretization methods, Cai et al.

[17]

proposed deep jump learning (DJL) [

] that avoids

the limitation of kernel-based methods due to using one global bandwidth by adaptively discretizing

one-dimensional continuous action spaces. The action space is discretized to have similar expected

rewards for each discretized action interval given a state. By using the discretized intervals for

DR estimation, DJL can estimate the policy value more accurately in comparison to kernel-based

methods when the action space has second-order derivatives of the expected rewards which change

signiﬁcantly across the action space. In such cases, kernel-based methods use the same bandwidth

for all actions even though the optimal bandwidth varies depending on actions. On the other hand,

DJL can discretize the action space into smaller intervals for the parts of the action space where the

second-order derivative is high, and larger intervals when it is low. However, their work focuses

on domains with a single action dimension and cannot be easily extended to environments with

multidimensional action spaces.

To tackle the limitation of kernel-based methods caused by using Euclidean distances, kernel metric

learning can be applied to shrink or extend distances in the directions that reduce MSE. Metric

learning has been used for nearest neighbor classiﬁcation [

–

], and kernel regression [

Among them, our work was inspired by the work of Noh et al. that learns a metric for reducing the

bias and MSE of kernel regression [

]. In their work, they made the assumption on the generative

model of the data where the input and output are jointly Gaussian. Under this assumption, they

derived the Mahalanobis metric that reduces the bias and MSE of Nadaraya-Watson kernel regression.

We learn our metric in a similar fashion in the context of OPE in contextual bandits except that we do

not have the generative assumption on the data as the assumption is unnatural in our problem setting.

3 Preliminaries

3.1 Problem Setting

In this work, we focus on OPE of a deterministic target policy in an environment with multidimen-

sional continuous action space

A ⊂ RDA

, state space

S ⊂ RDS

, reward

r∈R

sampled from the

conditional distribution of the reward

p(r|s,a)

, state distribution

p(s)

. The deterministic target policy

can be regarded as having a Dirac delta distribution

π(a|s) = δ(π(s)−a)

, where the probability

density function (PDF) value is zero everywhere except at the selected target action given a state

π(s)

The ofﬂine dataset

D={(si,ai, ri)}N

i=1

used for the evaluation of the deterministic target policy

is sampled from the environment using a known stochastic behavior policy

πb:S → ∆(A)

. We

assume that the support of the behavior policy

πb

contains the actions selected by the target policy

π(s)

. The goal of OPE is to evaluate the target policy value

ρπ=Es∼p(s),a∼π(a|s),r∼p(r|s,a)[r]

using Dand without πinteracting with the environment.

3.2 Bandwidth Selection for the Isotropic Kernel-Based IS Estimator

One of the methods to evaluate the target policy value by using the ofﬂine data

sampled with

πb

to perform IS estimation. The IS ratios correct the action sampling distribution for the expectation

from

πb

. However, since the density of a deterministic target policy

is a Dirac delta function,

its PDF value at the behavior action sampled from

πb

is almost surely zero. Existing works deal

with the problem by relaxing the target policy

to an isotropic kernel

with a bandwidth

and

computing the IS estimate of the policy value ˆρKas in Eq. (1) [2, 5, 24].

ρπ=Es∼p(s),a∼πb(a|s),r∼p(r|s,a)π(a|s)

πb(a|s)r

≈1

NhDA

i=1

Kai−π(si)

hri

πb(ai|si).(1)

As the Dirac delta function can be regarded as a kernel having its bandwidth

approaching zero, the

relaxation of the Dirac delta function to a kernel can be seen as increasing its

. By the relaxation, the

bias of the kernel-based IS estimation

Bias ˆρK:= Es∼p(s),a∼πb(a|s),r∼p(r|s,a)ˆρK−ρπ

increases

while its variance is reduced. As the bias and the variance compose the MSE of the estimate, the

bandwidth that best balances between them and reduce the MSE should be selected. Kallus and

Zhou

[2]

derived the leading-order MSE (LOMSE) in terms of bandwidth

, sample size

, and

action dimension

for selecting a bandwidth (Eq.

(2)

) assuming that

h→0

and

NhDA→0

N→ ∞

(derivation in Appendix A.1). They also derived the optimal bandwidth

h∗

that minimizes

the LOMSE (derivation in Appendix A.2).











 ,=1

 exp   



2 ,;=1

 exp   

   









For camera ready



(a) K(a−at) = 1

2πexp −(a−at)>(a−at)

2











 ,=1

 exp   



2 ,;=1

 exp   

   









For camera ready



(b) K(z−zt) = 1

2πexp −(a−at)>A(s)(a−at)

2

Figure 1: Illustration of bias reduction in a kernel-based IS estimate by the metric

A(s)

locally

learned at a given state

. The contour line is drawn for the reward over the action space given

Although behavior actions

and

a00

are away from target action

(

=π(s)

) by an equal Euclidean

distance, their corresponding rewards are different. When an (a) isotropic Gaussian kernel is used,

bias can come from

since the similarity measures of

, and

a00

from

are the same even though

their rewards are different. However, when the (b) metric is applied, bias is reduced as the similarity

measure is higher on a00 that has a similar reward to atcompared to that of a0.

LOMSE(h, N, DA) = h4Cb

|{z}

(leading-order bias)2

+Cv

NhDA,

| {z }

(leading-order variance)

(2)

h∗=arg min

hLOMSE(h, N, DA) = DACv

4NCb1

DA+4

,(3)

Cb:= 1

4Es∼p(s)h∇2

ar(s,a)|a=π(s)i2, Cv:= R(K)Es∼p(s)E[r2|s,a=π(s)]

πb(a=π(s)|s),

where the ﬁrst term in Eq.

(2)

is the squared leading-order bias and the second term is the leading-

order variance,

and

are constants related to the leading-order bias and variance, respectively,

the expected reward is

r(s,a) := E[r|s,a]

∇2

denotes the Laplacian operator w.r.t. action

the roughness of the kernel is

R(K) := RK(u)2du

. The kernel used for the derivation satisﬁes

RK(u)du= 1

and

K(u) = K(−u)

for all

. For simplicity, we assumed a Gaussian kernel and

used the property Ruu>K(u)du=I.

In Section 4, we use the leading-order bias and variance [

] for the derivation of our proposed metric.

We also use the optimal bandwidth [

] for analyzing the properties of kernel-based IS estimator with

and without a metric.

3.3 Mahalanobis Distance Metric

Relaxing the Dirac delta target policy

to a kernel

prevents the kernel-based IS estimator from

estimating zero almost surely. By using an isotropic kernel for the relaxation, the difference between

the target and behavior actions in all directions are treated equally for measuring the similarities

between the actions with a kernel in Eq.

(1)

. However, to produce a more accurate OPE estimation,

the difference between the two actions in some directions should be ignored relative to the others for

measuring the similarity between the actions. For this, the Mahalanobis distance can be used. We

deﬁne the Mahalanobis distance between two

-dimensional vectors

ai∈RDA

and

aj∈RDA

with the metric

A∈RDA×DA

as in Eq.

(4)

. Applying the Mahalanobis distance metric

(

=LL>

)

to a kernel function can be seen as linearly transforming the kernel inputs with the transformation

matrix L(L>a=z).

kai−ajkA:= p(ai−aj)>A(ai−aj) = kzi−zjk,(A0, A>=A, |A|= 1).(4)

Figure 1 illustrates a case where the Mahalanobis metric is locally learned at a given state for reducing

the bias of a kernel-based IS estimate. The bias of the estimate is reduced by altering the shape of the

kernel to produce a higher similarity measure on a behavior action that has a similar reward to the

target action.

4 Local Metric Learning for Kernel-Based IS

4.1 Local Metric Learning via LOMSE Reduction

To reduce the LOMSE (Eq.

(2)

) of the kernel-based IS estimate by learning a metric, we ﬁrst analyze

the LOMSE when the optimal bandwidth (Eq. (3)) is applied to the kernel. In Figure 1 we illustrate

how the bias of an IS estimation is induced due to kernel relaxation. The bias can worsen in high

dimensional spaces. To analytically show this, we adapt Proposition 4 in the work of Noh et al.

[23]

and show that the

contained in the squared leading-order bias (Eq.

(2)

) becomes a dominant term

in the LOMSE of kernel-based IS OPE given an optimal bandwidth and as the action dimension

increases.

Proposition 1.

(Adapted from Noh et al.

[23]

) For a high dimensional action space

DA4

, and

given optimal bandwidth

h∗

(Eq.

(3)

), the squared leading-order bias dominates over the leading-

order variance in LOMSE. Furthermore,

LOMSE(h∗, N, DA)

can be approximated by

in Eq.

(2)

LOMSE(h∗, N, DA) = N−4

DA+4 

DA

44

DA+4

+4

DADA

DA+4 

C

DA+4

v≈Cb.(5)

The proof is made by plugging in Eq.

(3)

to Eq.

(2)

and taking the limit

DA→ ∞

. (See Appendix

B.1.) Proposition 1 implies that in an environment with high dimensional action space, the MSE can

be signiﬁcantly reduced by the reduction of

contained in the squared leading-order bias (Eq.

(2)

Therefore, we aim to reduce

, or, reduce the leading bias in a bandwidth-agnostic manner with a

Mahalanobis distance metric that shortens the distance between the target and behavior actions in the

direction where their corresponding rewards are similar.

As mentioned in Section 3.3, applying a state-dependent metric (

A(s) = L(s)L(s)>

) to a kernel

is equivalent to linearly transforming input vectors of the kernel. Therefore,

with a metric

A:S → RDA×DA

(i.e.

Cb,A

) should be analyzed with the linearly transformed actions

L(s)>a

(derivation in Appendix A.3), and we aim to ﬁnd an optimal Athat minimizes it:

min

A:A(s)0,

A(s)=A(s)>,|A(s)|=1 ∀s

Cb,A =1

4Es∼p(s)htr A(s)−1Har(s,a)a=π(s)i2,(6)

where

is the Hessian operator. In the derivation, we assume a Gaussian kernel and use the property

Ruu>K(u)du=I

. Still, optimizing the function

in Eq.

(6)

itself is challenging since it requires

considering the overall effect of each metric matrix

A(s)

on the objective function. Therefore, we

instead consider minimizing the following upper bound, which allows us to compute the closed-form

metric matrix for each state in a nonparametric way:

min

A:A(s)0,

A(s)=A(s)>,|A(s)|=1 ∀s

Ub,A =1

4Es∼p(s)tr A(s)−1Har(s,a)a=π(s)2.(7)

In the following Theorem 1, we introduce the optimal Mahalanobis metric matrix

A∗(s)

that mini-

mizes Ub,A.

Theorem 1.

(Adapted from Noh et al.

[22]

) Assume that the

p(r|s,a)

is twice differentiable w.r.t.

an action

. Let

Λ+(s)

and

Λ−(s)

be diagonal matrices of positive and negative eigenvalues of the

Hessian

HaE[r|s,a]|a=π(s)

U+(s)

and

U−(s)

be matrices of eigenvectors corresponding to

Λ+(s)

and

Λ−(s)

respectively, and

d+(s)

and

d−(s)

be the numbers of positive and negative eigenvalues of

the Hessian. Then the metric A∗(s)that minimizes Ub,A is:

A∗(s) = α(s) [U+(s)U−(s)] d+(s)Λ+(s) 0

0−d−(s)Λ−(s)

| {z }

=:M(s)

[U+(s)U−(s)]>,(8)

where α(s) := |M(s)|−1/(d+(s)+d−(s)).

The proof involves solving a Lagrangian equation to minimize the square of the trace term in Eq.

(7)

with constraints on

A∗(s)

. (See Appendix B.2.) In the special case where the Hessian matrix contains

both positive and negative eigenvalues for all states, the metric reduces the

Cb,A

to zero. (See

Appendix B.2.)

A∗(s)

is locally computed at a state

and the corresponding target action

π(s)

as the

optimal metric is derived from the Hessian at that point

Har(s,a)|a=π(s)

. When the optimal metric

is applied to the kernel, it measures the similarity between a behavior and target action using the

Mahalanobis distance instead of the Euclidean distance, as shown in Figure 1 and in Eq. (9):

ka−π(s)k2

A∗(s)=α(s)

i=1 npd+(s)λ+,i(s)u+,i(s)>(a−π(s))o2

(9)

+α(s)

d−

i=1 np−d−(s)λ−,i(s)u−,i(s)>(a−π(s))o2,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LocalMetricLearningforOff-PolicyEvaluationinContextualBanditswithContinuousActionsHaanvidLee1,JongminLee2,YunseonChoi1,WonseokJeony,Byung-JunLee3;4,Yung-KyunNoh5;6,Kee-EungKim11KAIST,2UCBerkeley,3KoreaUniv.,4GaussLabsInc.,5HanyangUniv.,6KIAShaanvid@kaist.ac.kr,jongmin.lee@berkeley.edu,cys9506@kaist....

展开>> 收起<<

Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions Haanvid Lee1 Jongmin Lee2 Yunseon Choi1 Wonseok Jeony.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions Haanvid Lee1 Jongmin Lee2 Yunseon Choi1 Wonseok Jeony

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: