variance when there is a large mismatch between the behavior and target policy distributions [
9
]. The
last approach is doubly robust (DR), which uses DM to reduce the variance of IS while keeping the
unbiasedness of an IS estimate [6].
Although there are existing works on IS and DR that can be applied to continuous action spaces
[
13
–
16
], most of them cannot be easily extended to evaluate deterministic contextual bandit policies
with continuous actions. This is because IS weights used for both IS and DR estimators are almost
surely zero for deterministic target policies in continuous action spaces [
2
]. However, in practice, such
OPE algorithms are needed. For example, treatment prescription policies should not stochastically
prescribe drugs to patients.
To meet the needs, there are works for evaluating deterministic contextual bandit policies with
continuous actions [
2
,
5
,
17
]. These works focus on measuring the similarity between behavior and
target actions for assigning IS ratios. However, these works either assume a single-dimensional
action space [
17
] which cannot be straightforwardly extended to multiple action dimensions, or, use
Euclidean distances for the similarity measures [
2
,
5
]. In general, similarity measures should be
learned locally at a state to weigh differences between behavior and target actions in each action
dimension differently. For example, in the case of multi-drug prescription, the synergies and side
effects of the prescribed drugs are often very complex. Moreover, the different kinds of drugs are
likely to have different degrees of effect from person to person [18, 19].
To this end, we propose local kernel metric learning for IS (KMIS) OPE estimation of deterministic
contextual bandit policies with multidimensional continuous action spaces. Our proposed method
learns a Mahalanobis distance metric [
20
–
23
] locally at each state that lengthens or shortens the
distance between behavior and target actions to reduce the MSE. The metric-applied kernels measure
the similarities between actions according to the Mahalanobis distances induced by the metric. In our
work, we first analytically show that the leading-order bias [
2
] of a kernel-based IS estimator becomes
a dominant factor in the leading-order MSE [
2
] as the action dimension increases given the optimal
bandwidth [
2
] that minimizes the leading-order MSE without a metric. Then we derive the kernel
metric that minimizes the upper bound of the leading-order bias, which is bandwidth-agnostic. Our
analysis shows that the convergence speed of a kernel-based IS OPE estimator can be improved with
the application of the KMIS metrics. In the experiments, we demonstrate that MSEs of kernel-based
IS estimators are significantly reduced when combined with the proposed kernel metric learning. We
report empirical results in various synthetic domains as well as a real-world dataset.
2 Related Work
The works on OPE of deterministic contextual bandits with continuous actions can be largely divided
into importance sampling (IS) [
2
,
5
] and doubly robust estimators (DR) [
17
,
24
]. Both methods
eliminate the problem of almost surely having zero IS estimates when given a deterministic target
policy and a stochastic behavior policy in a continuous action space in two ways. Most of the works
relax the deterministic target policy, which can be seen as a Dirac delta function [
2
], to a kernel
[2, 5, 24], and the other work discretizes the action space [17].
Among these, kernel-based IS methods use a kernel to measure the similarities between the target
and behavior actions and focus on selecting the bandwidth of the kernel [
2
,
5
,
24
]. Su et al.
[5]
proposed the bandwidth selection algorithm that uses the Lepski’s principle for bandwidth selection
in the study of nonparametric statistics [
25
]. Kallus and Zhou
[2]
derived the leading-order MSE of a
kernel-based IS OPE estimation and chose the optimal bandwidth that minimizes the leading-order
MSE for the OPE estimation. One of the limitations of the existing kernel-based IS methods [
2
,
5
,
24
]
is that these methods use Euclidean distances for measuring the similarities between behavior and
target actions. The Euclidean distance is inadequate for measuring the similarities as assigning a
high similarity measure to the actions having similar rewards will induce less bias (bias definition
in Section 3.2). The other limitation is that these methods use one global bandwidth for the whole
action space [
17
]. Since the second-order derivative of the expected reward w.r.t. an action is related
to the leading-order bias derived by Kallus and Zhou
[2]
, the optimal bandwidth that balances the
bias-variance trade-off may vary depending on actions.
As for discretization methods, Cai et al.
[17]
proposed deep jump learning (DJL) [
17
] that avoids
the limitation of kernel-based methods due to using one global bandwidth by adaptively discretizing
one-dimensional continuous action spaces. The action space is discretized to have similar expected
2