CONDITIONALLY RISK-AVERSE CONTEXTUAL BANDITS Mónika Farsang Vienna University of Technology

2025-05-06 0 0 530.11KB 16 页 10玖币
侵权投诉
CONDITIONALLY RISK-AVERSE CONTEXTUAL BANDITS
Mónika Farsang
Vienna University of Technology
monika.farsang@tuwien.ac.at
Paul Mineiro
Microsoft Research
pmineiro@microsoft.com
Wangda Zhang
Microsoft Research
wangdazhang@microsoft.com
ABSTRACT
Contextual bandits with average-case statistical guarantees are inadequate in risk-averse situations be-
cause they might trade off degraded worst-case behaviour for better average performance. Designing
a risk-averse contextual bandit is challenging because exploration is necessary but risk-aversion is
sensitive to the entire distribution of rewards; nonetheless we exhibit the first risk-averse contextual
bandit algorithm with an online regret guarantee. We conduct experiments from diverse scenarios
where worst-case outcomes should be avoided, from dynamic pricing, inventory management, and
self-tuning software; including a production exascale data processing system.
1 Introduction
Contextual bandits [Auer et al., 2002, Langford and Zhang, 2007] are a mature technology with numerous applications:
however, adoption has been most aggressive in recommendation scenarios [Bouneffouf and Rish, 2019], where the
worst-case outcome is user annoyance. At the other extreme are medical and defense scenarios where worst-case
outcomes are literally fatal. In between are scenarios of interest where bad outcomes are tolerable but should be avoided,
e.g., logistics; finance; and self-tuning software, where the term tail catastrophe highlights the inadequacy of average
case performance guarantees in real-world applications [Marcus et al., 2021]. These scenarios demand risk-aversion, i.e.,
decisions should sacrifice average performance in order to avoid worst-case outcomes, and incorporating risk-aversion
into contextual bandits would facilitate adoption. More generally, risk aversion is essential for making informed
decisions that align with the risk preferences of the decision maker by balancing the potential benefits and risks of a
particular action.
This paper solves risk-averse decision making for contextual bandits via reduction to regression, resulting in the
first risk-averse contextual bandit algorithm with an online regret guarantee. The regret guarantee applies over
adversarially chosen context sequences and includes the exploration choices made by the algorithm. The approach
utilizes arbitrary (online learnable) function classes and extends to infinite action spaces; introduces no computational
overhead relative to the risk-neutral setting; introduces statistical overhead directly related to the desired level of
risk-aversion, with no overhead in the risk-neutral limit; and composes with other innovations within the Decision-to-
Estimation framework [Foster et al., 2021], e.g., linear representability [Zhu and Mineiro, 2022].
We make the following contributions:
We explain the problem setting (Section 2) with careful definitions which facilitate the application of theory
and reveal the unique status of expectile loss.
We state the resulting algorithms (Section 3), which arise via application of the Estimation-to-Decisions
framework [Foster et al., 2021].
We discuss the superior utility of expectile loss for algorithm design over more commonly used risk-measures
VaR and CVaR (Section 2 and 3).
We provide experimental support for the technique via diverse scenarios (Section 4). Empirically, tail control
is proportionally inexpensive relative to average-case degradation, justifying the criticism of average-case
guarantees in the self-tuning software literature.
arXiv:2210.13573v2 [stat.ML] 8 Jul 2023
Conditionally Risk-Averse Contextual Bandits
2 Problem Setting
This section contains tedious exposition, necessary because (i) this work draws heavily on results from mathematical
finance that cannot be presumed known by the general machine learning audience; and (ii) careful definitions are key to
our contribution. For the impatient reader wanting to skip directly to Section 3, we provide the following summary: use
expectile loss. The rest of this section answers the question "why?".
Contextual Bandits We describe the contextual bandit problem, which proceeds over
T
rounds. At each round
t[T]
, the learner receives a context
xt∈ X
(the context space), selects an action
at∈ A
(the action space), and
then observes a loss
lt(at)
, where
lt:A → [0,1]
is the underlying loss function. We assume that for each round
t
, conditioned on
xt
,
lt
is sampled from a distribution
Plt(· | xt)
. We allow both the contexts
x1, . . . , xT
and the
distributions Pl1,...,PlTto be selected in an arbitrary, potentially adaptive fashion based on the history.
Risk Measures In seminal work Artzner et al. [1999] presented an axiomatic approach to measuring risk. A risk
measure is a function which maps a random variable to
R∪ {∞}
and obeys certain axioms such as normalization,
translation contravariance, and monotonicity. Risk measures embed previous approaches to measuring risk: we refer
the interested readers to Meyfredi [2004].
Conditional Risk-Aversion When considering extensions of risk-averse bandit algorithms to the contextual setting,
two possible choices are apparent: marginal risk-aversion, corresponding to applying a risk measure to the distribution
of losses realized over the joint context-action distribution; and conditional risk-aversion, corresponding to computing a
risk measure on a per-context basis and then summing over encountered contexts. For now our focus is conditional
risk-aversion, but after introducing terminology, we revisit the relationship between these two at the end of this section.
Contextual Bandit Regret Conditional risk-aversion motivates our definition of regret for finite action sets,
RegCB(T).
=
T
X
t=1
Eathρ((lt)at)min
aρ((lt)a)xti,(1)
where
ρ
is a risk measure, and the expectation is with respect to (the algorithm’s) action distribution; note
ρ
is a
function of the adversary’s loss random variable and not the realization. For infinite action sets we use a smoothed
regret criterion: instead of competing with the best action, we compete with any action distribution
Q
with limited
concentration dQ
h1relative to a reference measure µ,
RegCB
(h,µ)(T).
=
T
X
t=1 Eat[ρ((lt)at)|xt]min
Q|dQ
h1
EaQ[ρ((lt)a)|xt]!.(2)
Note the finite action regret is a special case, corresponding to the uniform reference measure
µ
and
h1=|A|
. In
practice
µ
is a hyperparameter while
h
can be tuned using contextual bandit meta-learning: see experiments for details.
Reduction to Regression We attack the contextual bandit problem via reduction to regression, working with a
user-specified class of regression functions
F (X × A [0,1])
that aims to estimate a risk measure
ρ
of the
conditional loss distribution. We make the following realizability assumption1,
a∈ A, t [T] : f∈ F :f(xt, a) = ρ((lt)a),
i.e., our function class includes a function which correctly estimates the value of the risk measure arising from any
action
a
in context
xt
. This constrains the adversary’s choices, as
lt
must be consistent with realizability, but there are
many random variables that achieve a particular risk value.
Motivation for EVaR We describe additional desirable properties of a risk measure which ultimately determine our
choice of risk measure. A law-invariant risk measure is invariant to transformations of the random variable that preserve
the distribution of outcomes, i.e., is a function of distribution only [Kusuoka, 2001]. An elicitable risk measure can be
defined as the minimum of the expectation of a loss function. Because our algorithm operates via reduction to regression,
we require an elicitable risk measure. A coherent risk measure satisfies the additional axiom of convexity: coherence is
desirable because it implies risk reduction from diversification. To avoid confusion, note the convexity of a risk measure
1Foster et al. [2020] demonstrate misspecification is tolerable, but we do not complicate the exposition here.
2
Conditionally Risk-Averse Contextual Bandits
is with respect to stochastic mixtures of random variables, i.e.,
t[0,1] : ρ(tX + (1 t)Y)(X) + (1 t)ρ(Y)
.
For elicitable risk measures, this is a distinct property from the convexity of the elicitation loss.
Ziegel [2016] shows the class of elicitable law-invariant coherent risk measures for real-valued random variables is
precisely Entropic Value at Risk (EVaRq) for q0,1
2, defined as
EVaRq(D) = arg min
ˆv[0,1]
EvDh(1 q)(vˆv)+2+q(ˆvv)+2i,(3)
where
(x)+= max (x, 0)
. This asymmetrical strongly convex loss encourages overprediction relative to the mean,
implying infrequent large losses correspond to increased risk. A minimizer of equation
(3)
is called an expectile. Certain
technical qualifications are necessary for the minimum to be achieved (bounded realization suffices). We refer to the
elicitation loss function as expectile loss.
EVaR
is less familiar to the machine learning community than
VaR
or
CVaR
, but is a popular risk-measure in financial
applications [Bellini and Di Bernardino, 2017], whose proponents champion the superior finite-sample guarantees
induced by strong convexity [Rossello, 2022]. Waltrup et al. [2015] reveal connections between
EVaR
and the risk
measures
VaR
and
CVaR
; in particular noting that both
VaR
and
CVaR
can be computed from
EVaR
.
2
See Section 3
for additional commentary.
When
q1
2,1
,
EVaRq
is risk-seeking. While not our focus, the analysis remains valid therefore we state results in
terms of min(q, 1q).
Regression Oracle We assume access to an online regression oracle
AlgReg
, which is an algorithm for sequential
predication under strongly convex losses using
F
as a benchmark class. More specifically, the oracle operates in the
following protocol: at each round
t[T]
, the algorithm receives a context
xt∈ X
, makes a prediction
ˆ
ft
, where
ˆ
ft(xt, a)
is interpreted as the prediction for action
a
, and then observes an action
at∈ A
and realized outcome
lt(at)[0,1] and incurs instantaneous expectile loss
gt(ˆ
ft).
=(1 q)(vˆv)+2+q(ˆvv)+2v=lt(at),ˆv=ˆ
ft(xt,at).
We assume AlgReg guarantees that for any (potentially adaptively chosen) sequence (xt, at, lt)T
t=1,
T
X
t=1 gt(ˆ
ft)gt(f)RegEVaRq(T),(4)
for some (non-data-dependent) function
RegEVaRq(T)
. Online regression is well-studied with many known al-
gorithms in various cases, e.g., for linear
F
on the
d
-dimensional hypersphere, online Newton step achieves
RegEVaRq(T) = Od
min(q,1q)log(T)
[Hazan et al., 2007]. Furthermore, for any finite
F
we can achieve
RegEVaRq(T) = O(1
min(q,1q)log |F|)
using Vovk’s aggregation algorithm [Vovk, 1998]. Section 2.3 of Foster
and Rakhlin [2020] has a more complete list of oracles.
Optimization Oracle We assume an approximate (possibly randomized) optimization oracle
AlgOpt :F × ∆(A)×
R+∆(A)which guarantees
ˆ
f∈ F :EˆaAlgOpt (ˆ
f,µ,δ)hEaµhmax 0,ˆ
fa)ˆ
f(a)iiδ,
i.e., given an (estimated reward) function
ˆ
f
the optimization oracle can find an approximate minimizer
ˆa
w.r.t the
reference measure
µ
. For finite action sets we can compute
AlgOpt
in
O(|A|)
for all
µ
with
δ= 0
. For infinite
action sets we can compute
AlgOpt
with high probability via the empirical argmin over
O1
δ
i.i.d. samples from
µ
,
independent of the cardinality or dimensionality of the action space. Of course specific function classes may admit
superior customized strategies.
2
The relationship involves differences which induces ambiguous curvature and is therefore not viable for incorporating
VaR
or
CVaR into decision-to-estimation.
3
Conditionally Risk-Averse Contextual Bandits
Algorithm 1 Finite Action Set
1: for t= 1,2, . . . , T do
2: Receive context xt.
3: ˆ
ftAlgReg.predict(xt).
4: ˆatAlgOpt(ˆ
ft,0).
5: Sample atAL(ˆ
ft,ˆat).
6: Play atand observe loss lt.
7: Call AlgReg.update(xt, at, lt).
Algorithm 2 Infinite Action Set
1: for t= 1,2, . . . , T do
2: Receive context xt.
3: ˆ
ftAlgReg.predict(xt).
4: ˆatAlgOpt ˆ
ft,1
4θγ .
5: Sample atCont-AL(ˆ
ft,ˆat).
6: Play atand observe loss lt.
7: Call AlgReg.update(xt, at, lt).
Figure 1: (Left) Finite action set with exact optimization oracle. (Right) Infinite action set with approximate optimization
oracle. Hyperparameters µand hare elided to facilitate comparison.
Marginal vs. conditional, revisited Now consider an oblivious stationary environment where
(x, l)
is drawn from a
fixed joint distribution
D
: further assume a law-invariant risk measure to ease exposition, i.e., assume
ρ
is a function of
distribution only. Marginal risk-aversion regret for a policy π:XP(A)over a policy class Πis defined as
RegMarg(π).
=ρ(DMarg(π)) min
πΠρ(DMarg(π))
where DMarg(π)is defined via
EzDMarg(π)[f(z)] .
=E(x,l)D
aπ(x)
[f(la)] .
Marginal risk-aversion does not correspond to the expectation of a per-context function, because the risk measure is a
function of the complete distribution. Thus, if we apply an online-to-batch conversion to a conditional risk-aversion
regret guarantee, we end up with a regret guarantee with respect to the expected conditional risk under
D
rather than
the marginal risk. For coherent risk measures, minimizing expected conditional risk minimizes an upper bound on
marginal risk, which is sensible. However this is unlike the risk-neutral setting, where an adversarial guarantee provides
a tight stochastic guarantee. In financial parlance, an algorithm designed for the stochastic case could benefit from
diversification opportunities across context. However, conditional risk-aversion is the appropriate metric for scenarios
where re-distributing risk across contexts is not acceptable, e.g., software quality-of-service guarantees where the
contexts are customers.
Conditional risk alternative For conditional risk there is a plausible alternative definition. Equation
(1)
is defined
by averaging the per-action risk over the policy action distribution, but another quantity of interest is the risk measure
of the complete conditional (on context) action distribution. Due to coherence of the risk measure, the definition in
equation (1) upper bounds this alternative,
Eat[ρ((lt)at)|xt]ρ(D(lt, at|xt)),
where
D(lt, at|xt)
is the joint distribution of the action and loss under the algorithm’s conditional action distribution.
Fortunately, unlike the marginal vs. conditional case, this is tight because we are competing with the best single action
and the bound is tight for degenerate distributions. Thus optimizing our regret also controls the risk measure of the
complete conditional action distribution.
3 Algorithms
Proofs are elided to the supplemental. The proof technique has useful generality, e.g. enables the use of an approximate
minimizer in the continuous case.
By using the Estimation-to-Decision framework, we derive the resulting algorithm, which is the first risk-averse
contextual bandit with an online guarantee. We present two versions for finite and infinite action sets.
3.1 Finite Action Set
Algorithm 1 states the finite action version of our algorithm. It is the SquareCB algorithm [Foster and Rakhlin, 2020]
instantiated with an expectile loss regression oracle.
Theorem 3.1. Algorithm 1 guarantees RegCB(T)O1
θq|A|TRegEVaRq(T), where θ= min(q, 1q).
4
摘要:

CONDITIONALLYRISK-AVERSECONTEXTUALBANDITSMónikaFarsangViennaUniversityofTechnologymonika.farsang@tuwien.ac.atPaulMineiroMicrosoftResearchpmineiro@microsoft.comWangdaZhangMicrosoftResearchwangdazhang@microsoft.comABSTRACTContextualbanditswithaverage-casestatisticalguaranteesareinadequateinrisk-averse...

展开>> 收起<<
CONDITIONALLY RISK-AVERSE CONTEXTUAL BANDITS Mónika Farsang Vienna University of Technology.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:530.11KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注