CONDITIONALLY RISK-AVERSE CONTEXTUAL BANDITS Mónika Farsang Vienna University of Technology

2025-05-06 0 0 530.11KB 16 页 10玖币

侵权投诉

CONDITIONALLY RISK-AVERSE CONTEXTUAL BANDITS

Mónika Farsang

Vienna University of Technology

monika.farsang@tuwien.ac.at

Paul Mineiro

Microsoft Research

pmineiro@microsoft.com

Wangda Zhang

Microsoft Research

wangdazhang@microsoft.com

ABSTRACT

Contextual bandits with average-case statistical guarantees are inadequate in risk-averse situations be-

cause they might trade off degraded worst-case behaviour for better average performance. Designing

a risk-averse contextual bandit is challenging because exploration is necessary but risk-aversion is

sensitive to the entire distribution of rewards; nonetheless we exhibit the ﬁrst risk-averse contextual

bandit algorithm with an online regret guarantee. We conduct experiments from diverse scenarios

where worst-case outcomes should be avoided, from dynamic pricing, inventory management, and

self-tuning software; including a production exascale data processing system.

1 Introduction

Contextual bandits [Auer et al., 2002, Langford and Zhang, 2007] are a mature technology with numerous applications:

however, adoption has been most aggressive in recommendation scenarios [Bouneffouf and Rish, 2019], where the

worst-case outcome is user annoyance. At the other extreme are medical and defense scenarios where worst-case

outcomes are literally fatal. In between are scenarios of interest where bad outcomes are tolerable but should be avoided,

e.g., logistics; ﬁnance; and self-tuning software, where the term tail catastrophe highlights the inadequacy of average

case performance guarantees in real-world applications [Marcus et al., 2021]. These scenarios demand risk-aversion, i.e.,

decisions should sacriﬁce average performance in order to avoid worst-case outcomes, and incorporating risk-aversion

into contextual bandits would facilitate adoption. More generally, risk aversion is essential for making informed

decisions that align with the risk preferences of the decision maker by balancing the potential beneﬁts and risks of a

particular action.

This paper solves risk-averse decision making for contextual bandits via reduction to regression, resulting in the

ﬁrst risk-averse contextual bandit algorithm with an online regret guarantee. The regret guarantee applies over

adversarially chosen context sequences and includes the exploration choices made by the algorithm. The approach

utilizes arbitrary (online learnable) function classes and extends to inﬁnite action spaces; introduces no computational

overhead relative to the risk-neutral setting; introduces statistical overhead directly related to the desired level of

risk-aversion, with no overhead in the risk-neutral limit; and composes with other innovations within the Decision-to-

Estimation framework [Foster et al., 2021], e.g., linear representability [Zhu and Mineiro, 2022].

We make the following contributions:

•

We explain the problem setting (Section 2) with careful deﬁnitions which facilitate the application of theory

and reveal the unique status of expectile loss.

•

We state the resulting algorithms (Section 3), which arise via application of the Estimation-to-Decisions

framework [Foster et al., 2021].

•

We discuss the superior utility of expectile loss for algorithm design over more commonly used risk-measures

VaR and CVaR (Section 2 and 3).

•

We provide experimental support for the technique via diverse scenarios (Section 4). Empirically, tail control

is proportionally inexpensive relative to average-case degradation, justifying the criticism of average-case

guarantees in the self-tuning software literature.

arXiv:2210.13573v2 [stat.ML] 8 Jul 2023

Conditionally Risk-Averse Contextual Bandits

2 Problem Setting

This section contains tedious exposition, necessary because (i) this work draws heavily on results from mathematical

ﬁnance that cannot be presumed known by the general machine learning audience; and (ii) careful deﬁnitions are key to

our contribution. For the impatient reader wanting to skip directly to Section 3, we provide the following summary: use

expectile loss. The rest of this section answers the question "why?".

Contextual Bandits We describe the contextual bandit problem, which proceeds over

rounds. At each round

t∈[T]

, the learner receives a context

xt∈ X

(the context space), selects an action

at∈ A

(the action space), and

then observes a loss

lt(at)

, where

lt:A → [0,1]

is the underlying loss function. We assume that for each round

, conditioned on

is sampled from a distribution

Plt(· | xt)

. We allow both the contexts

x1, . . . , xT

and the

distributions Pl1,...,PlTto be selected in an arbitrary, potentially adaptive fashion based on the history.

Risk Measures In seminal work Artzner et al. [1999] presented an axiomatic approach to measuring risk. A risk

measure is a function which maps a random variable to

R∪ {∞}

and obeys certain axioms such as normalization,

translation contravariance, and monotonicity. Risk measures embed previous approaches to measuring risk: we refer

the interested readers to Meyfredi [2004].

Conditional Risk-Aversion When considering extensions of risk-averse bandit algorithms to the contextual setting,

two possible choices are apparent: marginal risk-aversion, corresponding to applying a risk measure to the distribution

of losses realized over the joint context-action distribution; and conditional risk-aversion, corresponding to computing a

risk measure on a per-context basis and then summing over encountered contexts. For now our focus is conditional

risk-aversion, but after introducing terminology, we revisit the relationship between these two at the end of this section.

Contextual Bandit Regret Conditional risk-aversion motivates our deﬁnition of regret for ﬁnite action sets,

RegCB(T).

t=1

Eathρ((lt)at)−min

aρ((lt)a)xti,(1)

where

is a risk measure, and the expectation is with respect to (the algorithm’s) action distribution; note

is a

function of the adversary’s loss random variable and not the realization. For inﬁnite action sets we use a smoothed

regret criterion: instead of competing with the best action, we compete with any action distribution

with limited

concentration dQ

dµ ≤h−1relative to a reference measure µ,

RegCB

(h,µ)(T).

t=1 Eat[ρ((lt)at)|xt]−min

Q|dQ

dµ ≤h−1

Ea∼Q[ρ((lt)a)|xt]!.(2)

Note the ﬁnite action regret is a special case, corresponding to the uniform reference measure

and

h−1=|A|

. In

practice

is a hyperparameter while

can be tuned using contextual bandit meta-learning: see experiments for details.

Reduction to Regression We attack the contextual bandit problem via reduction to regression, working with a

user-speciﬁed class of regression functions

F ⊆ (X × A → [0,1])

that aims to estimate a risk measure

of the

conditional loss distribution. We make the following realizability assumption1,

∀a∈ A, t ∈[T] : ∃f∗∈ F :f∗(xt, a) = ρ((lt)a),

i.e., our function class includes a function which correctly estimates the value of the risk measure arising from any

action

in context

. This constrains the adversary’s choices, as

must be consistent with realizability, but there are

many random variables that achieve a particular risk value.

Motivation for EVaR We describe additional desirable properties of a risk measure which ultimately determine our

choice of risk measure. A law-invariant risk measure is invariant to transformations of the random variable that preserve

the distribution of outcomes, i.e., is a function of distribution only [Kusuoka, 2001]. An elicitable risk measure can be

deﬁned as the minimum of the expectation of a loss function. Because our algorithm operates via reduction to regression,

we require an elicitable risk measure. A coherent risk measure satisﬁes the additional axiom of convexity: coherence is

desirable because it implies risk reduction from diversiﬁcation. To avoid confusion, note the convexity of a risk measure

1Foster et al. [2020] demonstrate misspeciﬁcation is tolerable, but we do not complicate the exposition here.

Conditionally Risk-Averse Contextual Bandits

is with respect to stochastic mixtures of random variables, i.e.,

∀t∈[0,1] : ρ(tX + (1 −t)Y)≤tρ(X) + (1 −t)ρ(Y)

For elicitable risk measures, this is a distinct property from the convexity of the elicitation loss.

Ziegel [2016] shows the class of elicitable law-invariant coherent risk measures for real-valued random variables is

precisely Entropic Value at Risk (EVaRq) for q∈0,1

2, deﬁned as

EVaRq(D) = arg min

ˆv∈[0,1]

Ev∼Dh(1 −q)(v−ˆv)+2+q(ˆv−v)+2i,(3)

where

(x)+= max (x, 0)

. This asymmetrical strongly convex loss encourages overprediction relative to the mean,

implying infrequent large losses correspond to increased risk. A minimizer of equation

(3)

is called an expectile. Certain

technical qualiﬁcations are necessary for the minimum to be achieved (bounded realization sufﬁces). We refer to the

elicitation loss function as expectile loss.

EVaR

is less familiar to the machine learning community than

VaR

CVaR

, but is a popular risk-measure in ﬁnancial

applications [Bellini and Di Bernardino, 2017], whose proponents champion the superior ﬁnite-sample guarantees

induced by strong convexity [Rossello, 2022]. Waltrup et al. [2015] reveal connections between

EVaR

and the risk

measures

VaR

and

CVaR

; in particular noting that both

VaR

and

CVaR

can be computed from

EVaR

See Section 3

for additional commentary.

When

q∈1

2,1

EVaRq

is risk-seeking. While not our focus, the analysis remains valid therefore we state results in

terms of min(q, 1−q).

Regression Oracle We assume access to an online regression oracle

AlgReg

, which is an algorithm for sequential

predication under strongly convex losses using

as a benchmark class. More speciﬁcally, the oracle operates in the

following protocol: at each round

t∈[T]

, the algorithm receives a context

xt∈ X

, makes a prediction

, where

ft(xt, a)

is interpreted as the prediction for action

, and then observes an action

at∈ A

and realized outcome

lt(at)∈[0,1] and incurs instantaneous expectile loss

gt(ˆ

ft).

=(1 −q)(v−ˆv)+2+q(ˆv−v)+2v=lt(at),ˆv=ˆ

ft(xt,at).

We assume AlgReg guarantees that for any (potentially adaptively chosen) sequence (xt, at, lt)T

t=1,

t=1 gt(ˆ

ft)−gt(f∗)≤RegEVaRq(T),(4)

for some (non-data-dependent) function

RegEVaRq(T)

. Online regression is well-studied with many known al-

gorithms in various cases, e.g., for linear

on the

-dimensional hypersphere, online Newton step achieves

RegEVaRq(T) = Od

min(q,1−q)log(T)

[Hazan et al., 2007]. Furthermore, for any ﬁnite

we can achieve

RegEVaRq(T) = O(1

min(q,1−q)log |F|)

using Vovk’s aggregation algorithm [Vovk, 1998]. Section 2.3 of Foster

and Rakhlin [2020] has a more complete list of oracles.

Optimization Oracle We assume an approximate (possibly randomized) optimization oracle

AlgOpt :F × ∆(A)×

R+→∆(A)which guarantees

∀ˆ

f∈ F :Eˆa∼AlgOpt (ˆ

f,µ,δ)hEa∼µhmax 0,ˆ

f(ˆa)−ˆ

f(a)ii≤δ,

i.e., given an (estimated reward) function

the optimization oracle can ﬁnd an approximate minimizer

ˆa

w.r.t the

reference measure

. For ﬁnite action sets we can compute

AlgOpt

O(|A|)

for all

with

δ= 0

. For inﬁnite

action sets we can compute

AlgOpt

with high probability via the empirical argmin over

O1

δ

i.i.d. samples from

independent of the cardinality or dimensionality of the action space. Of course speciﬁc function classes may admit

superior customized strategies.

The relationship involves differences which induces ambiguous curvature and is therefore not viable for incorporating

VaR

CVaR into decision-to-estimation.

Conditionally Risk-Averse Contextual Bandits

Algorithm 1 Finite Action Set

1: for t= 1,2, . . . , T do

2: Receive context xt.

3: ˆ

ft←AlgReg.predict(xt).

4: ˆat←AlgOpt(ˆ

ft,0).

5: Sample at∼AL(ˆ

ft,ˆat).

6: Play atand observe loss lt.

7: Call AlgReg.update(xt, at, lt).

Algorithm 2 Inﬁnite Action Set

1: for t= 1,2, . . . , T do

2: Receive context xt.

3: ˆ

ft←AlgReg.predict(xt).

4: ˆat←AlgOpt ˆ

ft,1

4θγ .

5: Sample at∼Cont-AL(ˆ

ft,ˆat).

6: Play atand observe loss lt.

7: Call AlgReg.update(xt, at, lt).

Figure 1: (Left) Finite action set with exact optimization oracle. (Right) Inﬁnite action set with approximate optimization

oracle. Hyperparameters µand hare elided to facilitate comparison.

Marginal vs. conditional, revisited Now consider an oblivious stationary environment where

(x, l)

is drawn from a

ﬁxed joint distribution

: further assume a law-invariant risk measure to ease exposition, i.e., assume

is a function of

distribution only. Marginal risk-aversion regret for a policy π:X→P(A)over a policy class Πis deﬁned as

RegMarg(π).

=ρ(DMarg(π)) −min

π∈Πρ(DMarg(π))

where DMarg(π)is deﬁned via

Ez∼DMarg(π)[f(z)] .

=E(x,l)∼D

a∼π(x)

[f(la)] .

Marginal risk-aversion does not correspond to the expectation of a per-context function, because the risk measure is a

function of the complete distribution. Thus, if we apply an online-to-batch conversion to a conditional risk-aversion

regret guarantee, we end up with a regret guarantee with respect to the expected conditional risk under

rather than

the marginal risk. For coherent risk measures, minimizing expected conditional risk minimizes an upper bound on

marginal risk, which is sensible. However this is unlike the risk-neutral setting, where an adversarial guarantee provides

a tight stochastic guarantee. In ﬁnancial parlance, an algorithm designed for the stochastic case could beneﬁt from

diversiﬁcation opportunities across context. However, conditional risk-aversion is the appropriate metric for scenarios

where re-distributing risk across contexts is not acceptable, e.g., software quality-of-service guarantees where the

contexts are customers.

Conditional risk alternative For conditional risk there is a plausible alternative deﬁnition. Equation

(1)

is deﬁned

by averaging the per-action risk over the policy action distribution, but another quantity of interest is the risk measure

of the complete conditional (on context) action distribution. Due to coherence of the risk measure, the deﬁnition in

equation (1) upper bounds this alternative,

Eat[ρ((lt)at)|xt]≥ρ(D(lt, at|xt)),

where

D(lt, at|xt)

is the joint distribution of the action and loss under the algorithm’s conditional action distribution.

Fortunately, unlike the marginal vs. conditional case, this is tight because we are competing with the best single action

and the bound is tight for degenerate distributions. Thus optimizing our regret also controls the risk measure of the

complete conditional action distribution.

3 Algorithms

Proofs are elided to the supplemental. The proof technique has useful generality, e.g. enables the use of an approximate

minimizer in the continuous case.

By using the Estimation-to-Decision framework, we derive the resulting algorithm, which is the ﬁrst risk-averse

contextual bandit with an online guarantee. We present two versions for ﬁnite and inﬁnite action sets.

3.1 Finite Action Set

Algorithm 1 states the ﬁnite action version of our algorithm. It is the SquareCB algorithm [Foster and Rakhlin, 2020]

instantiated with an expectile loss regression oracle.

Theorem 3.1. Algorithm 1 guarantees RegCB(T)≤O1

θq|A|TRegEVaRq(T), where θ= min(q, 1−q).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CONDITIONALLYRISK-AVERSECONTEXTUALBANDITSMónikaFarsangViennaUniversityofTechnologymonika.farsang@tuwien.ac.atPaulMineiroMicrosoftResearchpmineiro@microsoft.comWangdaZhangMicrosoftResearchwangdazhang@microsoft.comABSTRACTContextualbanditswithaverage-casestatisticalguaranteesareinadequateinrisk-averse...

展开>> 收起<<

CONDITIONALLY RISK-AVERSE CONTEXTUAL BANDITS Mónika Farsang Vienna University of Technology.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CONDITIONALLY RISK-AVERSE CONTEXTUAL BANDITS Mónika Farsang Vienna University of Technology

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: