
Conditionally Risk-Averse Contextual Bandits
2 Problem Setting
This section contains tedious exposition, necessary because (i) this work draws heavily on results from mathematical
finance that cannot be presumed known by the general machine learning audience; and (ii) careful definitions are key to
our contribution. For the impatient reader wanting to skip directly to Section 3, we provide the following summary: use
expectile loss. The rest of this section answers the question "why?".
Contextual Bandits We describe the contextual bandit problem, which proceeds over
T
rounds. At each round
t∈[T]
, the learner receives a context
xt∈ X
(the context space), selects an action
at∈ A
(the action space), and
then observes a loss
lt(at)
, where
lt:A → [0,1]
is the underlying loss function. We assume that for each round
t
, conditioned on
xt
,
lt
is sampled from a distribution
Plt(· | xt)
. We allow both the contexts
x1, . . . , xT
and the
distributions Pl1,...,PlTto be selected in an arbitrary, potentially adaptive fashion based on the history.
Risk Measures In seminal work Artzner et al. [1999] presented an axiomatic approach to measuring risk. A risk
measure is a function which maps a random variable to
R∪ {∞}
and obeys certain axioms such as normalization,
translation contravariance, and monotonicity. Risk measures embed previous approaches to measuring risk: we refer
the interested readers to Meyfredi [2004].
Conditional Risk-Aversion When considering extensions of risk-averse bandit algorithms to the contextual setting,
two possible choices are apparent: marginal risk-aversion, corresponding to applying a risk measure to the distribution
of losses realized over the joint context-action distribution; and conditional risk-aversion, corresponding to computing a
risk measure on a per-context basis and then summing over encountered contexts. For now our focus is conditional
risk-aversion, but after introducing terminology, we revisit the relationship between these two at the end of this section.
Contextual Bandit Regret Conditional risk-aversion motivates our definition of regret for finite action sets,
RegCB(T).
=
T
X
t=1
Eathρ((lt)at)−min
aρ((lt)a)xti,(1)
where
ρ
is a risk measure, and the expectation is with respect to (the algorithm’s) action distribution; note
ρ
is a
function of the adversary’s loss random variable and not the realization. For infinite action sets we use a smoothed
regret criterion: instead of competing with the best action, we compete with any action distribution
Q
with limited
concentration dQ
dµ ≤h−1relative to a reference measure µ,
RegCB
(h,µ)(T).
=
T
X
t=1 Eat[ρ((lt)at)|xt]−min
Q|dQ
dµ ≤h−1
Ea∼Q[ρ((lt)a)|xt]!.(2)
Note the finite action regret is a special case, corresponding to the uniform reference measure
µ
and
h−1=|A|
. In
practice
µ
is a hyperparameter while
h
can be tuned using contextual bandit meta-learning: see experiments for details.
Reduction to Regression We attack the contextual bandit problem via reduction to regression, working with a
user-specified class of regression functions
F ⊆ (X × A → [0,1])
that aims to estimate a risk measure
ρ
of the
conditional loss distribution. We make the following realizability assumption1,
∀a∈ A, t ∈[T] : ∃f∗∈ F :f∗(xt, a) = ρ((lt)a),
i.e., our function class includes a function which correctly estimates the value of the risk measure arising from any
action
a
in context
xt
. This constrains the adversary’s choices, as
lt
must be consistent with realizability, but there are
many random variables that achieve a particular risk value.
Motivation for EVaR We describe additional desirable properties of a risk measure which ultimately determine our
choice of risk measure. A law-invariant risk measure is invariant to transformations of the random variable that preserve
the distribution of outcomes, i.e., is a function of distribution only [Kusuoka, 2001]. An elicitable risk measure can be
defined as the minimum of the expectation of a loss function. Because our algorithm operates via reduction to regression,
we require an elicitable risk measure. A coherent risk measure satisfies the additional axiom of convexity: coherence is
desirable because it implies risk reduction from diversification. To avoid confusion, note the convexity of a risk measure
1Foster et al. [2020] demonstrate misspecification is tolerable, but we do not complicate the exposition here.
2