MAXIMUM ENTROPY EXPLORATION IN CONTEXTUAL BANDITS WITH NEURAL NETWORKS AND ENERGY BASED MODELS Adam Elwood Marco Leonardi Ashraf Mohamed Alessandro Rozza

2025-04-27 0 0 867.77KB 12 页 10玖币
侵权投诉
MAXIMUM ENTROPY EXPLORATION IN CONTEXTUAL BANDITS
WITH NEURAL NETWORKS AND ENERGY BASED MODELS
Adam Elwood, Marco Leonardi, Ashraf Mohamed, Alessandro Rozza
lastminute.com group
Vicolo de Calvi, 2, Chiasso, Switzerland
ABSTRACT
Contextual bandits can solve a huge range of real-world problems. However, current popular
algorithms to solve them either rely on linear models, or unreliable uncertainty estimation in non-
linear models, which are required to deal with the exploration-exploitation trade-off. Inspired by
theories of human cognition, we introduce novel techniques that use maximum entropy exploration,
relying on neural networks to find optimal policies in settings with both continuous and discrete
action spaces. We present two classes of models, one with neural networks as reward estimators, and
the other with energy based models, which model the probability of obtaining an optimal reward
given an action. We evaluate the performance of these models in static and dynamic contextual bandit
simulation environments. We show that both techniques outperform well-known standard algorithms,
where energy based models have the best overall performance. This provides practitioners with
new techniques that perform well in static and dynamic settings, and are particularly well suited to
non-linear scenarios with continuous action spaces.
Keywords energy based ·maximum entropy ·contextual bandits ·neural networks
1 Introduction
In recent years, machine learning has been applied to solve a large array of concrete scientific and business problems
(Silver et al., 2016; Portugal et al., 2018; Sarker, 2021). The rapid advancements have mainly been due to the increased
access to large datasets and computing resources. However, many real world scenarios require online decision making.
They generally do not come with readily available datasets that cover the phase space in question, instead the data
must be collected as decisions are made. These kind of problems generically come under the banner of reinforcement
learning, where a sequential series of actions must be made in an environment, where previous decisions influence
future decisions.
One class of reinforcement learning problem that is particularly relevant to modern technology businesses is known as
the contextual bandit, an extension of the multi-armed bandit problem (Bouneffouf et al., 2020). In contextual bandit
algorithms, actions must be chosen given the state of the system, which is specified by its context. Actions are chosen
so as to maximise the total reward over time. The result of a particular action is obtained immediately and can be used
to inform future decisions. For optimal performance, these actions should be chosen to trade-off exploration of phase
space with exploitation of the most rewarding behaviour. Contextual bandits are relevant in many business applications,
such as dynamic pricing and recommender systems.
There are lots of machine learning models capable of making predictions about a reward given an input action and
context. Artificial neural networks (NNs) are one of the most popular choices. However, these models are typically
brittle, in that they still give confident answers outside of the data distribution they have been trained on, where they
are likely to be wrong. A policy for choosing actions in a contextual bandit scenario therefore needs an exploration
component added on top of the underlying reward estimator.
One approach to the above issue is to estimate the uncertainties in the predictions made by the neural network. Actions
can then be chosen via Thompson sampling, a bayesian methodology for making sequential decisions under uncertainty
arXiv:2210.06302v1 [cs.LG] 12 Oct 2022
(Thompson, 1933; Agrawal and Goyal, 2013; Gopalan et al., 2014). However, finding accurate and efficient ways of
estimating the uncertainties for remains challenging.
Another approach is maximum entropy exploration, sometimes known as Active Inference or Boltzmann exploration.
This is also popular in Neuroscience as a model of the way the human brain works (Friston et al., 2006; Friston, 2009,
2010; Brown and Friston, 2012; Adams et al., 2013; Schwartenbeck et al., 2013; Markovi´
c et al., 2021; Smith et al.,
2022). In maximum entropy exploration, a policy is built that maintains a high entropy over the action space, ensuring
it tries lots of different actions, while still aiming for the best possible reward. This has been introduced for contextual
bandit problems with a discrete action space (Lee et al., 2020). In this work we extend this approach to work with a
continuous action space.
Energy Based Models (EBMs) are particularly well suited to maximum entropy exploration, due to the close relationship
of EBMs with Boltzmann distributions (Levine, 2018). While straightforward neural networks trained with cross-
entropy or mean-squared-error losses can work well as reward estimators, they are prone to brittleness. Conversely,
EBMs naturally build uncertainty into their formalisation. Instead of giving a certain answer on the best action to play,
energy based functions give a degree of possible actions based on the shape of the energy function. Actions can then be
found by sampling from this function with techniques based on Markov Chain Monte Carlo (MCMC). These types of
models have been considered in full reinforcement learning scenarios (Haarnoja et al., 2017; Du et al., 2019a). In this
work, we introduce a method to apply EBMs based on NNs to contextual bandit problems.
In this paper we introduce two new contextual bandit algorithms based on maximum entropy exploration. Both
algorithms are able to make decisions in continuous action spaces, a key use case that has not been studied as thoroughly
as discrete action spaces. Our main contributions can be summarised as follows:
Introducing a technique for maximum entropy exploration with neural networks estimating rewards in contex-
tual bandits with a continuous action space, sampling using Hamiltonian Monte Carlo;
A novel algorithm that uses Energy Based Models based on neural networks to solve contextual bandit
problems;
Testing our algorithms in different simulation environments (including with dynamic environments), giving
practitioners a guide to which algorithms to use in different scenarios.
2 RELATED WORK
As they are very relevant to many industry applications, contextual bandits have been widely studied, with many
different algorithms proposed, see for example (Bietti et al., 2021) and (Bouneffouf et al., 2020).
Many of the most successful algorithms rely on linear methods for interpreting the context, where it is easier to evaluate
output uncertainty (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013). This is necessary, because the most
commonly applied exploration strategies, Thompson Sampling (Thompson, 1933) and the Upper Confidence Bound
(UCB) algorithm (Lai and Robbins, 1985), rely on keeping track of uncertainties and updating them as data are collected.
However, several techniques for non-linear contextual bandit algorithms have been proposed, using methods based
on neural networks with different approaches to predict uncertainties in the output (Riquelme et al., 2018; Zhou et al.,
2020; Zhang et al., 2020; Kassraie and Krause, 2022).
2.1 Entropy based exploration in contextual bandits
As an alternative to Thompson Sampling and UCB, in this work we focus on entropy based exploration strategies, with
an emphasis on their application to non-linear contextual bandit problems. This approach has been researched in the
reinforcement learning (Kaelbling et al., 1996; Sutton and Barto, 2018) and Multi Armed Bandit literature (Kuleshov
and Precup, 2014; Markovi´
c et al., 2021).
For the contextual bandit use case, non-linear maximum entropy based exploration with a discrete action space has
been considered by (Lee et al., 2020). In this case the non-linearity comes from neural networks, which are used to
estimate a reward.
2.2 Energy based models in reinforcement learning
Many problems in machine learning, contextual bandits included, revolve around modelling a probability density
function,
p(x)
for
xRD
. These probability densities can always be expressed in the form of a scalar energy function
2
Eθ(x)(LeCun et al., 2006):
p(x) = exp(Eθ(x))
Rx0exp(Eθ(x0))dx0(1)
which allows many machine learning problems to be reformulated as an energy based modelling task (Grathwohl et al.,
2019). The difficulty with this reformulation comes in estimating the integral in the denominator of Eq. 1, which is
usually intractable. However, if this difficulty can be overcome, a scalar function,
Eθ(x)
, is learned which can be
evaluated at any value of x, providing a fully generative model.
Another advantage of EBMs in a reinforcement learning setting is that sampling from them naturally leads to maximum
entropy exploration (Levine, 2018) . This has been applied to solve full reinforcement learning problems both in
model-based (Du et al., 2019a) and a model-free (Heess et al., 2013; Haarnoja et al., 2017) formulations. However, it
has not yet been applied to specifically solve the contextual bandit problem.
3 ALGORITHMS TO SOLVE CONTEXTUAL BANDIT PROBLEMS WITH
MAXIMUM ENTROPY EXPLORATION
In this section we introduce two classes of algorithms for solving contextual bandit problems with NNs, using
exploration strategies based on entropy maximisation. In each case the algorithm defines a policy,
π(a|si)
, which gives
the probability of playing action
a
, given the observation of state
si
. The policy is applied by sampling actions from the
policy,
aπ
, at a particular time step. This policy is then updated given the rewards observed in previous time steps by
retraining the NNs.
3.1 Contextual bandit problem formulation
Contextual bandit problems require an algorithm to make the choice of an action,
a∈ A
(where
A
is an action space)
upon observing the context state of the environment,
s∈ S
(where
S
is a context space). Upon making the action, a
reward,
r∈ R
, is received. For each state observed, an action is chosen and the reward recorded. This results in a
dataset,
X
, being built up over a run, consisting of triplets,
{si, ai, ri}∈X
, where
iN
is the time step of a particular
triplet. At any step
i
, the data available for choosing the action
ai
consists of the set of triplets
{sj, aj, rj}
where
j < i
.
The goal of the problem is to maximise the expected reward over an indefinite time horizon, where an arbitrary number
of actions, N, can be played. This is usually measured in terms of the regret:
RN=
N
X
i=0
[r
irai
i](2)
where
r
i
is the best possible reward at the time step
i
and
rai
i
is the reward at time step
i
received by the action played,
ai. A more successful action choosing policy will have a lower regret.
In this work we assume
A ∈ R
;
S Rn
, where
n
is the dimension of the context vector and depends on the particular
problem being considered; and R ∈ R, where many of the problems considered assume R ∈ [0,1].
3.2 Maximum entropy exploration
First we define a reward estimator,
ˆrθ(si, a)
, which gives the expected reward of action
a
in state
si
and is parameterised
by the vector
θ
, similar to the approach in (Lee et al., 2020). In maximum entropy exploration, the policy is defined as
follows:
π(a|si) = arg max
π
(Eaπ[ˆrθ(si, a)] + αH(π)) (3)
where H(π) = Eaπ[log(π)] is the Shannon entropy. This can then be solved with a softmax:
π(a|si) = eˆrθ(a,si)
Ra0eˆrθ(a0,si)da0(4)
This approach finds a policy that trades off maximising the expected reward with a chosen action (the first term in Eq. 3)
with trying a range of different actions, which give a large Shannon entropy (the second term in Eq. 3). The degree of
this trade off is controlled by the
αR+
parameter, which is typically chosen to be at the same scale as the expected
reward. Larger
α
values result in more exploration. Models for
ˆrθ
should be chosen to have a fairly flat prior across the
states and actions upon initialisation, which will encourage exploration in the early stages of a contextual bandit run.
3
摘要:

MAXIMUMENTROPYEXPLORATIONINCONTEXTUALBANDITSWITHNEURALNETWORKSANDENERGYBASEDMODELSAdamElwood,MarcoLeonardi,AshrafMohamed,AlessandroRozzalastminute.comgroupVicolodeCalvi,2,Chiasso,SwitzerlandABSTRACTContextualbanditscansolveahugerangeofreal-worldproblems.However,currentpopularalgorithmstosolvethemeit...

展开>> 收起<<
MAXIMUM ENTROPY EXPLORATION IN CONTEXTUAL BANDITS WITH NEURAL NETWORKS AND ENERGY BASED MODELS Adam Elwood Marco Leonardi Ashraf Mohamed Alessandro Rozza.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:867.77KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注