(Thompson, 1933; Agrawal and Goyal, 2013; Gopalan et al., 2014). However, finding accurate and efficient ways of
estimating the uncertainties for remains challenging.
Another approach is maximum entropy exploration, sometimes known as Active Inference or Boltzmann exploration.
This is also popular in Neuroscience as a model of the way the human brain works (Friston et al., 2006; Friston, 2009,
2010; Brown and Friston, 2012; Adams et al., 2013; Schwartenbeck et al., 2013; Markovi´
c et al., 2021; Smith et al.,
2022). In maximum entropy exploration, a policy is built that maintains a high entropy over the action space, ensuring
it tries lots of different actions, while still aiming for the best possible reward. This has been introduced for contextual
bandit problems with a discrete action space (Lee et al., 2020). In this work we extend this approach to work with a
continuous action space.
Energy Based Models (EBMs) are particularly well suited to maximum entropy exploration, due to the close relationship
of EBMs with Boltzmann distributions (Levine, 2018). While straightforward neural networks trained with cross-
entropy or mean-squared-error losses can work well as reward estimators, they are prone to brittleness. Conversely,
EBMs naturally build uncertainty into their formalisation. Instead of giving a certain answer on the best action to play,
energy based functions give a degree of possible actions based on the shape of the energy function. Actions can then be
found by sampling from this function with techniques based on Markov Chain Monte Carlo (MCMC). These types of
models have been considered in full reinforcement learning scenarios (Haarnoja et al., 2017; Du et al., 2019a). In this
work, we introduce a method to apply EBMs based on NNs to contextual bandit problems.
In this paper we introduce two new contextual bandit algorithms based on maximum entropy exploration. Both
algorithms are able to make decisions in continuous action spaces, a key use case that has not been studied as thoroughly
as discrete action spaces. Our main contributions can be summarised as follows:
•
Introducing a technique for maximum entropy exploration with neural networks estimating rewards in contex-
tual bandits with a continuous action space, sampling using Hamiltonian Monte Carlo;
•
A novel algorithm that uses Energy Based Models based on neural networks to solve contextual bandit
problems;
•
Testing our algorithms in different simulation environments (including with dynamic environments), giving
practitioners a guide to which algorithms to use in different scenarios.
2 RELATED WORK
As they are very relevant to many industry applications, contextual bandits have been widely studied, with many
different algorithms proposed, see for example (Bietti et al., 2021) and (Bouneffouf et al., 2020).
Many of the most successful algorithms rely on linear methods for interpreting the context, where it is easier to evaluate
output uncertainty (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013). This is necessary, because the most
commonly applied exploration strategies, Thompson Sampling (Thompson, 1933) and the Upper Confidence Bound
(UCB) algorithm (Lai and Robbins, 1985), rely on keeping track of uncertainties and updating them as data are collected.
However, several techniques for non-linear contextual bandit algorithms have been proposed, using methods based
on neural networks with different approaches to predict uncertainties in the output (Riquelme et al., 2018; Zhou et al.,
2020; Zhang et al., 2020; Kassraie and Krause, 2022).
2.1 Entropy based exploration in contextual bandits
As an alternative to Thompson Sampling and UCB, in this work we focus on entropy based exploration strategies, with
an emphasis on their application to non-linear contextual bandit problems. This approach has been researched in the
reinforcement learning (Kaelbling et al., 1996; Sutton and Barto, 2018) and Multi Armed Bandit literature (Kuleshov
and Precup, 2014; Markovi´
c et al., 2021).
For the contextual bandit use case, non-linear maximum entropy based exploration with a discrete action space has
been considered by (Lee et al., 2020). In this case the non-linearity comes from neural networks, which are used to
estimate a reward.
2.2 Energy based models in reinforcement learning
Many problems in machine learning, contextual bandits included, revolve around modelling a probability density
function,
p(x)
for
x∈RD
. These probability densities can always be expressed in the form of a scalar energy function
2