MAXIMUM ENTROPY EXPLORATION IN CONTEXTUAL BANDITS WITH NEURAL NETWORKS AND ENERGY BASED MODELS Adam Elwood Marco Leonardi Ashraf Mohamed Alessandro Rozza

2025-04-27 2 0 867.77KB 12 页 10玖币

侵权投诉

MAXIMUM ENTROPY EXPLORATION IN CONTEXTUAL BANDITS

WITH NEURAL NETWORKS AND ENERGY BASED MODELS

Adam Elwood, Marco Leonardi, Ashraf Mohamed, Alessandro Rozza

lastminute.com group

Vicolo de Calvi, 2, Chiasso, Switzerland

ABSTRACT

Contextual bandits can solve a huge range of real-world problems. However, current popular

algorithms to solve them either rely on linear models, or unreliable uncertainty estimation in non-

linear models, which are required to deal with the exploration-exploitation trade-off. Inspired by

theories of human cognition, we introduce novel techniques that use maximum entropy exploration,

relying on neural networks to ﬁnd optimal policies in settings with both continuous and discrete

action spaces. We present two classes of models, one with neural networks as reward estimators, and

the other with energy based models, which model the probability of obtaining an optimal reward

given an action. We evaluate the performance of these models in static and dynamic contextual bandit

simulation environments. We show that both techniques outperform well-known standard algorithms,

where energy based models have the best overall performance. This provides practitioners with

new techniques that perform well in static and dynamic settings, and are particularly well suited to

non-linear scenarios with continuous action spaces.

Keywords energy based ·maximum entropy ·contextual bandits ·neural networks

1 Introduction

In recent years, machine learning has been applied to solve a large array of concrete scientiﬁc and business problems

(Silver et al., 2016; Portugal et al., 2018; Sarker, 2021). The rapid advancements have mainly been due to the increased

access to large datasets and computing resources. However, many real world scenarios require online decision making.

They generally do not come with readily available datasets that cover the phase space in question, instead the data

must be collected as decisions are made. These kind of problems generically come under the banner of reinforcement

learning, where a sequential series of actions must be made in an environment, where previous decisions inﬂuence

future decisions.

One class of reinforcement learning problem that is particularly relevant to modern technology businesses is known as

the contextual bandit, an extension of the multi-armed bandit problem (Bouneffouf et al., 2020). In contextual bandit

algorithms, actions must be chosen given the state of the system, which is speciﬁed by its context. Actions are chosen

so as to maximise the total reward over time. The result of a particular action is obtained immediately and can be used

to inform future decisions. For optimal performance, these actions should be chosen to trade-off exploration of phase

space with exploitation of the most rewarding behaviour. Contextual bandits are relevant in many business applications,

such as dynamic pricing and recommender systems.

There are lots of machine learning models capable of making predictions about a reward given an input action and

context. Artiﬁcial neural networks (NNs) are one of the most popular choices. However, these models are typically

brittle, in that they still give conﬁdent answers outside of the data distribution they have been trained on, where they

are likely to be wrong. A policy for choosing actions in a contextual bandit scenario therefore needs an exploration

component added on top of the underlying reward estimator.

One approach to the above issue is to estimate the uncertainties in the predictions made by the neural network. Actions

can then be chosen via Thompson sampling, a bayesian methodology for making sequential decisions under uncertainty

arXiv:2210.06302v1 [cs.LG] 12 Oct 2022

(Thompson, 1933; Agrawal and Goyal, 2013; Gopalan et al., 2014). However, ﬁnding accurate and efﬁcient ways of

estimating the uncertainties for remains challenging.

Another approach is maximum entropy exploration, sometimes known as Active Inference or Boltzmann exploration.

This is also popular in Neuroscience as a model of the way the human brain works (Friston et al., 2006; Friston, 2009,

2010; Brown and Friston, 2012; Adams et al., 2013; Schwartenbeck et al., 2013; Markovi´

c et al., 2021; Smith et al.,

2022). In maximum entropy exploration, a policy is built that maintains a high entropy over the action space, ensuring

it tries lots of different actions, while still aiming for the best possible reward. This has been introduced for contextual

bandit problems with a discrete action space (Lee et al., 2020). In this work we extend this approach to work with a

continuous action space.

Energy Based Models (EBMs) are particularly well suited to maximum entropy exploration, due to the close relationship

of EBMs with Boltzmann distributions (Levine, 2018). While straightforward neural networks trained with cross-

entropy or mean-squared-error losses can work well as reward estimators, they are prone to brittleness. Conversely,

EBMs naturally build uncertainty into their formalisation. Instead of giving a certain answer on the best action to play,

energy based functions give a degree of possible actions based on the shape of the energy function. Actions can then be

found by sampling from this function with techniques based on Markov Chain Monte Carlo (MCMC). These types of

models have been considered in full reinforcement learning scenarios (Haarnoja et al., 2017; Du et al., 2019a). In this

work, we introduce a method to apply EBMs based on NNs to contextual bandit problems.

In this paper we introduce two new contextual bandit algorithms based on maximum entropy exploration. Both

algorithms are able to make decisions in continuous action spaces, a key use case that has not been studied as thoroughly

as discrete action spaces. Our main contributions can be summarised as follows:

•

Introducing a technique for maximum entropy exploration with neural networks estimating rewards in contex-

tual bandits with a continuous action space, sampling using Hamiltonian Monte Carlo;

•

A novel algorithm that uses Energy Based Models based on neural networks to solve contextual bandit

problems;

•

Testing our algorithms in different simulation environments (including with dynamic environments), giving

practitioners a guide to which algorithms to use in different scenarios.

2 RELATED WORK

As they are very relevant to many industry applications, contextual bandits have been widely studied, with many

different algorithms proposed, see for example (Bietti et al., 2021) and (Bouneffouf et al., 2020).

Many of the most successful algorithms rely on linear methods for interpreting the context, where it is easier to evaluate

output uncertainty (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013). This is necessary, because the most

commonly applied exploration strategies, Thompson Sampling (Thompson, 1933) and the Upper Conﬁdence Bound

(UCB) algorithm (Lai and Robbins, 1985), rely on keeping track of uncertainties and updating them as data are collected.

However, several techniques for non-linear contextual bandit algorithms have been proposed, using methods based

on neural networks with different approaches to predict uncertainties in the output (Riquelme et al., 2018; Zhou et al.,

2020; Zhang et al., 2020; Kassraie and Krause, 2022).

2.1 Entropy based exploration in contextual bandits

As an alternative to Thompson Sampling and UCB, in this work we focus on entropy based exploration strategies, with

an emphasis on their application to non-linear contextual bandit problems. This approach has been researched in the

reinforcement learning (Kaelbling et al., 1996; Sutton and Barto, 2018) and Multi Armed Bandit literature (Kuleshov

and Precup, 2014; Markovi´

c et al., 2021).

For the contextual bandit use case, non-linear maximum entropy based exploration with a discrete action space has

been considered by (Lee et al., 2020). In this case the non-linearity comes from neural networks, which are used to

estimate a reward.

2.2 Energy based models in reinforcement learning

Many problems in machine learning, contextual bandits included, revolve around modelling a probability density

function,

p(x)

for

x∈RD

. These probability densities can always be expressed in the form of a scalar energy function

Eθ(x)(LeCun et al., 2006):

p(x) = exp(−Eθ(x))

Rx0exp(−Eθ(x0))dx0(1)

which allows many machine learning problems to be reformulated as an energy based modelling task (Grathwohl et al.,

2019). The difﬁculty with this reformulation comes in estimating the integral in the denominator of Eq. 1, which is

usually intractable. However, if this difﬁculty can be overcome, a scalar function,

Eθ(x)

, is learned which can be

evaluated at any value of x, providing a fully generative model.

Another advantage of EBMs in a reinforcement learning setting is that sampling from them naturally leads to maximum

entropy exploration (Levine, 2018) . This has been applied to solve full reinforcement learning problems both in

model-based (Du et al., 2019a) and a model-free (Heess et al., 2013; Haarnoja et al., 2017) formulations. However, it

has not yet been applied to speciﬁcally solve the contextual bandit problem.

3 ALGORITHMS TO SOLVE CONTEXTUAL BANDIT PROBLEMS WITH

MAXIMUM ENTROPY EXPLORATION

In this section we introduce two classes of algorithms for solving contextual bandit problems with NNs, using

exploration strategies based on entropy maximisation. In each case the algorithm deﬁnes a policy,

π(a|si)

, which gives

the probability of playing action

, given the observation of state

. The policy is applied by sampling actions from the

policy,

a∼π

, at a particular time step. This policy is then updated given the rewards observed in previous time steps by

retraining the NNs.

3.1 Contextual bandit problem formulation

Contextual bandit problems require an algorithm to make the choice of an action,

a∈ A

(where

is an action space)

upon observing the context state of the environment,

s∈ S

(where

is a context space). Upon making the action, a

reward,

r∈ R

, is received. For each state observed, an action is chosen and the reward recorded. This results in a

dataset,

, being built up over a run, consisting of triplets,

{si, ai, ri}∈X

, where

i∈N

is the time step of a particular

triplet. At any step

, the data available for choosing the action

consists of the set of triplets

{sj, aj, rj}

where

j < i

The goal of the problem is to maximise the expected reward over an indeﬁnite time horizon, where an arbitrary number

of actions, N, can be played. This is usually measured in terms of the regret:

RN=

i=0

[r∗

i−rai

i](2)

where

r∗

is the best possible reward at the time step

and

rai

is the reward at time step

received by the action played,

ai. A more successful action choosing policy will have a lower regret.

In this work we assume

A ∈ R

;

S ∈ Rn

, where

is the dimension of the context vector and depends on the particular

problem being considered; and R ∈ R, where many of the problems considered assume R ∈ [0,1].

3.2 Maximum entropy exploration

First we deﬁne a reward estimator,

ˆrθ(si, a)

, which gives the expected reward of action

in state

and is parameterised

by the vector

, similar to the approach in (Lee et al., 2020). In maximum entropy exploration, the policy is deﬁned as

follows:

π(a|si) = arg max

(Ea∼π[ˆrθ(si, a)] + αH(π)) (3)

where H(π) = Ea∼π[−log(π)] is the Shannon entropy. This can then be solved with a softmax:

π(a|si) = eˆrθ(a,si)/α

Ra0eˆrθ(a0,si)/αda0(4)

This approach ﬁnds a policy that trades off maximising the expected reward with a chosen action (the ﬁrst term in Eq. 3)

with trying a range of different actions, which give a large Shannon entropy (the second term in Eq. 3). The degree of

this trade off is controlled by the

α∈R+

parameter, which is typically chosen to be at the same scale as the expected

reward. Larger

values result in more exploration. Models for

ˆrθ

should be chosen to have a fairly ﬂat prior across the

states and actions upon initialisation, which will encourage exploration in the early stages of a contextual bandit run.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MAXIMUMENTROPYEXPLORATIONINCONTEXTUALBANDITSWITHNEURALNETWORKSANDENERGYBASEDMODELSAdamElwood,MarcoLeonardi,AshrafMohamed,AlessandroRozzalastminute.comgroupVicolodeCalvi,2,Chiasso,SwitzerlandABSTRACTContextualbanditscansolveahugerangeofreal-worldproblems.However,currentpopularalgorithmstosolvethemeit...

展开>> 收起<<

MAXIMUM ENTROPY EXPLORATION IN CONTEXTUAL BANDITS WITH NEURAL NETWORKS AND ENERGY BASED MODELS Adam Elwood Marco Leonardi Ashraf Mohamed Alessandro Rozza.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MAXIMUM ENTROPY EXPLORATION IN CONTEXTUAL BANDITS WITH NEURAL NETWORKS AND ENERGY BASED MODELS Adam Elwood Marco Leonardi Ashraf Mohamed Alessandro Rozza

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: