
MEET: A MONTE CARLO EXPLORATION-EXPLOITATION TRADE-OFF FOR BUFFER
SAMPLING
?Julius Ott1,2?Lorenzo Servadei1,2Jose Arjona-Medina3Enrico Rinaldi4
Gianfranco Mauro1Daniela S´
anchez Lopera1,2Michael Stephan1
Thomas Stadelmayer1Avik Santra1Robert Wille2
1Infineon Technologies AG, 2Technical University of Munich,
3Johannes Kepler University Linz, 4University of Michigan
ABSTRACT
Data selection is essential for any data-based optimization
technique, such as Reinforcement Learning. State-of-the-art
sampling strategies for the experience replay buffer improve
the performance of the Reinforcement Learning agent. How-
ever, they do not incorporate uncertainty in the Q-Value esti-
mation. Consequently, they cannot adapt the sampling strate-
gies, including exploration and exploitation of transitions, to
the complexity of the task. To address this, this paper pro-
poses a new sampling strategy that leverages the exploration-
exploitation trade-off. This is enabled by the uncertainty esti-
mation of the Q-Value function, which guides the sampling to
explore more significant transitions and, thus, learn a more ef-
ficient policy. Experiments on classical control environments
demonstrate stable results across various environments. They
show that the proposed method outperforms state-of-the-art
sampling strategies for dense rewards w.r.t. convergence and
peak performance by 26% on average.
Index Terms—uncertainty estimation, experience replay,
reinforcement learning
1. INTRODUCTION
In Deep Reinforcement Learning (DRL) applications, the
buffer, where experiences are saved, represents a key com-
ponent. In fact, learning from stored experiences leverages
supervised learning techniques, in which Deep Learning
excels [1]. Seminal work has shown how buffer sampling
techniques improve the performance of DRL models over
distributions observed during training [2]. Consequently,
how to sample from the buffer plays an important role in the
learning process. In this context, a major component of the
buffer sampling strategy regards the uncertainty of the agent
in choosing the optimal action. This influences the trade-
off between exploration-exploitation in the buffer sampling
strategy.
In the literature, the concept of uncertainty has been applied to
tasks performed by a Machine Learning (ML) model over un-
seen data distributions. Those are called Out-of-Distribution
(OOD) data, i.e., samples for which the model has high un-
certainty. Thus, in the state of the art, the assessment of
that uncertainty is typically used for OOD detection. For in-
stance, [3] proposes an uncertainty-based OOD-classification
?Equal contribution
framework called UBOOD, which uses the epistemic uncer-
tainty of the agent’s value function to classify OOD samples.
In particular, UBOOD compares two uncertainty estimation
methods: dropout- and bootstrap-based. The highest perfor-
mance is achieved using bootstrap-based estimators, which
leverage the bootstrap neural network (BootDQN) [4].
Inspired by [3, 4], this paper employs a bootstrap mechanism
with multiple heads for determining the uncertainty in the
Q-Value estimation. This is exploited by the proposed novel
algorithm: a Monte Carlo Exploration- Exploitation Trade-
Off (MEET) for buffer sampling. Thanks to the Q-value un-
certainty estimation, MEET enables an optimized selection of
the transitions for training Off-Policy Reinforcement Learn-
ing (RL) algorithms and maximizes their return. We evaluate
MEET on continuous control problems provided by the Mu-
JoCo 1physics simulation engine. Results show that MEET
performs consistently in terms of convergence speed and im-
proves the performance by 26% in challenging, continuous
control environments.
The remainder of this paper is structured as follows: in Sec-
tion 2, we present the background related to continuous RL,
buffer sampling, and uncertainty estimation. Furthermore,
we motivate the necessity of the proposed approach. In Sec-
tion 3, we introduce the proposed buffer sampling strategy,
while Section 4 describes the performed experiments on pub-
lic datasets and the obtained results. Finally, Section 5 con-
cludes the paper.
2. BACKGROUND AND RELATED WORK
In this section, we present concepts related to the approach
introduced in this paper. To this end, we first present char-
acteristics of continuous RL, and then we review the role of
uncertainty in RL.
2.1. Continuous Reinforcement Learning
Traditional RL methods often assume a finite action space.
In real-world applications, however, RL methods do face a
continuous action space. Different methods have been devel-
oped to extend existing methods to continuous action spaces.
One prominent example is the Deterministic Policy Gradient
(DPG) method [5]. Using the same approach as for stochas-
tic policies, the parameters are updated in the direction of
1github.com/deepmind/mujoco
arXiv:2210.13545v2 [cs.LG] 17 Apr 2023