MEET A MONTE CARLO EXPLORATION-EXPLOITATION TRADE-OFF FOR BUFFER SAMPLING Julius Ott12 Lorenzo Servadei12Jose Arjona-Medina3Enrico Rinaldi4

2025-05-02 0 0 969.49KB 5 页 10玖币
侵权投诉
MEET: A MONTE CARLO EXPLORATION-EXPLOITATION TRADE-OFF FOR BUFFER
SAMPLING
?Julius Ott1,2?Lorenzo Servadei1,2Jose Arjona-Medina3Enrico Rinaldi4
Gianfranco Mauro1Daniela S´
anchez Lopera1,2Michael Stephan1
Thomas Stadelmayer1Avik Santra1Robert Wille2
1Infineon Technologies AG, 2Technical University of Munich,
3Johannes Kepler University Linz, 4University of Michigan
ABSTRACT
Data selection is essential for any data-based optimization
technique, such as Reinforcement Learning. State-of-the-art
sampling strategies for the experience replay buffer improve
the performance of the Reinforcement Learning agent. How-
ever, they do not incorporate uncertainty in the Q-Value esti-
mation. Consequently, they cannot adapt the sampling strate-
gies, including exploration and exploitation of transitions, to
the complexity of the task. To address this, this paper pro-
poses a new sampling strategy that leverages the exploration-
exploitation trade-off. This is enabled by the uncertainty esti-
mation of the Q-Value function, which guides the sampling to
explore more significant transitions and, thus, learn a more ef-
ficient policy. Experiments on classical control environments
demonstrate stable results across various environments. They
show that the proposed method outperforms state-of-the-art
sampling strategies for dense rewards w.r.t. convergence and
peak performance by 26% on average.
Index Termsuncertainty estimation, experience replay,
reinforcement learning
1. INTRODUCTION
In Deep Reinforcement Learning (DRL) applications, the
buffer, where experiences are saved, represents a key com-
ponent. In fact, learning from stored experiences leverages
supervised learning techniques, in which Deep Learning
excels [1]. Seminal work has shown how buffer sampling
techniques improve the performance of DRL models over
distributions observed during training [2]. Consequently,
how to sample from the buffer plays an important role in the
learning process. In this context, a major component of the
buffer sampling strategy regards the uncertainty of the agent
in choosing the optimal action. This influences the trade-
off between exploration-exploitation in the buffer sampling
strategy.
In the literature, the concept of uncertainty has been applied to
tasks performed by a Machine Learning (ML) model over un-
seen data distributions. Those are called Out-of-Distribution
(OOD) data, i.e., samples for which the model has high un-
certainty. Thus, in the state of the art, the assessment of
that uncertainty is typically used for OOD detection. For in-
stance, [3] proposes an uncertainty-based OOD-classification
?Equal contribution
framework called UBOOD, which uses the epistemic uncer-
tainty of the agent’s value function to classify OOD samples.
In particular, UBOOD compares two uncertainty estimation
methods: dropout- and bootstrap-based. The highest perfor-
mance is achieved using bootstrap-based estimators, which
leverage the bootstrap neural network (BootDQN) [4].
Inspired by [3, 4], this paper employs a bootstrap mechanism
with multiple heads for determining the uncertainty in the
Q-Value estimation. This is exploited by the proposed novel
algorithm: a Monte Carlo Exploration- Exploitation Trade-
Off (MEET) for buffer sampling. Thanks to the Q-value un-
certainty estimation, MEET enables an optimized selection of
the transitions for training Off-Policy Reinforcement Learn-
ing (RL) algorithms and maximizes their return. We evaluate
MEET on continuous control problems provided by the Mu-
JoCo 1physics simulation engine. Results show that MEET
performs consistently in terms of convergence speed and im-
proves the performance by 26% in challenging, continuous
control environments.
The remainder of this paper is structured as follows: in Sec-
tion 2, we present the background related to continuous RL,
buffer sampling, and uncertainty estimation. Furthermore,
we motivate the necessity of the proposed approach. In Sec-
tion 3, we introduce the proposed buffer sampling strategy,
while Section 4 describes the performed experiments on pub-
lic datasets and the obtained results. Finally, Section 5 con-
cludes the paper.
2. BACKGROUND AND RELATED WORK
In this section, we present concepts related to the approach
introduced in this paper. To this end, we first present char-
acteristics of continuous RL, and then we review the role of
uncertainty in RL.
2.1. Continuous Reinforcement Learning
Traditional RL methods often assume a finite action space.
In real-world applications, however, RL methods do face a
continuous action space. Different methods have been devel-
oped to extend existing methods to continuous action spaces.
One prominent example is the Deterministic Policy Gradient
(DPG) method [5]. Using the same approach as for stochas-
tic policies, the parameters are updated in the direction of
1github.com/deepmind/mujoco
arXiv:2210.13545v2 [cs.LG] 17 Apr 2023
摘要:

MEET:AMONTECARLOEXPLORATION-EXPLOITATIONTRADE-OFFFORBUFFERSAMPLING?JuliusOtt1;2?LorenzoServadei1;2JoseArjona-Medina3EnricoRinaldi4GianfrancoMauro1DanielaS´anchezLopera1;2MichaelStephan1ThomasStadelmayer1AvikSantra1RobertWille21InneonTechnologiesAG,2TechnicalUniversityofMunich,3JohannesKeplerUnivers...

展开>> 收起<<
MEET A MONTE CARLO EXPLORATION-EXPLOITATION TRADE-OFF FOR BUFFER SAMPLING Julius Ott12 Lorenzo Servadei12Jose Arjona-Medina3Enrico Rinaldi4.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:969.49KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注