MEET A MONTE CARLO EXPLORATION-EXPLOITATION TRADE-OFF FOR BUFFER SAMPLING Julius Ott12 Lorenzo Servadei12Jose Arjona-Medina3Enrico Rinaldi4

2025-05-02 0 0 969.49KB 5 页 10玖币

侵权投诉

MEET: A MONTE CARLO EXPLORATION-EXPLOITATION TRADE-OFF FOR BUFFER

SAMPLING

?Julius Ott1,2?Lorenzo Servadei1,2Jose Arjona-Medina3Enrico Rinaldi4

Gianfranco Mauro1Daniela S´

anchez Lopera1,2Michael Stephan1

Thomas Stadelmayer1Avik Santra1Robert Wille2

1Inﬁneon Technologies AG, 2Technical University of Munich,

3Johannes Kepler University Linz, 4University of Michigan

ABSTRACT

Data selection is essential for any data-based optimization

technique, such as Reinforcement Learning. State-of-the-art

sampling strategies for the experience replay buffer improve

the performance of the Reinforcement Learning agent. How-

ever, they do not incorporate uncertainty in the Q-Value esti-

mation. Consequently, they cannot adapt the sampling strate-

gies, including exploration and exploitation of transitions, to

the complexity of the task. To address this, this paper pro-

poses a new sampling strategy that leverages the exploration-

exploitation trade-off. This is enabled by the uncertainty esti-

mation of the Q-Value function, which guides the sampling to

explore more signiﬁcant transitions and, thus, learn a more ef-

ﬁcient policy. Experiments on classical control environments

demonstrate stable results across various environments. They

show that the proposed method outperforms state-of-the-art

sampling strategies for dense rewards w.r.t. convergence and

peak performance by 26% on average.

Index Terms—uncertainty estimation, experience replay,

reinforcement learning

1. INTRODUCTION

In Deep Reinforcement Learning (DRL) applications, the

buffer, where experiences are saved, represents a key com-

ponent. In fact, learning from stored experiences leverages

supervised learning techniques, in which Deep Learning

excels [1]. Seminal work has shown how buffer sampling

techniques improve the performance of DRL models over

distributions observed during training [2]. Consequently,

how to sample from the buffer plays an important role in the

learning process. In this context, a major component of the

buffer sampling strategy regards the uncertainty of the agent

in choosing the optimal action. This inﬂuences the trade-

off between exploration-exploitation in the buffer sampling

strategy.

In the literature, the concept of uncertainty has been applied to

tasks performed by a Machine Learning (ML) model over un-

seen data distributions. Those are called Out-of-Distribution

(OOD) data, i.e., samples for which the model has high un-

certainty. Thus, in the state of the art, the assessment of

that uncertainty is typically used for OOD detection. For in-

stance, [3] proposes an uncertainty-based OOD-classiﬁcation

?Equal contribution

framework called UBOOD, which uses the epistemic uncer-

tainty of the agent’s value function to classify OOD samples.

In particular, UBOOD compares two uncertainty estimation

methods: dropout- and bootstrap-based. The highest perfor-

mance is achieved using bootstrap-based estimators, which

leverage the bootstrap neural network (BootDQN) [4].

Inspired by [3, 4], this paper employs a bootstrap mechanism

with multiple heads for determining the uncertainty in the

Q-Value estimation. This is exploited by the proposed novel

algorithm: a Monte Carlo Exploration- Exploitation Trade-

Off (MEET) for buffer sampling. Thanks to the Q-value un-

certainty estimation, MEET enables an optimized selection of

the transitions for training Off-Policy Reinforcement Learn-

ing (RL) algorithms and maximizes their return. We evaluate

MEET on continuous control problems provided by the Mu-

JoCo 1physics simulation engine. Results show that MEET

performs consistently in terms of convergence speed and im-

proves the performance by 26% in challenging, continuous

control environments.

The remainder of this paper is structured as follows: in Sec-

tion 2, we present the background related to continuous RL,

buffer sampling, and uncertainty estimation. Furthermore,

we motivate the necessity of the proposed approach. In Sec-

tion 3, we introduce the proposed buffer sampling strategy,

while Section 4 describes the performed experiments on pub-

lic datasets and the obtained results. Finally, Section 5 con-

cludes the paper.

2. BACKGROUND AND RELATED WORK

In this section, we present concepts related to the approach

introduced in this paper. To this end, we ﬁrst present char-

acteristics of continuous RL, and then we review the role of

uncertainty in RL.

2.1. Continuous Reinforcement Learning

Traditional RL methods often assume a ﬁnite action space.

In real-world applications, however, RL methods do face a

continuous action space. Different methods have been devel-

oped to extend existing methods to continuous action spaces.

One prominent example is the Deterministic Policy Gradient

(DPG) method [5]. Using the same approach as for stochas-

tic policies, the parameters are updated in the direction of

1github.com/deepmind/mujoco

arXiv:2210.13545v2 [cs.LG] 17 Apr 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MEET:AMONTECARLOEXPLORATION-EXPLOITATIONTRADE-OFFFORBUFFERSAMPLING?JuliusOtt1;2?LorenzoServadei1;2JoseArjona-Medina3EnricoRinaldi4GianfrancoMauro1DanielaS´anchezLopera1;2MichaelStephan1ThomasStadelmayer1AvikSantra1RobertWille21InneonTechnologiesAG,2TechnicalUniversityofMunich,3JohannesKeplerUnivers...

展开>> 收起<<

MEET A MONTE CARLO EXPLORATION-EXPLOITATION TRADE-OFF FOR BUFFER SAMPLING Julius Ott12 Lorenzo Servadei12Jose Arjona-Medina3Enrico Rinaldi4.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MEET A MONTE CARLO EXPLORATION-EXPLOITATION TRADE-OFF FOR BUFFER SAMPLING Julius Ott12 Lorenzo Servadei12Jose Arjona-Medina3Enrico Rinaldi4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: