Interpretable Option Discovery using Deep Q-Learning and Variational Autoencoders Per-Arne Andersen Morten Goodwin and Ole-Christoer Granmo

2025-05-05 0 0 1.36MB 12 页 10玖币

侵权投诉

Interpretable Option Discovery using Deep

Q-Learning and Variational Autoencoders

Per-Arne Andersen(), Morten Goodwin, and Ole-Christoﬀer Granmo

Department of ICT, University of Agder, Grimstad, Norway

{per.andersen,morten.goodwin,ole.granmo}@uia.no

Abstract. Deep Reinforcement Learning (RL) is unquestionably a ro-

bust framework to train autonomous agents in a wide variety of disci-

plines. However, traditional deep and shallow model-free RL algorithms

suﬀer from low sample eﬃciency and inadequate generalization for sparse

state spaces. The options framework with temporal abstractions [18] is

perhaps the most promising method to solve these problems, but it still

has noticeable shortcomings. It only guarantees local convergence, and it

is challenging to automate initiation and termination conditions, which

in practice are commonly hand-crafted.

Our proposal, the Deep Variational Q-Network (DVQN), combines deep

generative- and reinforcement learning. The algorithm ﬁnds good policies

from a Gaussian distributed latent-space, which is especially useful for

deﬁning options. The DVQN algorithm uses MSE with KL-divergence

as regularization, combined with traditional Q-Learning updates. The

algorithm learns a latent-space that represents good policies with state

clusters for options. We show that the DVQN algorithm is a promis-

ing approach for identifying initiation and termination conditions for

option-based reinforcement learning. Experiments show that the DVQN

algorithm, with automatic initiation and termination, has comparable

performance to Rainbow and can maintain stability when trained for

extended periods after convergence.

Keywords: Deep Reinforcement Learning ·Clustering ·Options ·Hi-

erarchical Reinforcement Learning ·Latent-space representation

1 Introduction

The interest in deep Reinforcement Learning (RL) is rapidly growing due to

signiﬁcant progress in several RL problems [2]. Deep RL has shown excellent

abilities in a wide variety of domains, such as video games, robotics and, natural

language progressing [16,14,13]. Current trends in applied RL has been to treat

neural networks as black-boxes without regard for the latent-space structure.

While unorganized latent-vectors are acceptable for model-free RL, it is disad-

vantageous for schemes such as options-based RL. In an option-based RL, the

policy splits into sub-policies that perform individual behaviors based on the

current state of the agent. A sub-policy, or option, is selected with initializa-

tion criteria and ended with a termination signal. The current state-of-the-art

arXiv:2210.01231v1 [cs.LG] 3 Oct 2022

2 P.-A. Andersen et al.

in option-based RL primarily uses hand-crafted options. Option-based RL algo-

rithms work well for simple environments but have poor performance in more

complicated tasks. There is, to the best of our knowledge, no literature that

addresses good option selection for diﬃcult control tasks. There are eﬀorts for

making automatic options selection [17], but no method achieves notable per-

formance across various environments.

This paper proposes a novel deep learning architecture for Q-learning using

variational autoencoders that learn to organize similar states in a vast latent-

space. The algorithm derives good policies from a latent-space that feature inter-

pretability and the ability to classify sub-spaces for automatic option generation.

Furthermore, we can produce human-interpretable visual representations from

latent-space that directly reﬂects the state-space structure. We call this architec-

ture DVQN for deep Variational Q-Networks and study the learned latent-space

on classic RL problems from the Open AI gym [4].

The paper is organized as follows. Section 3 introduces preliminary literature

for the proposed algorithm. Section 4 presents the proposed algorithm architec-

ture. Section 5 outlines the experiment setup and presents empirical evidence of

the algorithm performance. Section 2 brieﬂy surveys work that is similar to our

contribution. Finally, Section 6 summarises the work of this paper and outlines

a roadmap for future work.

2 Related Work

There are numerous attempts in the literature to improve interpretability with

deep learning algorithms, but primarily in the supervised cases. [22] provides an

in-depth survey of interpretability with Convolutional Neural Networks (CNNs).

Our approach is similar to the work of [20], where the authors propose an ar-

chitecture for visual perception of the DQN algorithm. The diﬀerence, however,

is primarily our focus on the interpretability of the latent-space distribution

via methods commonly found in variational autoencoders. There are similar ef-

forts to combine Q-Learning with Variational Autoencoders, such as [19,11], and

shows promising results theoretically but with limited focus on interpretability.

[1] did notable work on interpretability among using a distance KL-distance for

optimization but did not ﬁnd convincing evidence for a deeper understanding

of the model. The focus of our contribution deviates here and ﬁnds signiﬁcant

value in a shallow and organized latent-space.

Options The learned latent-space is valuable for the selection of options

in hierarchical reinforcement learning (HRL). There is increasing engagement in

HRL research because of several appealing beneﬁts such as sample eﬃciency and

model simplicity [3]. Despite its growing attention, there are few advancements

within this ﬁeld compared to model-free RL. The options framework [18] is per-

haps the most promising approach for HRL in terms of intuitive and convergence

guarantees. Speciﬁcally, the options framework deﬁnes semi-Markov decision pro-

cesses (SMDP), which is an extension of the traditional MDP framework [21].

SMDP features temporal abstractions where multiple discrete time steps are

Deep Variational Q-Networks for Options Discovery 3

generalized to a single step. These abstract steps are what deﬁnes an option,

where the option is a subset of the state-space. In the proposed algorithm, the

structure of the latent-space forms such temporal abstractions for options to

form.

3 Background

The algorithm is formalized under conventional Markov decision processes tuples

< S, A, P, R, γ > where Sis a (ﬁnite) set of all possible states, Ais a (ﬁnite) set

of all possible actions, Pdeﬁnes the probabilistic transition function P(St+1 =

s0|s, a) where sis the previous state, and s0is the transition state. Ris the reward

function R(rt+1|s, a). Finally, the γis a discount factor between γ∈[0 . . . 1] that

determines the importance of future states. Lower γvalues decrease future state

importance while higher values increase.

4 Deep Variational Q-Networks

Our contribution is a deep Q-learning algorithm that ﬁnds good policies in an or-

ganized latent space from variational autoencoders. 1Empirically, the algorithm

shows comparable performance to traditional model-free deep Q-Networks vari-

ants. We name our method the Deep Variational Q-Network (DVQN) that

combines two emerging algorithms, the variational autoencoder (VAE) [12] and

deep Q-Networks (DQN) [14].

In traditional deep Q-Networks, the (latent-space) hidden layers are treated

as a black-box. On the contrary, the objective of the variational autoencoder is

to reconstruct the input and organize the latent-vector so that similar (data)

states are adjacently modeled as a Gaussian distribution.

In DQN, the latent-space is sparse and is hard to interpret for humans and

even option-based machines. By introducing a VAE mechanism into the algo-

rithm, we expect far better interpretability for creating options in RL, which is

the primary motivation for this contribution. Variational autoencoders are, in

contrast to deep RL, involved with the organization of the latent-space repre-

sentation, and commonly used to generate clusters of similar data with t-SNE or

PCA [23]. The DVQN algorithm introduces three signiﬁcant properties. First,

the algorithm ﬁts the data as a Gaussian distribution. This reduces the policy-

space, which in practice reduces the probability of the policy drifting away from

global minima. Second, the algorithm is generative and does not require explo-

ration schemes such as -greedy because it is done in re-parametrization during

training. Third, the algorithm can learn the transition function and, if desirable,

generate training data directly from the latent-space parameters, similar to the

work of [8].

Figure 1 illustrates the architecture of the algorithm. The architecture fol-

lows general trends in similar RL literature but has notable contributions. First,

1The code will be published upon publication.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InterpretableOptionDiscoveryusingDeepQ-LearningandVariationalAutoencodersPer-ArneAndersen(),MortenGoodwin,andOle-ChristoerGranmoDepartmentofICT,UniversityofAgder,Grimstad,Norwayfper.andersen,morten.goodwin,ole.granmog@uia.noAbstract.DeepReinforcementLearning(RL)isunquestionablyaro-bustframeworktot...

展开>> 收起<<

Interpretable Option Discovery using Deep Q-Learning and Variational Autoencoders Per-Arne Andersen Morten Goodwin and Ole-Christoer Granmo.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Interpretable Option Discovery using Deep Q-Learning and Variational Autoencoders Per-Arne Andersen Morten Goodwin and Ole-Christoer Granmo

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: