Interpretable Option Discovery using Deep Q-Learning and Variational Autoencoders Per-Arne Andersen Morten Goodwin and Ole-Christoer Granmo

2025-05-05 0 0 1.36MB 12 页 10玖币
侵权投诉
Interpretable Option Discovery using Deep
Q-Learning and Variational Autoencoders
Per-Arne Andersen(), Morten Goodwin, and Ole-Christoffer Granmo
Department of ICT, University of Agder, Grimstad, Norway
{per.andersen,morten.goodwin,ole.granmo}@uia.no
Abstract. Deep Reinforcement Learning (RL) is unquestionably a ro-
bust framework to train autonomous agents in a wide variety of disci-
plines. However, traditional deep and shallow model-free RL algorithms
suffer from low sample efficiency and inadequate generalization for sparse
state spaces. The options framework with temporal abstractions [18] is
perhaps the most promising method to solve these problems, but it still
has noticeable shortcomings. It only guarantees local convergence, and it
is challenging to automate initiation and termination conditions, which
in practice are commonly hand-crafted.
Our proposal, the Deep Variational Q-Network (DVQN), combines deep
generative- and reinforcement learning. The algorithm finds good policies
from a Gaussian distributed latent-space, which is especially useful for
defining options. The DVQN algorithm uses MSE with KL-divergence
as regularization, combined with traditional Q-Learning updates. The
algorithm learns a latent-space that represents good policies with state
clusters for options. We show that the DVQN algorithm is a promis-
ing approach for identifying initiation and termination conditions for
option-based reinforcement learning. Experiments show that the DVQN
algorithm, with automatic initiation and termination, has comparable
performance to Rainbow and can maintain stability when trained for
extended periods after convergence.
Keywords: Deep Reinforcement Learning ·Clustering ·Options ·Hi-
erarchical Reinforcement Learning ·Latent-space representation
1 Introduction
The interest in deep Reinforcement Learning (RL) is rapidly growing due to
significant progress in several RL problems [2]. Deep RL has shown excellent
abilities in a wide variety of domains, such as video games, robotics and, natural
language progressing [16,14,13]. Current trends in applied RL has been to treat
neural networks as black-boxes without regard for the latent-space structure.
While unorganized latent-vectors are acceptable for model-free RL, it is disad-
vantageous for schemes such as options-based RL. In an option-based RL, the
policy splits into sub-policies that perform individual behaviors based on the
current state of the agent. A sub-policy, or option, is selected with initializa-
tion criteria and ended with a termination signal. The current state-of-the-art
arXiv:2210.01231v1 [cs.LG] 3 Oct 2022
2 P.-A. Andersen et al.
in option-based RL primarily uses hand-crafted options. Option-based RL algo-
rithms work well for simple environments but have poor performance in more
complicated tasks. There is, to the best of our knowledge, no literature that
addresses good option selection for difficult control tasks. There are efforts for
making automatic options selection [17], but no method achieves notable per-
formance across various environments.
This paper proposes a novel deep learning architecture for Q-learning using
variational autoencoders that learn to organize similar states in a vast latent-
space. The algorithm derives good policies from a latent-space that feature inter-
pretability and the ability to classify sub-spaces for automatic option generation.
Furthermore, we can produce human-interpretable visual representations from
latent-space that directly reflects the state-space structure. We call this architec-
ture DVQN for deep Variational Q-Networks and study the learned latent-space
on classic RL problems from the Open AI gym [4].
The paper is organized as follows. Section 3 introduces preliminary literature
for the proposed algorithm. Section 4 presents the proposed algorithm architec-
ture. Section 5 outlines the experiment setup and presents empirical evidence of
the algorithm performance. Section 2 briefly surveys work that is similar to our
contribution. Finally, Section 6 summarises the work of this paper and outlines
a roadmap for future work.
2 Related Work
There are numerous attempts in the literature to improve interpretability with
deep learning algorithms, but primarily in the supervised cases. [22] provides an
in-depth survey of interpretability with Convolutional Neural Networks (CNNs).
Our approach is similar to the work of [20], where the authors propose an ar-
chitecture for visual perception of the DQN algorithm. The difference, however,
is primarily our focus on the interpretability of the latent-space distribution
via methods commonly found in variational autoencoders. There are similar ef-
forts to combine Q-Learning with Variational Autoencoders, such as [19,11], and
shows promising results theoretically but with limited focus on interpretability.
[1] did notable work on interpretability among using a distance KL-distance for
optimization but did not find convincing evidence for a deeper understanding
of the model. The focus of our contribution deviates here and finds significant
value in a shallow and organized latent-space.
Options The learned latent-space is valuable for the selection of options
in hierarchical reinforcement learning (HRL). There is increasing engagement in
HRL research because of several appealing benefits such as sample efficiency and
model simplicity [3]. Despite its growing attention, there are few advancements
within this field compared to model-free RL. The options framework [18] is per-
haps the most promising approach for HRL in terms of intuitive and convergence
guarantees. Specifically, the options framework defines semi-Markov decision pro-
cesses (SMDP), which is an extension of the traditional MDP framework [21].
SMDP features temporal abstractions where multiple discrete time steps are
Deep Variational Q-Networks for Options Discovery 3
generalized to a single step. These abstract steps are what defines an option,
where the option is a subset of the state-space. In the proposed algorithm, the
structure of the latent-space forms such temporal abstractions for options to
form.
3 Background
The algorithm is formalized under conventional Markov decision processes tuples
< S, A, P, R, γ > where Sis a (finite) set of all possible states, Ais a (finite) set
of all possible actions, Pdefines the probabilistic transition function P(St+1 =
s0|s, a) where sis the previous state, and s0is the transition state. Ris the reward
function R(rt+1|s, a). Finally, the γis a discount factor between γ[0 . . . 1] that
determines the importance of future states. Lower γvalues decrease future state
importance while higher values increase.
4 Deep Variational Q-Networks
Our contribution is a deep Q-learning algorithm that finds good policies in an or-
ganized latent space from variational autoencoders. 1Empirically, the algorithm
shows comparable performance to traditional model-free deep Q-Networks vari-
ants. We name our method the Deep Variational Q-Network (DVQN) that
combines two emerging algorithms, the variational autoencoder (VAE) [12] and
deep Q-Networks (DQN) [14].
In traditional deep Q-Networks, the (latent-space) hidden layers are treated
as a black-box. On the contrary, the objective of the variational autoencoder is
to reconstruct the input and organize the latent-vector so that similar (data)
states are adjacently modeled as a Gaussian distribution.
In DQN, the latent-space is sparse and is hard to interpret for humans and
even option-based machines. By introducing a VAE mechanism into the algo-
rithm, we expect far better interpretability for creating options in RL, which is
the primary motivation for this contribution. Variational autoencoders are, in
contrast to deep RL, involved with the organization of the latent-space repre-
sentation, and commonly used to generate clusters of similar data with t-SNE or
PCA [23]. The DVQN algorithm introduces three significant properties. First,
the algorithm fits the data as a Gaussian distribution. This reduces the policy-
space, which in practice reduces the probability of the policy drifting away from
global minima. Second, the algorithm is generative and does not require explo-
ration schemes such as -greedy because it is done in re-parametrization during
training. Third, the algorithm can learn the transition function and, if desirable,
generate training data directly from the latent-space parameters, similar to the
work of [8].
Figure 1 illustrates the architecture of the algorithm. The architecture fol-
lows general trends in similar RL literature but has notable contributions. First,
1The code will be published upon publication.
摘要:

InterpretableOptionDiscoveryusingDeepQ-LearningandVariationalAutoencodersPer-ArneAndersen( ),MortenGoodwin,andOle-Christo erGranmoDepartmentofICT,UniversityofAgder,Grimstad,Norwayfper.andersen,morten.goodwin,ole.granmog@uia.noAbstract.DeepReinforcementLearning(RL)isunquestionablyaro-bustframeworktot...

展开>> 收起<<
Interpretable Option Discovery using Deep Q-Learning and Variational Autoencoders Per-Arne Andersen Morten Goodwin and Ole-Christoer Granmo.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.36MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注