CostNet An End-to-End Framework for Goal-Directed Reinforcement Learning Per-Arne Andersen0000000277424907 Morten

2025-04-27 0 0 1.68MB 14 页 10玖币
侵权投诉
CostNet: An End-to-End Framework for
Goal-Directed Reinforcement Learning
Per-Arne Andersen[0000000277424907], Morten
Goodwin[000000016331702X], and Ole-Christoffer
Granmo[000000027287030X]
Department of ICT, University of Agder, Grimstad, Norway
{per.andersen,morten.goodwin,ole.granmo}@uia.no
Abstract. Reinforcement Learning (RL) is a general framework con-
cerned with an agent that seeks to maximize rewards in an environment.
The learning typically happens through trial and error using explorative
methods, such as -greedy. There are two approaches, model-based and
model-free reinforcement learning, that show concrete results in sev-
eral disciplines. Model-based RL learns a model of the environment for
learning the policy while model-free approaches are fully explorative and
exploitative without considering the underlying environment dynamics.
Model-free RL works conceptually well in simulated environments, and
empirical evidence suggests that trial and error lead to a near-optimal
behavior with enough training. On the other hand, model-based RL aims
to be sample efficient, and studies show that it requires far less training
in the real environment for learning a good policy.
A significant challenge with RL is that it relies on a well-defined reward
function to work well for complex environments and such a reward func-
tion is challenging to define. Goal-Directed RL is an alternative method
that learns an intrinsic reward function with emphasis on a few explored
trajectories that reveals the path to the goal state.
This paper introduces a novel reinforcement learning algorithm for pre-
dicting the distance between two states in a Markov Decision Process.
The learned distance function works as an intrinsic reward that fuels
the agent’s learning. Using the distance-metric as a reward, we show
that the algorithm performs comparably to model-free RL while having
significantly better sample-efficiently in several test environments.
Keywords: Reinforcement Learning ·Markov Decision Processes ·Neu-
ral Networks ·Representation Learning ·Goal-directed Reinforcement
Learning
1 Introduction
Goal-directed reinforcement learning (GDRL) separates the learning into two
phases, where phase one aims to solve the goal-directed exploration problem
(GDE). To solve the GDE problem, the agent must determine at least one viable
arXiv:2210.01805v1 [cs.LG] 3 Oct 2022
path from the initial state to the goal state. In phase two, the agent uses the
learned path to find a near-optimal path. The two phases iterate until the agent
policy is converged.
Reinforcement learning (RL) classifies into two categories of algorithms. Model-
free RL learns a policy or a value-function by interaction with the environ-
ment and succeeds in various simulated areas, including video-games [19, 25],
robotics [12, 15], and autonomous vehicles [7, 24], but comes at the cost of effi-
ciency. Specifically, model-free approaches suffer from low sample efficiency and
are a fundamental limitation for application in real-world physical systems.
On the other hand, Model-based reinforcement learning (MBRL) aims to
learn a predictive model of the environment to increase sample efficiency. The
agent samples from the learned predictive model, which reduces the required
interaction with the environment. However, it is challenging to achieve good ac-
curacy of the predictive model for many domains, specifically for high complexity
environments. With high complexity comes high modeling error (model-bias) and
it is perhaps the most common problem for unstable and collapsing policies in
model-based RL. Recent work in model-based RL focuses primarily on learning
high-dimensional and complex predictive models with graphics as part of the
MDP. This complicates the model severely and limits long-horizon predictions
as the prediction-error increases exponentially.
This paper address this issue with a combination of GDRL and MBRL by
learning a predictive model and a distance model that describes the distance
between two states. The learned predictive model abstracts the state-space to
distance between state and goal, which reduce the state-complexity significantly.
The learned distance is applied to the reward-function of Deep Q-Learning
(DQN) [18] and accelerates the learning effectively. The proposed algorithm,
CostNet, is an end-to-end solution for goal-directed reinforcement learning where
the main contributions are summarized as follows.
1. CostNet for estimating the distance between arbitrary states and terminal
states,
2. modified objective for DQN for efficient goal-directed reinforcement learning,
and
3. the proposed method demonstrates excellent performance in simulated grid-
like environments.
The paper is organized as follows. Section 2 details the preliminary work for
the proposed method. Section 3 presents a detailed overview of related work.
Section 4 introduces CostNet, a novel algorithm for cost-directed reinforcement
learning. Section 5 thoroughly presents the results of the proposed approach,
and Section 6 summarizes the work and propose future work in Goal-Directed
Reinforcement learning.
2 Background
Model-based reinforcement learning builds a model of the environment to derive
its behavioral policy. The underlying mechanism is a Markov Decision Process
(MDP), which mathematically defines the synergy between state, reward, and
actions as a tuple M= (S, A, T, R), where S={sn, . . . , st+n}is a set of possible
states and A={an, . . . , at+n}is a set of possible actions. The state transition
function T:S×A×S[0,1], which the predictive model tries to learn is a
probability function such that Tat(st, st+1)is the probability that current state
sttransitions to st+1 given that the agent choses action at. The reward function
R:S×ARwhere Rat(st, st+1)returns the immediate reward received on
when taking action ain state stwith transition to st+1. The policy takes the
form π={s1, a1, s2, a2, . . . , sn, an}where π(a|s)denotes chosen action given a
state. Model-based reinforcement learning divides primarily into three categories:
1) Dyna-based, 2) Policy Search-based, and 3) Shooting-based algorithms in
which this work concerns Dyna-based approaches. The Dyna algorithm from [26]
trains in two steps. First, the algorithm collects experience from interaction with
the environment using a policy from a model-free algorithm (i.e., Q-learning).
This experience is part of learning an estimated model of the environment, also
referred to as a predictive model. Second, the agent policy samples imagined data
generated by the predictive model and update its parameters towards optimal
behavior.
Autoencoders are commonly used in supervised learning to encode arbitrary
input to a compact representation, and using a decoder to reconstruct the orig-
inal data from the encoding. The purpose of autoencoders is to store redundant
data into a densely packed vector form. In its simplest form, an autoencoder
consists of a feed-forward neural network where the input and output layer is
of equal neuron capacity and the hidden layer smaller, used to compress the
data. The model consists of an encoder Q(z|X), latent variable distribution
P(z), and decoder P(ˆ
X|z). The input Xis a vector that represents only a
fraction of the ground truth. The objective is for the autoencoder to learn the
distribution of all possible training samples, including data not in the training
data, but nevertheless, part of the distribution P(X). The final objective for the
model is E[logP (X|z)] DKL[Q(z|X)kP(z)], where the first term denotes the
reconstruction loss, similar to standard autoencoders and the second term the
distance between the estimated latent-space and the ground truth space. The
ground truth latent-space is difficult to define, and therefore it is assumed to be
a Gaussian, and hence, the learned distribution should also be a Gaussian.
3 Related Work
Pioneering work of the goal-directed viewpoint of reinforcement learning, uni-
formly suggests that pre-processing of the state-representation (i.e., model-based
RL) and careful reward modeling is the preferred method to perform efficient
GDRL. The following section introduces related work in GDRL and relevant
model-based reinforcement learning methods1.
1The reader is referred to [20] for an in-depth survey of MBRL-based methods.
摘要:

CostNet:AnEnd-to-EndFrameworkforGoal-DirectedReinforcementLearningPer-ArneAndersen [0000000277424907],MortenGoodwin[000000016331702X],andOle-ChristoerGranmo[000000027287030X]DepartmentofICT,UniversityofAgder,Grimstad,Norway{per.andersen,morten.goodwin,ole.granmo}@uia.noAbstract.ReinforcementLearnin...

展开>> 收起<<
CostNet An End-to-End Framework for Goal-Directed Reinforcement Learning Per-Arne Andersen0000000277424907 Morten.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.68MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注