CostNet An End-to-End Framework for Goal-Directed Reinforcement Learning Per-Arne Andersen0000000277424907 Morten

2025-04-27 1 0 1.68MB 14 页 10玖币

侵权投诉

CostNet: An End-to-End Framework for

Goal-Directed Reinforcement Learning

Per-Arne Andersen[0000−0002−7742−4907], Morten

Goodwin[0000−0001−6331−702X], and Ole-Christoﬀer

Granmo[0000−0002−7287−030X]

Department of ICT, University of Agder, Grimstad, Norway

{per.andersen,morten.goodwin,ole.granmo}@uia.no

Abstract. Reinforcement Learning (RL) is a general framework con-

cerned with an agent that seeks to maximize rewards in an environment.

The learning typically happens through trial and error using explorative

methods, such as -greedy. There are two approaches, model-based and

model-free reinforcement learning, that show concrete results in sev-

eral disciplines. Model-based RL learns a model of the environment for

learning the policy while model-free approaches are fully explorative and

exploitative without considering the underlying environment dynamics.

Model-free RL works conceptually well in simulated environments, and

empirical evidence suggests that trial and error lead to a near-optimal

behavior with enough training. On the other hand, model-based RL aims

to be sample eﬃcient, and studies show that it requires far less training

in the real environment for learning a good policy.

A signiﬁcant challenge with RL is that it relies on a well-deﬁned reward

function to work well for complex environments and such a reward func-

tion is challenging to deﬁne. Goal-Directed RL is an alternative method

that learns an intrinsic reward function with emphasis on a few explored

trajectories that reveals the path to the goal state.

This paper introduces a novel reinforcement learning algorithm for pre-

dicting the distance between two states in a Markov Decision Process.

The learned distance function works as an intrinsic reward that fuels

the agent’s learning. Using the distance-metric as a reward, we show

that the algorithm performs comparably to model-free RL while having

signiﬁcantly better sample-eﬃciently in several test environments.

Keywords: Reinforcement Learning ·Markov Decision Processes ·Neu-

ral Networks ·Representation Learning ·Goal-directed Reinforcement

Learning

1 Introduction

Goal-directed reinforcement learning (GDRL) separates the learning into two

phases, where phase one aims to solve the goal-directed exploration problem

(GDE). To solve the GDE problem, the agent must determine at least one viable

arXiv:2210.01805v1 [cs.LG] 3 Oct 2022

path from the initial state to the goal state. In phase two, the agent uses the

learned path to ﬁnd a near-optimal path. The two phases iterate until the agent

policy is converged.

Reinforcement learning (RL) classiﬁes into two categories of algorithms. Model-

free RL learns a policy or a value-function by interaction with the environ-

ment and succeeds in various simulated areas, including video-games [19, 25],

robotics [12, 15], and autonomous vehicles [7, 24], but comes at the cost of eﬃ-

ciency. Speciﬁcally, model-free approaches suﬀer from low sample eﬃciency and

are a fundamental limitation for application in real-world physical systems.

On the other hand, Model-based reinforcement learning (MBRL) aims to

learn a predictive model of the environment to increase sample eﬃciency. The

agent samples from the learned predictive model, which reduces the required

interaction with the environment. However, it is challenging to achieve good ac-

curacy of the predictive model for many domains, speciﬁcally for high complexity

environments. With high complexity comes high modeling error (model-bias) and

it is perhaps the most common problem for unstable and collapsing policies in

model-based RL. Recent work in model-based RL focuses primarily on learning

high-dimensional and complex predictive models with graphics as part of the

MDP. This complicates the model severely and limits long-horizon predictions

as the prediction-error increases exponentially.

This paper address this issue with a combination of GDRL and MBRL by

learning a predictive model and a distance model that describes the distance

between two states. The learned predictive model abstracts the state-space to

distance between state and goal, which reduce the state-complexity signiﬁcantly.

The learned distance is applied to the reward-function of Deep Q-Learning

(DQN) [18] and accelerates the learning eﬀectively. The proposed algorithm,

CostNet, is an end-to-end solution for goal-directed reinforcement learning where

the main contributions are summarized as follows.

1. CostNet for estimating the distance between arbitrary states and terminal

states,

2. modiﬁed objective for DQN for eﬃcient goal-directed reinforcement learning,

and

3. the proposed method demonstrates excellent performance in simulated grid-

like environments.

The paper is organized as follows. Section 2 details the preliminary work for

the proposed method. Section 3 presents a detailed overview of related work.

Section 4 introduces CostNet, a novel algorithm for cost-directed reinforcement

learning. Section 5 thoroughly presents the results of the proposed approach,

and Section 6 summarizes the work and propose future work in Goal-Directed

Reinforcement learning.

2 Background

Model-based reinforcement learning builds a model of the environment to derive

its behavioral policy. The underlying mechanism is a Markov Decision Process

(MDP), which mathematically deﬁnes the synergy between state, reward, and

actions as a tuple M= (S, A, T, R), where S={sn, . . . , st+n}is a set of possible

states and A={an, . . . , at+n}is a set of possible actions. The state transition

function T:S×A×S→[0,1], which the predictive model tries to learn is a

probability function such that Tat(st, st+1)is the probability that current state

sttransitions to st+1 given that the agent choses action at. The reward function

R:S×A→Rwhere Rat(st, st+1)returns the immediate reward received on

when taking action ain state stwith transition to st+1. The policy takes the

form π={s1, a1, s2, a2, . . . , sn, an}where π(a|s)denotes chosen action given a

state. Model-based reinforcement learning divides primarily into three categories:

1) Dyna-based, 2) Policy Search-based, and 3) Shooting-based algorithms in

which this work concerns Dyna-based approaches. The Dyna algorithm from [26]

trains in two steps. First, the algorithm collects experience from interaction with

the environment using a policy from a model-free algorithm (i.e., Q-learning).

This experience is part of learning an estimated model of the environment, also

referred to as a predictive model. Second, the agent policy samples imagined data

generated by the predictive model and update its parameters towards optimal

behavior.

Autoencoders are commonly used in supervised learning to encode arbitrary

input to a compact representation, and using a decoder to reconstruct the orig-

inal data from the encoding. The purpose of autoencoders is to store redundant

data into a densely packed vector form. In its simplest form, an autoencoder

consists of a feed-forward neural network where the input and output layer is

of equal neuron capacity and the hidden layer smaller, used to compress the

data. The model consists of an encoder Q(z|X), latent variable distribution

P(z), and decoder P(ˆ

X|z). The input Xis a vector that represents only a

fraction of the ground truth. The objective is for the autoencoder to learn the

distribution of all possible training samples, including data not in the training

data, but nevertheless, part of the distribution P(X). The ﬁnal objective for the

model is E[logP (X|z)] −DKL[Q(z|X)kP(z)], where the ﬁrst term denotes the

reconstruction loss, similar to standard autoencoders and the second term the

distance between the estimated latent-space and the ground truth space. The

ground truth latent-space is diﬃcult to deﬁne, and therefore it is assumed to be

a Gaussian, and hence, the learned distribution should also be a Gaussian.

3 Related Work

Pioneering work of the goal-directed viewpoint of reinforcement learning, uni-

formly suggests that pre-processing of the state-representation (i.e., model-based

RL) and careful reward modeling is the preferred method to perform eﬃcient

GDRL. The following section introduces related work in GDRL and relevant

model-based reinforcement learning methods1.

1The reader is referred to [20] for an in-depth survey of MBRL-based methods.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CostNet:AnEnd-to-EndFrameworkforGoal-DirectedReinforcementLearningPer-ArneAndersen[0000000277424907],MortenGoodwin[000000016331702X],andOle-ChristoerGranmo[000000027287030X]DepartmentofICT,UniversityofAgder,Grimstad,Norway{per.andersen,morten.goodwin,ole.granmo}@uia.noAbstract.ReinforcementLearnin...

展开>> 收起<<

CostNet An End-to-End Framework for Goal-Directed Reinforcement Learning Per-Arne Andersen0000000277424907 Morten.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CostNet An End-to-End Framework for Goal-Directed Reinforcement Learning Per-Arne Andersen0000000277424907 Morten

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: