Meta-Reinforcement Learning Using Model Parameters Gabriel Hartmann12and Amos Azaria2 Abstract In meta-reinforcement learning an agent is

2025-05-02 0 0 500.96KB 8 页 10玖币

侵权投诉

Meta-Reinforcement Learning Using Model Parameters

Gabriel Hartmann1,2and Amos Azaria2

Abstract— In meta-reinforcement learning, an agent is

trained in multiple different environments and attempts to learn

a meta-policy that can efﬁciently adapt to a new environment.

This paper presents RAMP, a Reinforcement learning Agent

using Model Parameters that utilizes the idea that a neural

network trained to predict environment dynamics encapsulates

the environment information. RAMP is constructed in two

phases: in the ﬁrst phase, a multi-environment parameterized

dynamic model is learned. In the second phase, the model

parameters of the dynamic model are used as context for the

multi-environment policy of the model-free reinforcement learn-

ing agent. We show the performance of our novel method in

simulated experiments and compare them to existing methods.

I. INTRODUCTION

Common approaches for developing controllers do not rely

on machine learning. Instead, engineers manually construct

the controller based on general information about the world

and the problem. After repetitively testing the controller in

the environment, the engineer improves the controller based

on the feedback from these tests. That is, a human is an

essential part of this iterative process. Reinforcement Learn-

ing (RL) reduces human effort by automatically learning

from interaction with the environment. Instead of explicitly

designing and improving a controller, the engineer develops

a general RL agent that learns to improve the controller’s

performance without human intervention. The RL agent is

usually general and does not include speciﬁc information

about the target environment; this allows it to adapt to dif-

ferent environments. Indeed, RL agents may achieve higher

performance compared to human-crafted controllers [1]–[3].

However, RL agents usually require training from the ground

up for every new environment, which requires extensive

interaction in the new environment.

One solution to speed up the training time is to explicitly

provide human-crafted information about the environment

(context) to the RL agent [4]. However, such a solution

requires explicitly analyzing the target environment, which

may be challenging and time-consuming.

Instead of relying on the human understanding of the

problem for providing such context, a meta-Reinforcement

Learning (meta-RL) agent can learn to extract a proper envi-

ronmental context. To that end, a meta-RL agent is trained on

extended interaction in multiple different environments, and

then, after a short interaction in a new, unseen environment, it

This research was supported, in part, by the Ministry of Science &

Technology, Israel.

1Department of Mechanical Engineering and Mechatronics, Ariel Uni-

versity, Israel

2Department of Computer Science, Ariel University, Israel

gabrielh@ariel.ac.il, amos.azaria@ariel.ac.il

is required to perform well in it [5], [6]. Speciﬁcally, a meta-

RL algorithm that is based on context extraction is composed

of two phases. First, in the meta-learning phase, the agent

learns a general policy suitable to all environments given a

context. Additionally, in this phase, the meta-RL agent learns

how to extract a context from samples obtained from an

environment. Secondly, in the adaptation phase, the meta-RL

agent conducts a short interaction in the new environment,

and the context is extracted from it. This context is then fed

to the general policy, which acts in the new environment.

One common approach for context extraction is using

a Recurrent Neural Network (RNN). That is, the RNN

receives the history of the states, actions, and rewards and

is trained to output a context that is useful for the general

policy. However, the RNN long-term memory capability

usually limits the effective history length [7]. Additionally,

since the context vector is not explicitly explainable, it is

difﬁcult to examine the learning process and understand if

the RNN learned to extract the representative properties of

the environments.

In this paper, we introduce RAMP – a Reinforcement

learning Agent using Model Parameters. We utilize the idea

that a neural network trained to predict environment dynam-

ics encapsulates the environment properties; therefore, its pa-

rameters can be used as the context for the policy. During the

meta-RL phase, RAMP learns a neural network that predicts

the environment dynamic for each environment. However,

since the number of the neural network’s parameters is

usually high, it is challenging for the policy to use the entire

set of parameters as its context. Therefore, the majority of

the model’s parameters are shared between all environments,

and only a small set of parameters are trained separately

in each environment. In that way, the environment-speciﬁc

parameters represent the speciﬁc environment properties.

Consequently, a general policy uses only these parameters as

context and outputs actions that are suitable for that particular

environment. One advantage of RAMP is that the history

length used for the context extraction is not limited because

the context is extracted from a global dynamic model.

Additionally, the combination of model learning and RL in

RAMP makes the training process more transparent since it

is possible to evaluate the performance of the model learning

process independently. We demonstrate the effectiveness of

RAMP in several simulated experiments in Sec. V.

To summarize, the contributions of this paper are:

•Suggesting a novel method for meta-reinforcement

learning.

•Presenting a multi-environment dynamic model learning

method that adapts to new environments by updating

arXiv:2210.15515v1 [cs.LG] 27 Oct 2022

only a few parameters.

•Using the dynamic model parameters directly as a

context for the general policy.

•Combining model-based and model-free RL.

II. RELATED WORK

RL has shown success in numerous domains, such as

playing Atari games [1], [8], playing Go [9], and driving

autonomous vehicles [3], [10]. Some are designed for one

speciﬁc environment [11], [12], while others can learn to

master multiple environments [2], [8]; however, many algo-

rithms require separate training for each environment.

Several approaches were proposed to mitigate the need

for long training times by using meta-RL methods. We

begin by describing methods that, similarly to ours, learn

a context-conditioned, general policy. However, they con-

structed the context vector in different ways. We note that

some previous works term the different training environments

“tasks" since they emphasize the changes in the reward

function. However, since our work focuses on environments

with different dynamics (transition functions), we use the

term “environments”. In [13], the environment properties

are predicted by a neural network based on a ﬁxed, small

number of steps. However, this approach requires explicitly

deﬁning the representative environment properties. More-

over, it assumes that these properties can be estimated based

on the immediate environmental dynamics. Rasool et al.

[14] introduce TD3-context, a TD3-based RL agent that

uses a recurrent neural network (RNN) to create a context

vector, which receives the recent states and rewards as input.

However, even though types of RNNs such as LSTM [15]

and GRU [16] are designed for long-term history, in practice,

the number of previous states considered by the RNN is

limited [7]. Therefore, if an event that deﬁnes an environment

occurs too early, the RNN will “forget" it and not provide

an accurate context to the policy. In our method, RAMP,

the context consists of the parameters of a global, dynamic

model, which is not limited by the history length. Other

approaches use the RNN directly as a policy, based on the

transitions and rewards during the previous episode [6], [17],

instead of creating a context vector for a general policy.

These approaches are also vulnerable to this RNN memory

limitation.

Finn et al. [5] proposed a different principle for meta-

learning termed “Model-Agnostic Meta-Learning (MAML)."

In MAML, the neural network parameters are trained such

that the model will be adapted to a new environment by

updating all parameters only with a low number of gradient-

descent steps. However, the training process of MAML

may be challenging [18]. Furthermore, MAML uses on-

policy RL and therefore is unsuitable for the more sampling-

efﬁcient off-policy methods as in our approach. Nevertheless,

since MAML can also be used for regression, we compare

our multi-environment dynamic model learning method to

MAML in Sec. V-A.

Some proposed meta-RL methods are suitable for off-

policy learning [14], [19]. Meta-Q-learning (MQL) [14]

updates the policy to new environments by using data from

multiple previous environments stored in the replay buffer.

The transitions from the replay buffer are reweighed to match

the current environment. We compare our method, RAMP,

to MQL in our testing environment in Sec. V-B.2.

As opposed to all these meta-RL methods, which are

model-free, also model-based meta-RL methods were pro-

posed. In model-based meta-RL, the agent learns a model

that can quickly adapt to the dynamics of a new environment.

Ignasi et al. [20] propose to use recurrence-based or gradient-

based (MAML) online adaptation for learning the model.

Similarly, Lee et al. [21] train a model that is conditioned on

the encoded, previous transitions. In contrast to model-free

RL, which learns a direct mapping (i.e., a policy) between

the state and actions, model-based RL computes the actions

by planning (using a model-predictive controller) based on

the learned model. In our work, we combine the model-free

and model-based approaches resulting in rapid learning of

the environment dynamic model and a direct policy without

the need for planning.

III. PROBLEM DEFINITION

We consider a set of Nenvironments that are modeled

as a Markov Decision Processes Mk={S,A,Tk,R},

k={1, . . . , N}. All environments share the same state space

S, action space A, and reward function Rand differ only by

their unknown transition function T. These Nenvironments

are randomly split into training environments Mtrain and

testing environments Mtest.

The meta-RL agent is trained on the Mtrain environments

and must adapt separately to each of the Mtest environments.

That is, the agent is permitted to interact with the Mtrain

environments for an unlimited number of episodes. Then,

the meta-RL agent is given only a short opportunity to

interact with each of the Mtest environments (e.g., a single

episode, a number of time steps, etc.), and update its policy

based on this interaction. Overall, the agent’s goal is to

maximize the average expected discounted for each of the

Mtest environments.

IV. RAMP

RAMP is constructed in two phases: in the ﬁrst phase,

a multi-environment dynamic model is learned, and in the

second phase, the model parameters of the dynamic model

are used as context for the multi-environment policy of

the reinforcement learning agent. The following sections

ﬁrst describe how the multi-environment dynamic model is

learned by exploiting the environments’ common structure.

In the second part, we describe the reinforcement learning

agent.

A. Multi-Environment Dynamic Model

Attempting to approximate the transition function of each

environment Tkby an individual neural network is likely

to work well for the training environments. However, it is

unlikely to generalize to the testing environments, as we have

only a limited set of data points for them. However, since

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Meta-ReinforcementLearningUsingModelParametersGabrielHartmann1;2andAmosAzaria2AbstractInmeta-reinforcementlearning,anagentistrainedinmultipledifferentenvironmentsandattemptstolearnameta-policythatcanefcientlyadapttoanewenvironment.ThispaperpresentsRAMP,aReinforcementlearningAgentusingModelParamete...

展开>> 收起<<

Meta-Reinforcement Learning Using Model Parameters Gabriel Hartmann12and Amos Azaria2 Abstract In meta-reinforcement learning an agent is.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Meta-Reinforcement Learning Using Model Parameters Gabriel Hartmann12and Amos Azaria2 Abstract In meta-reinforcement learning an agent is

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: