Meta-Reinforcement Learning Using Model Parameters Gabriel Hartmann12and Amos Azaria2 Abstract In meta-reinforcement learning an agent is

2025-05-02 0 0 500.96KB 8 页 10玖币
侵权投诉
Meta-Reinforcement Learning Using Model Parameters
Gabriel Hartmann1,2and Amos Azaria2
Abstract In meta-reinforcement learning, an agent is
trained in multiple different environments and attempts to learn
a meta-policy that can efficiently adapt to a new environment.
This paper presents RAMP, a Reinforcement learning Agent
using Model Parameters that utilizes the idea that a neural
network trained to predict environment dynamics encapsulates
the environment information. RAMP is constructed in two
phases: in the first phase, a multi-environment parameterized
dynamic model is learned. In the second phase, the model
parameters of the dynamic model are used as context for the
multi-environment policy of the model-free reinforcement learn-
ing agent. We show the performance of our novel method in
simulated experiments and compare them to existing methods.
I. INTRODUCTION
Common approaches for developing controllers do not rely
on machine learning. Instead, engineers manually construct
the controller based on general information about the world
and the problem. After repetitively testing the controller in
the environment, the engineer improves the controller based
on the feedback from these tests. That is, a human is an
essential part of this iterative process. Reinforcement Learn-
ing (RL) reduces human effort by automatically learning
from interaction with the environment. Instead of explicitly
designing and improving a controller, the engineer develops
a general RL agent that learns to improve the controller’s
performance without human intervention. The RL agent is
usually general and does not include specific information
about the target environment; this allows it to adapt to dif-
ferent environments. Indeed, RL agents may achieve higher
performance compared to human-crafted controllers [1]–[3].
However, RL agents usually require training from the ground
up for every new environment, which requires extensive
interaction in the new environment.
One solution to speed up the training time is to explicitly
provide human-crafted information about the environment
(context) to the RL agent [4]. However, such a solution
requires explicitly analyzing the target environment, which
may be challenging and time-consuming.
Instead of relying on the human understanding of the
problem for providing such context, a meta-Reinforcement
Learning (meta-RL) agent can learn to extract a proper envi-
ronmental context. To that end, a meta-RL agent is trained on
extended interaction in multiple different environments, and
then, after a short interaction in a new, unseen environment, it
This research was supported, in part, by the Ministry of Science &
Technology, Israel.
1Department of Mechanical Engineering and Mechatronics, Ariel Uni-
versity, Israel
2Department of Computer Science, Ariel University, Israel
gabrielh@ariel.ac.il, amos.azaria@ariel.ac.il
is required to perform well in it [5], [6]. Specifically, a meta-
RL algorithm that is based on context extraction is composed
of two phases. First, in the meta-learning phase, the agent
learns a general policy suitable to all environments given a
context. Additionally, in this phase, the meta-RL agent learns
how to extract a context from samples obtained from an
environment. Secondly, in the adaptation phase, the meta-RL
agent conducts a short interaction in the new environment,
and the context is extracted from it. This context is then fed
to the general policy, which acts in the new environment.
One common approach for context extraction is using
a Recurrent Neural Network (RNN). That is, the RNN
receives the history of the states, actions, and rewards and
is trained to output a context that is useful for the general
policy. However, the RNN long-term memory capability
usually limits the effective history length [7]. Additionally,
since the context vector is not explicitly explainable, it is
difficult to examine the learning process and understand if
the RNN learned to extract the representative properties of
the environments.
In this paper, we introduce RAMP – a Reinforcement
learning Agent using Model Parameters. We utilize the idea
that a neural network trained to predict environment dynam-
ics encapsulates the environment properties; therefore, its pa-
rameters can be used as the context for the policy. During the
meta-RL phase, RAMP learns a neural network that predicts
the environment dynamic for each environment. However,
since the number of the neural network’s parameters is
usually high, it is challenging for the policy to use the entire
set of parameters as its context. Therefore, the majority of
the model’s parameters are shared between all environments,
and only a small set of parameters are trained separately
in each environment. In that way, the environment-specific
parameters represent the specific environment properties.
Consequently, a general policy uses only these parameters as
context and outputs actions that are suitable for that particular
environment. One advantage of RAMP is that the history
length used for the context extraction is not limited because
the context is extracted from a global dynamic model.
Additionally, the combination of model learning and RL in
RAMP makes the training process more transparent since it
is possible to evaluate the performance of the model learning
process independently. We demonstrate the effectiveness of
RAMP in several simulated experiments in Sec. V.
To summarize, the contributions of this paper are:
Suggesting a novel method for meta-reinforcement
learning.
Presenting a multi-environment dynamic model learning
method that adapts to new environments by updating
arXiv:2210.15515v1 [cs.LG] 27 Oct 2022
only a few parameters.
Using the dynamic model parameters directly as a
context for the general policy.
Combining model-based and model-free RL.
II. RELATED WORK
RL has shown success in numerous domains, such as
playing Atari games [1], [8], playing Go [9], and driving
autonomous vehicles [3], [10]. Some are designed for one
specific environment [11], [12], while others can learn to
master multiple environments [2], [8]; however, many algo-
rithms require separate training for each environment.
Several approaches were proposed to mitigate the need
for long training times by using meta-RL methods. We
begin by describing methods that, similarly to ours, learn
a context-conditioned, general policy. However, they con-
structed the context vector in different ways. We note that
some previous works term the different training environments
“tasks" since they emphasize the changes in the reward
function. However, since our work focuses on environments
with different dynamics (transition functions), we use the
term “environments”. In [13], the environment properties
are predicted by a neural network based on a fixed, small
number of steps. However, this approach requires explicitly
defining the representative environment properties. More-
over, it assumes that these properties can be estimated based
on the immediate environmental dynamics. Rasool et al.
[14] introduce TD3-context, a TD3-based RL agent that
uses a recurrent neural network (RNN) to create a context
vector, which receives the recent states and rewards as input.
However, even though types of RNNs such as LSTM [15]
and GRU [16] are designed for long-term history, in practice,
the number of previous states considered by the RNN is
limited [7]. Therefore, if an event that defines an environment
occurs too early, the RNN will “forget" it and not provide
an accurate context to the policy. In our method, RAMP,
the context consists of the parameters of a global, dynamic
model, which is not limited by the history length. Other
approaches use the RNN directly as a policy, based on the
transitions and rewards during the previous episode [6], [17],
instead of creating a context vector for a general policy.
These approaches are also vulnerable to this RNN memory
limitation.
Finn et al. [5] proposed a different principle for meta-
learning termed “Model-Agnostic Meta-Learning (MAML)."
In MAML, the neural network parameters are trained such
that the model will be adapted to a new environment by
updating all parameters only with a low number of gradient-
descent steps. However, the training process of MAML
may be challenging [18]. Furthermore, MAML uses on-
policy RL and therefore is unsuitable for the more sampling-
efficient off-policy methods as in our approach. Nevertheless,
since MAML can also be used for regression, we compare
our multi-environment dynamic model learning method to
MAML in Sec. V-A.
Some proposed meta-RL methods are suitable for off-
policy learning [14], [19]. Meta-Q-learning (MQL) [14]
updates the policy to new environments by using data from
multiple previous environments stored in the replay buffer.
The transitions from the replay buffer are reweighed to match
the current environment. We compare our method, RAMP,
to MQL in our testing environment in Sec. V-B.2.
As opposed to all these meta-RL methods, which are
model-free, also model-based meta-RL methods were pro-
posed. In model-based meta-RL, the agent learns a model
that can quickly adapt to the dynamics of a new environment.
Ignasi et al. [20] propose to use recurrence-based or gradient-
based (MAML) online adaptation for learning the model.
Similarly, Lee et al. [21] train a model that is conditioned on
the encoded, previous transitions. In contrast to model-free
RL, which learns a direct mapping (i.e., a policy) between
the state and actions, model-based RL computes the actions
by planning (using a model-predictive controller) based on
the learned model. In our work, we combine the model-free
and model-based approaches resulting in rapid learning of
the environment dynamic model and a direct policy without
the need for planning.
III. PROBLEM DEFINITION
We consider a set of Nenvironments that are modeled
as a Markov Decision Processes Mk={S,A,Tk,R},
k={1, . . . , N}. All environments share the same state space
S, action space A, and reward function Rand differ only by
their unknown transition function T. These Nenvironments
are randomly split into training environments Mtrain and
testing environments Mtest.
The meta-RL agent is trained on the Mtrain environments
and must adapt separately to each of the Mtest environments.
That is, the agent is permitted to interact with the Mtrain
environments for an unlimited number of episodes. Then,
the meta-RL agent is given only a short opportunity to
interact with each of the Mtest environments (e.g., a single
episode, a number of time steps, etc.), and update its policy
based on this interaction. Overall, the agent’s goal is to
maximize the average expected discounted for each of the
Mtest environments.
IV. RAMP
RAMP is constructed in two phases: in the first phase,
a multi-environment dynamic model is learned, and in the
second phase, the model parameters of the dynamic model
are used as context for the multi-environment policy of
the reinforcement learning agent. The following sections
first describe how the multi-environment dynamic model is
learned by exploiting the environments’ common structure.
In the second part, we describe the reinforcement learning
agent.
A. Multi-Environment Dynamic Model
Attempting to approximate the transition function of each
environment Tkby an individual neural network is likely
to work well for the training environments. However, it is
unlikely to generalize to the testing environments, as we have
only a limited set of data points for them. However, since
摘要:

Meta-ReinforcementLearningUsingModelParametersGabrielHartmann1;2andAmosAzaria2Abstract—Inmeta-reinforcementlearning,anagentistrainedinmultipledifferentenvironmentsandattemptstolearnameta-policythatcanefcientlyadapttoanewenvironment.ThispaperpresentsRAMP,aReinforcementlearningAgentusingModelParamete...

展开>> 收起<<
Meta-Reinforcement Learning Using Model Parameters Gabriel Hartmann12and Amos Azaria2 Abstract In meta-reinforcement learning an agent is.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:500.96KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注