only a few parameters.
•Using the dynamic model parameters directly as a
context for the general policy.
•Combining model-based and model-free RL.
II. RELATED WORK
RL has shown success in numerous domains, such as
playing Atari games [1], [8], playing Go [9], and driving
autonomous vehicles [3], [10]. Some are designed for one
specific environment [11], [12], while others can learn to
master multiple environments [2], [8]; however, many algo-
rithms require separate training for each environment.
Several approaches were proposed to mitigate the need
for long training times by using meta-RL methods. We
begin by describing methods that, similarly to ours, learn
a context-conditioned, general policy. However, they con-
structed the context vector in different ways. We note that
some previous works term the different training environments
“tasks" since they emphasize the changes in the reward
function. However, since our work focuses on environments
with different dynamics (transition functions), we use the
term “environments”. In [13], the environment properties
are predicted by a neural network based on a fixed, small
number of steps. However, this approach requires explicitly
defining the representative environment properties. More-
over, it assumes that these properties can be estimated based
on the immediate environmental dynamics. Rasool et al.
[14] introduce TD3-context, a TD3-based RL agent that
uses a recurrent neural network (RNN) to create a context
vector, which receives the recent states and rewards as input.
However, even though types of RNNs such as LSTM [15]
and GRU [16] are designed for long-term history, in practice,
the number of previous states considered by the RNN is
limited [7]. Therefore, if an event that defines an environment
occurs too early, the RNN will “forget" it and not provide
an accurate context to the policy. In our method, RAMP,
the context consists of the parameters of a global, dynamic
model, which is not limited by the history length. Other
approaches use the RNN directly as a policy, based on the
transitions and rewards during the previous episode [6], [17],
instead of creating a context vector for a general policy.
These approaches are also vulnerable to this RNN memory
limitation.
Finn et al. [5] proposed a different principle for meta-
learning termed “Model-Agnostic Meta-Learning (MAML)."
In MAML, the neural network parameters are trained such
that the model will be adapted to a new environment by
updating all parameters only with a low number of gradient-
descent steps. However, the training process of MAML
may be challenging [18]. Furthermore, MAML uses on-
policy RL and therefore is unsuitable for the more sampling-
efficient off-policy methods as in our approach. Nevertheless,
since MAML can also be used for regression, we compare
our multi-environment dynamic model learning method to
MAML in Sec. V-A.
Some proposed meta-RL methods are suitable for off-
policy learning [14], [19]. Meta-Q-learning (MQL) [14]
updates the policy to new environments by using data from
multiple previous environments stored in the replay buffer.
The transitions from the replay buffer are reweighed to match
the current environment. We compare our method, RAMP,
to MQL in our testing environment in Sec. V-B.2.
As opposed to all these meta-RL methods, which are
model-free, also model-based meta-RL methods were pro-
posed. In model-based meta-RL, the agent learns a model
that can quickly adapt to the dynamics of a new environment.
Ignasi et al. [20] propose to use recurrence-based or gradient-
based (MAML) online adaptation for learning the model.
Similarly, Lee et al. [21] train a model that is conditioned on
the encoded, previous transitions. In contrast to model-free
RL, which learns a direct mapping (i.e., a policy) between
the state and actions, model-based RL computes the actions
by planning (using a model-predictive controller) based on
the learned model. In our work, we combine the model-free
and model-based approaches resulting in rapid learning of
the environment dynamic model and a direct policy without
the need for planning.
III. PROBLEM DEFINITION
We consider a set of Nenvironments that are modeled
as a Markov Decision Processes Mk={S,A,Tk,R},
k={1, . . . , N}. All environments share the same state space
S, action space A, and reward function Rand differ only by
their unknown transition function T. These Nenvironments
are randomly split into training environments Mtrain and
testing environments Mtest.
The meta-RL agent is trained on the Mtrain environments
and must adapt separately to each of the Mtest environments.
That is, the agent is permitted to interact with the Mtrain
environments for an unlimited number of episodes. Then,
the meta-RL agent is given only a short opportunity to
interact with each of the Mtest environments (e.g., a single
episode, a number of time steps, etc.), and update its policy
based on this interaction. Overall, the agent’s goal is to
maximize the average expected discounted for each of the
Mtest environments.
IV. RAMP
RAMP is constructed in two phases: in the first phase,
a multi-environment dynamic model is learned, and in the
second phase, the model parameters of the dynamic model
are used as context for the multi-environment policy of
the reinforcement learning agent. The following sections
first describe how the multi-environment dynamic model is
learned by exploiting the environments’ common structure.
In the second part, we describe the reinforcement learning
agent.
A. Multi-Environment Dynamic Model
Attempting to approximate the transition function of each
environment Tkby an individual neural network is likely
to work well for the training environments. However, it is
unlikely to generalize to the testing environments, as we have
only a limited set of data points for them. However, since