FLAP cannot learn to explore to reduce uncertainty. Finally, Xian et al. [16] use hypernetworks
to predict model dynamics then use model predictive control. However, this model still requires
planning to make use of an uncertain model, whereas model-free RL learns a policy that explores
optimally in order to attain data for adaptation. Using a general procedure trained to arbitrarily
modify the weights of a model-free policy has never been tried in RL, to the best of our knowledge.
Hypernetworks. Hypernetworks, or similar architectures, have been used in supervised learning
(SL), multi-task RL, and meta-SL. Hypernetworks have been used in the supervised learning litera-
ture for sequence modelling [2], as well as in continual learning and image classification [3], where
it was shown that the hypernetwork initialization scheme was crucial for performance. Similar mod-
els have also been used in multi-task RL and meta-SL, but not meta-RL. For instance, in multi-task
RL, Yu et al. [17] use a network conditioned on a task encoding to produce the weights and biases
for every other layer in another network conditioned on state. In meta-SL, there have also been
attempts to use one network to adapt weights of another, both as a general function of the dataset
[18,19,20], conditioned on an embedding adapted by gradient descent [21], and by adding deltas in
a way framed as learning to optimize [22,23]. The abundance of representations in meta-SL suggest
there is a similarly large space of representation-based methods to explore in meta-RL. Our work –
getting hypernetworks to work in practice for meta-RL – can be seen as a first step towards applying
all of these methods in meta-RL.
3 Background
3.1 Problem Setting
An RL task is formalized as a Markov Decision Processes (MDP). We define an MDP as a tuple of
(S,A,R,P, γ). At time-step t, the agent inhabits a state, st∈ S, observable by the agent. The agent
takes an action at∈ A. The MDP then transitions to state st+1 ∼ P(st+1|st, at): S × A × S →
R≥0, and the agent receives reward rt=R(st, at): S × A → Rupon entering st+1. Given
a discount factor, γ∈[0,1), the agent acts to maximize the expected future discounted reward,
R(τ) = Prt∈τγtrt, where τis the agent’s trajectory over an episode in the MDP. To maximize this
return, the action takes actions sampled from a learned policy, π(a|s) : S × A → R+.
Meta-RL algorithms learn an RL algorithm, i.e., a mapping from the data sampled from a single
MDP, M ∼ p(M), to a policy. Since an RL algorithm generally needs multiple episodes of inter-
action to produce a reasonable policy, the algorithm conditions on τ, which is the entire sequence of
states, actions, and rewards within M. As in the RL setting, this sequence up to time-step t forms
a trajectory τt∈(S × A × R)t. Here, however, τmay span multiple episodes, and so we use the
same symbol, but refer to it as a meta-episode. The policy is then a meta-episode dependent policy,
πθ(a|s, τ), parameterized by the meta-parameters,θ.
We define the objective in meta-RL as finding meta-parameters θthat maximize the sum of the
returns in the meta-episode across a distribution of tasks (MDPs):
arg max
θ
EM∼p(M)EτR(τ)
πθ(·),M (1)
3.2 Policy Architecture
We consider meta-RL agents capable of adaptation at every time-step, and adaptation within one
episode is required to solve some of our benchmarks. In such methods [5,10,11], the history is
generally summarized by a function, g, into an embedding that represents relevant task information.
We write this embedding as e=g(τ), and call gthe task encoder. The policy, represented as a
multi-layer perceptron, then conditions on this task embedding as an input, instead of on the history
directly, which we write as πθ(a|s, e). We call this the standard architecture, shown in Figure 2.
In this paper, we primarily build off of VariBAD [5], which can be seen as an instance of the standard
architecture where the task encoder is the mean and variance from a recurrent variational auto-
encoder (VAE) [24] trained using a self-supervised loss. In other words, the task is inferred as a latent
variable optimized for reconstructing a meta-episode. See Zintgraf et al. [5] for details. Additionally,
evaluate the addition of hypernetworks to RL2 [11] on the most challenging benchmark. (See section
5.2.) In RL2, the task encoder is a recurrent neural network trained end-to-end on equation 1.
3