
that prepare for distribution shifts by optimizing the worst-case empirical risk against a set of task
distributions which lie within a bounded distance from the original training task distribution (often
referred to as an uncertainty set)). This allows meta-policies to deal with potential test-time task
distribution shift, bounding their worst-case test-time regret for distributional shifts within the chosen
uncertainty set. However, choosing an appropriate uncertainty set can be quite challenging without
further information about the test environment, significantly impacting the test-time performance
of algorithms under distribution shift. Large uncertainty sets allow resiliency to a wider range of
distribution shifts, but the resulting meta-policy adapts very slowly at test time; smaller uncertainty
sets enable faster test-time adaptation, but leave the meta-policy brittle to task distribution shifts. Can
we get the best of both worlds?
+1
+1 +1
Meta Train
Meta Test
+1
Episode 1 Episode 2
⇡meta
<latexit sha1_base64="04ZwISNQzfwUIERNw0RDchdyb+M=">AAAB+nicbVBNS8NAEN34WetXqkcvwSJ4KkkV9Fj04rGC/YA2hM120i7dfLA7UUvMT/HiQRGv/hJv/hu3bQ7a+mDg8d4MM/P8RHCFtv1trKyurW9slrbK2zu7e/tm5aCt4lQyaLFYxLLrUwWCR9BCjgK6iQQa+gI6/vh66nfuQSoeR3c4ScAN6TDiAWcUteSZlX7CvayP8IhZCEjz3DOrds2ewVomTkGqpEDTM7/6g5ilIUTIBFWq59gJuhmVyJmAvNxPFSSUjekQeppGNATlZrPTc+tEKwMriKWuCK2Z+nsio6FSk9DXnSHFkVr0puJ/Xi/F4NLNeJSkCBGbLwpSYWFsTXOwBlwCQzHRhDLJ9a0WG1FJGeq0yjoEZ/HlZdKu15yzWv32vNq4KuIokSNyTE6JQy5Ig9yQJmkRRh7IM3klb8aT8WK8Gx/z1hWjmDkkf2B8/gBBmJSl</latexit>
⇡meta
<latexit sha1_base64="04ZwISNQzfwUIERNw0RDchdyb+M=">AAAB+nicbVBNS8NAEN34WetXqkcvwSJ4KkkV9Fj04rGC/YA2hM120i7dfLA7UUvMT/HiQRGv/hJv/hu3bQ7a+mDg8d4MM/P8RHCFtv1trKyurW9slrbK2zu7e/tm5aCt4lQyaLFYxLLrUwWCR9BCjgK6iQQa+gI6/vh66nfuQSoeR3c4ScAN6TDiAWcUteSZlX7CvayP8IhZCEjz3DOrds2ewVomTkGqpEDTM7/6g5ilIUTIBFWq59gJuhmVyJmAvNxPFSSUjekQeppGNATlZrPTc+tEKwMriKWuCK2Z+nsio6FSk9DXnSHFkVr0puJ/Xi/F4NLNeJSkCBGbLwpSYWFsTXOwBlwCQzHRhDLJ9a0WG1FJGeq0yjoEZ/HlZdKu15yzWv32vNq4KuIokSNyTE6JQy5Ig9yQJmkRRh7IM3klb8aT8WK8Gx/z1hWjmDkkf2B8/gBBmJSl</latexit>
Figure 1: Failure of Typical Meta-RL.
On meta-training tasks,
πmeta
explores effec-
tively and quickly learns the optimal behav-
ior (top row). When test tasks come from a
slightly larger task distribution, exploration
fails catastrophically, resulting in poor adap-
tation behavior (bottom row).
Our key insight is that we can prepare for a variety of
potential test-time distribution shifts by constructing and
training against different uncertainty sets at training time.
By preparing for adaptation against each of these uncer-
tainty sets, an agent is able to adapt to a variety of poten-
tial test-time distribution shifts by adaptively choosing the
most appropriate level of distributional robustness for the
test distribution at hand. We introduce a conceptual frame-
work called distributionally adaptive meta reinforcement
learning, formalizing this idea. At train time, the agent
learns robust meta-policies with widening uncertainty sets,
preemptively accounting for different levels of test-time
distribution shift that may be encountered. At test time,
the agent infers the level of distribution shift it is faced
with, and then uses the corresponding meta-policy to adapt
to the new task (Figure 2). In doing so, the agent can adap-
tively choose the best level of robustness for the test-time
task distribution, preserving the fast adaptation benefits of
meta RL, while also ensuring good asymptotic performance under distribution shift. We instantiate a
practical algorithm in this framework (DiAMetR), using learned generative models to imagine new
task distributions close to the provided training tasks that can be used to train robust meta-policies.
The contribution of this paper is to propose a framework for making meta-reinforcement learning
resilient to a variety of task distribution shifts, and DiAMetR, a practical algorithm instantiating
the framework. DiAMetR trains a population of meta-policies to be robust to different degrees of
distribution shifts and then adaptively chooses a meta-policy to deploy based on the inferred test-time
distribution shift. Our experiments verify the utility of adaptive distributional robustness under
test-time task distribution shift in a number of simulated robotics domains.
2 Related Work
Meta-reinforcement learning algorithms aim to leverage a distribution of training tasks to “learn
a reinforcement learning algorithm", that is able to learn as quickly on new tasks drawn from the
same distribution. A variety of algorithms have been proposed for meta-RL, including memory-
based [
7
,
25
], gradient-based [
10
,
35
,
13
] and latent-variable based [
33
,
48
,
47
,
11
] schemes. These
algorithms show the ability to generalize to new tasks drawn from the same distribution, and have been
applied to problems ranging from robotics [
27
,
47
,
18
] to computer science education [
43
]. This line
of work has been extended to operate in scenarios without requiring any pre-specified task distribution
[
12
,
16
], in offline settings [
6
,
28
,
26
] or in hard (meta-)exploration settings [
49
,
46
], making them
more broadly applicable to a wider class of problems. However, most meta-RL algorithms assume
source and target tasks are drawn from the same distribution, an assumption rarely met in practice.
Our work shows how the machinery of meta-RL can be made compatible with distribution shift at
test time, using ideas from distributional robustness. Some recent work shows that model based
meta-reinforcement learning can be made to be robust to a particular level distribution shift [
23
,
20
]
by learning a shared dynamics model against adversarially chosen task distributions. We show that we
can build model-free meta-reinforcement learning algorithms, which are not just robust to a particular
level of distribution shift, but can adapt to various levels of shift.
Distributional robustness methods have been studied in the context of building supervised learning
systems that are robust to the test distribution being different than the training one. The key idea
2