Distributionally Adaptive Meta Reinforcement Learning Anurag Ajay Abhishek Gupta Dibya Ghosh Sergey Levine Pulkit Agrawal

2025-04-27 0 0 6.01MB 29 页 10玖币
侵权投诉
Distributionally Adaptive Meta Reinforcement
Learning
Anurag Ajay§, Abhishek Gupta *§, Dibya Ghosh, Sergey Levine, Pulkit Agrawal§
Improbable AI Lab
MIT-IBM Watson AI Lab
Massachusetts Institute Technology§
University of California, Berkeley
Abstract
Meta-reinforcement learning algorithms provide a data-driven way to acquire poli-
cies that quickly adapt to many tasks with varying rewards or dynamics functions.
However, learned meta-policies are often effective only on the exact task distribu-
tion on which they were trained and struggle in the presence of distribution shift
of test-time rewards or transition dynamics. In this work, we develop a frame-
work for meta-RL algorithms that are able to behave appropriately under test-time
distribution shifts in the space of tasks. Our framework centers on an adaptive
approach to distributional robustness that trains a population of meta-policies to
be robust to varying levels of distribution shift. When evaluated on a potentially
shifted test-time distribution of tasks, this allows us to choose the meta-policy with
the most appropriate level of robustness, and use it to perform fast adaptation. We
formally show how our framework allows for improved regret under distribution
shift, and empirically show its efficacy on simulated robotics problems under a
wide range of distribution shifts.
1 Introduction
The diversity and dynamism of the real world require reinforcement learning (RL) agents that
can quickly adapt and learn new behaviors when placed in novel situations. Meta reinforcement
learning provides a framework for conferring this ability to RL agents, by learning a “meta-policy”
trained to adapt as quickly as possible to tasks from a provided training distribution [
39
,
10
,
33
,
48
].
Unfortunately, meta-RL agents assume tasks to be always drawn from the training task distribution
and often behave erratically when asked to adapt to tasks beyond the training distribution [
5
,
8
]. As
an example of this negative transfer, consider using meta-learning to teach a robot to navigate to
goals quickly (illustrated in Figure 1). The resulting meta-policy learns to quickly adapt and walk to
any target location specified in the training distribution, but explores poorly and fails to adapt to any
location not in that distribution. This is particularly problematic for the meta-learning setting, since
the scenarios where we need the ability to learn quickly are usually exactly those where the agent
experiences distribution shift. This type of meta-distribution shift afflicts a number of real-world
problems including autonomous vehicle driving [
9
], in-hand manipulation [
17
,
1
], and quadruped
locomotion [
24
,
22
,
18
], where training task distribution may not encompass all real-world scenarios.
In this work, we study meta-RL algorithms that learn meta-policies resilient to task distribution shift at
test time. We assume the test-time distribution shift to be unknown but fixed. One approach to enable
this resiliency is to leverage the framework of distributional robustness [
37
], training meta-policies
denotes equal contribution. Authors are also affiliated with Computer Science and Artificial Laboratory
(CSAIL). Correspondence to aajay@mit.edu and abhgupta@cs.washington.edu
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03104v2 [cs.LG] 10 Jul 2023
that prepare for distribution shifts by optimizing the worst-case empirical risk against a set of task
distributions which lie within a bounded distance from the original training task distribution (often
referred to as an uncertainty set)). This allows meta-policies to deal with potential test-time task
distribution shift, bounding their worst-case test-time regret for distributional shifts within the chosen
uncertainty set. However, choosing an appropriate uncertainty set can be quite challenging without
further information about the test environment, significantly impacting the test-time performance
of algorithms under distribution shift. Large uncertainty sets allow resiliency to a wider range of
distribution shifts, but the resulting meta-policy adapts very slowly at test time; smaller uncertainty
sets enable faster test-time adaptation, but leave the meta-policy brittle to task distribution shifts. Can
we get the best of both worlds?
+1
+1 +1
Meta Train
Meta Test
+1
Episode 1 Episode 2
meta
<latexit sha1_base64="04ZwISNQzfwUIERNw0RDchdyb+M=">AAAB+nicbVBNS8NAEN34WetXqkcvwSJ4KkkV9Fj04rGC/YA2hM120i7dfLA7UUvMT/HiQRGv/hJv/hu3bQ7a+mDg8d4MM/P8RHCFtv1trKyurW9slrbK2zu7e/tm5aCt4lQyaLFYxLLrUwWCR9BCjgK6iQQa+gI6/vh66nfuQSoeR3c4ScAN6TDiAWcUteSZlX7CvayP8IhZCEjz3DOrds2ewVomTkGqpEDTM7/6g5ilIUTIBFWq59gJuhmVyJmAvNxPFSSUjekQeppGNATlZrPTc+tEKwMriKWuCK2Z+nsio6FSk9DXnSHFkVr0puJ/Xi/F4NLNeJSkCBGbLwpSYWFsTXOwBlwCQzHRhDLJ9a0WG1FJGeq0yjoEZ/HlZdKu15yzWv32vNq4KuIokSNyTE6JQy5Ig9yQJmkRRh7IM3klb8aT8WK8Gx/z1hWjmDkkf2B8/gBBmJSl</latexit>
meta
<latexit sha1_base64="04ZwISNQzfwUIERNw0RDchdyb+M=">AAAB+nicbVBNS8NAEN34WetXqkcvwSJ4KkkV9Fj04rGC/YA2hM120i7dfLA7UUvMT/HiQRGv/hJv/hu3bQ7a+mDg8d4MM/P8RHCFtv1trKyurW9slrbK2zu7e/tm5aCt4lQyaLFYxLLrUwWCR9BCjgK6iQQa+gI6/vh66nfuQSoeR3c4ScAN6TDiAWcUteSZlX7CvayP8IhZCEjz3DOrds2ewVomTkGqpEDTM7/6g5ilIUTIBFWq59gJuhmVyJmAvNxPFSSUjekQeppGNATlZrPTc+tEKwMriKWuCK2Z+nsio6FSk9DXnSHFkVr0puJ/Xi/F4NLNeJSkCBGbLwpSYWFsTXOwBlwCQzHRhDLJ9a0WG1FJGeq0yjoEZ/HlZdKu15yzWv32vNq4KuIokSNyTE6JQy5Ig9yQJmkRRh7IM3klb8aT8WK8Gx/z1hWjmDkkf2B8/gBBmJSl</latexit>
Figure 1: Failure of Typical Meta-RL.
On meta-training tasks,
πmeta
explores effec-
tively and quickly learns the optimal behav-
ior (top row). When test tasks come from a
slightly larger task distribution, exploration
fails catastrophically, resulting in poor adap-
tation behavior (bottom row).
Our key insight is that we can prepare for a variety of
potential test-time distribution shifts by constructing and
training against different uncertainty sets at training time.
By preparing for adaptation against each of these uncer-
tainty sets, an agent is able to adapt to a variety of poten-
tial test-time distribution shifts by adaptively choosing the
most appropriate level of distributional robustness for the
test distribution at hand. We introduce a conceptual frame-
work called distributionally adaptive meta reinforcement
learning, formalizing this idea. At train time, the agent
learns robust meta-policies with widening uncertainty sets,
preemptively accounting for different levels of test-time
distribution shift that may be encountered. At test time,
the agent infers the level of distribution shift it is faced
with, and then uses the corresponding meta-policy to adapt
to the new task (Figure 2). In doing so, the agent can adap-
tively choose the best level of robustness for the test-time
task distribution, preserving the fast adaptation benefits of
meta RL, while also ensuring good asymptotic performance under distribution shift. We instantiate a
practical algorithm in this framework (DiAMetR), using learned generative models to imagine new
task distributions close to the provided training tasks that can be used to train robust meta-policies.
The contribution of this paper is to propose a framework for making meta-reinforcement learning
resilient to a variety of task distribution shifts, and DiAMetR, a practical algorithm instantiating
the framework. DiAMetR trains a population of meta-policies to be robust to different degrees of
distribution shifts and then adaptively chooses a meta-policy to deploy based on the inferred test-time
distribution shift. Our experiments verify the utility of adaptive distributional robustness under
test-time task distribution shift in a number of simulated robotics domains.
2 Related Work
Meta-reinforcement learning algorithms aim to leverage a distribution of training tasks to “learn
a reinforcement learning algorithm", that is able to learn as quickly on new tasks drawn from the
same distribution. A variety of algorithms have been proposed for meta-RL, including memory-
based [
7
,
25
], gradient-based [
10
,
35
,
13
] and latent-variable based [
33
,
48
,
47
,
11
] schemes. These
algorithms show the ability to generalize to new tasks drawn from the same distribution, and have been
applied to problems ranging from robotics [
27
,
47
,
18
] to computer science education [
43
]. This line
of work has been extended to operate in scenarios without requiring any pre-specified task distribution
[
12
,
16
], in offline settings [
6
,
28
,
26
] or in hard (meta-)exploration settings [
49
,
46
], making them
more broadly applicable to a wider class of problems. However, most meta-RL algorithms assume
source and target tasks are drawn from the same distribution, an assumption rarely met in practice.
Our work shows how the machinery of meta-RL can be made compatible with distribution shift at
test time, using ideas from distributional robustness. Some recent work shows that model based
meta-reinforcement learning can be made to be robust to a particular level distribution shift [
23
,
20
]
by learning a shared dynamics model against adversarially chosen task distributions. We show that we
can build model-free meta-reinforcement learning algorithms, which are not just robust to a particular
level of distribution shift, but can adapt to various levels of shift.
Distributional robustness methods have been studied in the context of building supervised learning
systems that are robust to the test distribution being different than the training one. The key idea
2
Meta-train on train-task
distribution
Replay
Buer
Task Distribution
Meta-train on imagined test-task
distributions
Meta-policy selection during
meta-test
2
meta
<latexit sha1_base64="xBA8UM1oEeMILSyNGU8PfQW+A5o=">AAACB3icbVBNS8NAEN3Ur1q/oh4FCRbBU0mqoMeiF48V7Ac0sWy203bp5oPdiVhCbl78K148KOLVv+DNf+O2zUFbHww83pthZp4fC67Qtr+NwtLyyupacb20sbm1vWPu7jVVlEgGDRaJSLZ9qkDwEBrIUUA7lkADX0DLH11N/NY9SMWj8BbHMXgBHYS8zxlFLXXNQzfm3dRFeMA0AKRZdpe6ECsutFvNumbZrthTWIvEyUmZ5Kh3zS+3F7EkgBCZoEp1HDtGL6USOROQldxEQUzZiA6go2lIA1BeOv0js4610rP6kdQVojVVf0+kNFBqHPi6M6A4VPPeRPzP6yTYv/BSHsYJQshmi/qJsDCyJqFYPS6BoRhrQpnk+laLDamkDHV0JR2CM//yImlWK85ppXpzVq5d5nEUyQE5IifEIeekRq5JnTQII4/kmbySN+PJeDHejY9Za8HIZ/bJHxifP5TkmmQ=</latexit>
M1
meta
<latexit sha1_base64="QwcfAk19J3QtH7ZtayrGpjw5RA4=">AAACC3icbVDJSgNBEO2JW4xb1KOXIUHwYpiJgh6DXrwIEcwCmXHo6VSSJj0L3TViGObuxV/x4kERr/6AN//GznLQxAcFj/eqqKrnx4IrtKxvI7e0vLK6ll8vbGxube8Ud/eaKkokgwaLRCTbPlUgeAgN5CigHUuggS+g5Q8vx37rHqTiUXiLoxjcgPZD3uOMopa8YsmJuZc6CA+YBoA0y+5SB2LFhXbT62M7y7xi2apYE5iLxJ6RMpmh7hW/nG7EkgBCZIIq1bGtGN2USuRMQFZwEgUxZUPah46mIQ1Auenkl8w81ErX7EVSV4jmRP09kdJAqVHg686A4kDNe2PxP6+TYO/cTXkYJwghmy7qJcLEyBwHY3a5BIZipAllkutbTTagkjLU8RV0CPb8y4ukWa3YJ5XqzWm5djGLI08OSIkcEZuckRq5InXSIIw8kmfySt6MJ+PFeDc+pq05YzazT/7A+PwBjYyb/Q==</latexit>
M
meta
<latexit sha1_base64="TbROBDvTKP/TFnexB/mBsprek58=">AAACB3icbVBNS8NAEN34WetX1KMgwSJ4KkkV9Fj04kWoYD+giWWznbZLNx/sTsQScvPiX/HiQRGv/gVv/hu3bQ7a+mDg8d4MM/P8WHCFtv1tLCwuLa+sFtaK6xubW9vmzm5DRYlkUGeRiGTLpwoED6GOHAW0Ygk08AU0/eHl2G/eg1Q8Cm9xFIMX0H7Ie5xR1FLHPHBj3kldhAdMA0CaZXepC7HiQrvXWccs2WV7AmueODkpkRy1jvnldiOWBBAiE1SptmPH6KVUImcCsqKbKIgpG9I+tDUNaQDKSyd/ZNaRVrpWL5K6QrQm6u+JlAZKjQJfdwYUB2rWG4v/ee0Ee+deysM4QQjZdFEvERZG1jgUq8slMBQjTSiTXN9qsQGVlKGOrqhDcGZfnieNStk5KVduTkvVizyOAtknh+SYOOSMVMkVqZE6YeSRPJNX8mY8GS/Gu/ExbV0w8pk98gfG5w+965p/</latexit>
meta
<latexit sha1_base64="YWS0GhZozsEcomFEusaGA45kSRM=">AAACBXicbVA9SwNBEN2LXzF+RS21OAyCVbiLgpZBG8sI5gNyMextJsmSvb1jd04MxzU2/hUbC0Vs/Q92/hs3yRWa+GDg8d4MM/P8SHCNjvNt5ZaWV1bX8uuFjc2t7Z3i7l5Dh7FiUGehCFXLpxoEl1BHjgJakQIa+AKa/uhq4jfvQWkeylscR9AJ6EDyPmcUjdQtHnoR7yYewgMmASBN07vEg0hzEcq0Wyw5ZWcKe5G4GSmRDLVu8cvrhSwOQCITVOu260TYSahCzgSkBS/WEFE2ogNoGyppALqTTL9I7WOj9Ox+qExJtKfq74mEBlqPA990BhSHet6biP957Rj7F52EyyhGkGy2qB8LG0N7Eond4woYirEhlClubrXZkCrK0ARXMCG48y8vkkal7J6WKzdnpeplFkeeHJAjckJcck6q5JrUSJ0w8kieySt5s56sF+vd+pi15qxsZp/8gfX5A1a3mb8=</latexit>
meta
<latexit sha1_base64="YWS0GhZozsEcomFEusaGA45kSRM=">AAACBXicbVA9SwNBEN2LXzF+RS21OAyCVbiLgpZBG8sI5gNyMextJsmSvb1jd04MxzU2/hUbC0Vs/Q92/hs3yRWa+GDg8d4MM/P8SHCNjvNt5ZaWV1bX8uuFjc2t7Z3i7l5Dh7FiUGehCFXLpxoEl1BHjgJakQIa+AKa/uhq4jfvQWkeylscR9AJ6EDyPmcUjdQtHnoR7yYewgMmASBN07vEg0hzEcq0Wyw5ZWcKe5G4GSmRDLVu8cvrhSwOQCITVOu260TYSahCzgSkBS/WEFE2ogNoGyppALqTTL9I7WOj9Ox+qExJtKfq74mEBlqPA990BhSHet6biP957Rj7F52EyyhGkGy2qB8LG0N7Eond4woYirEhlClubrXZkCrK0ARXMCG48y8vkkal7J6WKzdnpeplFkeeHJAjckJcck6q5JrUSJ0w8kieySt5s56sF+vd+pi15qxsZp/8gfX5A1a3mb8=</latexit>
Test time Meta-policy
adaptation
Imagined
Test tasks
Train tasks
Figure 2: During meta-train, DiAMetR learns a meta-policy
πϵ1
meta
and task distribution model
Tω(s, a, z)
on train
task distribution. Then, it uses the task distribution model to imagine different shifted test task distributions on
which it learns different meta-policies
{πϵi
meta}M
i=2
, each corresponding to a different level of robustness. During
meta-test, it chooses an appropriate meta-policy based on inferred test task distribution shift with Thompson’s
sampling and then quickly adapts the selected meta-policy to individual tasks.
is to train a model to not just minimize empirical risk, but instead learn a model that has the
lowest worst-case empirical risk among an “uncertainty-set" of distributions that are boundedly close
to the empirical training distribution [
37
,
21
,
2
,
15
]. If the uncertainty set and optimization are
chosen carefully, these methods have been shown to obtain models that are robust to small amounts
of distribution shift at test time [
37
,
21
,
2
,
15
], finding applications in problems like federated
learning [
15
] and image classification [
21
]. This has been extended to the min-max robustness
setting for specific algorithms like model-agnostic meta-learning [
3
], but are critically dependent on
correct specification of the appropriate uncertainty set and applicable primarily in supervised learning
settings. Alternatively, several RL techniques aim to directly tackle the robustness problem, aiming
to learn policies robust to adversarial perturbations [
41
,
45
,
32
,
31
]. [
44
] conditions the policy on
uncertainty sets to make it robust to different perturbation sets. While these methods are able to
learn conservative, robust policies, they are unable to adapt to new tasks as DiAMetR does in the
meta-reinforcement learning setting. In our work, rather than choosing a single uncertainty set, we
learn many meta-policies for widening uncertainty sets, thereby accounting for different levels of
test-time distribution shift.
3 Preliminaries
Meta-Reinforcement Learning aims to learn a fast reinforcement learning algorithm or a “meta-
policy" that can quickly maximize performance on tasks
T
from some distribution
p(T)
. Formally,
each task
T
is a Markov decision process (MDP)
M= (S,A,P,R, γ, µ0)
; the goal is to exploit
regularities in the structure of rewards and environment dynamics across tasks in
p(T)
to acquire
effective exploration and adaptation mechanisms that enable learning on new tasks much faster than
learning the task naively from scratch. A meta-policy (or fast learning algorithm)
πmeta
maps a history
of environment experience
h(S × A × R)
in a new task to an action
a
, and is trained to acquire
optimal behaviors on tasks from p(T)within kepisodes:
min
πmeta
ET ∼p(T)[Regret(πmeta,T)] ,
Regret(πmeta,T) = J(π
T)Ea(i)
tπmeta(·|h(i)
t),T"1
k
k
X
i=1
T
X
t=1
r(i)
t#, J(π
T) = max
π
Eπ,T[X
t
rt]
where h(i)
t= (s(i)
1:t, r(i)
1:t, a(i)
1:t1)(s(j)
1:T, r(j)
1:T, a(j)
1:T)i1
j=1.(1)
Intuitively, the meta-policy has two components: an exploration mechanism that ensures that appro-
priate reward signal is found for all tasks in the training distribution, and an adaptation mechanism
that uses the collected exploratory data to generate optimal actions for the current task. In practice,
the meta-policy may be represented explicitly as an exploration policy conjoined with a policy
update[
10
,
33
], or implicitly as a black-box RNN [
7
,
48
]. We use the terminology “meta-policies"
interchangeably with that of “fast-adaptation" algorithms, since our practical implementation builds
on [
30
] (which represents the adaptation mechanism using a black-box RNN). Our work focuses
on the setting where there is potential drift between
ptrain(T
), the task distribution we have access to
during training, and ptest(T), the task distribution of interest during evaluation.
Distributional robustness [
37
] learns models that do not minimize empirical risk against the training
distribution, but instead prepare for distribution shift by optimizing the worst-case empirical risk
3
<latexit sha1_base64="9P1voiZ84SqCwluVSmQiInYHaCI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB9PHfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YYXvuZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpHVR9S6rtftapX6Tx1GEEziFc/DgCupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH2yMjec=</latexit>
st
<latexit sha1_base64="/oWVKNQYXJCZFD5ShZQ/zeO5/lY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRoh6LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM6VCIUjKKV2rSf4YU37ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SlqXVe+qWnuoVeq3eRxFOIFTOAcPrqEO99CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QPyFY9T</latexit>
at1
<latexit sha1_base64="dCZ8tef4G4UeuGLAdX3xOP8KtcE=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRoh6LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM6VCIUjKKV2rqf4YU37ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SlqXVe+qWnuoVeq3eRxFOIFTOAcPrqEO99CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QMMTo9k</latexit>
rt1
<latexit sha1_base64="mbVnKt9MX/bZjXKlv3scEZddv44=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRoh6LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM6VCIUjKKV2qN+hhfetF+uuFV3DrJKvJxUIEejX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn83Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophjd+JlSSIldssShMJcGYzH4nA6E5QzmxhDIt7K2EjaimDG1CJRuCt/zyKmldVr2rau2hVqnf5nEU4QRO4Rw8uIY63EMDmsBgDM/wCm9O4rw4787HorXg5DPH8AfO5w/8249a</latexit>
ht1
<latexit sha1_base64="q6eooYNDShd6czrV7MLhjzWbJeI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB9rHfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YYXvuZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpHVR9S6rtftapX6Tx1GEEziFc/DgCupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH1EgjdU=</latexit>
at
<latexit sha1_base64="aBvIQsdYmDsupA4CS/PLuofbSYI=">AAACLnicbVDLSgMxFM34tr6qLt0Ei1A3ZUaKuhQf4FLBqtApQya9tcFMJiZ3xDLOF7nxV3QhqIhbP8O0duHrQOBwzr03955YS2HR95+9kdGx8YnJqenSzOzc/EJ5cenUppnh0OCpTM15zCxIoaCBAiWcawMsiSWcxZd7ff/sGowVqTrBnoZWwi6U6AjO0ElR+WC/qqM8RLjBHA0TqiiqYcKwy5nMT4r129urKNRd8UNcp6GEKxqCtkK6KSIqV/yaPwD9S4IhqZAhjqLyY9hOeZaAQi6Ztc3A19jKmUHBJRSlMLOgGb9kF9B0VLEEbCsfnFvQNae0aSc17imkA/V7R84Sa3tJ7Cr7S9vfXl/8z2tm2Nlu5ULpDEHxr486maSY0n52tC0McJQ9Rxg3wu1KeZcZxtElXHIhBL9P/ktON2rBZq1+XK/s7A7jmCIrZJVUSUC2yA45JEekQTi5Iw/khbx6996T9+a9f5WOeMOeZfID3scnebaqAA==</latexit>
D(ptrain(T)||q(T)) i
<latexit sha1_base64="yP169jRF0ZDPdgvMMz7+cYkrzVU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZdFNy4r9AXToWTSTBuaSYYkI5Shn+HGhSJu/Rp3/o2ZdhbaeiBwOOdecu4JE860cd1vp7SxubW9U96t7O0fHB5Vj0+6WqaK0A6RXKp+iDXlTNCOYYbTfqIojkNOe+H0Pvd7T1RpJkXbzBIaxHgsWMQINlbyBzE2E4J51p4PqzW37i6A1olXkBoUaA2rX4ORJGlMhSEca+17bmKCDCvDCKfzyiDVNMFkisfUt1TgmOogW0SeowurjFAklX3CoIX6eyPDsdazOLSTeUS96uXif56fmug2yJhIUkMFWX4UpRwZifL70YgpSgyfWYKJYjYrIhOsMDG2pYotwVs9eZ10r+redb3x2Kg174o6ynAG53AJHtxAEx6gBR0gIOEZXuHNMc6L8+58LEdLTrFzCn/gfP4AkBuRcw==</latexit>
T
~
Meta Policy MDP
Constrained
task distribution
<latexit sha1_base64="Jg2J/7yM2Hv03ZieJabwiOUClFY=">AAACBXicbVA9SwNBEN2LXzF+nVpqcRgEq3AnQS2DNpYRzAfkYtjbTJIle3fL7pwYjmts/Cs2ForY+h/s/DduPgpNfDDweG+GmXmBFFyj635buaXlldW1/HphY3Nre8fe3avrOFEMaiwWsWoGVIPgEdSQo4CmVEDDQEAjGF6N/cY9KM3j6BZHEtoh7Ue8xxlFI3XsQ1/yjo/wgGkISLO71AepuTAezzp20S25EziLxJuRIpmh2rG//G7MkhAiZIJq3fJcie2UKuRMQFbwEw2SsiHtQ8vQiIag2+nki8w5NkrX6cXKVITORP09kdJQ61EYmM6Q4kDPe2PxP6+VYO+infJIJggRmy7qJcLB2BlH4nS5AoZiZAhliptbHTagijI0wRVMCN78y4ukflryzkrlm3KxcjmLI08OyBE5IR45JxVyTaqkRhh5JM/klbxZT9aL9W59TFtz1mxmn/yB9fkDDXCZlA==</latexit>
i
meta
<latexit sha1_base64="lvoAgjL9MXbXgsagjjz46dBbD/0=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REinosevFYwX5AG8pmu2mXbjZhdyKU0B/hxYMiXv093vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLoXiTRQoeSfRnEaB5O1gfDfz209cGxGrR5wk3I/oUIlQMIpWal/0cMSR9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8MbPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU2oZEPwll9eJa3LqndVrT3UKvXbPI4inMApnIMH11CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AEPmo9n</latexit>
,
<latexit sha1_base64="FIUMeFE9kFwUoHwhVJD5XJWiokw=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoiRT0WvXisYD+gDWWz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RTW1jc2t4rbpZ3dvf2D8uFRy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvpv57SeujYjVI04S7kd0qEQoGEUrtU0/wwtv2i9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+7pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uwxs/EypJkSu2WBSmkmBMZr+TgdCcoZxYQpkW9lbCRlRThjahkg3BW355lbQuq95VtfZQq9Rv8ziKcAKncA4eXEMd7qEBTWAwhmd4hTcncV6cd+dj0Vpw8plj+APn8wcKzI9j</latexit>
st+1
<latexit sha1_base64="ajAtzoZWBZnmAcO+/TnKa6vo/pw=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB93HfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YYXvuZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpHVR9S6rtftapX6Tx1GEEziFc/DgCupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH2sGjeY=</latexit>
rt
Meta
Policy
Selection
<latexit sha1_base64="cGLuORsIWabMvJdhvNTGDB3WWBs=">AAACDnicbVC7SgNBFJ31GeNr1dJmMQQsJOxKUMugjWUE84BsDLOTm2TI7IOZu2JY9gts/BUbC0Vsre38G2eTLTTxwDCHc85l5h4vElyhbX8bS8srq2vrhY3i5tb2zq65t99UYSwZNFgoQtn2qALBA2ggRwHtSAL1PQEtb3yV+a17kIqHwS1OIuj6dBjwAWcUtdQzy27Ee4mL8ICJD0jTExdH2X2XuBApLnTISXtmya7YU1iLxMlJieSo98wvtx+y2IcAmaBKdRw7wm5CJXImIC26sYKIsjEdQkfTgPqgusl0ndQqa6VvDUKpT4DWVP09kVBfqYnv6aRPcaTmvUz8z+vEOLjoJjyIYoSAzR4axMLC0Mq6sfpcAkMx0YQyyfVfLTaikjLUDRZ1Cc78youkeVpxzirVm2qpdpnXUSCH5IgcE4eckxq5JnXSIIw8kmfySt6MJ+PFeDc+ZtElI585IH9gfP4A7P6dTA==</latexit>
1
meta,
<latexit sha1_base64="Lb0ftmcYldUekKnHrY8tC7Go3cc=">AAACDnicbVC7SgNBFJ2NrxhfUUubxRCwkLAbgloGbSwjmAdk12V2cpMMmX0wc1cMS77Axl+xsVDE1trOv3E2SaGJB4Y5nHMuM/f4seAKLevbyK2srq1v5DcLW9s7u3vF/YOWihLJoMkiEcmOTxUIHkITOQroxBJo4Ato+6OrzG/fg1Q8Cm9xHIMb0EHI+5xR1JJXLDsx91IH4QHTAJBOTh0cZvdd6kCsuNCh6sQrlqyKNYW5TOw5KZE5Gl7xy+lFLAkgRCaoUl3bitFNqUTOBEwKTqIgpmxEB9DVNKQBKDedrjMxy1rpmf1I6hOiOVV/T6Q0UGoc+DoZUByqRS8T//O6CfYv3JSHcYIQstlD/USYGJlZN2aPS2AoxppQJrn+q8mGVFKGusGCLsFeXHmZtKoV+6xSu6mV6pfzOvLkiByTE2KTc1In16RBmoSRR/JMXsmb8WS8GO/GxyyaM+Yzh+QPjM8f7oOdTQ==</latexit>
2
meta,
<latexit sha1_base64="LXDeGkS4K4GngJBXOBxJAbPVBmk=">AAACDnicbVC7SgNBFJ2NrxhfUUubxRCwkLArQS2DNjZCBPOA7BpmJzfJkNkHM3fFsOQLbPwVGwtFbK3t/Btnky008cAwh3POZeYeLxJcoWV9G7ml5ZXVtfx6YWNza3unuLvXVGEsGTRYKELZ9qgCwQNoIEcB7UgC9T0BLW90mfqte5CKh8EtjiNwfToIeJ8zilrqFstOxLuJg/CAiQ9IJ8cODtP7LnEgUlzo0PWkWyxZFWsKc5HYGSmRDPVu8cvphSz2IUAmqFId24rQTahEzgRMCk6sIKJsRAfQ0TSgPig3ma4zMcta6Zn9UOoToDlVf08k1Fdq7Hs66VMcqnkvFf/zOjH2z92EB1GMELDZQ/1YmBiaaTdmj0tgKMaaUCa5/qvJhlRShrrBgi7Bnl95kTRPKvZppXpTLdUusjry5IAckiNikzNSI1ekThqEkUfyTF7Jm/FkvBjvxscsmjOymX3yB8bnDxeZnWg=</latexit>
M
meta,
……
<latexit sha1_base64="byscUDq+E67nMv18EfYBjNi0tvU=">AAACFXicbVDJSgNBEO2JW4xb1KOXxiBEkDAjQT0GvXiMkA0yIfR0KkmTnoXuGjEM8xNe/BUvHhTxKnjzb+wsB018UPB4r4qqel4khUbb/rYyK6tr6xvZzdzW9s7uXn7/oKHDWHGo81CGquUxDVIEUEeBElqRAuZ7Epre6GbiN+9BaREGNRxH0PHZIBB9wRkaqZs/c32GQ85kUktdLXwadRMX4QETBI1pWvzln3bzBbtkT0GXiTMnBTJHtZv/cnshj30IkEumdduxI+wkTKHgEtKcG2uIGB+xAbQNDZgPupNMv0rpiVF6tB8qUwHSqfp7ImG+1mPfM52TG/WiNxH/89ox9q86iQiiGCHgs0X9WFIM6SQi2hMKOMqxIYwrYW6lfMgU42iCzJkQnMWXl0njvORclMp35ULleh5HlhyRY1IkDrkkFXJLqqROOHkkz+SVvFlP1ov1bn3MWjPWfOaQ/IH1+QNH7qAm</latexit>
Tptest(T)
<latexit sha1_base64="iDPxNBnfJu84Ncgyn77dyFGJg/M=">AAACFXicbZDLSsNAFIYn9VbrrerSTbAIFaQkUtSNUHTjskJv0KZhMp20Q2aSMHMilJCXcOOruHGhiFvBnW/j9LLQ1h8GPv5zDmfO78WcKbCsbyO3srq2vpHfLGxt7+zuFfcPWipKJKFNEvFIdjysKGchbQIDTjuxpFh4nLa94HZSbz9QqVgUNmAcU0fgYch8RjBoyy2e9XyJSWpnaZD1VCLclF3bWT+YMUy4IV3op2V2mrnFklWxpjKXwZ5DCc1Vd4tfvUFEEkFDIBwr1bWtGJwUS2CE06zQSxSNMQnwkHY1hlhQ5aTTqzLzRDsD04+kfiGYU/f3RIqFUmPh6U6BYaQWaxPzv1o3Af/KSVkYJ0BDMlvkJ9yEyJxEZA6YpAT4WAMmkum/mmSEdUyggyzoEOzFk5ehdV6xLyrV+2qpdjOPI4+O0DEqIxtdohq6Q3XURAQ9omf0it6MJ+PFeDc+Zq05Yz5ziP7I+PwBUtafiQ==</latexit>
1
k
k
X
i=1
T
X
t=1
r(i)
t
Meta Train phase Meta Test phase
Figure 3: During meta-train phase, DiAMetR learns a family of meta-policies robust to varying levels of
distribution shift (as characterized by
ϵi
). During meta-test phase, given a potentially shifted test-time distribution
of tasks, DiAMetR chooses the meta-policy with the most appropriate level of robustness and use it to perform
fast adaptation for new tasks sampled from the same shifted test task distribution.
against a set of data distributions close to the training distribution (called an uncertainty set):
min
θmax
ϕ
Exqϕ(x)[l(x;θ)] s.t. D(ptrain(x)||qϕ(x)) ϵ(2)
This optimization finds the model parameters
θ
that minimizes worst case risk
l
over distributions
qϕ(x)in an ϵ-ball (measured by an f-divergence) from the training distribution ptrain(x).
4 Distributionally Adaptive Meta-Reinforcement Learning
In this section, we develop a framework for learning meta-policies, that given access to a training
distribution of tasks
ptrain(T)
, is still able to adapt to tasks from a test-time distribution
ptest(T)
that
is similar but not identical to the training distribution. We introduce a framework for distributionally
adaptive meta-RL below and instantiate it as a practical method in Section 5.
4.1 Known Level of Test-Time Distribution Shift
We begin by studying a simplified problem where we can exactly quantify the degree to which
the test distribution deviates from the training distribution. Suppose we know that
ptest
satisfies
D(ptest(T)||ptrain(T)) < ϵ
for some
ϵ > 0
, where
D(·∥·)
is a probability divergence on the set of task
distributions (e.g. an
f
-divergence [
34
] or a Wasserstein distance [
40
]). A natural learning objective
to learn a meta-policy under this assumption is to minimize the worst-case test-time regret across any
test task distribution q(T)that is within some ϵdivergence of the train distribution:
min
πmeta R(πmeta, ptrain(T), ϵ),
R(πmeta, ptrain(T), ϵ) = max
q(T)
ET ∼q(T)[Regret(πmeta,T)] s.t. D(ptrain(T)q(T)) ϵ(3)
Solving this optimization problem results in a meta-policy that has been trained to adapt to tasks
from a wider task distribution than the original training distribution. It is worthwhile distinguishing
this robust meta-objective, which incentivizes a robust adaptation mechanism to a wider set of tasks,
from robust objectives in standard RL, which produce base policies robust to a wider set of dynamics
conditions. The objective in Eq 3incentivizes an agent to explore and adapt more broadly, not act
more conservatively as standard robust RL methods [
32
] would encourage. Naturally, the quality of
the robust meta-policy depends on the size of the uncertainty set. If
ϵ
is large, or the geometry of the
divergence poorly reflect natural task variations, then the robust policy will have to adapt to an overly
large set of tasks, potentially degrading the speed of adaptation.
4.2 Handling Arbitrary Levels of Distribution Shift
In practice, it is not known how the test distribution
ptest
deviates from the training distribution, and
consequently it is challenging to determine what
ϵ
to use in the meta-robustness objective. We propose
to overcome this via an adaptive strategy: to train meta-policies for varying degrees of distribution
shift, and at test-time, inferring which distribution shift is most appropriate through experience.
4
We train a population of meta-policies
{π(i)
meta}M
i=1
, each solving the distributionally robust meta-RL
objective (eq 3) for a different level of robustness ϵi:
πϵi
meta := arg min
πmeta R(πmeta, ptrain(T), ϵi)M
i=1
where ϵM> ϵM1> . . . > ϵ1= 0 (4)
In choosing a spectrum of
ϵi
, we learn a set of meta-policies that have been trained on increasingly
large set of tasks: at one end (
i= 1
), the meta-policy is trained only on the original training
distribution, and at the other (
i=M
), the meta-policy trained to adapt to any possible task within the
parametric family of tasks. These policies span a tradeoff between being robust to a wider set of task
distributions with larger
ϵ
(allowing for larger distribution shifts), and being able to adapt quickly to
any given task with smaller ϵ(allowing for better per-task regret minimization).
With a set of meta-policies in hand, we must now decide how to leverage test-time experience to
discover the right one to use for the actual test distribution
ptest
. We recognize that the problem
of policy selection can be treated as a stochastic multi-armed bandit problem (precise formulation
in Appendix C), where pulling arm
i
corresponds to running the meta-policy
πϵi
meta
for an entire
meta-episode (
k
task episodes). If a zero-regret bandit algorithm (eg: Thompson’s sampling [
42
]) is
used , then after a certain number of test-time meta episodes, we can guarantee that the meta-policy
selection mechanism will converge to the meta-policy that best balances the tradeoff between adapting
quickly while still being able to adapt to all the tasks from ptest(T).
To summarize our framework for distributionally adaptive meta-RL, we train a population of meta-
policies at varying levels of robustness on a distributionally robust objective that forces the learned
adaptation mechanism to also be robust to tasks not in the training task distribution. At test-time, we
use a bandit algorithm to select the meta-policy whose adaptation mechanism has the best tradeoff
between robustness and speed of adaptation specifically on the test task distribution. Combining
distributional robustness with test-time adaptation allows the adaptation mechanism to work even
if distribution shift is present, while obviating the decreased performance that usually accompanies
overly conservative, distributionally robust solutions.
4.3 Analysis
To provide some intuition on the properties of this algorithm, we formally analyze adaptive distribu-
tional robustness in a simplified meta RL problem involving tasks
Tg
corresponding to reaching some
unknown goal
g
in a deterministic MDP
M
, exactly at the final timestep of an episode. We assume
that all goals are reachable, and use the family of meta-policies that use a stochastic exploratory
policy
π
until the goal is discovered and return to the discovered goal in all future episodes. The
performance of a meta-policy on a task
Tg
under this model can be expressed in terms of the state
distribution of the exploratory policy:
Regret(πmeta,Tg) = 1
dT
π(g)
. This particular framework has
been studied in [12,19], and is a simple, interpretable framework for analysis.
We seek to understand performance under distribution shift when the original training task distribution
is relatively concentrated on a subset of possible tasks. We choose the training distribution
ptrain(Tg) =
(1 β)Uniform(S0) + βUniform(S\S0)
, so that
ptrain
is concentrated on tasks involving a subset
of the state space
S0⊂ S
, with
β
a parameter dictating the level of concentration, and consider
test distributions that perturb under the TV metric. Our main result compares the performance of a
meta-policy trained to an ϵ2-level of robustness when the true test distribution deviates by ϵ1.
Proposition 4.1. Let
ϵi= min{ϵi+β, 1|S0|
|S| }
. There exists
q(T)
satisfying
DT V (ptrain, q)ϵ1
where an ϵ2-robust meta policy incurs excess regret over the optimal ϵ1-robust meta-policy:
Eq(T)[Regret(πϵ2
meta,T)Regret(πϵ1
meta,T)] c(ϵ1, ϵ2) + 1
c(ϵ1, ϵ2)2(5)
pϵ1(1 ϵ1)|S0|(|S| − S0|)(6)
The scale of regret depends on c(ϵ1, ϵ2) = qϵ211
ϵ111, a measure of the mismatch between ϵ1and ϵ2.
We first compare robust and non-robust solutions by analyzing the bound when
ϵ2= 0
. In the regime
of
β1
, excess regret scales as
O(ϵ1q1
β)
, meaning that the robust solution is most necessary
5
摘要:

DistributionallyAdaptiveMetaReinforcementLearningAnuragAjay∗†‡§,AbhishekGupta*†§,DibyaGhosh¶,SergeyLevine¶,PulkitAgrawal†‡§ImprobableAILab†MIT-IBMWatsonAILab‡MassachusettsInstituteTechnology§UniversityofCalifornia,Berkeley¶AbstractMeta-reinforcementlearningalgorithmsprovideadata-drivenwaytoacquirepo...

展开>> 收起<<
Distributionally Adaptive Meta Reinforcement Learning Anurag Ajay Abhishek Gupta Dibya Ghosh Sergey Levine Pulkit Agrawal.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:6.01MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注