Distributionally Adaptive Meta Reinforcement Learning Anurag Ajay Abhishek Gupta Dibya Ghosh Sergey Levine Pulkit Agrawal

2025-04-27 0 0 6.01MB 29 页 10玖币

侵权投诉

Distributionally Adaptive Meta Reinforcement

Learning

Anurag Ajay∗†‡§, Abhishek Gupta *†§, Dibya Ghosh¶, Sergey Levine¶, Pulkit Agrawal†‡§

Improbable AI Lab†

MIT-IBM Watson AI Lab‡

Massachusetts Institute Technology§

University of California, Berkeley¶

Abstract

Meta-reinforcement learning algorithms provide a data-driven way to acquire poli-

cies that quickly adapt to many tasks with varying rewards or dynamics functions.

However, learned meta-policies are often effective only on the exact task distribu-

tion on which they were trained and struggle in the presence of distribution shift

of test-time rewards or transition dynamics. In this work, we develop a frame-

work for meta-RL algorithms that are able to behave appropriately under test-time

distribution shifts in the space of tasks. Our framework centers on an adaptive

approach to distributional robustness that trains a population of meta-policies to

be robust to varying levels of distribution shift. When evaluated on a potentially

shifted test-time distribution of tasks, this allows us to choose the meta-policy with

the most appropriate level of robustness, and use it to perform fast adaptation. We

formally show how our framework allows for improved regret under distribution

shift, and empirically show its efﬁcacy on simulated robotics problems under a

wide range of distribution shifts.

1 Introduction

The diversity and dynamism of the real world require reinforcement learning (RL) agents that

can quickly adapt and learn new behaviors when placed in novel situations. Meta reinforcement

learning provides a framework for conferring this ability to RL agents, by learning a “meta-policy”

trained to adapt as quickly as possible to tasks from a provided training distribution [

Unfortunately, meta-RL agents assume tasks to be always drawn from the training task distribution

and often behave erratically when asked to adapt to tasks beyond the training distribution [

]. As

an example of this negative transfer, consider using meta-learning to teach a robot to navigate to

goals quickly (illustrated in Figure 1). The resulting meta-policy learns to quickly adapt and walk to

any target location speciﬁed in the training distribution, but explores poorly and fails to adapt to any

location not in that distribution. This is particularly problematic for the meta-learning setting, since

the scenarios where we need the ability to learn quickly are usually exactly those where the agent

experiences distribution shift. This type of meta-distribution shift afﬂicts a number of real-world

problems including autonomous vehicle driving [

], in-hand manipulation [

], and quadruped

locomotion [

], where training task distribution may not encompass all real-world scenarios.

In this work, we study meta-RL algorithms that learn meta-policies resilient to task distribution shift at

test time. We assume the test-time distribution shift to be unknown but ﬁxed. One approach to enable

this resiliency is to leverage the framework of distributional robustness [

], training meta-policies

∗

denotes equal contribution. Authors are also afﬁliated with Computer Science and Artiﬁcial Laboratory

(CSAIL). Correspondence to aajay@mit.edu and abhgupta@cs.washington.edu

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.03104v2 [cs.LG] 10 Jul 2023

that prepare for distribution shifts by optimizing the worst-case empirical risk against a set of task

distributions which lie within a bounded distance from the original training task distribution (often

referred to as an uncertainty set)). This allows meta-policies to deal with potential test-time task

distribution shift, bounding their worst-case test-time regret for distributional shifts within the chosen

uncertainty set. However, choosing an appropriate uncertainty set can be quite challenging without

further information about the test environment, signiﬁcantly impacting the test-time performance

of algorithms under distribution shift. Large uncertainty sets allow resiliency to a wider range of

distribution shifts, but the resulting meta-policy adapts very slowly at test time; smaller uncertainty

sets enable faster test-time adaptation, but leave the meta-policy brittle to task distribution shifts. Can

we get the best of both worlds?

+1 +1

Meta Train

Meta Test

Episode 1 Episode 2

⇡meta

<latexit sha1_base64="04ZwISNQzfwUIERNw0RDchdyb+M=">AAAB+nicbVBNS8NAEN34WetXqkcvwSJ4KkkV9Fj04rGC/YA2hM120i7dfLA7UUvMT/HiQRGv/hJv/hu3bQ7a+mDg8d4MM/P8RHCFtv1trKyurW9slrbK2zu7e/tm5aCt4lQyaLFYxLLrUwWCR9BCjgK6iQQa+gI6/vh66nfuQSoeR3c4ScAN6TDiAWcUteSZlX7CvayP8IhZCEjz3DOrds2ewVomTkGqpEDTM7/6g5ilIUTIBFWq59gJuhmVyJmAvNxPFSSUjekQeppGNATlZrPTc+tEKwMriKWuCK2Z+nsio6FSk9DXnSHFkVr0puJ/Xi/F4NLNeJSkCBGbLwpSYWFsTXOwBlwCQzHRhDLJ9a0WG1FJGeq0yjoEZ/HlZdKu15yzWv32vNq4KuIokSNyTE6JQy5Ig9yQJmkRRh7IM3klb8aT8WK8Gx/z1hWjmDkkf2B8/gBBmJSl</latexit>

⇡meta

Figure 1: Failure of Typical Meta-RL.

On meta-training tasks,

πmeta

explores effec-

tively and quickly learns the optimal behav-

ior (top row). When test tasks come from a

slightly larger task distribution, exploration

fails catastrophically, resulting in poor adap-

tation behavior (bottom row).

Our key insight is that we can prepare for a variety of

potential test-time distribution shifts by constructing and

training against different uncertainty sets at training time.

By preparing for adaptation against each of these uncer-

tainty sets, an agent is able to adapt to a variety of poten-

tial test-time distribution shifts by adaptively choosing the

most appropriate level of distributional robustness for the

test distribution at hand. We introduce a conceptual frame-

work called distributionally adaptive meta reinforcement

learning, formalizing this idea. At train time, the agent

learns robust meta-policies with widening uncertainty sets,

preemptively accounting for different levels of test-time

distribution shift that may be encountered. At test time,

the agent infers the level of distribution shift it is faced

with, and then uses the corresponding meta-policy to adapt

to the new task (Figure 2). In doing so, the agent can adap-

tively choose the best level of robustness for the test-time

task distribution, preserving the fast adaptation beneﬁts of

meta RL, while also ensuring good asymptotic performance under distribution shift. We instantiate a

practical algorithm in this framework (DiAMetR), using learned generative models to imagine new

task distributions close to the provided training tasks that can be used to train robust meta-policies.

The contribution of this paper is to propose a framework for making meta-reinforcement learning

resilient to a variety of task distribution shifts, and DiAMetR, a practical algorithm instantiating

the framework. DiAMetR trains a population of meta-policies to be robust to different degrees of

distribution shifts and then adaptively chooses a meta-policy to deploy based on the inferred test-time

distribution shift. Our experiments verify the utility of adaptive distributional robustness under

test-time task distribution shift in a number of simulated robotics domains.

2 Related Work

Meta-reinforcement learning algorithms aim to leverage a distribution of training tasks to “learn

a reinforcement learning algorithm", that is able to learn as quickly on new tasks drawn from the

same distribution. A variety of algorithms have been proposed for meta-RL, including memory-

based [

], gradient-based [

] and latent-variable based [

] schemes. These

algorithms show the ability to generalize to new tasks drawn from the same distribution, and have been

applied to problems ranging from robotics [

] to computer science education [

]. This line

of work has been extended to operate in scenarios without requiring any pre-speciﬁed task distribution

[

], in ofﬂine settings [

] or in hard (meta-)exploration settings [

], making them

more broadly applicable to a wider class of problems. However, most meta-RL algorithms assume

source and target tasks are drawn from the same distribution, an assumption rarely met in practice.

Our work shows how the machinery of meta-RL can be made compatible with distribution shift at

test time, using ideas from distributional robustness. Some recent work shows that model based

meta-reinforcement learning can be made to be robust to a particular level distribution shift [

]

by learning a shared dynamics model against adversarially chosen task distributions. We show that we

can build model-free meta-reinforcement learning algorithms, which are not just robust to a particular

level of distribution shift, but can adapt to various levels of shift.

Distributional robustness methods have been studied in the context of building supervised learning

systems that are robust to the test distribution being different than the training one. The key idea

Meta-train on train-task

distribution

Replay

Buﬀer

Task Distribution

Meta-train on imagined test-task

distributions

Meta-policy selection during

meta-test

⇡✏2

meta

<latexit sha1_base64="xBA8UM1oEeMILSyNGU8PfQW+A5o=">AAACB3icbVBNS8NAEN3Ur1q/oh4FCRbBU0mqoMeiF48V7Ac0sWy203bp5oPdiVhCbl78K148KOLVv+DNf+O2zUFbHww83pthZp4fC67Qtr+NwtLyyupacb20sbm1vWPu7jVVlEgGDRaJSLZ9qkDwEBrIUUA7lkADX0DLH11N/NY9SMWj8BbHMXgBHYS8zxlFLXXNQzfm3dRFeMA0AKRZdpe6ECsutFvNumbZrthTWIvEyUmZ5Kh3zS+3F7EkgBCZoEp1HDtGL6USOROQldxEQUzZiA6go2lIA1BeOv0js4610rP6kdQVojVVf0+kNFBqHPi6M6A4VPPeRPzP6yTYv/BSHsYJQshmi/qJsDCyJqFYPS6BoRhrQpnk+laLDamkDHV0JR2CM//yImlWK85ppXpzVq5d5nEUyQE5IifEIeekRq5JnTQII4/kmbySN+PJeDHejY9Za8HIZ/bJHxifP5TkmmQ=</latexit>

⇡✏M1

meta

<latexit sha1_base64="QwcfAk19J3QtH7ZtayrGpjw5RA4=">AAACC3icbVDJSgNBEO2JW4xb1KOXIUHwYpiJgh6DXrwIEcwCmXHo6VSSJj0L3TViGObuxV/x4kERr/6AN//GznLQxAcFj/eqqKrnx4IrtKxvI7e0vLK6ll8vbGxube8Ud/eaKkokgwaLRCTbPlUgeAgN5CigHUuggS+g5Q8vx37rHqTiUXiLoxjcgPZD3uOMopa8YsmJuZc6CA+YBoA0y+5SB2LFhXbT62M7y7xi2apYE5iLxJ6RMpmh7hW/nG7EkgBCZIIq1bGtGN2USuRMQFZwEgUxZUPah46mIQ1Auenkl8w81ErX7EVSV4jmRP09kdJAqVHg686A4kDNe2PxP6+TYO/cTXkYJwghmy7qJcLEyBwHY3a5BIZipAllkutbTTagkjLU8RV0CPb8y4ukWa3YJ5XqzWm5djGLI08OSIkcEZuckRq5InXSIIw8kmfySt6MJ+PFeDc+pq05YzazT/7A+PwBjYyb/Q==</latexit>

⇡✏M

meta

<latexit sha1_base64="TbROBDvTKP/TFnexB/mBsprek58=">AAACB3icbVBNS8NAEN34WetX1KMgwSJ4KkkV9Fj04kWoYD+giWWznbZLNx/sTsQScvPiX/HiQRGv/gVv/hu3bQ7a+mDg8d4MM/P8WHCFtv1tLCwuLa+sFtaK6xubW9vmzm5DRYlkUGeRiGTLpwoED6GOHAW0Ygk08AU0/eHl2G/eg1Q8Cm9xFIMX0H7Ie5xR1FLHPHBj3kldhAdMA0CaZXepC7HiQrvXWccs2WV7AmueODkpkRy1jvnldiOWBBAiE1SptmPH6KVUImcCsqKbKIgpG9I+tDUNaQDKSyd/ZNaRVrpWL5K6QrQm6u+JlAZKjQJfdwYUB2rWG4v/ee0Ee+deysM4QQjZdFEvERZG1jgUq8slMBQjTSiTXN9qsQGVlKGOrqhDcGZfnieNStk5KVduTkvVizyOAtknh+SYOOSMVMkVqZE6YeSRPJNX8mY8GS/Gu/ExbV0w8pk98gfG5w+965p/</latexit>

⇡✏

meta

<latexit sha1_base64="YWS0GhZozsEcomFEusaGA45kSRM=">AAACBXicbVA9SwNBEN2LXzF+RS21OAyCVbiLgpZBG8sI5gNyMextJsmSvb1jd04MxzU2/hUbC0Vs/Q92/hs3yRWa+GDg8d4MM/P8SHCNjvNt5ZaWV1bX8uuFjc2t7Z3i7l5Dh7FiUGehCFXLpxoEl1BHjgJakQIa+AKa/uhq4jfvQWkeylscR9AJ6EDyPmcUjdQtHnoR7yYewgMmASBN07vEg0hzEcq0Wyw5ZWcKe5G4GSmRDLVu8cvrhSwOQCITVOu260TYSahCzgSkBS/WEFE2ogNoGyppALqTTL9I7WOj9Ox+qExJtKfq74mEBlqPA990BhSHet6biP957Rj7F52EyyhGkGy2qB8LG0N7Eond4woYirEhlClubrXZkCrK0ARXMCG48y8vkkal7J6WKzdnpeplFkeeHJAjckJcck6q5JrUSJ0w8kieySt5s56sF+vd+pi15qxsZp/8gfX5A1a3mb8=</latexit>

⇡✏

meta

Test time Meta-policy

adaptation

Imagined

Test tasks

Train tasks

<latexit sha1_base64="iMYRy6M/80Q9nE+UhWRaR1TcDZQ=">AAACAnicbVDLSsNAFJ3UV62vqCtxEyxChVISKeqy6MZlhb6gCeFmOmmHTh7MTIQaiht/xY0LRdz6Fe78GydtFtp64MLhnHu59x4vZlRI0/zWCiura+sbxc3S1vbO7p6+f9ARUcIxaeOIRbzngSCMhqQtqWSkF3MCgcdI1xvfZH73nnBBo7AlJzFxAhiG1KcYpJJc/cgOQI4wsLQ1de0oIEOoiCpUH85cvWzWzBmMZWLlpIxyNF39yx5EOAlIKDEDIfqWGUsnBS4pZmRashNBYsBjGJK+oiEERDjp7IWpcaqUgeFHXFUojZn6eyKFQIhJ4KnO7GCx6GXif14/kf6Vk9IwTiQJ8XyRnzBDRkaWhzGgnGDJJooA5lTdauARcMBSpVZSIViLLy+TznnNuqjV7+rlxnUeRxEdoxNUQRa6RA10i5qojTB6RM/oFb1pT9qL9q59zFsLWj5ziP5A+/wBm9SW6w==</latexit>

T!(s, a, z)

Figure 2: During meta-train, DiAMetR learns a meta-policy

πϵ1

meta

and task distribution model

Tω(s, a, z)

on train

task distribution. Then, it uses the task distribution model to imagine different shifted test task distributions on

which it learns different meta-policies

{πϵi

meta}M

i=2

, each corresponding to a different level of robustness. During

meta-test, it chooses an appropriate meta-policy based on inferred test task distribution shift with Thompson’s

sampling and then quickly adapts the selected meta-policy to individual tasks.

is to train a model to not just minimize empirical risk, but instead learn a model that has the

lowest worst-case empirical risk among an “uncertainty-set" of distributions that are boundedly close

to the empirical training distribution [

]. If the uncertainty set and optimization are

chosen carefully, these methods have been shown to obtain models that are robust to small amounts

of distribution shift at test time [

], ﬁnding applications in problems like federated

learning [

] and image classiﬁcation [

]. This has been extended to the min-max robustness

setting for speciﬁc algorithms like model-agnostic meta-learning [

], but are critically dependent on

correct speciﬁcation of the appropriate uncertainty set and applicable primarily in supervised learning

settings. Alternatively, several RL techniques aim to directly tackle the robustness problem, aiming

to learn policies robust to adversarial perturbations [

]. [

] conditions the policy on

uncertainty sets to make it robust to different perturbation sets. While these methods are able to

learn conservative, robust policies, they are unable to adapt to new tasks as DiAMetR does in the

meta-reinforcement learning setting. In our work, rather than choosing a single uncertainty set, we

learn many meta-policies for widening uncertainty sets, thereby accounting for different levels of

test-time distribution shift.

3 Preliminaries

Meta-Reinforcement Learning aims to learn a fast reinforcement learning algorithm or a “meta-

policy" that can quickly maximize performance on tasks

from some distribution

p(T)

. Formally,

each task

is a Markov decision process (MDP)

M= (S,A,P,R, γ, µ0)

; the goal is to exploit

regularities in the structure of rewards and environment dynamics across tasks in

p(T)

to acquire

effective exploration and adaptation mechanisms that enable learning on new tasks much faster than

learning the task naively from scratch. A meta-policy (or fast learning algorithm)

πmeta

maps a history

of environment experience

h∈(S × A × R)∗

in a new task to an action

, and is trained to acquire

optimal behaviors on tasks from p(T)within kepisodes:

min

πmeta

ET ∼p(T)[Regret(πmeta,T)] ,

Regret(πmeta,T) = J(π∗

T)−Ea(i)

t∼πmeta(·|h(i)

t),T"1

i=1

t=1

r(i)

t#, J(π∗

T) = max

Eπ,T[X

rt]

where h(i)

t= (s(i)

1:t, r(i)

1:t, a(i)

1:t−1)∪(s(j)

1:T, r(j)

1:T, a(j)

1:T)i−1

j=1.(1)

Intuitively, the meta-policy has two components: an exploration mechanism that ensures that appro-

priate reward signal is found for all tasks in the training distribution, and an adaptation mechanism

that uses the collected exploratory data to generate optimal actions for the current task. In practice,

the meta-policy may be represented explicitly as an exploration policy conjoined with a policy

update[

], or implicitly as a black-box RNN [

]. We use the terminology “meta-policies"

interchangeably with that of “fast-adaptation" algorithms, since our practical implementation builds

on [

] (which represents the adaptation mechanism using a black-box RNN). Our work focuses

on the setting where there is potential drift between

ptrain(T

), the task distribution we have access to

during training, and ptest(T), the task distribution of interest during evaluation.

Distributional robustness [

] learns models that do not minimize empirical risk against the training

distribution, but instead prepare for distribution shift by optimizing the worst-case empirical risk

<latexit sha1_base64="9P1voiZ84SqCwluVSmQiInYHaCI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB9PHfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YYXvuZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpHVR9S6rtftapX6Tx1GEEziFc/DgCupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH2yMjec=</latexit>

<latexit sha1_base64="/oWVKNQYXJCZFD5ShZQ/zeO5/lY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRoh6LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM6VCIUjKKV2rSf4YU37ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SlqXVe+qWnuoVeq3eRxFOIFTOAcPrqEO99CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QPyFY9T</latexit>

at1

<latexit sha1_base64="dCZ8tef4G4UeuGLAdX3xOP8KtcE=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRoh6LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM6VCIUjKKV2rqf4YU37ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SlqXVe+qWnuoVeq3eRxFOIFTOAcPrqEO99CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QMMTo9k</latexit>

rt1

<latexit sha1_base64="mbVnKt9MX/bZjXKlv3scEZddv44=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRoh6LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM6VCIUjKKV2qN+hhfetF+uuFV3DrJKvJxUIEejX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn83Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophjd+JlSSIldssShMJcGYzH4nA6E5QzmxhDIt7K2EjaimDG1CJRuCt/zyKmldVr2rau2hVqnf5nEU4QRO4Rw8uIY63EMDmsBgDM/wCm9O4rw4787HorXg5DPH8AfO5w/8249a</latexit>

ht1

<latexit sha1_base64="q6eooYNDShd6czrV7MLhjzWbJeI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB9rHfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YYXvuZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpHVR9S6rtftapX6Tx1GEEziFc/DgCupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH1EgjdU=</latexit>

<latexit sha1_base64="aBvIQsdYmDsupA4CS/PLuofbSYI=">AAACLnicbVDLSgMxFM34tr6qLt0Ei1A3ZUaKuhQf4FLBqtApQya9tcFMJiZ3xDLOF7nxV3QhqIhbP8O0duHrQOBwzr03955YS2HR95+9kdGx8YnJqenSzOzc/EJ5cenUppnh0OCpTM15zCxIoaCBAiWcawMsiSWcxZd7ff/sGowVqTrBnoZWwi6U6AjO0ElR+WC/qqM8RLjBHA0TqiiqYcKwy5nMT4r129urKNRd8UNcp6GEKxqCtkK6KSIqV/yaPwD9S4IhqZAhjqLyY9hOeZaAQi6Ztc3A19jKmUHBJRSlMLOgGb9kF9B0VLEEbCsfnFvQNae0aSc17imkA/V7R84Sa3tJ7Cr7S9vfXl/8z2tm2Nlu5ULpDEHxr486maSY0n52tC0McJQ9Rxg3wu1KeZcZxtElXHIhBL9P/ktON2rBZq1+XK/s7A7jmCIrZJVUSUC2yA45JEekQTi5Iw/khbx6996T9+a9f5WOeMOeZfID3scnebaqAA==</latexit>

D(ptrain(T)||q(T)) ✏i

<latexit sha1_base64="yP169jRF0ZDPdgvMMz7+cYkrzVU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZdFNy4r9AXToWTSTBuaSYYkI5Shn+HGhSJu/Rp3/o2ZdhbaeiBwOOdecu4JE860cd1vp7SxubW9U96t7O0fHB5Vj0+6WqaK0A6RXKp+iDXlTNCOYYbTfqIojkNOe+H0Pvd7T1RpJkXbzBIaxHgsWMQINlbyBzE2E4J51p4PqzW37i6A1olXkBoUaA2rX4ORJGlMhSEca+17bmKCDCvDCKfzyiDVNMFkisfUt1TgmOogW0SeowurjFAklX3CoIX6eyPDsdazOLSTeUS96uXif56fmug2yJhIUkMFWX4UpRwZifL70YgpSgyfWYKJYjYrIhOsMDG2pYotwVs9eZ10r+redb3x2Kg174o6ynAG53AJHtxAEx6gBR0gIOEZXuHNMc6L8+58LEdLTrFzCn/gfP4AkBuRcw==</latexit>

Meta Policy MDP

Constrained

task distribution

<latexit sha1_base64="Jg2J/7yM2Hv03ZieJabwiOUClFY=">AAACBXicbVA9SwNBEN2LXzF+nVpqcRgEq3AnQS2DNpYRzAfkYtjbTJIle3fL7pwYjmts/Cs2ForY+h/s/DduPgpNfDDweG+GmXmBFFyj635buaXlldW1/HphY3Nre8fe3avrOFEMaiwWsWoGVIPgEdSQo4CmVEDDQEAjGF6N/cY9KM3j6BZHEtoh7Ue8xxlFI3XsQ1/yjo/wgGkISLO71AepuTAezzp20S25EziLxJuRIpmh2rG//G7MkhAiZIJq3fJcie2UKuRMQFbwEw2SsiHtQ8vQiIag2+nki8w5NkrX6cXKVITORP09kdJQ61EYmM6Q4kDPe2PxP6+VYO+infJIJggRmy7qJcLB2BlH4nS5AoZiZAhliptbHTagijI0wRVMCN78y4ukflryzkrlm3KxcjmLI08OyBE5IR45JxVyTaqkRhh5JM/klbxZT9aL9W59TFtz1mxmn/yB9fkDDXCZlA==</latexit>

⇡✏i

meta

<latexit sha1_base64="lvoAgjL9MXbXgsagjjz46dBbD/0=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REinosevFYwX5AG8pmu2mXbjZhdyKU0B/hxYMiXv093vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLoXiTRQoeSfRnEaB5O1gfDfz209cGxGrR5wk3I/oUIlQMIpWal/0cMSR9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8MbPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU2oZEPwll9eJa3LqndVrT3UKvXbPI4inMApnIMH11CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AEPmo9n</latexit>

,✓

<latexit sha1_base64="FIUMeFE9kFwUoHwhVJD5XJWiokw=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBZBEEoiRT0WvXisYD+gDWWz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RTW1jc2t4rbpZ3dvf2D8uFRy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvpv57SeujYjVI04S7kd0qEQoGEUrtU0/wwtv2i9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+7pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uwxs/EypJkSu2WBSmkmBMZr+TgdCcoZxYQpkW9lbCRlRThjahkg3BW355lbQuq95VtfZQq9Rv8ziKcAKncA4eXEMd7qEBTWAwhmd4hTcncV6cd+dj0Vpw8plj+APn8wcKzI9j</latexit>

st+1

<latexit sha1_base64="ajAtzoZWBZnmAcO+/TnKa6vo/pw=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilB93HfrniVt05yCrxclKBHI1++as3iFkacYVMUmO6npugn1GNgkk+LfVSwxPKxnTIu5YqGnHjZ/NTp+TMKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YYXvuZUEmKXLHFojCVBGMy+5sMhOYM5cQSyrSwtxI2opoytOmUbAje8surpHVR9S6rtftapX6Tx1GEEziFc/DgCupwBw1oAoMhPMMrvDnSeXHenY9Fa8HJZ47hD5zPH2sGjeY=</latexit>

Meta

Policy

Selection

<latexit sha1_base64="cGLuORsIWabMvJdhvNTGDB3WWBs=">AAACDnicbVC7SgNBFJ31GeNr1dJmMQQsJOxKUMugjWUE84BsDLOTm2TI7IOZu2JY9gts/BUbC0Vsre38G2eTLTTxwDCHc85l5h4vElyhbX8bS8srq2vrhY3i5tb2zq65t99UYSwZNFgoQtn2qALBA2ggRwHtSAL1PQEtb3yV+a17kIqHwS1OIuj6dBjwAWcUtdQzy27Ee4mL8ICJD0jTExdH2X2XuBApLnTISXtmya7YU1iLxMlJieSo98wvtx+y2IcAmaBKdRw7wm5CJXImIC26sYKIsjEdQkfTgPqgusl0ndQqa6VvDUKpT4DWVP09kVBfqYnv6aRPcaTmvUz8z+vEOLjoJjyIYoSAzR4axMLC0Mq6sfpcAkMx0YQyyfVfLTaikjLUDRZ1Cc78youkeVpxzirVm2qpdpnXUSCH5IgcE4eckxq5JnXSIIw8kmfySt6MJ+PFeDc+ZtElI585IH9gfP4A7P6dTA==</latexit>

⇡✏1

meta,✓

<latexit sha1_base64="Lb0ftmcYldUekKnHrY8tC7Go3cc=">AAACDnicbVC7SgNBFJ2NrxhfUUubxRCwkLAbgloGbSwjmAdk12V2cpMMmX0wc1cMS77Axl+xsVDE1trOv3E2SaGJB4Y5nHMuM/f4seAKLevbyK2srq1v5DcLW9s7u3vF/YOWihLJoMkiEcmOTxUIHkITOQroxBJo4Ato+6OrzG/fg1Q8Cm9xHIMb0EHI+5xR1JJXLDsx91IH4QHTAJBOTh0cZvdd6kCsuNCh6sQrlqyKNYW5TOw5KZE5Gl7xy+lFLAkgRCaoUl3bitFNqUTOBEwKTqIgpmxEB9DVNKQBKDedrjMxy1rpmf1I6hOiOVV/T6Q0UGoc+DoZUByqRS8T//O6CfYv3JSHcYIQstlD/USYGJlZN2aPS2AoxppQJrn+q8mGVFKGusGCLsFeXHmZtKoV+6xSu6mV6pfzOvLkiByTE2KTc1In16RBmoSRR/JMXsmb8WS8GO/GxyyaM+Yzh+QPjM8f7oOdTQ==</latexit>

⇡✏2

meta,✓

<latexit sha1_base64="LXDeGkS4K4GngJBXOBxJAbPVBmk=">AAACDnicbVC7SgNBFJ2NrxhfUUubxRCwkLArQS2DNjZCBPOA7BpmJzfJkNkHM3fFsOQLbPwVGwtFbK3t/Btnky008cAwh3POZeYeLxJcoWV9G7ml5ZXVtfx6YWNza3unuLvXVGEsGTRYKELZ9qgCwQNoIEcB7UgC9T0BLW90mfqte5CKh8EtjiNwfToIeJ8zilrqFstOxLuJg/CAiQ9IJ8cODtP7LnEgUlzo0PWkWyxZFWsKc5HYGSmRDPVu8cvphSz2IUAmqFId24rQTahEzgRMCk6sIKJsRAfQ0TSgPig3ma4zMcta6Zn9UOoToDlVf08k1Fdq7Hs66VMcqnkvFf/zOjH2z92EB1GMELDZQ/1YmBiaaTdmj0tgKMaaUCa5/qvJhlRShrrBgi7Bnl95kTRPKvZppXpTLdUusjry5IAckiNikzNSI1ekThqEkUfyTF7Jm/FkvBjvxscsmjOymX3yB8bnDxeZnWg=</latexit>

⇡✏M

meta,✓

……

<latexit sha1_base64="byscUDq+E67nMv18EfYBjNi0tvU=">AAACFXicbVDJSgNBEO2JW4xb1KOXxiBEkDAjQT0GvXiMkA0yIfR0KkmTnoXuGjEM8xNe/BUvHhTxKnjzb+wsB018UPB4r4qqel4khUbb/rYyK6tr6xvZzdzW9s7uXn7/oKHDWHGo81CGquUxDVIEUEeBElqRAuZ7Epre6GbiN+9BaREGNRxH0PHZIBB9wRkaqZs/c32GQ85kUktdLXwadRMX4QETBI1pWvzln3bzBbtkT0GXiTMnBTJHtZv/cnshj30IkEumdduxI+wkTKHgEtKcG2uIGB+xAbQNDZgPupNMv0rpiVF6tB8qUwHSqfp7ImG+1mPfM52TG/WiNxH/89ox9q86iQiiGCHgs0X9WFIM6SQi2hMKOMqxIYwrYW6lfMgU42iCzJkQnMWXl0njvORclMp35ULleh5HlhyRY1IkDrkkFXJLqqROOHkkz+SVvFlP1ov1bn3MWjPWfOaQ/IH1+QNH7qAm</latexit>

T⇠ptest(T)

<latexit sha1_base64="iDPxNBnfJu84Ncgyn77dyFGJg/M=">AAACFXicbZDLSsNAFIYn9VbrrerSTbAIFaQkUtSNUHTjskJv0KZhMp20Q2aSMHMilJCXcOOruHGhiFvBnW/j9LLQ1h8GPv5zDmfO78WcKbCsbyO3srq2vpHfLGxt7+zuFfcPWipKJKFNEvFIdjysKGchbQIDTjuxpFh4nLa94HZSbz9QqVgUNmAcU0fgYch8RjBoyy2e9XyJSWpnaZD1VCLclF3bWT+YMUy4IV3op2V2mrnFklWxpjKXwZ5DCc1Vd4tfvUFEEkFDIBwr1bWtGJwUS2CE06zQSxSNMQnwkHY1hlhQ5aTTqzLzRDsD04+kfiGYU/f3RIqFUmPh6U6BYaQWaxPzv1o3Af/KSVkYJ0BDMlvkJ9yEyJxEZA6YpAT4WAMmkum/mmSEdUyggyzoEOzFk5ehdV6xLyrV+2qpdjOPI4+O0DEqIxtdohq6Q3XURAQ9omf0it6MJ+PFeDc+Zq05Yz5ziP7I+PwBUtafiQ==</latexit>

i=1

t=1

r(i)

Meta Train phase Meta Test phase

Figure 3: During meta-train phase, DiAMetR learns a family of meta-policies robust to varying levels of

distribution shift (as characterized by

ϵi

). During meta-test phase, given a potentially shifted test-time distribution

of tasks, DiAMetR chooses the meta-policy with the most appropriate level of robustness and use it to perform

fast adaptation for new tasks sampled from the same shifted test task distribution.

against a set of data distributions close to the training distribution (called an uncertainty set):

min

θmax

Ex∼qϕ(x)[l(x;θ)] s.t. D(ptrain(x)||qϕ(x)) ≤ϵ(2)

This optimization ﬁnds the model parameters

that minimizes worst case risk

over distributions

qϕ(x)in an ϵ-ball (measured by an f-divergence) from the training distribution ptrain(x).

4 Distributionally Adaptive Meta-Reinforcement Learning

In this section, we develop a framework for learning meta-policies, that given access to a training

distribution of tasks

ptrain(T)

, is still able to adapt to tasks from a test-time distribution

ptest(T)

that

is similar but not identical to the training distribution. We introduce a framework for distributionally

adaptive meta-RL below and instantiate it as a practical method in Section 5.

4.1 Known Level of Test-Time Distribution Shift

We begin by studying a simpliﬁed problem where we can exactly quantify the degree to which

the test distribution deviates from the training distribution. Suppose we know that

ptest

satisﬁes

D(ptest(T)||ptrain(T)) < ϵ

for some

ϵ > 0

, where

D(·∥·)

is a probability divergence on the set of task

distributions (e.g. an

-divergence [

] or a Wasserstein distance [

]). A natural learning objective

to learn a meta-policy under this assumption is to minimize the worst-case test-time regret across any

test task distribution q(T)that is within some ϵdivergence of the train distribution:

min

πmeta R(πmeta, ptrain(T), ϵ),

R(πmeta, ptrain(T), ϵ) = max

q(T)

ET ∼q(T)[Regret(πmeta,T)] s.t. D(ptrain(T)∥q(T)) ≤ϵ(3)

Solving this optimization problem results in a meta-policy that has been trained to adapt to tasks

from a wider task distribution than the original training distribution. It is worthwhile distinguishing

this robust meta-objective, which incentivizes a robust adaptation mechanism to a wider set of tasks,

from robust objectives in standard RL, which produce base policies robust to a wider set of dynamics

conditions. The objective in Eq 3incentivizes an agent to explore and adapt more broadly, not act

more conservatively as standard robust RL methods [

] would encourage. Naturally, the quality of

the robust meta-policy depends on the size of the uncertainty set. If

is large, or the geometry of the

divergence poorly reﬂect natural task variations, then the robust policy will have to adapt to an overly

large set of tasks, potentially degrading the speed of adaptation.

4.2 Handling Arbitrary Levels of Distribution Shift

In practice, it is not known how the test distribution

ptest

deviates from the training distribution, and

consequently it is challenging to determine what

to use in the meta-robustness objective. We propose

to overcome this via an adaptive strategy: to train meta-policies for varying degrees of distribution

shift, and at test-time, inferring which distribution shift is most appropriate through experience.

We train a population of meta-policies

{π(i)

meta}M

i=1

, each solving the distributionally robust meta-RL

objective (eq 3) for a different level of robustness ϵi:

πϵi

meta := arg min

πmeta R(πmeta, ptrain(T), ϵi)M

i=1

where ϵM> ϵM−1> . . . > ϵ1= 0 (4)

In choosing a spectrum of

ϵi

, we learn a set of meta-policies that have been trained on increasingly

large set of tasks: at one end (

i= 1

), the meta-policy is trained only on the original training

distribution, and at the other (

i=M

), the meta-policy trained to adapt to any possible task within the

parametric family of tasks. These policies span a tradeoff between being robust to a wider set of task

distributions with larger

(allowing for larger distribution shifts), and being able to adapt quickly to

any given task with smaller ϵ(allowing for better per-task regret minimization).

With a set of meta-policies in hand, we must now decide how to leverage test-time experience to

discover the right one to use for the actual test distribution

ptest

. We recognize that the problem

of policy selection can be treated as a stochastic multi-armed bandit problem (precise formulation

in Appendix C), where pulling arm

corresponds to running the meta-policy

πϵi

meta

for an entire

meta-episode (

task episodes). If a zero-regret bandit algorithm (eg: Thompson’s sampling [

]) is

used , then after a certain number of test-time meta episodes, we can guarantee that the meta-policy

selection mechanism will converge to the meta-policy that best balances the tradeoff between adapting

quickly while still being able to adapt to all the tasks from ptest(T).

To summarize our framework for distributionally adaptive meta-RL, we train a population of meta-

policies at varying levels of robustness on a distributionally robust objective that forces the learned

adaptation mechanism to also be robust to tasks not in the training task distribution. At test-time, we

use a bandit algorithm to select the meta-policy whose adaptation mechanism has the best tradeoff

between robustness and speed of adaptation speciﬁcally on the test task distribution. Combining

distributional robustness with test-time adaptation allows the adaptation mechanism to work even

if distribution shift is present, while obviating the decreased performance that usually accompanies

overly conservative, distributionally robust solutions.

4.3 Analysis

To provide some intuition on the properties of this algorithm, we formally analyze adaptive distribu-

tional robustness in a simpliﬁed meta RL problem involving tasks

corresponding to reaching some

unknown goal

in a deterministic MDP

, exactly at the ﬁnal timestep of an episode. We assume

that all goals are reachable, and use the family of meta-policies that use a stochastic exploratory

policy

until the goal is discovered and return to the discovered goal in all future episodes. The

performance of a meta-policy on a task

under this model can be expressed in terms of the state

distribution of the exploratory policy:

Regret(πmeta,Tg) = 1

π(g)

. This particular framework has

been studied in [12,19], and is a simple, interpretable framework for analysis.

We seek to understand performance under distribution shift when the original training task distribution

is relatively concentrated on a subset of possible tasks. We choose the training distribution

ptrain(Tg) =

(1 −β)Uniform(S0) + βUniform(S\S0)

, so that

ptrain

is concentrated on tasks involving a subset

of the state space

S0⊂ S

, with

a parameter dictating the level of concentration, and consider

test distributions that perturb under the TV metric. Our main result compares the performance of a

meta-policy trained to an ϵ2-level of robustness when the true test distribution deviates by ϵ1.

Proposition 4.1. Let

ϵi= min{ϵi+β, 1−|S0|

|S| }

. There exists

q(T)

satisfying

DT V (ptrain, q)≤ϵ1

where an ϵ2-robust meta policy incurs excess regret over the optimal ϵ1-robust meta-policy:

Eq(T)[Regret(πϵ2

meta,T)−Regret(πϵ1

meta,T)] ≥c(ϵ1, ϵ2) + 1

c(ϵ1, ϵ2)−2(5)

pϵ1(1 −ϵ1)|S0|(|S| − S0|)(6)

The scale of regret depends on c(ϵ1, ϵ2) = qϵ2−1−1

ϵ1−1−1, a measure of the mismatch between ϵ1and ϵ2.

We ﬁrst compare robust and non-robust solutions by analyzing the bound when

ϵ2= 0

. In the regime

β≪1

, excess regret scales as

O(ϵ1q1

β)

, meaning that the robust solution is most necessary

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DistributionallyAdaptiveMetaReinforcementLearningAnuragAjay∗†‡§,AbhishekGupta*†§,DibyaGhosh¶,SergeyLevine¶,PulkitAgrawal†‡§ImprobableAILab†MIT-IBMWatsonAILab‡MassachusettsInstituteTechnology§UniversityofCalifornia,Berkeley¶AbstractMeta-reinforcementlearningalgorithmsprovideadata-drivenwaytoacquirepo...

展开>> 收起<<

Distributionally Adaptive Meta Reinforcement Learning Anurag Ajay Abhishek Gupta Dibya Ghosh Sergey Levine Pulkit Agrawal.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Distributionally Adaptive Meta Reinforcement Learning Anurag Ajay Abhishek Gupta Dibya Ghosh Sergey Levine Pulkit Agrawal

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: