Hypernetworks in Meta-Reinforcement Learning Jacob Beck Department of Computer Science

2025-05-08 1 0 991.41KB 14 页 10玖币

侵权投诉

Hypernetworks in Meta-Reinforcement Learning

Jacob Beck

Department of Computer Science

University of Oxford, United Kingdom

jacob beck@alumni.brown.edu

Matthew Jackson

Department of Engineering Science

University of Oxford, United Kingdom

jackson@robots.ox.ac.uk

Risto Vuorio

Department of Computer Science

University of Oxford, United Kingdom

risto.vuorio@keble.ox.ac.uk

Shimon Whiteson

Department of Computer Science

University of Oxford, United Kingdom

shimon.whiteson@cs.ox.ac.uk

Abstract: Training a reinforcement learning (RL) agent on a real-world robotics

task remains generally impractical due to sample inefﬁciency. Multi-task RL and

meta-RL aim to improve sample efﬁciency by generalizing over a distribution of

related tasks. However, doing so is difﬁcult in practice: In multi-task RL, state of

the art methods often fail to outperform a degenerate solution that simply learns

each task separately. Hypernetworks are a promising path forward since they

replicate the separate policies of the degenerate solution while also allowing for

generalization across tasks, and are applicable to meta-RL. However, evidence

from supervised learning suggests hypernetwork performance is highly sensitive

to the initialization. In this paper, we 1) show that hypernetwork initialization

is also a critical factor in meta-RL, and that naive initializations yield poor per-

formance; 2) propose a novel hypernetwork initialization scheme that matches or

exceeds the performance of a state-of-the-art approach proposed for supervised

settings, as well as being simpler and more general; and 3) use this method to

show that hypernetworks can improve performance in meta-RL by evaluating on

multiple simulated robotics benchmarks.

Keywords: Meta-Learning, Reinforcement, Hypernetwork

1 Introduction

Deep reinforcement learning (RL) has helped solve previously intractable problems but still remains

highly sample inefﬁcient. This sample inefﬁciency makes it impractical, particularly in settings

where data collection happens in the real world. For example, a robot’s actions have the potential

to inﬂict damage on both itself and its surroundings. Multi-task RL and meta-RL aim to improve

sample efﬁciency on novel tasks by generalizing over a distribution of related tasks. However,

such generalization has proven difﬁcult in practice. In fact, multi-task RL methods often fail to

outperform a degenerate solution that simply trains a separate policy for each task [1].

One promising way to improve generalization is with a hypernetwork, a neural network that pro-

duces the parameters for another network, called the base network [2]. In multi-task RL, using a

hypernetwork that conditions on the task ID to generate task-speciﬁc parameters can replicate the

separate policies of the degenerate solution, while also allowing generalization across tasks. Fur-

thermore, unlike the degenerate solution, hypernetworks can also be applied to meta-RL, where task

IDs are not provided and test tasks may be novel, by conditioning them on the output of a task

encoder.

However, hypernetworks come with their own challenges. Since hypernetworks generate base net-

work parameters, the initialization of parameters in the hypernetwork determines the initialization

of the base network it produces. Evidence suggests hypernetwork performance is highly sensitive

to the initialization scheme in supervised learning [3]. However, to our knowledge this question has

6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.

arXiv:2210.11348v1 [cs.LG] 20 Oct 2022

0 10000 20000 30000 40000 50000 60000 70000

Step (k)

−1000

−500

500

1000

1500

2000

Return

Cheetah-Dir

Hypernetwork init

Bias-HyperInit

HFI

Kaiming

(a) Hypernetwork initialization methods

0 10000 20000 30000 40000

Step (k)

Success Percentage (test)

Pick-Place

Architecture

Hypernetwork

VariBAD

(b) Architectures

Figure 1: Naive initializations such as Kaiming [4] fail for hypernetworks, whereas our proposed

Bias-HyperInit does not and matches the state of the art, HFI [3] (claims 1, 2). Adding hypernet-

works with the proposed Bias-HyperInit signiﬁcantly improves the state-of-the-art meta-RL method,

VariBAD [5] (claim 3).

not been considered in meta-RL. In this paper, we show that hypernetwork initialization is also a

critical factor in meta-RL, and that naive initializations yield poor performance.

Furthermore, we propose two novel initialization schemes: Bias-HyperInit and Weight-HyperInit.

Both produce strong results, with the former matching or exceeding the performance of the state-of-

the-art hypernetwork initialization method designed for supervised learning [3]. Moreover, both pro-

posed methods are simpler and more general than this existing method, in that they may be applied

to arbitrary base network architectures and target base network initializations without additional

derivation. Using Bias-HyperInit, we present results that substantially improve the a state-of-the-art

method on a range of meta-RL benchmarks.

Applying hypernetworks to meta-RL, we make the following contributions (examples in Figure 1):

1. We empirically demonstrate that initialization is a critical factor in the performance of

hypernetworks in meta-RL, and that naive initializations fail to learn reliably;

2. We propose a novel hypernetwork initialization scheme that matches or exceeds the per-

formance of a state-of-the-art approach proposed for supervised settings, as well as being

simpler and more general; and

3. We use this method to show that hypernetworks can improve a state-of-the-art method on

a range of meta-RL benchmarks (grid-world [5], MuJoCo [6], and Meta-World [1]).

2 Related Work

Meta-RL. Despite the advantages of hypernetworks [2], they remain relatively unexplored in

meta-RL. We use hypernetworks to arbitrarily update a policy’s parameters at every time-step,

whereas all prior work we are aware of restrict this procedure in some way. Many procedures in

few-shot meta-RL build off of MAML [7] to adapt the parameters of a policy network using a pol-

icy gradient [7,8,9]. Such methods require the estimation of a policy gradient, which reduces

sample-efﬁciency when faster adaptation is possible, as in our benchmarks [5]. Most meta-learning

procedures capable of zero-shot adaption using an RNN (or convolutions) that can represent an ar-

bitrary update function [5,10,11]. These methods generally update a set of activations on which a

ﬁxed policy is then conditioned, whereas hypernetworks update all policy parameters. We include

a state-of-the-art method from this class in our evaluations [5]. There are also unsupervised meth-

ods in zero-shot meta-RL for weight updates [12,13] but none can produce a fully general learning

procedure since they make use of local and unsupervised heuristics. Saraﬁan et al. [14] use hyper-

networks in the context of meta-RL, but the policy network, not the hypernetwork, is conditioned on

the RNN used for adaptation, preventing the hypernetwork from representing a general learning pro-

cedure. Finally, FLAP [15] learns to infer a set of weights trained in the multi-task setting; however

since the adaptation procedure is not trained on a meta-RL objective, it is constrained. For example,

FLAP cannot learn to explore to reduce uncertainty. Finally, Xian et al. [16] use hypernetworks

to predict model dynamics then use model predictive control. However, this model still requires

planning to make use of an uncertain model, whereas model-free RL learns a policy that explores

optimally in order to attain data for adaptation. Using a general procedure trained to arbitrarily

modify the weights of a model-free policy has never been tried in RL, to the best of our knowledge.

Hypernetworks. Hypernetworks, or similar architectures, have been used in supervised learning

(SL), multi-task RL, and meta-SL. Hypernetworks have been used in the supervised learning litera-

ture for sequence modelling [2], as well as in continual learning and image classiﬁcation [3], where

it was shown that the hypernetwork initialization scheme was crucial for performance. Similar mod-

els have also been used in multi-task RL and meta-SL, but not meta-RL. For instance, in multi-task

RL, Yu et al. [17] use a network conditioned on a task encoding to produce the weights and biases

for every other layer in another network conditioned on state. In meta-SL, there have also been

attempts to use one network to adapt weights of another, both as a general function of the dataset

[18,19,20], conditioned on an embedding adapted by gradient descent [21], and by adding deltas in

a way framed as learning to optimize [22,23]. The abundance of representations in meta-SL suggest

there is a similarly large space of representation-based methods to explore in meta-RL. Our work –

getting hypernetworks to work in practice for meta-RL – can be seen as a ﬁrst step towards applying

all of these methods in meta-RL.

3 Background

3.1 Problem Setting

An RL task is formalized as a Markov Decision Processes (MDP). We deﬁne an MDP as a tuple of

(S,A,R,P, γ). At time-step t, the agent inhabits a state, st∈ S, observable by the agent. The agent

takes an action at∈ A. The MDP then transitions to state st+1 ∼ P(st+1|st, at): S × A × S →

R≥0, and the agent receives reward rt=R(st, at): S × A → Rupon entering st+1. Given

a discount factor, γ∈[0,1), the agent acts to maximize the expected future discounted reward,

R(τ) = Prt∈τγtrt, where τis the agent’s trajectory over an episode in the MDP. To maximize this

return, the action takes actions sampled from a learned policy, π(a|s) : S × A → R+.

Meta-RL algorithms learn an RL algorithm, i.e., a mapping from the data sampled from a single

MDP, M ∼ p(M), to a policy. Since an RL algorithm generally needs multiple episodes of inter-

action to produce a reasonable policy, the algorithm conditions on τ, which is the entire sequence of

states, actions, and rewards within M. As in the RL setting, this sequence up to time-step t forms

a trajectory τt∈(S × A × R)t. Here, however, τmay span multiple episodes, and so we use the

same symbol, but refer to it as a meta-episode. The policy is then a meta-episode dependent policy,

πθ(a|s, τ), parameterized by the meta-parameters,θ.

We deﬁne the objective in meta-RL as ﬁnding meta-parameters θthat maximize the sum of the

returns in the meta-episode across a distribution of tasks (MDPs):

arg max

EM∼p(M)EτR(τ)



πθ(·),M (1)

3.2 Policy Architecture

We consider meta-RL agents capable of adaptation at every time-step, and adaptation within one

episode is required to solve some of our benchmarks. In such methods [5,10,11], the history is

generally summarized by a function, g, into an embedding that represents relevant task information.

We write this embedding as e=g(τ), and call gthe task encoder. The policy, represented as a

multi-layer perceptron, then conditions on this task embedding as an input, instead of on the history

directly, which we write as πθ(a|s, e). We call this the standard architecture, shown in Figure 2.

In this paper, we primarily build off of VariBAD [5], which can be seen as an instance of the standard

architecture where the task encoder is the mean and variance from a recurrent variational auto-

encoder (VAE) [24] trained using a self-supervised loss. In other words, the task is inferred as a latent

variable optimized for reconstructing a meta-episode. See Zintgraf et al. [5] for details. Additionally,

evaluate the addition of hypernetworks to RL2 [11] on the most challenging benchmark. (See section

5.2.) In RL2, the task encoder is a recurrent neural network trained end-to-end on equation 1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HypernetworksinMeta-ReinforcementLearningJacobBeckDepartmentofComputerScienceUniversityofOxford,UnitedKingdomjacobbeck@alumni.brown.eduMatthewJacksonDepartmentofEngineeringScienceUniversityofOxford,UnitedKingdomjackson@robots.ox.ac.ukRistoVuorioDepartmentofComputerScienceUniversityofOxford,UnitedKin...

展开>> 收起<<

Hypernetworks in Meta-Reinforcement Learning Jacob Beck Department of Computer Science.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hypernetworks in Meta-Reinforcement Learning Jacob Beck Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: