Hypernetworks in Meta-Reinforcement Learning Jacob Beck Department of Computer Science

2025-05-08 0 0 991.41KB 14 页 10玖币
侵权投诉
Hypernetworks in Meta-Reinforcement Learning
Jacob Beck
Department of Computer Science
University of Oxford, United Kingdom
jacob beck@alumni.brown.edu
Matthew Jackson
Department of Engineering Science
University of Oxford, United Kingdom
jackson@robots.ox.ac.uk
Risto Vuorio
Department of Computer Science
University of Oxford, United Kingdom
risto.vuorio@keble.ox.ac.uk
Shimon Whiteson
Department of Computer Science
University of Oxford, United Kingdom
shimon.whiteson@cs.ox.ac.uk
Abstract: Training a reinforcement learning (RL) agent on a real-world robotics
task remains generally impractical due to sample inefficiency. Multi-task RL and
meta-RL aim to improve sample efficiency by generalizing over a distribution of
related tasks. However, doing so is difficult in practice: In multi-task RL, state of
the art methods often fail to outperform a degenerate solution that simply learns
each task separately. Hypernetworks are a promising path forward since they
replicate the separate policies of the degenerate solution while also allowing for
generalization across tasks, and are applicable to meta-RL. However, evidence
from supervised learning suggests hypernetwork performance is highly sensitive
to the initialization. In this paper, we 1) show that hypernetwork initialization
is also a critical factor in meta-RL, and that naive initializations yield poor per-
formance; 2) propose a novel hypernetwork initialization scheme that matches or
exceeds the performance of a state-of-the-art approach proposed for supervised
settings, as well as being simpler and more general; and 3) use this method to
show that hypernetworks can improve performance in meta-RL by evaluating on
multiple simulated robotics benchmarks.
Keywords: Meta-Learning, Reinforcement, Hypernetwork
1 Introduction
Deep reinforcement learning (RL) has helped solve previously intractable problems but still remains
highly sample inefficient. This sample inefficiency makes it impractical, particularly in settings
where data collection happens in the real world. For example, a robot’s actions have the potential
to inflict damage on both itself and its surroundings. Multi-task RL and meta-RL aim to improve
sample efficiency on novel tasks by generalizing over a distribution of related tasks. However,
such generalization has proven difficult in practice. In fact, multi-task RL methods often fail to
outperform a degenerate solution that simply trains a separate policy for each task [1].
One promising way to improve generalization is with a hypernetwork, a neural network that pro-
duces the parameters for another network, called the base network [2]. In multi-task RL, using a
hypernetwork that conditions on the task ID to generate task-specific parameters can replicate the
separate policies of the degenerate solution, while also allowing generalization across tasks. Fur-
thermore, unlike the degenerate solution, hypernetworks can also be applied to meta-RL, where task
IDs are not provided and test tasks may be novel, by conditioning them on the output of a task
encoder.
However, hypernetworks come with their own challenges. Since hypernetworks generate base net-
work parameters, the initialization of parameters in the hypernetwork determines the initialization
of the base network it produces. Evidence suggests hypernetwork performance is highly sensitive
to the initialization scheme in supervised learning [3]. However, to our knowledge this question has
6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.
arXiv:2210.11348v1 [cs.LG] 20 Oct 2022
0 10000 20000 30000 40000 50000 60000 70000
Step (k)
−1000
−500
0
500
1000
1500
2000
Return
Cheetah-Dir
Hypernetwork init
Bias-HyperInit
HFI
Kaiming
(a) Hypernetwork initialization methods
0 10000 20000 30000 40000
Step (k)
0
10
20
30
40
50
60
70
Success Percentage (test)
Pick-Place
Architecture
Hypernetwork
VariBAD
(b) Architectures
Figure 1: Naive initializations such as Kaiming [4] fail for hypernetworks, whereas our proposed
Bias-HyperInit does not and matches the state of the art, HFI [3] (claims 1, 2). Adding hypernet-
works with the proposed Bias-HyperInit significantly improves the state-of-the-art meta-RL method,
VariBAD [5] (claim 3).
not been considered in meta-RL. In this paper, we show that hypernetwork initialization is also a
critical factor in meta-RL, and that naive initializations yield poor performance.
Furthermore, we propose two novel initialization schemes: Bias-HyperInit and Weight-HyperInit.
Both produce strong results, with the former matching or exceeding the performance of the state-of-
the-art hypernetwork initialization method designed for supervised learning [3]. Moreover, both pro-
posed methods are simpler and more general than this existing method, in that they may be applied
to arbitrary base network architectures and target base network initializations without additional
derivation. Using Bias-HyperInit, we present results that substantially improve the a state-of-the-art
method on a range of meta-RL benchmarks.
Applying hypernetworks to meta-RL, we make the following contributions (examples in Figure 1):
1. We empirically demonstrate that initialization is a critical factor in the performance of
hypernetworks in meta-RL, and that naive initializations fail to learn reliably;
2. We propose a novel hypernetwork initialization scheme that matches or exceeds the per-
formance of a state-of-the-art approach proposed for supervised settings, as well as being
simpler and more general; and
3. We use this method to show that hypernetworks can improve a state-of-the-art method on
a range of meta-RL benchmarks (grid-world [5], MuJoCo [6], and Meta-World [1]).
2 Related Work
Meta-RL. Despite the advantages of hypernetworks [2], they remain relatively unexplored in
meta-RL. We use hypernetworks to arbitrarily update a policy’s parameters at every time-step,
whereas all prior work we are aware of restrict this procedure in some way. Many procedures in
few-shot meta-RL build off of MAML [7] to adapt the parameters of a policy network using a pol-
icy gradient [7,8,9]. Such methods require the estimation of a policy gradient, which reduces
sample-efficiency when faster adaptation is possible, as in our benchmarks [5]. Most meta-learning
procedures capable of zero-shot adaption using an RNN (or convolutions) that can represent an ar-
bitrary update function [5,10,11]. These methods generally update a set of activations on which a
fixed policy is then conditioned, whereas hypernetworks update all policy parameters. We include
a state-of-the-art method from this class in our evaluations [5]. There are also unsupervised meth-
ods in zero-shot meta-RL for weight updates [12,13] but none can produce a fully general learning
procedure since they make use of local and unsupervised heuristics. Sarafian et al. [14] use hyper-
networks in the context of meta-RL, but the policy network, not the hypernetwork, is conditioned on
the RNN used for adaptation, preventing the hypernetwork from representing a general learning pro-
cedure. Finally, FLAP [15] learns to infer a set of weights trained in the multi-task setting; however
since the adaptation procedure is not trained on a meta-RL objective, it is constrained. For example,
2
FLAP cannot learn to explore to reduce uncertainty. Finally, Xian et al. [16] use hypernetworks
to predict model dynamics then use model predictive control. However, this model still requires
planning to make use of an uncertain model, whereas model-free RL learns a policy that explores
optimally in order to attain data for adaptation. Using a general procedure trained to arbitrarily
modify the weights of a model-free policy has never been tried in RL, to the best of our knowledge.
Hypernetworks. Hypernetworks, or similar architectures, have been used in supervised learning
(SL), multi-task RL, and meta-SL. Hypernetworks have been used in the supervised learning litera-
ture for sequence modelling [2], as well as in continual learning and image classification [3], where
it was shown that the hypernetwork initialization scheme was crucial for performance. Similar mod-
els have also been used in multi-task RL and meta-SL, but not meta-RL. For instance, in multi-task
RL, Yu et al. [17] use a network conditioned on a task encoding to produce the weights and biases
for every other layer in another network conditioned on state. In meta-SL, there have also been
attempts to use one network to adapt weights of another, both as a general function of the dataset
[18,19,20], conditioned on an embedding adapted by gradient descent [21], and by adding deltas in
a way framed as learning to optimize [22,23]. The abundance of representations in meta-SL suggest
there is a similarly large space of representation-based methods to explore in meta-RL. Our work –
getting hypernetworks to work in practice for meta-RL – can be seen as a first step towards applying
all of these methods in meta-RL.
3 Background
3.1 Problem Setting
An RL task is formalized as a Markov Decision Processes (MDP). We define an MDP as a tuple of
(S,A,R,P, γ). At time-step t, the agent inhabits a state, st∈ S, observable by the agent. The agent
takes an action at∈ A. The MDP then transitions to state st+1 ∼ P(st+1|st, at): S × A × S
R0, and the agent receives reward rt=R(st, at): S × A Rupon entering st+1. Given
a discount factor, γ[0,1), the agent acts to maximize the expected future discounted reward,
R(τ) = Prtτγtrt, where τis the agent’s trajectory over an episode in the MDP. To maximize this
return, the action takes actions sampled from a learned policy, π(a|s) : S × A R+.
Meta-RL algorithms learn an RL algorithm, i.e., a mapping from the data sampled from a single
MDP, M ∼ p(M), to a policy. Since an RL algorithm generally needs multiple episodes of inter-
action to produce a reasonable policy, the algorithm conditions on τ, which is the entire sequence of
states, actions, and rewards within M. As in the RL setting, this sequence up to time-step t forms
a trajectory τt(S × A × R)t. Here, however, τmay span multiple episodes, and so we use the
same symbol, but refer to it as a meta-episode. The policy is then a meta-episode dependent policy,
πθ(a|s, τ), parameterized by the meta-parameters,θ.
We define the objective in meta-RL as finding meta-parameters θthat maximize the sum of the
returns in the meta-episode across a distribution of tasks (MDPs):
arg max
θ
EM∼p(M)EτR(τ)
πθ(·),M (1)
3.2 Policy Architecture
We consider meta-RL agents capable of adaptation at every time-step, and adaptation within one
episode is required to solve some of our benchmarks. In such methods [5,10,11], the history is
generally summarized by a function, g, into an embedding that represents relevant task information.
We write this embedding as e=g(τ), and call gthe task encoder. The policy, represented as a
multi-layer perceptron, then conditions on this task embedding as an input, instead of on the history
directly, which we write as πθ(a|s, e). We call this the standard architecture, shown in Figure 2.
In this paper, we primarily build off of VariBAD [5], which can be seen as an instance of the standard
architecture where the task encoder is the mean and variance from a recurrent variational auto-
encoder (VAE) [24] trained using a self-supervised loss. In other words, the task is inferred as a latent
variable optimized for reconstructing a meta-episode. See Zintgraf et al. [5] for details. Additionally,
evaluate the addition of hypernetworks to RL2 [11] on the most challenging benchmark. (See section
5.2.) In RL2, the task encoder is a recurrent neural network trained end-to-end on equation 1.
3
摘要:

HypernetworksinMeta-ReinforcementLearningJacobBeckDepartmentofComputerScienceUniversityofOxford,UnitedKingdomjacobbeck@alumni.brown.eduMatthewJacksonDepartmentofEngineeringScienceUniversityofOxford,UnitedKingdomjackson@robots.ox.ac.ukRistoVuorioDepartmentofComputerScienceUniversityofOxford,UnitedKin...

展开>> 收起<<
Hypernetworks in Meta-Reinforcement Learning Jacob Beck Department of Computer Science.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:991.41KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注