Discovered Policy Optimisation Chris Lu FLAIR University of Oxford

2025-04-24 0 0 6.92MB 18 页 10玖币
侵权投诉
Discovered Policy Optimisation
Chris Lu
FLAIR, University of Oxford
christopher.lu@exeter.ox.ac.uk
Jakub Grudzien Kuba∗ †
BAIR, UC Berkeley
kuba@berkeley.edu
Alistair Letcher
aletcher.github.io
ahp.letcher@gmail.com
Luke Metz
Google Brain
Luke.s.metz@gmail.com
Christian Schroeder de Witt
FLAIR, University of Oxford
cs@robots.ox.ac.uk
Jakob Foerster
FLAIR, University of Oxford
jakob.foerster@eng.ox.ac.uk
Abstract
Tremendous progress has been made in reinforcement learning (RL) over the past
decade. Most of these advancements came through the continual development
of new algorithms, which were designed using a combination of mathematical
derivations, intuitions, and experimentation. Such an approach of creating algo-
rithms manually is limited by human understanding and ingenuity. In contrast,
meta-learning provides a toolkit for automatic machine learning method optimi-
sation, potentially addressing this flaw. However, black-box approaches which
attempt to discover RL algorithms with minimal prior structure have thus far not
outperformed existing hand-crafted algorithms. Mirror Learning, which includes
RL algorithms, such as PPO, offers a potential middle-ground starting point: while
every method in this framework comes with theoretical guarantees, components
that differentiate them are subject to design. In this paper we explore the Mirror
Learning space by meta-learning a “drift” function. We refer to the immediate
result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original
insights into policy optimisation which we use to formulate a novel, closed-form
RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax
environments confirm state-of-the-art performance of LPO and DPO, as well as
their transfer to unseen settings.
1 Introduction
Recent advancements in deep learning have allowed reinforcement learning algorithms [
35
, RL]
to successfully tackle large-scale problems [
36
,
34
]. As a result, great efforts have been put into
designing methods that are capable of training a neural-network policies in increasingly more complex
tasks [
33
,
31
,
24
,
9
]. Among the most practical such algorithms are TRPO [
31
] and PPO [
32
] which
are known for their performance and stability [
2
]. Nevertheless, although these research threads
have delivered a handful of successful techniques, their design relies on concepts handcrafted by
humans, rather than discovered in a learning process. As a possible consequence, these methods
often suffer from various flaws, such as the brittleness to hyperparameter settings [
31
,
12
], and a lack
of robustness guarantees.
The most promising alternative approach, algorithm discovery, thus far has been a “tough nut to
crack”. Popular approaches in meta-RL [
30
,
7
,
4
] are unable to generalise to tasks that lie outside of
their training distribution. Alternatively, many approaches that attempt to meta-learn more general
algorithms [
25
,
16
] fail to outperform existing handcrafted algorithms and lack theoretical guarantees.
Equal Contribution
Work done while at FLAIR, University of Oxford
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05639v2 [cs.LG] 13 Oct 2022
Recently, Mirror Learning [
18
], a new theoretical framework, introduced an infinite space of provably
correct algorithms, all of which share the same template. In a nutshell, a Mirror Learning algorithm
is defined by four attributes, but in this work we focus on the drift function. A drift function guides
the agent’s update, usually by penalising large changes. Any Mirror Learning algorithm provably
achieves monotonic improvement of the return, and converges to an optimal policy [
18
]. Popular RL
methods such as TRPO [31] and PPO [32] are instances of this framework.
In this paper, we use meta-learning to discover a new state-of-the-art (SOTA) RL algorithm within
the Mirror Learning space. Our algorithm thus inherits theoretical convergence guarantees by
construction. Specifically, we parameterise a drift function with a neural network, which we then
meta-train using evolution strategies [
29
, ES]. The outcome of this meta-training is a specific Mirror
Learning algorithm which we name Learnt Policy Optimisation (LPO).
While having a neural network representation of a novel, high-performing drift function is a great
first step, our next goal is to understand the relevant algorithmic features of this drift function. Out
analysis reveals that LPO’s drift discovered, for example, optimism about actions that scored low
rewards in the past—a feature we refere to as rollback. Building upon these insights we propose a new,
closed-form algorithm which we name —Discovered Policy Optimisation (DPO). We evaluate LPO
and DPO in the Brax [8] continuous control environments, where they obtain superior performance
compared to PPO. Importantly, both LPO and DPO generalise to environments that were not used for
training LPO. To our knowledge, DPO is the first theoretically-sound, scalable deep RL algorithm
that was discovered via meta-learning.
2 Related Work
Over the last few years, researchers have put significant effort into designing and developing algo-
rithmic improvements in reinforcement learning. Fujimoto et al. [
9
] combine DDPG policy training
with estimates of pessimistic Bellman targets from a separate critic. Hsu et al. [
15
] stabilise the,
previously unsuccessful [
32
], KL-penalised version of PPO and improve its robustness through novel
policy design choices. Haarnoja et al. [
13
] introduce a mechanism that automatically adjusts the
temperature parameter of the entropy bonus in SAC. However, none of these hand-crafted efforts
succeeds in fully mitigating common RL pathologies, such as sensitivity to hyperparameter choices
and lack of domain generalisation [
4
]. This motivates radically expanding the RL algorithm search
space through automated means [27].
Popular approaches in meta-RL have shown that agents can learn to quickly adapt over a pre-specified
distribution of tasks. RL
2
equips a learning agent with a recurrent neural network that retains state
across episode boundaries to adapt the agent’s behaviour to the current environment [
4
]. Similarly, a
MAML agent meta-learns policy parameters which can adapt to a range of tasks with a few steps of
gradient descent [
7
]. However, both RL
2
and MAML usually only meta-learn across narrow domains
and are not expected to generalise well to truly unseen environments.
Xu et al. [
40
] introduce an actor-critic method that adjusts its hyperparameters online using meta-
gradients that are updated with every few inner iterations. Similarly, STAC [
42
] uses implementation
techniques from IMPALA [
5
] and auxiliary loss-guided meta-parameter tuning to further improve on
this approach.
Such advances have inspired extending meta-gradient RL techniques to more ambitious objectives,
including the discovery of algorithms ab initio. Notably, Oh et al. [
25
] succeeded in meta-learning an
RL algorithm, LPG, that can solve simple tasks efficiently without explicitly relying on concepts such
as value functions and policy gradients. Similarly, Evolved Policy Gradients [
14
, EPG] meta-trains a
policy loss network function with Evolution Strategies [
29
, ES]. Although EPG surpasses PPO in
average performance, it suffers from much larger variance [
14
] and is not expected to perform well
on environments with dynamics that differ greatly from the training distribution. MetaGenRL [
17
],
instead, meta-learns the loss function for deterministic policies which are inherently less affected
by estimators’ variance [
33
]. MetaGenRL, however, fails to improve upon DDPG [
21
] in terms of
performance, despite building up on it. Neither EPG nor MetaGenRL have resulted in the discovery
of novel analytical RL algorithms, perhaps due to the limited interpretability of the loss functions
learnt. Lastly, Co-Reyes et al. [
3
], Garau et al. [
10
] and Alet et al. [
1
] discover and improve standard
RL conventions by evolving, symbolically, algorithms represented as graphs, which leads to improved
performance in simple tasks. However, none of those trained-from-scratch methods inherit correctness
2
guarantees, limiting our certainty of the generality of their abilities. In contrast, our method, LPO, is
meta-developed in a Mirror Learning space [
18
], where every algorithm is guaranteed convergence to
an optimal policy. As a result to this construction, meta-training of LPO is easier than that of methods
that learn “from scratch”, and achieves great performance across environments. Furthermore, thanks
to the clear meta-structure of Mirror Learning, LPO is interpretable, and lets us discover new learning
strategies. This lets us introduce DPO—an efficient algorithm with a closed-form formulation that
exploits the discovered learning concepts.
3 Background
In this section, we introduce the essential concepts required to comprehend our contribution—the
RL and meta-RL problem formulations, as well as the Mirror Learning and Evolution Strategies
frameworks for solving them.
3.1 Reinforcement Learning
Formulation
We formulate the reinforcement learning (RL) problem as a Markov decision process
(MDP) [
35
] represented by a tuple
hS,A, R, P, γ, di
which defines the experience of a learning agent
as follows: at time step
tN
, the agent is at state
st∈ S
(where
s0d
) and takes an action
at∈ A
according to its stochastic policy
π(·|st)
, which is a member of the policy space
Π
. The environment
then emits the reward
R(st,at)
and transits to the next state
st+1
drawn from the transition function,
st+1 P(·|st,at). The agent aims to maximise the expected value of the total discounted return,
η(π),E[Rγ|π] = Es0d,a0:π,s1:Ph
X
t=0
γtR(st,at)i.(1)
The agent guides its learning process with value functions that evaluate the expected return conditioned
on states or state-action pairs
Vπ(s),E[Rγ|π, s0=s](the state value function),
Qπ(s, a),E[Rγ|π, s0=s, a0=a](the state-action value function).
The function that the agent is concerned about most is the advantage function, which computes
relative values of actions at different states,
Aπ(s, a),Qπ(s, a)Vπ(s).(2)
Policy Optimisation
In fact, by updating its policy simply to maximise the advantage function at
every state, the agent is guaranteed to improve its policy,
η(πnew)η(πold)
[
35
]. This fact, although
requiring a maximisation operation that is intractable in large state-space settings tackled by deep
RL (where the policy
πθ
is parameterised by weights
θ
of a neural network), has inspired a range
of algorithms that perform it approximately. For example, A2C [
24
] updates the policy by a step of
policy gradient (PG) ascent
θk+1 =θk+α
B
B
X
b=1
Aπθk(sb, ab)θlog πθk(ab|sb), α (0,1),(3)
estimated from a batch of
B
transitions. Nevertheless, such simple adoptions of generalized policy
iteration [
35
, GPI] suffer from large variance and instability [
43
,
33
,
32
]. Hence, methods that
constrain (either explicitly or implicitly) the policy update size are preferred [
31
]. Among the most
popular, as well as successful ones, is Proximal Policy Optimization [
32
, PPO], inspired by trust
region learning [31], which updates its policy by maximising the PPO-clip objective,
πk+1 = arg max
πΠ
Esρπk,aπkhmin π(a|s)
πk(a|s)Aπk(s,a),clipπ(a|s)
πk(a|s),1±Aπk(s,a)i,(4)
where the
clip(·,1±)
operator clips (if necessary) the input so that it stays within
[1 , 1 + ]
interval. In deep RL, the maximisation oracle in Equation (4) is approximated by a few steps of
gradient ascent on policy parameters.
3
Meta-RL
The above approaches to policy optimisation rely on human-possessed knowledge, and
thus are limited by humans’ understanding of the problem. The goal of meta-RL is to instead
optimise the learning algorithm using machine learning. Formally, suppose that an RL algorithm
algφ
, parameterised by
φ
, trains an agent for
K
iterations. Meta-RL aims to find the meta-parameter
φ=φsuch that the expected return of the output policy, E[η(πK)|algφ], is maximised.
3.2 Mirror Learning
A Mirror Learning agent [
18
], in addition to value functions, has access to the following operators:
the drift function
Dπk(π|s)
which, intuitively, evaluates the significance of change from policy
πk
to
π
at state
s
; the neighbourhood operator
N(πk)
which forms a region around the policy
πk
; as
well as sampling and drift distributions
βπk(s)
and
νπ
πk(s)
over states. With these defined, a Mirror
Learning algorithm updates an agent’s policy by maximising the mirror objective
πk+1 = arg max
π∈N (πk)
EsβπkAπk(s,a)Esνπ
πkDπk(π|s).(5)
If, for all policies πand πk, the drift function satisfies the following conditions:
1. It is non-negative everywhere and zero at identity Dπk(π|s)Dπk(πk|s) = 0,
2. Its gradient with respect to πis zero at π=πk,
then the Mirror Learning algorithm attains the monotonic improvement property,
η(πk+1)η(πk)
,
and converges to the optimal return,
η(πk)η(π)
, as
k→ ∞
[
18
]. A Mirror Learning agent can
be implemented in practice by specifying functional forms of the drift function and neighbourhood
operator, and parameterising the policy of the agent with a neural network,
πθ
. As such, the agent
approximates the objective in Equation (5) by sample averages, and maximises it with an optimisation
method, like gradient ascent. PPO is a valid instance of Mirror Learning, with the drift function:
DPPO
πk(π|s),EaπkhReLUhπ(a|s)
πk(a|s)clipπ(a|s)
πk(a|s),1±iAπk(s, a)i.(6)
While it is possible to explicitly constrain the neighbourhood of policy update [
31
], some algorithms
do it implicitly. For example, as maximisation oracle of PPO (see Equation (4)) has a form of
N
steps of gradient ascent with learning rate
α
and gradient clipping threshold
c
, it implicitly employs a
neighbourhood of an Euclidean ball or radius Nαc around θk.
Different Mirror Learning algorithms can differ in multiple aspects such as sample complexity and
wall-clock time efficiency [
18
]. Depending on the setting, different properties may be desirable. In
this paper, we optimise for the return of the Kth iterate, η(πK).
3.3 Evolution Strategies
Evolution Strategies [
28
,
29
, ES] is a backpropagation-free approach to function optimisation. At
their core lies the following identity, which holds for any continuously differentiable function
F
of
φ
,
and any positive scalar σ
φEN(0,I)[F(φ+σ)] = 1
σEN(0,I)[F(φ+σ)],(7)
where
N(0, I)
denotes the standard multivariate normal distribution. By taking the limit
σ0
, the
gradient on the left-hand side recovers the gradient of
φF(φ)
. These facts inspire an approach of
optimising
F
with respect to
φ
without estimating gradients with backpropagation—for a random
sample
1, . . . , nN(0, I)
, the vector
1
Pn
i=1 F(φ+σi)i
is an unbiased gradient estimate.
To reduce variance of this estimator, antithetic sampling is commonly used [
26
]. In the context of
meta-RL, where
φ
is the meta-parameter of an RL algorithm
algφ
, the role of
F(φ)
is played by the
average return after the training,
F(φ) = E[η(πK)|φ]
. As oppose to the meta-gradient approaches
described in Section 2, ES does not require backpropagation of the gradient through the whole training
episode—a cumbersome procedure which, often approximated by the truncated backpropagation,
introduces bias [38,39,25,6,23].
4
摘要:

DiscoveredPolicyOptimisationChrisLuFLAIR,UniversityofOxfordchristopher.lu@exeter.ox.ac.ukJakubGrudzienKubayBAIR,UCBerkeleykuba@berkeley.eduAlistairLetcheraletcher.github.ioahp.letcher@gmail.comLukeMetzGoogleBrainLuke.s.metz@gmail.comChristianSchroederdeWittFLAIR,UniversityofOxfordcs@robots.ox.ac.u...

展开>> 收起<<
Discovered Policy Optimisation Chris Lu FLAIR University of Oxford.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:6.92MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注