Discovered Policy Optimisation Chris Lu FLAIR University of Oxford

2025-04-24 0 0 6.92MB 18 页 10玖币

侵权投诉

Discovered Policy Optimisation

Chris Lu∗

FLAIR, University of Oxford

christopher.lu@exeter.ox.ac.uk

Jakub Grudzien Kuba∗ †

BAIR, UC Berkeley

kuba@berkeley.edu

Alistair Letcher

aletcher.github.io

ahp.letcher@gmail.com

Luke Metz

Google Brain

Luke.s.metz@gmail.com

Christian Schroeder de Witt

FLAIR, University of Oxford

cs@robots.ox.ac.uk

Jakob Foerster

FLAIR, University of Oxford

jakob.foerster@eng.ox.ac.uk

Abstract

Tremendous progress has been made in reinforcement learning (RL) over the past

decade. Most of these advancements came through the continual development

of new algorithms, which were designed using a combination of mathematical

derivations, intuitions, and experimentation. Such an approach of creating algo-

rithms manually is limited by human understanding and ingenuity. In contrast,

meta-learning provides a toolkit for automatic machine learning method optimi-

sation, potentially addressing this ﬂaw. However, black-box approaches which

attempt to discover RL algorithms with minimal prior structure have thus far not

outperformed existing hand-crafted algorithms. Mirror Learning, which includes

RL algorithms, such as PPO, offers a potential middle-ground starting point: while

every method in this framework comes with theoretical guarantees, components

that differentiate them are subject to design. In this paper we explore the Mirror

Learning space by meta-learning a “drift” function. We refer to the immediate

result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original

insights into policy optimisation which we use to formulate a novel, closed-form

RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax

environments conﬁrm state-of-the-art performance of LPO and DPO, as well as

their transfer to unseen settings.

1 Introduction

Recent advancements in deep learning have allowed reinforcement learning algorithms [

, RL]

to successfully tackle large-scale problems [

]. As a result, great efforts have been put into

designing methods that are capable of training a neural-network policies in increasingly more complex

tasks [

]. Among the most practical such algorithms are TRPO [

] and PPO [

] which

are known for their performance and stability [

]. Nevertheless, although these research threads

have delivered a handful of successful techniques, their design relies on concepts handcrafted by

humans, rather than discovered in a learning process. As a possible consequence, these methods

often suffer from various ﬂaws, such as the brittleness to hyperparameter settings [

], and a lack

of robustness guarantees.

The most promising alternative approach, algorithm discovery, thus far has been a “tough nut to

crack”. Popular approaches in meta-RL [

] are unable to generalise to tasks that lie outside of

their training distribution. Alternatively, many approaches that attempt to meta-learn more general

algorithms [

] fail to outperform existing handcrafted algorithms and lack theoretical guarantees.

∗Equal Contribution

†Work done while at FLAIR, University of Oxford

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05639v2 [cs.LG] 13 Oct 2022

Recently, Mirror Learning [

], a new theoretical framework, introduced an inﬁnite space of provably

correct algorithms, all of which share the same template. In a nutshell, a Mirror Learning algorithm

is deﬁned by four attributes, but in this work we focus on the drift function. A drift function guides

the agent’s update, usually by penalising large changes. Any Mirror Learning algorithm provably

achieves monotonic improvement of the return, and converges to an optimal policy [

]. Popular RL

methods such as TRPO [31] and PPO [32] are instances of this framework.

In this paper, we use meta-learning to discover a new state-of-the-art (SOTA) RL algorithm within

the Mirror Learning space. Our algorithm thus inherits theoretical convergence guarantees by

construction. Speciﬁcally, we parameterise a drift function with a neural network, which we then

meta-train using evolution strategies [

, ES]. The outcome of this meta-training is a speciﬁc Mirror

Learning algorithm which we name Learnt Policy Optimisation (LPO).

While having a neural network representation of a novel, high-performing drift function is a great

ﬁrst step, our next goal is to understand the relevant algorithmic features of this drift function. Out

analysis reveals that LPO’s drift discovered, for example, optimism about actions that scored low

rewards in the past—a feature we refere to as rollback. Building upon these insights we propose a new,

closed-form algorithm which we name —Discovered Policy Optimisation (DPO). We evaluate LPO

and DPO in the Brax [8] continuous control environments, where they obtain superior performance

compared to PPO. Importantly, both LPO and DPO generalise to environments that were not used for

training LPO. To our knowledge, DPO is the ﬁrst theoretically-sound, scalable deep RL algorithm

that was discovered via meta-learning.

2 Related Work

Over the last few years, researchers have put signiﬁcant effort into designing and developing algo-

rithmic improvements in reinforcement learning. Fujimoto et al. [

] combine DDPG policy training

with estimates of pessimistic Bellman targets from a separate critic. Hsu et al. [

] stabilise the,

previously unsuccessful [

], KL-penalised version of PPO and improve its robustness through novel

policy design choices. Haarnoja et al. [

] introduce a mechanism that automatically adjusts the

temperature parameter of the entropy bonus in SAC. However, none of these hand-crafted efforts

succeeds in fully mitigating common RL pathologies, such as sensitivity to hyperparameter choices

and lack of domain generalisation [

]. This motivates radically expanding the RL algorithm search

space through automated means [27].

Popular approaches in meta-RL have shown that agents can learn to quickly adapt over a pre-speciﬁed

distribution of tasks. RL

equips a learning agent with a recurrent neural network that retains state

across episode boundaries to adapt the agent’s behaviour to the current environment [

]. Similarly, a

MAML agent meta-learns policy parameters which can adapt to a range of tasks with a few steps of

gradient descent [

]. However, both RL

and MAML usually only meta-learn across narrow domains

and are not expected to generalise well to truly unseen environments.

Xu et al. [

] introduce an actor-critic method that adjusts its hyperparameters online using meta-

gradients that are updated with every few inner iterations. Similarly, STAC [

] uses implementation

techniques from IMPALA [

] and auxiliary loss-guided meta-parameter tuning to further improve on

this approach.

Such advances have inspired extending meta-gradient RL techniques to more ambitious objectives,

including the discovery of algorithms ab initio. Notably, Oh et al. [

] succeeded in meta-learning an

RL algorithm, LPG, that can solve simple tasks efﬁciently without explicitly relying on concepts such

as value functions and policy gradients. Similarly, Evolved Policy Gradients [

, EPG] meta-trains a

policy loss network function with Evolution Strategies [

, ES]. Although EPG surpasses PPO in

average performance, it suffers from much larger variance [

] and is not expected to perform well

on environments with dynamics that differ greatly from the training distribution. MetaGenRL [

instead, meta-learns the loss function for deterministic policies which are inherently less affected

by estimators’ variance [

]. MetaGenRL, however, fails to improve upon DDPG [

] in terms of

performance, despite building up on it. Neither EPG nor MetaGenRL have resulted in the discovery

of novel analytical RL algorithms, perhaps due to the limited interpretability of the loss functions

learnt. Lastly, Co-Reyes et al. [

], Garau et al. [

] and Alet et al. [

] discover and improve standard

RL conventions by evolving, symbolically, algorithms represented as graphs, which leads to improved

performance in simple tasks. However, none of those trained-from-scratch methods inherit correctness

guarantees, limiting our certainty of the generality of their abilities. In contrast, our method, LPO, is

meta-developed in a Mirror Learning space [

], where every algorithm is guaranteed convergence to

an optimal policy. As a result to this construction, meta-training of LPO is easier than that of methods

that learn “from scratch”, and achieves great performance across environments. Furthermore, thanks

to the clear meta-structure of Mirror Learning, LPO is interpretable, and lets us discover new learning

strategies. This lets us introduce DPO—an efﬁcient algorithm with a closed-form formulation that

exploits the discovered learning concepts.

3 Background

In this section, we introduce the essential concepts required to comprehend our contribution—the

RL and meta-RL problem formulations, as well as the Mirror Learning and Evolution Strategies

frameworks for solving them.

3.1 Reinforcement Learning

Formulation

We formulate the reinforcement learning (RL) problem as a Markov decision process

(MDP) [

] represented by a tuple

hS,A, R, P, γ, di

which deﬁnes the experience of a learning agent

as follows: at time step

t∈N

, the agent is at state

st∈ S

(where

s0∼d

) and takes an action

at∈ A

according to its stochastic policy

π(·|st)

, which is a member of the policy space

. The environment

then emits the reward

R(st,at)

and transits to the next state

st+1

drawn from the transition function,

st+1 ∼P(·|st,at). The agent aims to maximise the expected value of the total discounted return,

η(π),E[Rγ|π] = Es0∼d,a0:∞∼π,s1:∞∼Ph∞

t=0

γtR(st,at)i.(1)

The agent guides its learning process with value functions that evaluate the expected return conditioned

on states or state-action pairs

Vπ(s),E[Rγ|π, s0=s](the state value function),

Qπ(s, a),E[Rγ|π, s0=s, a0=a](the state-action value function).

The function that the agent is concerned about most is the advantage function, which computes

relative values of actions at different states,

Aπ(s, a),Qπ(s, a)−Vπ(s).(2)

Policy Optimisation

In fact, by updating its policy simply to maximise the advantage function at

every state, the agent is guaranteed to improve its policy,

η(πnew)≥η(πold)

[

]. This fact, although

requiring a maximisation operation that is intractable in large state-space settings tackled by deep

RL (where the policy

πθ

is parameterised by weights

of a neural network), has inspired a range

of algorithms that perform it approximately. For example, A2C [

] updates the policy by a step of

policy gradient (PG) ascent

θk+1 =θk+α

b=1

Aπθk(sb, ab)∇θlog πθk(ab|sb), α ∈(0,1),(3)

estimated from a batch of

transitions. Nevertheless, such simple adoptions of generalized policy

iteration [

, GPI] suffer from large variance and instability [

]. Hence, methods that

constrain (either explicitly or implicitly) the policy update size are preferred [

]. Among the most

popular, as well as successful ones, is Proximal Policy Optimization [

, PPO], inspired by trust

region learning [31], which updates its policy by maximising the PPO-clip objective,

πk+1 = arg max

π∈Π

Es∼ρπk,a∼πkhmin π(a|s)

πk(a|s)Aπk(s,a),clipπ(a|s)

πk(a|s),1±Aπk(s,a)i,(4)

where the

clip(·,1±)

operator clips (if necessary) the input so that it stays within

[1 −, 1 + ]

interval. In deep RL, the maximisation oracle in Equation (4) is approximated by a few steps of

gradient ascent on policy parameters.

Meta-RL

The above approaches to policy optimisation rely on human-possessed knowledge, and

thus are limited by humans’ understanding of the problem. The goal of meta-RL is to instead

optimise the learning algorithm using machine learning. Formally, suppose that an RL algorithm

algφ

, parameterised by

, trains an agent for

iterations. Meta-RL aims to ﬁnd the meta-parameter

φ=φ∗such that the expected return of the output policy, E[η(πK)|algφ], is maximised.

3.2 Mirror Learning

A Mirror Learning agent [

], in addition to value functions, has access to the following operators:

the drift function

Dπk(π|s)

which, intuitively, evaluates the signiﬁcance of change from policy

πk

at state

; the neighbourhood operator

N(πk)

which forms a region around the policy

πk

; as

well as sampling and drift distributions

βπk(s)

and

νπ

πk(s)

over states. With these deﬁned, a Mirror

Learning algorithm updates an agent’s policy by maximising the mirror objective

πk+1 = arg max

π∈N (πk)

Es∼βπkAπk(s,a)−Es∼νπ

πkDπk(π|s).(5)

If, for all policies πand πk, the drift function satisﬁes the following conditions:

1. It is non-negative everywhere and zero at identity Dπk(π|s)≥Dπk(πk|s) = 0,

2. Its gradient with respect to πis zero at π=πk,

then the Mirror Learning algorithm attains the monotonic improvement property,

η(πk+1)≥η(πk)

and converges to the optimal return,

η(πk)→η(π∗)

, as

k→ ∞

[

]. A Mirror Learning agent can

be implemented in practice by specifying functional forms of the drift function and neighbourhood

operator, and parameterising the policy of the agent with a neural network,

πθ

. As such, the agent

approximates the objective in Equation (5) by sample averages, and maximises it with an optimisation

method, like gradient ascent. PPO is a valid instance of Mirror Learning, with the drift function:

DPPO

πk(π|s),Ea∼πkhReLUhπ(a|s)

πk(a|s)−clipπ(a|s)

πk(a|s),1±iAπk(s, a)i.(6)

While it is possible to explicitly constrain the neighbourhood of policy update [

], some algorithms

do it implicitly. For example, as maximisation oracle of PPO (see Equation (4)) has a form of

steps of gradient ascent with learning rate

and gradient clipping threshold

, it implicitly employs a

neighbourhood of an Euclidean ball or radius Nαc around θk.

Different Mirror Learning algorithms can differ in multiple aspects such as sample complexity and

wall-clock time efﬁciency [

]. Depending on the setting, different properties may be desirable. In

this paper, we optimise for the return of the Kth iterate, η(πK).

3.3 Evolution Strategies

Evolution Strategies [

, ES] is a backpropagation-free approach to function optimisation. At

their core lies the following identity, which holds for any continuously differentiable function

and any positive scalar σ

∇φE∼N(0,I)[F(φ+σ)] = 1

σE∼N(0,I)[F(φ+σ)],(7)

where

N(0, I)

denotes the standard multivariate normal distribution. By taking the limit

σ→0

, the

gradient on the left-hand side recovers the gradient of

∇φF(φ)

. These facts inspire an approach of

optimising

with respect to

without estimating gradients with backpropagation—for a random

sample

1, . . . , n∼N(0, I)

, the vector

nσ Pn

i=1 F(φ+σi)i

is an unbiased gradient estimate.

To reduce variance of this estimator, antithetic sampling is commonly used [

]. In the context of

meta-RL, where

is the meta-parameter of an RL algorithm

algφ

, the role of

F(φ)

is played by the

average return after the training,

F(φ) = E[η(πK)|φ]

. As oppose to the meta-gradient approaches

described in Section 2, ES does not require backpropagation of the gradient through the whole training

episode—a cumbersome procedure which, often approximated by the truncated backpropagation,

introduces bias [38,39,25,6,23].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DiscoveredPolicyOptimisationChrisLuFLAIR,UniversityofOxfordchristopher.lu@exeter.ox.ac.ukJakubGrudzienKubayBAIR,UCBerkeleykuba@berkeley.eduAlistairLetcheraletcher.github.ioahp.letcher@gmail.comLukeMetzGoogleBrainLuke.s.metz@gmail.comChristianSchroederdeWittFLAIR,UniversityofOxfordcs@robots.ox.ac.u...

展开>> 收起<<

Discovered Policy Optimisation Chris Lu FLAIR University of Oxford.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Discovered Policy Optimisation Chris Lu FLAIR University of Oxford

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: