Recently, Mirror Learning [
18
], a new theoretical framework, introduced an infinite space of provably
correct algorithms, all of which share the same template. In a nutshell, a Mirror Learning algorithm
is defined by four attributes, but in this work we focus on the drift function. A drift function guides
the agent’s update, usually by penalising large changes. Any Mirror Learning algorithm provably
achieves monotonic improvement of the return, and converges to an optimal policy [
18
]. Popular RL
methods such as TRPO [31] and PPO [32] are instances of this framework.
In this paper, we use meta-learning to discover a new state-of-the-art (SOTA) RL algorithm within
the Mirror Learning space. Our algorithm thus inherits theoretical convergence guarantees by
construction. Specifically, we parameterise a drift function with a neural network, which we then
meta-train using evolution strategies [
29
, ES]. The outcome of this meta-training is a specific Mirror
Learning algorithm which we name Learnt Policy Optimisation (LPO).
While having a neural network representation of a novel, high-performing drift function is a great
first step, our next goal is to understand the relevant algorithmic features of this drift function. Out
analysis reveals that LPO’s drift discovered, for example, optimism about actions that scored low
rewards in the past—a feature we refere to as rollback. Building upon these insights we propose a new,
closed-form algorithm which we name —Discovered Policy Optimisation (DPO). We evaluate LPO
and DPO in the Brax [8] continuous control environments, where they obtain superior performance
compared to PPO. Importantly, both LPO and DPO generalise to environments that were not used for
training LPO. To our knowledge, DPO is the first theoretically-sound, scalable deep RL algorithm
that was discovered via meta-learning.
2 Related Work
Over the last few years, researchers have put significant effort into designing and developing algo-
rithmic improvements in reinforcement learning. Fujimoto et al. [
9
] combine DDPG policy training
with estimates of pessimistic Bellman targets from a separate critic. Hsu et al. [
15
] stabilise the,
previously unsuccessful [
32
], KL-penalised version of PPO and improve its robustness through novel
policy design choices. Haarnoja et al. [
13
] introduce a mechanism that automatically adjusts the
temperature parameter of the entropy bonus in SAC. However, none of these hand-crafted efforts
succeeds in fully mitigating common RL pathologies, such as sensitivity to hyperparameter choices
and lack of domain generalisation [
4
]. This motivates radically expanding the RL algorithm search
space through automated means [27].
Popular approaches in meta-RL have shown that agents can learn to quickly adapt over a pre-specified
distribution of tasks. RL
2
equips a learning agent with a recurrent neural network that retains state
across episode boundaries to adapt the agent’s behaviour to the current environment [
4
]. Similarly, a
MAML agent meta-learns policy parameters which can adapt to a range of tasks with a few steps of
gradient descent [
7
]. However, both RL
2
and MAML usually only meta-learn across narrow domains
and are not expected to generalise well to truly unseen environments.
Xu et al. [
40
] introduce an actor-critic method that adjusts its hyperparameters online using meta-
gradients that are updated with every few inner iterations. Similarly, STAC [
42
] uses implementation
techniques from IMPALA [
5
] and auxiliary loss-guided meta-parameter tuning to further improve on
this approach.
Such advances have inspired extending meta-gradient RL techniques to more ambitious objectives,
including the discovery of algorithms ab initio. Notably, Oh et al. [
25
] succeeded in meta-learning an
RL algorithm, LPG, that can solve simple tasks efficiently without explicitly relying on concepts such
as value functions and policy gradients. Similarly, Evolved Policy Gradients [
14
, EPG] meta-trains a
policy loss network function with Evolution Strategies [
29
, ES]. Although EPG surpasses PPO in
average performance, it suffers from much larger variance [
14
] and is not expected to perform well
on environments with dynamics that differ greatly from the training distribution. MetaGenRL [
17
],
instead, meta-learns the loss function for deterministic policies which are inherently less affected
by estimators’ variance [
33
]. MetaGenRL, however, fails to improve upon DDPG [
21
] in terms of
performance, despite building up on it. Neither EPG nor MetaGenRL have resulted in the discovery
of novel analytical RL algorithms, perhaps due to the limited interpretability of the loss functions
learnt. Lastly, Co-Reyes et al. [
3
], Garau et al. [
10
] and Alet et al. [
1
] discover and improve standard
RL conventions by evolving, symbolically, algorithms represented as graphs, which leads to improved
performance in simple tasks. However, none of those trained-from-scratch methods inherit correctness
2