Probing Transfer in Deep Reinforcement Learning without Task Engineering

2025-05-02 0 0 924.12KB 24 页 10玖币

侵权投诉

Published at 1st Conference on Lifelong Learning Agents, 2022

PROBING TRANSFER IN DEEP REINFORCEMENT LEARNING

WITHOUT TASK ENGINEERING

Andrei A. Rusu, Sebastian Flennerhag, Dushyant Rao, Razvan Pascanu, Raia Hadsell

DeepMind, UK

{andrei, flennerhag, dushyantr, razp, raia}@deepmind.com

ABSTRACT

We evaluate the use of original game curricula supported by the Atari 2600 console as a heteroge-

neous transfer benchmark for deep reinforcement learning agents. Game designers created curricula

using combinations of several discrete modiﬁcations to the basic versions of games such as Space

Invaders, Breakout and Freeway, making them progressively more challenging for human players.

By formally organising these modiﬁcations into several factors of variation, we are able to show that

Analyses of Variance (ANOVA) are a potent tool for studying the effects of human-relevant domain

changes on the learning and transfer performance of a deep reinforcement learning agent. Since no

manual task engineering is needed on our part, leveraging the original multi-factorial design avoids

the pitfalls of unintentionally biasing the experimental setup. We ﬁnd that game design factors have

a large and statistically signiﬁcant impact on an agent’s ability to learn, and so do their combinatorial

interactions. Furthermore, we show that zero-shot transfer from the basic games to their respective

variations is possible, but the variance in performance is also largely explained by interactions be-

tween factors. As such, we argue that Atari game curricula offer a challenging benchmark for trans-

fer learning in RL, that can help the community better understand the generalisation capabilities of

RL agents along dimensions which meaningfully impact human generalisation performance. As a

start, we report that value-function ﬁnetuning of regularly trained agents achieves positive transfer in

a majority of cases, but signiﬁcant headroom for algorithmic innovation remains. We conclude with

the observation that selective transfer from multiple variants could further improve performance.

1 INTRODUCTION

A key open challenge in artiﬁcial intelligence is training reinforcement learning (RL) agents which generally achieve

high returns when faced with critical changes to their environments (Schaul et al.,2018), motivated by impressive

ﬂexibility of animal and human learning. One way to approach the problem is through the prism of generalisation

across related but distinct environments, also called transfer learning (Pan & Yang,2009;Taylor & Stone,2009) and

comprehensively reviewed in the RL setting by Zhu et al. (2020). Many purpose-built benchmarks serve investigations

into more speciﬁc research questions, e.g. transfer learning in particular cases where additional assumptions hold.

While useful for progress, the challenge of transfer learning in the more general case remains.

Motivated by visual observation similarity, Machado et al. (2018) suggest using the newest iteration of the Atari Learn-

ing Environment (ALE) (Bellemare et al.,2013) to study the transfer learning between single-player game variants, or

“ﬂavours”, found in the curricula of many Atari game titles. We will use the terms default or basic game interchange-

ably to refer to the environment recommended by respective manuals as the entry point. We call all other distinct

games variations or variants of their respective default game. Hence, each Atari game title we consider provides a

curriculum consisting of a default and its variants, all designed to teach and challenge human players in novel ways.

Studying transfer within curricula ensures that environments are related, and that meaningful knowledge reuse should

be possible and beneﬁcial. Interestingly, differences in game variant dynamics, subtle changes in observations which

are crucial for optimal behaviour, novel environment states, as well as new player abilities challenge uniﬁed approaches

to transfer learning across variations. Farebrother et al. (2018) argue that Deep Q-Network (DQN) agents (Mnih et al.,

2015), trained with appropriate regularisation, can be effective in zero-shot and ﬁnetuning transfer scenarios. In this

work we study the learning and transfer performance of an updated version of the DQN agent, called Rainbow-IQN

(Toromanoff et al.,2019). We use ANOVA to quantitatively conﬁrm the suspected link between game design factors

and agent performance. While current approaches occasionally achieve meaningful transfer from default games, they

have limited success for variations with several modiﬁcations. Our analyses reveal this is due to strong interaction

arXiv:2210.12448v1 [cs.LG] 22 Oct 2022

Published at 1st Conference on Lifelong Learning Agents, 2022

effects between factors, not just the isolated effects of the modiﬁcations they introduce. This reinforces the case for

agents leveraging these systematic curricula, originally designed for human players, through transfer learning.

Contributions: (1) We show that discrete changes to Atari game environments, challenging for human players, also

modulate the performance of a popular model-free deep reinforcement learning algorithm starting from scratch. (2) Ze-

ro-shot transfer of policies trained on one game variation and tested on others can be signiﬁcant, but performance is far

from uniform across respective environments. (3) Interestingly, zero-shot transfer variance from default game experts

is also well explained by game design factors, especially their interactions. (4) We empirically evaluate the perfor-

mance of value-function ﬁnetuning, a general transfer learning technique compatible with model-free deep RL, and

conﬁrm that it can lead to positive transfer from basic game experts. (5) We point out that more complex challenges of

transfer with deep RL are captured by Atari game variations, e.g. appropriate source task selection, fast policy transfer

and evaluation using experts, as well as data efﬁcient behaviour adaptation more generally.

2 BACKGROUND

Reinforcement Learning. A Markov Decision Processes (MDP) (Puterman,1994) is the classic abstraction used to

characterise the sequential interaction loop between an action taking agent and its environment, which responds with

observations and rewards (Sutton & Barto,2018). Formally, a ﬁnite MDP is a tuple M=hX ,A,T,R, γi, where X

is the state space, Ais the action space, both ﬁnite sets, T:X × A × X 7→ [0,1] is the stochastic transition function

which maps each state and action to a probability distribution over possible future states T(x, a, x0) = P(x0|x, a),

R:X × A × X 7→ Ris the reward distribution function, with r:X × A 7→ Rthe expected immediate reward

r(x, a) = ET[R(x, a, x0)], and γ∈[0,1] is the discount factor (Bellman,1957). A stochastic policy function is the

action selection strategy π:X ×A 7→ [0,1] which maps states to probability distributions over actions π(x) = P(a|x).

The discounted sum of future rewards is the random variable Zπ

M(x, a) = P∞

t=0 γtr(xt, at), where x0=x,a0=a,

xt∼ T (·|xt−1, at−1)and at∼π(·|xt). Given an MDP Mand a policy π, the value function is the expectation

over the discounted sum of future rewards, also called the expected return: Vπ

M(x) = E[Zπ

M(x, π(x))]. The goal

of Reinforcement Learning (RL) is to ﬁnd a policy π∗

M:X 7→ A which is optimal, in the sense that it maximises

expected return in M. The state-action Q-function is deﬁned as Qπ

M(x, a) = E[Zπ

M(x, a)] and it satisﬁes the Bellman

equation Qπ

M(x, a) = ET[R(x, a, x0)] + γET,π [Qπ

M(x0, a0)] for all states x∈ X and actions a∈ A.Mnih et al.

(2013) adapted a reinforcement learning algorithm called Q-Learning (Watkins,1989) to train deep neural networks

end-to-end, mastering several Atari 2600 games, with inputs consisting of high-dimensional observations, in the form

of console screen pixels, and differences in game scores. For more details please consult Appendix A.

Note that value functions depend critically on all aspects of the MDP. For any policy π, in general Qπ

M6=Qπ

M0for

MDPs Mand M0deﬁned over the same state and action sets, with T 6=T0or R 6=R0. Even if differences in

dynamics or rewards are isolated to a subset of X × A, changes may be induced across the support of Qπ

M0, since

value functions are expectations over sums of discounted future rewards, issued according to R0, along sequences of

states decided entirely by the new environment dynamics T0when following a ﬁxed behaviour policy π. Nevertheless,

many particular cases of interest exist, which we discuss below.

Transfer learning. Described and motivated in its general form by Caruana (1997); Thrun & Pratt (1998); Bengio

(2012), the goal of transfer learning is to use knowledge acquired from one or more source tasks to improve the

learning process, or its outcomes, for one or more target tasks, e.g. by using fewer resources compared to learning

from scratch. When this is achieved, we call it positive transfer. We often further qualify transfer learning by the

metric used to measure speciﬁc effects. Several transfer learning metrics have been deﬁned for the RL setting (Taylor

& Stone,2009), but none capture all aspects of interest on their own. A ﬁrst metric we use is “jumpstart” or “zero-shot”

transfer, which is performance on the target task before gaining access to its data. Another highly relevant metric is

“performance with ﬁxed training epochs” (Zhu et al.,2020), deﬁned as returns achieved in the target task after using

ﬁxed computation and data budgets under transfer learning conditions.

One way to classify such approaches in RL is by the format of knowledge being transferred (Zhu et al.,2020), com-

monly: datasets, predictions, and/or parameters of neural networks encoding representations of observations, policies,

value-functions or approximate state-transition “world models” acquired using source tasks. Another way to classify

transfer learning approaches in RL is by their respective sets of assumptions about source and target tasks, also well

illustrated by their associated benchmark domains: (1) Differences are limited to observations, but the underlying

MDP is the same, e.g. domain adaptation and randomisation (Tobin et al.,2017), mastered through the generalisation

abilities of large models (Cobbe et al.,2019;2020). (2) MDP states and dynamics are the same, but reward functions

are different: successor features and representations (Barreto et al.,2017). (3) Overlapping state spaces with similar

dynamics, e.g. multitask training and policy transfer (Rusu et al.,2016a;Schmitt et al.,2018), contextual parametric

approaches, e.g. UVFAs (Schaul et al.,2015) and collections of skills/goals/options (Barreto et al.,2019). (4) Sus-

Published at 1st Conference on Lifelong Learning Agents, 2022

pected but unqualiﬁed overlaps between tasks (Parisotto et al.,2016;Rusu et al.,2016b;2017). (5) Large, curated

collections of similar environments designed to facilitate complex transfer and fast adaptation through meta-learning

(Yu et al.,2020;Hospedales et al.,2020). All these works capture important sub-problems of interest, and clever

design of specialised benchmarks has greatly aided progress. Atari game variations (Machado et al.,2018) offer the

exciting prospect of direct comparisons between generalisations of transfer learning methods originally developed un-

der different sets of assumptions, by measuring their performance along dimensions of variation which are meaningful

and challenging for human players, one of many interesting criteria cutting across specialised paradigms.

Figure 1: Variant naming convention and factorial design matrices for Space Invaders (top), Breakout (bottom left) and

Freeway (bottom right). All factors of variation are categorical and highlighted only when taking different values from

default games. Binary factors are abbreviated by their initial, e.g. “Moving Shields” is plotted as ‘M’. Non-binary

factors are: Breakout “Rules” with additional values: ‘T’ for “Timed Breakout”, or ‘U’ for “Breakthru”; Breakout

“Extras” with additional values: ‘S’ for “Steerable”, ‘C’ for “Catch”, or ‘I’ for “Invisible”; Freeway “Trafﬁc” with

levels: ‘K’ for “Thick”, ‘R’ for “Thicker”, or ‘T’ for “Thickest”. Colours indicate the “game mode”, and hatching

denotes the “difﬁculty” switch being activated. Colours and horizontal labels correspond to those used in subsequent

ﬁgures. Labels include factor abbreviations to supplement the ALE naming conventions.

2.1 ALE ATARI AS A TRANSFER LEARNING BENCHMARK

With the ﬁrst version of the Arcade Learning Environment (ALE), Bellemare et al. (2013) gave access to over 50

different Atari 2600 game titles through a uniﬁed observation and action interface, modelled after the standard rein-

forcement learning loop. The ALE proved an excellent development test-bed for building general deep RL agents, in

no small part due to its diversity and being devoid of experimenter bias, since games were not modiﬁed by researchers.

The second and latest version of the ALE (Machado et al.,2018) opens up a wealth of game variations for transfer

learning research. This is achieved by emulating functions of “difﬁculty” and “game select” switches for single-player

games. The original cartridges of many game titles came with variations which could be selected and played using

these switches. We assign unique identiﬁers of game variants using the X YZ notation, with X∈ {0,1}denoting the

position of the “difﬁculty” switch, and YZ ∈ {00,01,02, . . . }indicating the selected “game mode”. Value ranges are

game speciﬁc and identical to those used by the ALE code-base (Machado et al.,2018). For example, the entry game

version of each title is denoted as 0 00, which we call the default or basic game. All game variations are different

environments that still feature the main concepts of the default. Furthermore, the variants were designed to serve as

curricula for human players, hence positive transfer should be possible. Most importantly for our purposes, the ALE

remains largely free of experimenter bias. Farebrother et al. (2018) selected a few representative variants from four

game titles for experiments. We aim to analyse entire curricula, thus we consider the top three most popular titles from

their list, according to sales, and use all their variations, for a total of 72 distinct environments.

Published at 1st Conference on Lifelong Learning Agents, 2022

The original designers created game variations using combinations of discrete modiﬁcations to a default “entry” game,

qualitatively described in accompanying game manuals. We organise these discrete modiﬁcation into formal factors of

variation, following closely the original game design matrices sometimes explicitly plotted in manuals. In Figure 1we

formalise variant naming conventions relative to default games. We brieﬂy explain the meanings of all design factors

below, to help build an intuitive understanding of this heterogeneous collection of transfer learning scenarios.

Space Invaders. The player controls an Earth-based laser cannon which can move horizontally along the bottom of

the screen. There is a grid of approaching aliens above the player, shooting laser bombs, and the objective of the game

is to eliminate all the aliens before they reach the Earth, especially their Command Ships. Three destructible shields

above the cannon offer some protection. The game ends once any alien reaches the Earth or the player’s cannon is hit a

third time with laser bombs. Game variants are combinations of ﬁve binary factors which, when activated, modify the

default: (1) The difﬁculty switch widens the player’s laser cannon, making it an easier target for enemy laser bombs.

(2) Shields move, and thus are harder to use. (3) Enemy laser bombs zigzag, which makes them harder to predict.

(4) Enemy laser bombs drop faster. (5) Invaders are invisible, and only hitting one with the laser brieﬂy illuminates

the others. This creates a total of 32 variants of Space Invader.

Breakout. In order to achieve the best score in any variant, the players need to completely break walls with six layers

of coloured bricks by knocking off said bricks with a ball, served and bounced towards the wall with a controllable

paddle, which they can move horizontally along the bottom of the screen. The ball will also bounce off of screen

edges, except for the bottom edge, where balls either come in contact with the paddle or are lost. Players have a total

of ﬁve balls at their disposal, and the game ends when all balls are lost. Points are scored when bricks are knocked off

the wall, and the ball accelerates on contact with bricks in the top three rows or after twelve consecutive hits. Game

variants are created by all combinations of three factors: (1) The binary difﬁculty switch reduces the paddle’s width by

a quarter, making it easier to miss the ball. (2) The precise rules which determine how points accumulate: (2-a) with

standard “Breakout” rules, e.g. in the default game (0 00), the player must completely break down two walls of bricks,

one after the other, while loosing the fewest balls; (2-b) under “Timed Breakout” rules, the player must completely

break a single wall as fast as possible, no matter how many of the ﬁve balls are used; (2-c) with “Breakthru” rules,

the player needs to break two walls consecutively, but the ball does not bounce off bricks, unlike in previous variants;

the ball keeps going through the wall, quickly picking up speed and accumulating points. (3) How the ball is aimed at

the wall and “extras”: (3-a) the ball simply bounces off the paddle at a position dependent angle; (3-b) the player can

also steer the ball in ﬂight; (3-c) the player is also able catch the ball and release it at a slower speed; (3-d) the wall is

invisible and is only brieﬂy illuminated when a brick if knocked off, so the player needs to remember its conﬁguration

in order to aim effectively. All factors together deﬁne 24 variants of Breakout.

Freeway. In all variants, the goal of the player is to safely get a chicken across ten lanes of freeway trafﬁc as many

times as possible in 2 minutes and 16 seconds. Variants are constructed as combinations of three design factors:

(1) The difﬁculty switch controls whether the chicken is knocked back one lane, or all the way to the kerb, when hit

by incoming vehicles. (2) Trafﬁc “thickness”, deﬁned as four levels of trafﬁc density. (3) Vehicle speeds across lanes,

either constant or randomised. Hence, there are 16 variants of Freeway.

3 METHODOLOGY

Experimental setup. Following the latest ALE benchmark recommendations (Machado et al.,2018), information

about player “lives” is not disclosed to agents, and stochasticity is introduced using randomised action replay with

probability 25%, also known as “sticky actions”. Irrespective of redundancies, agents act using the full set of 18

actions. Environment observations (Atari screen frames) were pre-processed according to standard practice (Mnih

et al.,2015;Hessel et al.,2018). We report agent returns at the end of training, averaged over the last 2 million steps.

Expert Agent Training. We use the Rainbow-IQN model-free deep reinforcement learning algorithm (Hessel et al.,

2018;Dabney et al.,2018;Toromanoff et al.,2019) since it is available to the community in several implementations,

e.g. Castro et al. (2018), and is effective with widely available commodity hardware and open-source software.

Rainbow (Hessel et al.,2018) collects a number of improvements to DQN (Mnih et al.,2013;2015), of which we used

Double Q-Learning (Van Hasselt et al.,2016), Prioritised Replay (Schaul et al.,2016) and multi-step learning (Sutton

& Barto,2018). Following Castro et al. (2018), we did not use the dueling network architecture (Wang et al.,2016) or

noisy networks (Fortunato et al.,2018). We replaced the distributional RL approach C51 (Bellemare et al.,2017) with

its more general form (IQN), since it has been shown to be superior (Dabney et al.,2018;Toromanoff et al.,2019).

We used the standard limit on environment interactions of 200 million steps (Mnih et al.,2015;2016;Dabney et al.,

2018;Machado et al.,2018;Toromanoff et al.,2019). Our study is the ﬁrst to report Rainbow-IQN results for Atari

game variants, hence we use independent hyper-parameter selection for expert training and ﬁnetuning experiments.

Published at 1st Conference on Lifelong Learning Agents, 2022

Expert Agent Finetuning. The experimental setup for ﬁnetuning follows closely that for agent training, except for

two important differences: (1) We used 10 million environment steps to adapt to new variants, following Farebrother

et al. (2018), which is 20×less data than what variant-experts are trained with. The aim of transfer learning is to

reduce resources needed for acquiring new knowledge and behaviours, hence we are interested in improving sample

complexity using transfer. (2) The hyper-parameter grid was slightly adapted in order to improve chances of fast

learning in this reduced data regime. Further details, including parameter grids, are listed in Appendix B.

Statistical Analyses. A multi-factor Analyses of Variance (ANOVA) (Girden,1992) is a statistical procedure which

investigates the inﬂuence that two or more independent variables, or factors, have on a dependent variable. The factors

can take two or more categorical values, and experimental designs are often balanced: they use equal numbers of

independent observations for all combinations of factor values, no less than three. Other assumptions are normality of

deviations from group means, and equal or similar variances, see the discussion in Appendix C. ANOVA can be used

to reveal statistically signiﬁcant evidence against the null-hypothesis that group means are all the same. We study the

inﬂuence of game design factors, discrete modiﬁcation to default games, on average returns of learned policies.

Study Limitations. The relationship between hyper-parameters and quality of policies learned with deep RL is com-

plex and poorly understood. Mnih et al. (2015) used task-agnostic DQN hyper-parameters, tuned across a few basic

games. Later works introduce several interacting modiﬁcation, including changes to sets of hyper-parameters (Schaul

et al.,2016;Dabney et al.,2018;Toromanoff et al.,2019). With the risk of maximisation bias, we used grid searches

per game variant in order to mitigate the greater risks of incorrect inferences about Atari curricula due to poor hyper-

parameter settings or divergence. Future works may perform ﬁne-grained sensitivity analyses, using more than two

random seeds. Due to computational limitations, we instead report the returns of the top-three agents from our grids.

4 EXPERIMENTS

We would like to characterise the performance of general transfer learning strategies across heterogeneous scenarios,

designed to progressively challenge human players. While the shapes of learning curves are expected to be different

between game variants, agent performance is thought to be bounded only by its inductive biases, the learning algorithm

and practical limitation on resources, such as computation or interactions (Silver et al.,2021). We say that some game

variant is “harder” for a given agent if the average returns of learned policies are lower given equal resource budgets.

We aim to explain this in terms of the factors of variation which combinatorially deﬁne environments within curricula,

knowing that they also impact human player performance. Comparing raw scores across the benchmark is difﬁcult if

the underlying tasks are “harder” to learn from scratch, or if variants have different scoring scales, as is sometimes the

case here. Hence, we must ﬁrst establish the performance of our agent learning game variants in isolation. On this

basis, we then introduce a relative scoring scheme which enables conclusions to be drawn at the benchmark level.

4.1 TRAINING VARIANT-EXPERTS

We aim to answer the following questions: (1) Does our agent achieve similar levels of performance when learning

variants of the same game from scratch? (2) If not, what explains differences in performance across variants? In

Figure 2we report means and standard deviations of top three variant-experts trained from scratch for 200 million

environment steps. Our scores on default games are largely in line with those reported by other implementations of

basic IQN agents, e.g. Castro et al. (2018). On remaining variants, we ﬁnd that Rainbow-IQN performance varies

widely for some game titles, less so on Freeway, even with independent hyper-parameter tuning. Overall, our agent’s

scores are below variant performance ceilings, suggesting that some game variations may be harder to learn. While

this may be at times due to the agent’s inductive biases—in particular, invisible objects—it is unlikely to fully explain

observed variation, because such changes to basic games do not always have a detrimental impact on their own. For

example, experts achieve signiﬁcantly different levels of performance on identical variants of Space Invaders, except

for visible vs invisible aliens (0 00 vs.0 08). However, experts have virtually equal performance with the same change

to environment observations across Breakout variants, visible vs invisible walls of bricks (e.g. 0 00 vs. 0 12). Rather,

we hypothesise that some variants are inherently harder to learn for the chosen agent, and thus a ﬁxed training budget

leads to performance differences across variations. This imposes a challenging bottleneck for transfer learning, which

places high priority on limiting the resources expended for acquiring behaviours which maximise cumulative returns.

Statistical Analyses. We verify that variants have meaningful impact on learning by rejecting the null hypothesis that

expert performance is the same across all combinations of the design factors which deﬁne game variants. We perform

multi-factor Analysis of Variance (ANOVA) tests separately for each game title, and report results in Table 1. Inter-

action effects are statistically signiﬁcant in all cases, supporting the hypothesis that—although conceptually similar—

design factors introduce signiﬁcant changes to agent learning dynamics, even within the same game. In the post-hoc

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Publishedat1stConferenceonLifelongLearningAgents,2022PROBINGTRANSFERINDEEPREINFORCEMENTLEARNINGWITHOUTTASKENGINEERINGAndreiA.Rusu,SebastianFlennerhag,DushyantRao,RazvanPascanu,RaiaHadsellDeepMind,UKfandrei,flennerhag,dushyantr,razp,raiag@deepmind.comABSTRACTWeevaluatetheuseoforiginalgamecurriculasup...

展开>> 收起<<

Probing Transfer in Deep Reinforcement Learning without Task Engineering.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Probing Transfer in Deep Reinforcement Learning without Task Engineering

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: