Probing Transfer in Deep Reinforcement Learning without Task Engineering

2025-05-02 0 0 924.12KB 24 页 10玖币
侵权投诉
Published at 1st Conference on Lifelong Learning Agents, 2022
PROBING TRANSFER IN DEEP REINFORCEMENT LEARNING
WITHOUT TASK ENGINEERING
Andrei A. Rusu, Sebastian Flennerhag, Dushyant Rao, Razvan Pascanu, Raia Hadsell
DeepMind, UK
{andrei, flennerhag, dushyantr, razp, raia}@deepmind.com
ABSTRACT
We evaluate the use of original game curricula supported by the Atari 2600 console as a heteroge-
neous transfer benchmark for deep reinforcement learning agents. Game designers created curricula
using combinations of several discrete modifications to the basic versions of games such as Space
Invaders, Breakout and Freeway, making them progressively more challenging for human players.
By formally organising these modifications into several factors of variation, we are able to show that
Analyses of Variance (ANOVA) are a potent tool for studying the effects of human-relevant domain
changes on the learning and transfer performance of a deep reinforcement learning agent. Since no
manual task engineering is needed on our part, leveraging the original multi-factorial design avoids
the pitfalls of unintentionally biasing the experimental setup. We find that game design factors have
a large and statistically significant impact on an agent’s ability to learn, and so do their combinatorial
interactions. Furthermore, we show that zero-shot transfer from the basic games to their respective
variations is possible, but the variance in performance is also largely explained by interactions be-
tween factors. As such, we argue that Atari game curricula offer a challenging benchmark for trans-
fer learning in RL, that can help the community better understand the generalisation capabilities of
RL agents along dimensions which meaningfully impact human generalisation performance. As a
start, we report that value-function finetuning of regularly trained agents achieves positive transfer in
a majority of cases, but significant headroom for algorithmic innovation remains. We conclude with
the observation that selective transfer from multiple variants could further improve performance.
1 INTRODUCTION
A key open challenge in artificial intelligence is training reinforcement learning (RL) agents which generally achieve
high returns when faced with critical changes to their environments (Schaul et al.,2018), motivated by impressive
flexibility of animal and human learning. One way to approach the problem is through the prism of generalisation
across related but distinct environments, also called transfer learning (Pan & Yang,2009;Taylor & Stone,2009) and
comprehensively reviewed in the RL setting by Zhu et al. (2020). Many purpose-built benchmarks serve investigations
into more specific research questions, e.g. transfer learning in particular cases where additional assumptions hold.
While useful for progress, the challenge of transfer learning in the more general case remains.
Motivated by visual observation similarity, Machado et al. (2018) suggest using the newest iteration of the Atari Learn-
ing Environment (ALE) (Bellemare et al.,2013) to study the transfer learning between single-player game variants, or
“flavours”, found in the curricula of many Atari game titles. We will use the terms default or basic game interchange-
ably to refer to the environment recommended by respective manuals as the entry point. We call all other distinct
games variations or variants of their respective default game. Hence, each Atari game title we consider provides a
curriculum consisting of a default and its variants, all designed to teach and challenge human players in novel ways.
Studying transfer within curricula ensures that environments are related, and that meaningful knowledge reuse should
be possible and beneficial. Interestingly, differences in game variant dynamics, subtle changes in observations which
are crucial for optimal behaviour, novel environment states, as well as new player abilities challenge unified approaches
to transfer learning across variations. Farebrother et al. (2018) argue that Deep Q-Network (DQN) agents (Mnih et al.,
2015), trained with appropriate regularisation, can be effective in zero-shot and finetuning transfer scenarios. In this
work we study the learning and transfer performance of an updated version of the DQN agent, called Rainbow-IQN
(Toromanoff et al.,2019). We use ANOVA to quantitatively confirm the suspected link between game design factors
and agent performance. While current approaches occasionally achieve meaningful transfer from default games, they
have limited success for variations with several modifications. Our analyses reveal this is due to strong interaction
1
arXiv:2210.12448v1 [cs.LG] 22 Oct 2022
Published at 1st Conference on Lifelong Learning Agents, 2022
effects between factors, not just the isolated effects of the modifications they introduce. This reinforces the case for
agents leveraging these systematic curricula, originally designed for human players, through transfer learning.
Contributions: (1) We show that discrete changes to Atari game environments, challenging for human players, also
modulate the performance of a popular model-free deep reinforcement learning algorithm starting from scratch. (2) Ze-
ro-shot transfer of policies trained on one game variation and tested on others can be significant, but performance is far
from uniform across respective environments. (3) Interestingly, zero-shot transfer variance from default game experts
is also well explained by game design factors, especially their interactions. (4) We empirically evaluate the perfor-
mance of value-function finetuning, a general transfer learning technique compatible with model-free deep RL, and
confirm that it can lead to positive transfer from basic game experts. (5) We point out that more complex challenges of
transfer with deep RL are captured by Atari game variations, e.g. appropriate source task selection, fast policy transfer
and evaluation using experts, as well as data efficient behaviour adaptation more generally.
2 BACKGROUND
Reinforcement Learning. A Markov Decision Processes (MDP) (Puterman,1994) is the classic abstraction used to
characterise the sequential interaction loop between an action taking agent and its environment, which responds with
observations and rewards (Sutton & Barto,2018). Formally, a finite MDP is a tuple M=hX ,A,T,R, γi, where X
is the state space, Ais the action space, both finite sets, T:X × A × X 7→ [0,1] is the stochastic transition function
which maps each state and action to a probability distribution over possible future states T(x, a, x0) = P(x0|x, a),
R:X × A × X 7→ Ris the reward distribution function, with r:X × A 7→ Rthe expected immediate reward
r(x, a) = ET[R(x, a, x0)], and γ[0,1] is the discount factor (Bellman,1957). A stochastic policy function is the
action selection strategy π:X ×A 7→ [0,1] which maps states to probability distributions over actions π(x) = P(a|x).
The discounted sum of future rewards is the random variable Zπ
M(x, a) = P
t=0 γtr(xt, at), where x0=x,a0=a,
xt T (·|xt1, at1)and atπ(·|xt). Given an MDP Mand a policy π, the value function is the expectation
over the discounted sum of future rewards, also called the expected return: Vπ
M(x) = E[Zπ
M(x, π(x))]. The goal
of Reinforcement Learning (RL) is to find a policy π
M:X 7→ A which is optimal, in the sense that it maximises
expected return in M. The state-action Q-function is defined as Qπ
M(x, a) = E[Zπ
M(x, a)] and it satisfies the Bellman
equation Qπ
M(x, a) = ET[R(x, a, x0)] + γET[Qπ
M(x0, a0)] for all states x∈ X and actions a∈ A.Mnih et al.
(2013) adapted a reinforcement learning algorithm called Q-Learning (Watkins,1989) to train deep neural networks
end-to-end, mastering several Atari 2600 games, with inputs consisting of high-dimensional observations, in the form
of console screen pixels, and differences in game scores. For more details please consult Appendix A.
Note that value functions depend critically on all aspects of the MDP. For any policy π, in general Qπ
M6=Qπ
M0for
MDPs Mand M0defined over the same state and action sets, with T 6=T0or R 6=R0. Even if differences in
dynamics or rewards are isolated to a subset of X × A, changes may be induced across the support of Qπ
M0, since
value functions are expectations over sums of discounted future rewards, issued according to R0, along sequences of
states decided entirely by the new environment dynamics T0when following a fixed behaviour policy π. Nevertheless,
many particular cases of interest exist, which we discuss below.
Transfer learning. Described and motivated in its general form by Caruana (1997); Thrun & Pratt (1998); Bengio
(2012), the goal of transfer learning is to use knowledge acquired from one or more source tasks to improve the
learning process, or its outcomes, for one or more target tasks, e.g. by using fewer resources compared to learning
from scratch. When this is achieved, we call it positive transfer. We often further qualify transfer learning by the
metric used to measure specific effects. Several transfer learning metrics have been defined for the RL setting (Taylor
& Stone,2009), but none capture all aspects of interest on their own. A first metric we use is “jumpstart” or “zero-shot”
transfer, which is performance on the target task before gaining access to its data. Another highly relevant metric is
“performance with fixed training epochs” (Zhu et al.,2020), defined as returns achieved in the target task after using
fixed computation and data budgets under transfer learning conditions.
One way to classify such approaches in RL is by the format of knowledge being transferred (Zhu et al.,2020), com-
monly: datasets, predictions, and/or parameters of neural networks encoding representations of observations, policies,
value-functions or approximate state-transition “world models” acquired using source tasks. Another way to classify
transfer learning approaches in RL is by their respective sets of assumptions about source and target tasks, also well
illustrated by their associated benchmark domains: (1) Differences are limited to observations, but the underlying
MDP is the same, e.g. domain adaptation and randomisation (Tobin et al.,2017), mastered through the generalisation
abilities of large models (Cobbe et al.,2019;2020). (2) MDP states and dynamics are the same, but reward functions
are different: successor features and representations (Barreto et al.,2017). (3) Overlapping state spaces with similar
dynamics, e.g. multitask training and policy transfer (Rusu et al.,2016a;Schmitt et al.,2018), contextual parametric
approaches, e.g. UVFAs (Schaul et al.,2015) and collections of skills/goals/options (Barreto et al.,2019). (4) Sus-
2
Published at 1st Conference on Lifelong Learning Agents, 2022
pected but unqualified overlaps between tasks (Parisotto et al.,2016;Rusu et al.,2016b;2017). (5) Large, curated
collections of similar environments designed to facilitate complex transfer and fast adaptation through meta-learning
(Yu et al.,2020;Hospedales et al.,2020). All these works capture important sub-problems of interest, and clever
design of specialised benchmarks has greatly aided progress. Atari game variations (Machado et al.,2018) offer the
exciting prospect of direct comparisons between generalisations of transfer learning methods originally developed un-
der different sets of assumptions, by measuring their performance along dimensions of variation which are meaningful
and challenging for human players, one of many interesting criteria cutting across specialised paradigms.
Figure 1: Variant naming convention and factorial design matrices for Space Invaders (top), Breakout (bottom left) and
Freeway (bottom right). All factors of variation are categorical and highlighted only when taking different values from
default games. Binary factors are abbreviated by their initial, e.g. “Moving Shields” is plotted as ‘M’. Non-binary
factors are: Breakout “Rules” with additional values: ‘T’ for “Timed Breakout”, or ‘U’ for “Breakthru”; Breakout
“Extras” with additional values: ‘S’ for “Steerable”, ‘C’ for “Catch”, or ‘I’ for “Invisible”; Freeway “Traffic” with
levels: ‘K’ for “Thick”, ‘R’ for “Thicker”, or ‘T’ for “Thickest”. Colours indicate the “game mode”, and hatching
denotes the “difficulty” switch being activated. Colours and horizontal labels correspond to those used in subsequent
figures. Labels include factor abbreviations to supplement the ALE naming conventions.
2.1 ALE ATARI AS A TRANSFER LEARNING BENCHMARK
With the first version of the Arcade Learning Environment (ALE), Bellemare et al. (2013) gave access to over 50
different Atari 2600 game titles through a unified observation and action interface, modelled after the standard rein-
forcement learning loop. The ALE proved an excellent development test-bed for building general deep RL agents, in
no small part due to its diversity and being devoid of experimenter bias, since games were not modified by researchers.
The second and latest version of the ALE (Machado et al.,2018) opens up a wealth of game variations for transfer
learning research. This is achieved by emulating functions of “difficulty” and “game select” switches for single-player
games. The original cartridges of many game titles came with variations which could be selected and played using
these switches. We assign unique identifiers of game variants using the X YZ notation, with X∈ {0,1}denoting the
position of the “difficulty” switch, and YZ ∈ {00,01,02, . . . }indicating the selected “game mode”. Value ranges are
game specific and identical to those used by the ALE code-base (Machado et al.,2018). For example, the entry game
version of each title is denoted as 0 00, which we call the default or basic game. All game variations are different
environments that still feature the main concepts of the default. Furthermore, the variants were designed to serve as
curricula for human players, hence positive transfer should be possible. Most importantly for our purposes, the ALE
remains largely free of experimenter bias. Farebrother et al. (2018) selected a few representative variants from four
game titles for experiments. We aim to analyse entire curricula, thus we consider the top three most popular titles from
their list, according to sales, and use all their variations, for a total of 72 distinct environments.
3
Published at 1st Conference on Lifelong Learning Agents, 2022
The original designers created game variations using combinations of discrete modifications to a default “entry” game,
qualitatively described in accompanying game manuals. We organise these discrete modification into formal factors of
variation, following closely the original game design matrices sometimes explicitly plotted in manuals. In Figure 1we
formalise variant naming conventions relative to default games. We briefly explain the meanings of all design factors
below, to help build an intuitive understanding of this heterogeneous collection of transfer learning scenarios.
Space Invaders. The player controls an Earth-based laser cannon which can move horizontally along the bottom of
the screen. There is a grid of approaching aliens above the player, shooting laser bombs, and the objective of the game
is to eliminate all the aliens before they reach the Earth, especially their Command Ships. Three destructible shields
above the cannon offer some protection. The game ends once any alien reaches the Earth or the player’s cannon is hit a
third time with laser bombs. Game variants are combinations of five binary factors which, when activated, modify the
default: (1) The difficulty switch widens the player’s laser cannon, making it an easier target for enemy laser bombs.
(2) Shields move, and thus are harder to use. (3) Enemy laser bombs zigzag, which makes them harder to predict.
(4) Enemy laser bombs drop faster. (5) Invaders are invisible, and only hitting one with the laser briefly illuminates
the others. This creates a total of 32 variants of Space Invader.
Breakout. In order to achieve the best score in any variant, the players need to completely break walls with six layers
of coloured bricks by knocking off said bricks with a ball, served and bounced towards the wall with a controllable
paddle, which they can move horizontally along the bottom of the screen. The ball will also bounce off of screen
edges, except for the bottom edge, where balls either come in contact with the paddle or are lost. Players have a total
of five balls at their disposal, and the game ends when all balls are lost. Points are scored when bricks are knocked off
the wall, and the ball accelerates on contact with bricks in the top three rows or after twelve consecutive hits. Game
variants are created by all combinations of three factors: (1) The binary difficulty switch reduces the paddle’s width by
a quarter, making it easier to miss the ball. (2) The precise rules which determine how points accumulate: (2-a) with
standard “Breakout” rules, e.g. in the default game (0 00), the player must completely break down two walls of bricks,
one after the other, while loosing the fewest balls; (2-b) under “Timed Breakout” rules, the player must completely
break a single wall as fast as possible, no matter how many of the five balls are used; (2-c) with “Breakthru” rules,
the player needs to break two walls consecutively, but the ball does not bounce off bricks, unlike in previous variants;
the ball keeps going through the wall, quickly picking up speed and accumulating points. (3) How the ball is aimed at
the wall and “extras”: (3-a) the ball simply bounces off the paddle at a position dependent angle; (3-b) the player can
also steer the ball in flight; (3-c) the player is also able catch the ball and release it at a slower speed; (3-d) the wall is
invisible and is only briefly illuminated when a brick if knocked off, so the player needs to remember its configuration
in order to aim effectively. All factors together define 24 variants of Breakout.
Freeway. In all variants, the goal of the player is to safely get a chicken across ten lanes of freeway traffic as many
times as possible in 2 minutes and 16 seconds. Variants are constructed as combinations of three design factors:
(1) The difficulty switch controls whether the chicken is knocked back one lane, or all the way to the kerb, when hit
by incoming vehicles. (2) Traffic “thickness”, defined as four levels of traffic density. (3) Vehicle speeds across lanes,
either constant or randomised. Hence, there are 16 variants of Freeway.
3 METHODOLOGY
Experimental setup. Following the latest ALE benchmark recommendations (Machado et al.,2018), information
about player “lives” is not disclosed to agents, and stochasticity is introduced using randomised action replay with
probability 25%, also known as “sticky actions”. Irrespective of redundancies, agents act using the full set of 18
actions. Environment observations (Atari screen frames) were pre-processed according to standard practice (Mnih
et al.,2015;Hessel et al.,2018). We report agent returns at the end of training, averaged over the last 2 million steps.
Expert Agent Training. We use the Rainbow-IQN model-free deep reinforcement learning algorithm (Hessel et al.,
2018;Dabney et al.,2018;Toromanoff et al.,2019) since it is available to the community in several implementations,
e.g. Castro et al. (2018), and is effective with widely available commodity hardware and open-source software.
Rainbow (Hessel et al.,2018) collects a number of improvements to DQN (Mnih et al.,2013;2015), of which we used
Double Q-Learning (Van Hasselt et al.,2016), Prioritised Replay (Schaul et al.,2016) and multi-step learning (Sutton
& Barto,2018). Following Castro et al. (2018), we did not use the dueling network architecture (Wang et al.,2016) or
noisy networks (Fortunato et al.,2018). We replaced the distributional RL approach C51 (Bellemare et al.,2017) with
its more general form (IQN), since it has been shown to be superior (Dabney et al.,2018;Toromanoff et al.,2019).
We used the standard limit on environment interactions of 200 million steps (Mnih et al.,2015;2016;Dabney et al.,
2018;Machado et al.,2018;Toromanoff et al.,2019). Our study is the first to report Rainbow-IQN results for Atari
game variants, hence we use independent hyper-parameter selection for expert training and finetuning experiments.
4
Published at 1st Conference on Lifelong Learning Agents, 2022
Expert Agent Finetuning. The experimental setup for finetuning follows closely that for agent training, except for
two important differences: (1) We used 10 million environment steps to adapt to new variants, following Farebrother
et al. (2018), which is 20×less data than what variant-experts are trained with. The aim of transfer learning is to
reduce resources needed for acquiring new knowledge and behaviours, hence we are interested in improving sample
complexity using transfer. (2) The hyper-parameter grid was slightly adapted in order to improve chances of fast
learning in this reduced data regime. Further details, including parameter grids, are listed in Appendix B.
Statistical Analyses. A multi-factor Analyses of Variance (ANOVA) (Girden,1992) is a statistical procedure which
investigates the influence that two or more independent variables, or factors, have on a dependent variable. The factors
can take two or more categorical values, and experimental designs are often balanced: they use equal numbers of
independent observations for all combinations of factor values, no less than three. Other assumptions are normality of
deviations from group means, and equal or similar variances, see the discussion in Appendix C. ANOVA can be used
to reveal statistically significant evidence against the null-hypothesis that group means are all the same. We study the
influence of game design factors, discrete modification to default games, on average returns of learned policies.
Study Limitations. The relationship between hyper-parameters and quality of policies learned with deep RL is com-
plex and poorly understood. Mnih et al. (2015) used task-agnostic DQN hyper-parameters, tuned across a few basic
games. Later works introduce several interacting modification, including changes to sets of hyper-parameters (Schaul
et al.,2016;Dabney et al.,2018;Toromanoff et al.,2019). With the risk of maximisation bias, we used grid searches
per game variant in order to mitigate the greater risks of incorrect inferences about Atari curricula due to poor hyper-
parameter settings or divergence. Future works may perform fine-grained sensitivity analyses, using more than two
random seeds. Due to computational limitations, we instead report the returns of the top-three agents from our grids.
4 EXPERIMENTS
We would like to characterise the performance of general transfer learning strategies across heterogeneous scenarios,
designed to progressively challenge human players. While the shapes of learning curves are expected to be different
between game variants, agent performance is thought to be bounded only by its inductive biases, the learning algorithm
and practical limitation on resources, such as computation or interactions (Silver et al.,2021). We say that some game
variant is “harder” for a given agent if the average returns of learned policies are lower given equal resource budgets.
We aim to explain this in terms of the factors of variation which combinatorially define environments within curricula,
knowing that they also impact human player performance. Comparing raw scores across the benchmark is difficult if
the underlying tasks are “harder” to learn from scratch, or if variants have different scoring scales, as is sometimes the
case here. Hence, we must first establish the performance of our agent learning game variants in isolation. On this
basis, we then introduce a relative scoring scheme which enables conclusions to be drawn at the benchmark level.
4.1 TRAINING VARIANT-EXPERTS
We aim to answer the following questions: (1) Does our agent achieve similar levels of performance when learning
variants of the same game from scratch? (2) If not, what explains differences in performance across variants? In
Figure 2we report means and standard deviations of top three variant-experts trained from scratch for 200 million
environment steps. Our scores on default games are largely in line with those reported by other implementations of
basic IQN agents, e.g. Castro et al. (2018). On remaining variants, we find that Rainbow-IQN performance varies
widely for some game titles, less so on Freeway, even with independent hyper-parameter tuning. Overall, our agent’s
scores are below variant performance ceilings, suggesting that some game variations may be harder to learn. While
this may be at times due to the agent’s inductive biases—in particular, invisible objects—it is unlikely to fully explain
observed variation, because such changes to basic games do not always have a detrimental impact on their own. For
example, experts achieve significantly different levels of performance on identical variants of Space Invaders, except
for visible vs invisible aliens (0 00 vs.0 08). However, experts have virtually equal performance with the same change
to environment observations across Breakout variants, visible vs invisible walls of bricks (e.g. 0 00 vs. 0 12). Rather,
we hypothesise that some variants are inherently harder to learn for the chosen agent, and thus a fixed training budget
leads to performance differences across variations. This imposes a challenging bottleneck for transfer learning, which
places high priority on limiting the resources expended for acquiring behaviours which maximise cumulative returns.
Statistical Analyses. We verify that variants have meaningful impact on learning by rejecting the null hypothesis that
expert performance is the same across all combinations of the design factors which define game variants. We perform
multi-factor Analysis of Variance (ANOVA) tests separately for each game title, and report results in Table 1. Inter-
action effects are statistically significant in all cases, supporting the hypothesis that—although conceptually similar
design factors introduce significant changes to agent learning dynamics, even within the same game. In the post-hoc
5
摘要:

Publishedat1stConferenceonLifelongLearningAgents,2022PROBINGTRANSFERINDEEPREINFORCEMENTLEARNINGWITHOUTTASKENGINEERINGAndreiA.Rusu,SebastianFlennerhag,DushyantRao,RazvanPascanu,RaiaHadsellDeepMind,UKfandrei,flennerhag,dushyantr,razp,raiag@deepmind.comABSTRACTWeevaluatetheuseoforiginalgamecurriculasup...

展开>> 收起<<
Probing Transfer in Deep Reinforcement Learning without Task Engineering.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:924.12KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注