
On Neural Consolidation for Transfer in Reinforcement Learning
We present an overview of related works in Section 2, before describing our experimental protocol in Section 3.
Section 4 discusses the effect of using the Actor-Mimic algorithm for consolidation, on the performance obtained
within a given set of tasks. Section 5 focuses on knowledge transfer and generalization to new tasks. Section 6
provides a comparison baseline illustrating that consolidation mitigates the effects of negative transfer. We conclude
in Section 7.
2 Background and Related Work
Distillation was originally proposed by [4] and is among the most promising methods to achieve transfer between
tasks. [7] and [9] extend the core idea of learning several functions as one, and add an incentive to also copy the
features in order to guide the training process. Similarly, [10] builds a central network, encoding common behaviors,
to share knowledge between tasks.
One major challenge in RL today is lifelong learning, i.e. how to solve different tasks sequentially while avoiding
catastrophic forgetting. Different approaches exist to tackle this problem. We follow the division in three categories
proposed in [11]. One possibility is to periodically modify the network architecture when facing new tasks in order
to enhance its representative power [12, 13, 14]. Another approach is to use regularization to preserve previously
acquired knowledge [15, 16, 17]. Finally, the lifelong learning problem can be reduced to a multi-task learning one
by using a rehearsal strategy, memorizing every task encountered [18, 19, 20, 21, 22]. These three main categories are
not mutually exclusive, and many of these algorithms make use of different techniques that belong to two categories.
The idea of alternating between an active phase of pattern-separated learning and a passive phase of generalization
as inspired by CLS theory has also been explored before. In particular, [23] introduces the PLAID algorithm that
progressively grows a central network using distillation on newly encountered tasks. Similarly, [24] successively
compresses different expert networks in a knowledge base that is then reused by new experts via lateral layer-wise
connections [13]. [25] introduces a Self-Organising Map to DRL to simulate complementary learning in a neocortical
and a hippocampal system, improving learning on grid world control and demonstrating the biological plausibility of
artificial CLS.
Instead of learning to solve multiple tasks, another possibility is to learn how to be efficient at learning: this is the
meta-learning approach. One intuitive way to achieve this is by using a meta-algorithm to output a set of neural
network weights which are then used as initialization for solving new tasks [26]. On the other hand, [27] proposes the
use of a second network whose role is to deactivate part of a typical neural network. By analogy with the human brain,
this network is called the neuromodulatory network as it is responsible for activating or deactivating part of the main
network depending on the current task to solve. Finally, [28] proposes a framework for meta-algorithms which divides
them into a “What” part whose objective is to identify the current running task from context data, and a “How” part
responsible for producing a set of parameters for a neural network that will be able to solve this task.
3 Actor-Mimic Networks for consolidation in Lifelong Learning
In order to study the consolidation process and its interaction with knowledge transfer, we explore the use of the Actor-
Mimic (Network) algorithm [7, AMN] that acts as a policy distillation algorithm with an additional incentive to imitate
the teacher’s features. In standard policy distillation, as proposed by [4], the distilled network — also called student
network — learns to reproduce the output of multiple expert networks (policy regression objective) using supervised
learning. In addition, the AMN algorithm adds another feature regression objective that regularizes the features of
the student network (defined as the outputs of the second-to-last layer) towards the features of the experts. Intuitively,
the policy regression objective teaches the student how it should act while this feature regression objective teaches the
result of the expert’s “thinking process” that indicates why it should act that way.
The AMN algorithm makes it possible to consolidate several expert networks at the same time while extracting features
containing the same information as the experts. As the input states of target tasks can be quite different in nature (e.g.
graphical features, color palette, etc.), it is a desirable property for the extracted features in the consolidated network
to represent abstract concepts that facilitate generalization across tasks. To evaluate this property, we propose a new
training protocol composed of two phases that emulate day-night cycles: an active learning phase in which neural
networks —coined “expert networks”— are trained individually on a set of visual RL tasks, and a passive imitation
phase in which the knowledge acquired by all experts is consolidated into a central AMN that retains knowledge in
the long term.
During the active phase, each expert network is trained on its corresponding task using a standard RL algorithm. We
use Rainbow [29] in the present study, as implemented in the Dopamine framework [30], with the typical architecture
2