
recognize which task it is currently solving. Adding the one-hot encoding for task id is a common
implementation of getting disjoint state space during experiments.
In general MTRL setting, we don’t have strict restrictions on the which tasks are involved, but we
assume that tasks in the full task distribution share some similarities. In real applications, depending
on how a task is defined, we can divide it into Multi-Goal MTRL and Multi-Skill MTRL. For the
former one, the task set is defined by various “goals” in the same environment. The reward function
rτ
is different for each goal, but the state and transition remains the same. Typical examples of
this Multi-goal settings are locomotion tasks like Cheetah-Velocity/Direction
3
and all kinds of
goal-conditioned manipulation tasks [
18
]. For the later one, besides changes in goals in the same
environment, the task set also involves different environments that share similar dynamics (transition
functions). This happens more in manipulation tasks where different environments train different
skills of a robot, and one natural example is the Meta-World [
36
] benchmark which includes multiple
goal-conditioned manipulation tasks using the same robot arm. In this setting, the state space of
different tasks changes across different skills since the robot is manipulating different objects (c.f.
Figure 1). In both Multi-goal and Multi-skill setting, we have to form the set of MDPs into a universal
Multi-task MDP and find a universal policy that works for all tasks. For multi-goal tasks, we need
to append “goal” information into state; for multi-skill tasks, we need to append “goal” (usually
position) as well as “skill” (usually one-hot encoding). After getting state
Sτ
, the corresponding
transition and reward Pτ, rτcan be defined accordingly.
3.2 Challenges in Multi-Task Reinforcement Learning
Parameter-Sharing.
Multi-task learning aims to learn a single model that can be applied to a set
of different tasks. Sharing parameters allows us to take advantage of the similarities among tasks.
However, the gaps between contents and difficulties of different tasks bring us the challenges on both
which tasks should share the parameters and what parameters should be shared. Failure in the design
may result in low success rate on certain tasks that could have been solved if trained separately. This
is a challenge in designing an effective structure to solve the MTRL task.
Multi-Task Training Stability.
Although we have assumed some similarity in the task sets used
for multi-task learning, conflicts between different skills may affect the whole training process [35].
Also, failure like loss explosion in some tasks can severely affect the training of other tasks due to
parameter sharing [
24
]. In multi-task training with large task numbers, the uncertainty of single task
training is enlarged. This is a challenge in designing an algorithm to avoid negative influence brought
by parameter-sharing among multiple tasks.
4 Parameter-Compositional Multi-Task RL
Motivated by the challenges in training universal policies for multiple tasks discussed in Section 3, we
will present a Parameter-Compositional approach to MTRL. The proposed approach is conceptually
simple, yet offers opportunities in addressing the MTRL challenges as detailed in the sequel.
4.1 Formulation
In this section, we describe how we formulate the parameter-compositional framework for MTRL.
Given a task
τ∼ T
, where
T
denotes the set of tasks with
|T | =T
, we use
θτ∈Rn
to denote the
vector of all the trainable parameters of the model (i.e., policy and critic networks) for task
τ
. We
employ the following decomposition for the task parameter vector θτ:
θτ=Φwτ,(1)
where
Φ= [φ1,φ2,· · · ,φi,· · · ,φK]∈Rn×K
denotes a matrix formed by a set of
K
parameter
vectors
{φi}K
i=1
(referred to as parameter set, which is also overloaded for referring to
Φ
), each of
which has the same dimensionality as
θτ
,i.e.,
φi∈Rn
.
wτ∈RK
is a compositional vector, which is
implemented as a trainable embedding vector for the task index
τ
. We refer a model with parameters
in the form of Eqn.(1) as a parameter-compositional model.
3
By Cheetah/Ant-Velocity/Direction, we refer to the tasks that have the same dynamics as the standard
locomotion tasks but with a goal of running at a specific velocity or in a specific direction.
3