PaCo Parameter-Compositional Multi-Task Reinforcement Learning Lingfeng Sun1yHaichao Zhang2Wei Xu2Masayoshi Tomizuka1

2025-05-06 0 0 2.19MB 19 页 10玖币
侵权投诉
PaCo: Parameter-Compositional Multi-Task
Reinforcement Learning
Lingfeng Sun1∗† Haichao Zhang2Wei Xu2Masayoshi Tomizuka1
1University of California Berkeley 2Horizon Robotics
lingfengsun@berkeley.edu {haichao.zhang, wei.xu}@horizon.ai tomizuka@berkeley.edu
Abstract
The purpose of multi-task reinforcement learning (MTRL) is to train a single policy
that can be applied to a set of different tasks. Sharing parameters allows us to take
advantage of the similarities among tasks. However, the gaps between contents and
difficulties of different tasks bring us challenges on both which tasks should share
the parameters and what parameters should be shared, as well as the optimization
challenges due to parameter sharing. In this work, we introduce a parameter-
compositional approach (PaCo) as an attempt to address these challenges. In this
framework, a policy subspace represented by a set of parameters is learned. Policies
for all the single tasks lie in this subspace and can be composed by interpolating
with the learned set. It allows not only flexible parameter sharing but also a natural
way to improve training. We demonstrate the state-of-the-art performance on
Meta-World benchmarks, verifying the effectiveness of the proposed approach.
1 Introduction
Deep reinforcement learning (RL) has made massive progress in solving complex tasks in different
domains. Despite the success of RL in various robotic tasks, most of the improvements are restricted
to single tasks in locomotion or manipulation. Although many similar tasks with different target
and interacting objects are accomplished by the same robot, they are usually defined as individual
tasks and solved separately. On the other hand, as intelligent agents, humans usually spend less time
learning similar tasks and can acquire new skills using existing ones. This motivates us to think
about the advantages of training a set of tasks with certain similarities together efficiently. Multi-task
reinforcement learning (MTRL) aims to train an effective policy that can be applied to the same robot
to solve different tasks. Compared to training each task separately, a multi-task policy should be
efficient in the number of parameters and training samples and benefit from the sharing process.
The key challenge in multi-task RL methods is determining what should be shared among tasks
and how to share. It is reasonable to assume the existence of similarities among all the tasks
picked (usually on the same robot) since training completely different tasks together is meaningless.
However, the gaps between different tasks can be significant even within the set. For tasks using
the same skill but with different goals, it’s natural to share all the parameters and add the goal into
state representation to turn the policy into a goal-conditioned policy. For tasks with different skills,
sharing policy parameters can be efficient for related tasks but may bring additional difficulties for
uncorrelated skills (e.g., push and peg-insert-side in Meta-World [
36
]), due to additional challenges
in learning brought by the conflictions between tasks.
Recent works on multi-task RL proposed different methods on this problem, which can be roughly
divided into three categories. Some focus on modeling share-structures for sub-policies of different
Equal contribution.
Work done while interning at Horizon Robotics.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11653v1 [cs.LG] 21 Oct 2022
Figure 1: Example tasks from Meta-World [36].
tasks [
3
,
33
], while some focus more on algorithms and aim to handle conflicting gradients from
different tasks losses during training [
35
]. In addition, many works attempt to select or learn better
representations as better task-condition for the policies [
24
]. In this paper, we focus on the share-
structure design for multiple tasks. We propose a parameter-compositional MTRL method that learns
a task-agnostic parameter set forming a subspace in the policy parameter space for all tasks. We
infer the task-specific policy in this subspace using a compositional vector for each task. Instead
of interpolating different policies’ output in the action space, we directly compose the policies in
the parameter space. In this way, two different tasks can have identical or independent policies.
With different subspace dimensions (i.e., size of parameter set) and additional constraints, this
compositional formulation can unify many previous works on sharing structures of MTRL. Moreover,
separating the task-specific and the task-agnostic parameter set brings advantages in dealing with
instability RL training of certain tasks, which helps improve the multi-task training stability.
The key contributions of our work are summarized as below.
i)
We present a general Parameter
Compositional (PaCo) MTRL training framework that can learn representative parameter sets used
to compose policies for different tasks.
ii)
We introduce a scheme to stabilize MTRL training by
leveraging PaCo’s decompositional structure.
iii)
We validate the state-of-the-art performance of
PaCo on Meta-World benchmark compared with a number of existing methods.
2 Preliminaries
2.1 Markov Decision Process (MDP)
A discrete-time Markov decision process is defined by a tuple
(S,A, P, r, µ, γ)
, where
S
is the state
space;
A
is the action space;
P
is the transition process between states;
r:S × AR
is the reward
function;
µ∈ P(S)
is distribution of the initial state, and
γ[0,1]
is the discount factor. At each
time step
t
, the learning agent generates the action with a policy
π(at|st)
as the decision. The goal is
to learn a policy to maximize the accumulated discounted return.
2.2 Soft Actor-Critic
In the scope of this work, we will use Soft Actor-Critic (SAC) [
9
] to train the universal policy
for the multi-task RL problem. SAC is an off-policy actor-critic method that uses the maximum
entropy framework. The parameters in SAC framework include the policy network
π(at|st)
used in
evaluation, the critic network
Q(st, at)
as a soft Q-function. A temperature parameter
α
is used to
maintain the entropy level of policy. In multi-task learning, the one-hot id of task skill and the goal
is appended to the state space. Different from single-task SAC, multiple tasks may have different
learning dynamics. Therefore, we follow previous works [
24
] to assign a separate temperature
ατ
for each task with different skills. The policy and critic function optimization procedure remains the
same as the single-task setting.
3 Revisiting and Analyzing Multi-Task Reinforcement Learning
3.1 Multi-Task Reinforcement Learning Setting
Each single task can be defined by a unique MDP, and changes in state space, action space, transition,
reward function can result in completely different tasks. In MTRL, instead of solving a single MDP,
we solve a bunch of MDPs from a task family using a universal policy
πθ(a|s)
. The first assumption
for MTRL is to have a universal shared state space
S
and each task has a disjoint state space
SτS
,
where
τ∈ T
is any task from the full task distribution. In this way, the policy would be able to
2
recognize which task it is currently solving. Adding the one-hot encoding for task id is a common
implementation of getting disjoint state space during experiments.
In general MTRL setting, we don’t have strict restrictions on the which tasks are involved, but we
assume that tasks in the full task distribution share some similarities. In real applications, depending
on how a task is defined, we can divide it into Multi-Goal MTRL and Multi-Skill MTRL. For the
former one, the task set is defined by various “goals” in the same environment. The reward function
rτ
is different for each goal, but the state and transition remains the same. Typical examples of
this Multi-goal settings are locomotion tasks like Cheetah-Velocity/Direction
3
and all kinds of
goal-conditioned manipulation tasks [
18
]. For the later one, besides changes in goals in the same
environment, the task set also involves different environments that share similar dynamics (transition
functions). This happens more in manipulation tasks where different environments train different
skills of a robot, and one natural example is the Meta-World [
36
] benchmark which includes multiple
goal-conditioned manipulation tasks using the same robot arm. In this setting, the state space of
different tasks changes across different skills since the robot is manipulating different objects (c.f.
Figure 1). In both Multi-goal and Multi-skill setting, we have to form the set of MDPs into a universal
Multi-task MDP and find a universal policy that works for all tasks. For multi-goal tasks, we need
to append “goal” information into state; for multi-skill tasks, we need to append “goal” (usually
position) as well as “skill” (usually one-hot encoding). After getting state
Sτ
, the corresponding
transition and reward Pτ, rτcan be defined accordingly.
3.2 Challenges in Multi-Task Reinforcement Learning
Parameter-Sharing.
Multi-task learning aims to learn a single model that can be applied to a set
of different tasks. Sharing parameters allows us to take advantage of the similarities among tasks.
However, the gaps between contents and difficulties of different tasks bring us the challenges on both
which tasks should share the parameters and what parameters should be shared. Failure in the design
may result in low success rate on certain tasks that could have been solved if trained separately. This
is a challenge in designing an effective structure to solve the MTRL task.
Multi-Task Training Stability.
Although we have assumed some similarity in the task sets used
for multi-task learning, conflicts between different skills may affect the whole training process [35].
Also, failure like loss explosion in some tasks can severely affect the training of other tasks due to
parameter sharing [
24
]. In multi-task training with large task numbers, the uncertainty of single task
training is enlarged. This is a challenge in designing an algorithm to avoid negative influence brought
by parameter-sharing among multiple tasks.
4 Parameter-Compositional Multi-Task RL
Motivated by the challenges in training universal policies for multiple tasks discussed in Section 3, we
will present a Parameter-Compositional approach to MTRL. The proposed approach is conceptually
simple, yet offers opportunities in addressing the MTRL challenges as detailed in the sequel.
4.1 Formulation
In this section, we describe how we formulate the parameter-compositional framework for MTRL.
Given a task
τ∼ T
, where
T
denotes the set of tasks with
|T | =T
, we use
θτRn
to denote the
vector of all the trainable parameters of the model (i.e., policy and critic networks) for task
τ
. We
employ the following decomposition for the task parameter vector θτ:
θτ=Φwτ,(1)
where
Φ= [φ1,φ2,· · · ,φi,· · · ,φK]Rn×K
denotes a matrix formed by a set of
K
parameter
vectors
{φi}K
i=1
(referred to as parameter set, which is also overloaded for referring to
Φ
), each of
which has the same dimensionality as
θτ
,i.e.,
φiRn
.
wτRK
is a compositional vector, which is
implemented as a trainable embedding vector for the task index
τ
. We refer a model with parameters
in the form of Eqn.(1) as a parameter-compositional model.
3
By Cheetah/Ant-Velocity/Direction, we refer to the tasks that have the same dynamics as the standard
locomotion tasks but with a goal of running at a specific velocity or in a specific direction.
3
Figure 2:
Parameter-Compositional method (PaCo)
for multi-task reinforcement learning. In this
framework, the network parameter vector
θτ
for a task
τ
is instantiated in a compositional form based
on the shared base parameter set
Φ
and the task-specific compositional vector
wτ
. Then the networks
are used in the standard way for generating actions or computing the loss [
9
]. During training,
Φ
will
be impacted by all the task losses, while wτis impacted by the corresponding task loss only.
In the presence of a single task, the decomposition in Eqn.(1) brings no additional benefits, as it is
essentially equivalent to the standard way of parameterizing the model. However, when faced with
multiple tasks, as in the MTRL setting considered in this work, the decomposition in Eqn.(1) offers
opportunities for tackling the challenges posed by the MTRL setting. More concretely, since Eqn.(1)
decomposes the parameters to two parts: i) task-agnostic
Φ
and ii) task-aware
wτ
, we can share the
task-agnostic Φacross all the tasks, while still ensure task awareness via wτ, leading to:
[θ1,· · · ,θτ,· · · ,θT] = Φ[w1,· · · ,wτ,· · · wT]
Θ=ΦW.(2)
For MTRL, let
Jτ(θ)
denotes the summation of both actor and critic losses implemented in the same
way as in SAC [
9
] for task
τ
, the multi-task loss is defined as the summation of individual loss
Jτ
across tasks:
JΘ,PτJτ(θ)
where
Θ
denotes the collection of all the trainable parameters of both actor and critic networks.
Together with Eqn.(2), it can be observed that the multi-task loss
JΘ
contributes to the learning of
the model parameters in two ways:
JΘ/∂Φ=PτJτ/∂Φ: all the Ttasks contribute to the learning of the shared parameter set Φ;
JΘ/∂W=PτJτ/∂wτ: as for the training of the the task specific compositional vectors, each
task loss Jτwill impact only its own task specific compositional vector wτ.
The PaCo framework is illustrated in Figure 2. Additional implementation details about PaCo
are provided in Appendix A.2. The proposed approach has several attractive properties towards
addressing the MTRL challenges discussed earlier:
the compositional form of parameters as in Eqn.(2) offers flexible parameter sharing between tasks,
by learning the appropriate compositional vectors for each task over the shared parameter set;
because of the clear separation between task-specific and task-agnostic parameters, it also offers a
natural solution for improving the stability of MTRL training, as detailed in the sequel.
In addition, the separation between task-specific and task-agnostic information has other benefits
that beyond the scope of the current work. For example, the task-agnostic parameter set
Φ
could be
reused as a pre-trained policy basis in some transfer scenarios (initial attempts in Appendix A.3).
4.2 Stable Multi-Task Reinforcement Learning
One inherent challenge in MTRL is the interference during training among tasks due to parameter
sharing. One consequence of this is that the failure of training on one task may adversely impact
the training of other tasks [
28
,
35
]. For example, it has been empirically observed that some task
losses may explode during training on Meta-World [
24
], which will contribute a significant portion in
4
摘要:

PaCo:Parameter-CompositionalMulti-TaskReinforcementLearningLingfengSun1yHaichaoZhang2WeiXu2MasayoshiTomizuka11UniversityofCaliforniaBerkeley2HorizonRoboticslingfengsun@berkeley.edu{haichao.zhang,wei.xu}@horizon.aitomizuka@berkeley.eduAbstractThepurposeofmulti-taskreinforcementlearning(MTRL)istotra...

展开>> 收起<<
PaCo Parameter-Compositional Multi-Task Reinforcement Learning Lingfeng Sun1yHaichao Zhang2Wei Xu2Masayoshi Tomizuka1.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:19 页 大小:2.19MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注