PaCo Parameter-Compositional Multi-Task Reinforcement Learning Lingfeng Sun1yHaichao Zhang2Wei Xu2Masayoshi Tomizuka1

2025-05-06 0 0 2.19MB 19 页 10玖币

侵权投诉

PaCo: Parameter-Compositional Multi-Task

Reinforcement Learning

Lingfeng Sun1∗† Haichao Zhang2∗Wei Xu2Masayoshi Tomizuka1

1University of California Berkeley 2Horizon Robotics

lingfengsun@berkeley.edu {haichao.zhang, wei.xu}@horizon.ai tomizuka@berkeley.edu

Abstract

The purpose of multi-task reinforcement learning (MTRL) is to train a single policy

that can be applied to a set of different tasks. Sharing parameters allows us to take

advantage of the similarities among tasks. However, the gaps between contents and

difﬁculties of different tasks bring us challenges on both which tasks should share

the parameters and what parameters should be shared, as well as the optimization

challenges due to parameter sharing. In this work, we introduce a parameter-

compositional approach (PaCo) as an attempt to address these challenges. In this

framework, a policy subspace represented by a set of parameters is learned. Policies

for all the single tasks lie in this subspace and can be composed by interpolating

with the learned set. It allows not only ﬂexible parameter sharing but also a natural

way to improve training. We demonstrate the state-of-the-art performance on

Meta-World benchmarks, verifying the effectiveness of the proposed approach.

1 Introduction

Deep reinforcement learning (RL) has made massive progress in solving complex tasks in different

domains. Despite the success of RL in various robotic tasks, most of the improvements are restricted

to single tasks in locomotion or manipulation. Although many similar tasks with different target

and interacting objects are accomplished by the same robot, they are usually deﬁned as individual

tasks and solved separately. On the other hand, as intelligent agents, humans usually spend less time

learning similar tasks and can acquire new skills using existing ones. This motivates us to think

about the advantages of training a set of tasks with certain similarities together efﬁciently. Multi-task

reinforcement learning (MTRL) aims to train an effective policy that can be applied to the same robot

to solve different tasks. Compared to training each task separately, a multi-task policy should be

efﬁcient in the number of parameters and training samples and beneﬁt from the sharing process.

The key challenge in multi-task RL methods is determining what should be shared among tasks

and how to share. It is reasonable to assume the existence of similarities among all the tasks

picked (usually on the same robot) since training completely different tasks together is meaningless.

However, the gaps between different tasks can be signiﬁcant even within the set. For tasks using

the same skill but with different goals, it’s natural to share all the parameters and add the goal into

state representation to turn the policy into a goal-conditioned policy. For tasks with different skills,

sharing policy parameters can be efﬁcient for related tasks but may bring additional difﬁculties for

uncorrelated skills (e.g., push and peg-insert-side in Meta-World [

]), due to additional challenges

in learning brought by the conﬂictions between tasks.

Recent works on multi-task RL proposed different methods on this problem, which can be roughly

divided into three categories. Some focus on modeling share-structures for sub-policies of different

∗Equal contribution.

†Work done while interning at Horizon Robotics.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11653v1 [cs.LG] 21 Oct 2022

Figure 1: Example tasks from Meta-World [36].

tasks [

], while some focus more on algorithms and aim to handle conﬂicting gradients from

different tasks losses during training [

]. In addition, many works attempt to select or learn better

representations as better task-condition for the policies [

]. In this paper, we focus on the share-

structure design for multiple tasks. We propose a parameter-compositional MTRL method that learns

a task-agnostic parameter set forming a subspace in the policy parameter space for all tasks. We

infer the task-speciﬁc policy in this subspace using a compositional vector for each task. Instead

of interpolating different policies’ output in the action space, we directly compose the policies in

the parameter space. In this way, two different tasks can have identical or independent policies.

With different subspace dimensions (i.e., size of parameter set) and additional constraints, this

compositional formulation can unify many previous works on sharing structures of MTRL. Moreover,

separating the task-speciﬁc and the task-agnostic parameter set brings advantages in dealing with

instability RL training of certain tasks, which helps improve the multi-task training stability.

The key contributions of our work are summarized as below.

We present a general Parameter

Compositional (PaCo) MTRL training framework that can learn representative parameter sets used

to compose policies for different tasks.

ii)

We introduce a scheme to stabilize MTRL training by

leveraging PaCo’s decompositional structure.

iii)

We validate the state-of-the-art performance of

PaCo on Meta-World benchmark compared with a number of existing methods.

2 Preliminaries

2.1 Markov Decision Process (MDP)

A discrete-time Markov decision process is deﬁned by a tuple

(S,A, P, r, µ, γ)

, where

is the state

space;

is the action space;

is the transition process between states;

r:S × A→R

is the reward

function;

µ∈ P(S)

is distribution of the initial state, and

γ∈[0,1]

is the discount factor. At each

time step

, the learning agent generates the action with a policy

π(at|st)

as the decision. The goal is

to learn a policy to maximize the accumulated discounted return.

2.2 Soft Actor-Critic

In the scope of this work, we will use Soft Actor-Critic (SAC) [

] to train the universal policy

for the multi-task RL problem. SAC is an off-policy actor-critic method that uses the maximum

entropy framework. The parameters in SAC framework include the policy network

π(at|st)

used in

evaluation, the critic network

Q(st, at)

as a soft Q-function. A temperature parameter

is used to

maintain the entropy level of policy. In multi-task learning, the one-hot id of task skill and the goal

is appended to the state space. Different from single-task SAC, multiple tasks may have different

learning dynamics. Therefore, we follow previous works [

] to assign a separate temperature

ατ

for each task with different skills. The policy and critic function optimization procedure remains the

same as the single-task setting.

3 Revisiting and Analyzing Multi-Task Reinforcement Learning

3.1 Multi-Task Reinforcement Learning Setting

Each single task can be deﬁned by a unique MDP, and changes in state space, action space, transition,

reward function can result in completely different tasks. In MTRL, instead of solving a single MDP,

we solve a bunch of MDPs from a task family using a universal policy

πθ(a|s)

. The ﬁrst assumption

for MTRL is to have a universal shared state space

and each task has a disjoint state space

Sτ⊂S

where

τ∈ T

is any task from the full task distribution. In this way, the policy would be able to

recognize which task it is currently solving. Adding the one-hot encoding for task id is a common

implementation of getting disjoint state space during experiments.

In general MTRL setting, we don’t have strict restrictions on the which tasks are involved, but we

assume that tasks in the full task distribution share some similarities. In real applications, depending

on how a task is deﬁned, we can divide it into Multi-Goal MTRL and Multi-Skill MTRL. For the

former one, the task set is deﬁned by various “goals” in the same environment. The reward function

rτ

is different for each goal, but the state and transition remains the same. Typical examples of

this Multi-goal settings are locomotion tasks like Cheetah-Velocity/Direction

and all kinds of

goal-conditioned manipulation tasks [

]. For the later one, besides changes in goals in the same

environment, the task set also involves different environments that share similar dynamics (transition

functions). This happens more in manipulation tasks where different environments train different

skills of a robot, and one natural example is the Meta-World [

] benchmark which includes multiple

goal-conditioned manipulation tasks using the same robot arm. In this setting, the state space of

different tasks changes across different skills since the robot is manipulating different objects (c.f.

Figure 1). In both Multi-goal and Multi-skill setting, we have to form the set of MDPs into a universal

Multi-task MDP and ﬁnd a universal policy that works for all tasks. For multi-goal tasks, we need

to append “goal” information into state; for multi-skill tasks, we need to append “goal” (usually

position) as well as “skill” (usually one-hot encoding). After getting state

Sτ

, the corresponding

transition and reward Pτ, rτcan be deﬁned accordingly.

3.2 Challenges in Multi-Task Reinforcement Learning

Parameter-Sharing.

Multi-task learning aims to learn a single model that can be applied to a set

of different tasks. Sharing parameters allows us to take advantage of the similarities among tasks.

However, the gaps between contents and difﬁculties of different tasks bring us the challenges on both

which tasks should share the parameters and what parameters should be shared. Failure in the design

may result in low success rate on certain tasks that could have been solved if trained separately. This

is a challenge in designing an effective structure to solve the MTRL task.

Multi-Task Training Stability.

Although we have assumed some similarity in the task sets used

for multi-task learning, conﬂicts between different skills may affect the whole training process [35].

Also, failure like loss explosion in some tasks can severely affect the training of other tasks due to

parameter sharing [

]. In multi-task training with large task numbers, the uncertainty of single task

training is enlarged. This is a challenge in designing an algorithm to avoid negative inﬂuence brought

by parameter-sharing among multiple tasks.

4 Parameter-Compositional Multi-Task RL

Motivated by the challenges in training universal policies for multiple tasks discussed in Section 3, we

will present a Parameter-Compositional approach to MTRL. The proposed approach is conceptually

simple, yet offers opportunities in addressing the MTRL challenges as detailed in the sequel.

4.1 Formulation

In this section, we describe how we formulate the parameter-compositional framework for MTRL.

Given a task

τ∼ T

, where

denotes the set of tasks with

|T | =T

, we use

θτ∈Rn

to denote the

vector of all the trainable parameters of the model (i.e., policy and critic networks) for task

. We

employ the following decomposition for the task parameter vector θτ:

θτ=Φwτ,(1)

where

Φ= [φ1,φ2,· · · ,φi,· · · ,φK]∈Rn×K

denotes a matrix formed by a set of

parameter

vectors

{φi}K

i=1

(referred to as parameter set, which is also overloaded for referring to

), each of

which has the same dimensionality as

θτ

,i.e.,

φi∈Rn

wτ∈RK

is a compositional vector, which is

implemented as a trainable embedding vector for the task index

. We refer a model with parameters

in the form of Eqn.(1) as a parameter-compositional model.

By Cheetah/Ant-Velocity/Direction, we refer to the tasks that have the same dynamics as the standard

locomotion tasks but with a goal of running at a speciﬁc velocity or in a speciﬁc direction.

Figure 2:

Parameter-Compositional method (PaCo)

for multi-task reinforcement learning. In this

framework, the network parameter vector

θτ

for a task

is instantiated in a compositional form based

on the shared base parameter set

and the task-speciﬁc compositional vector

wτ

. Then the networks

are used in the standard way for generating actions or computing the loss [

]. During training,

will

be impacted by all the task losses, while wτis impacted by the corresponding task loss only.

In the presence of a single task, the decomposition in Eqn.(1) brings no additional beneﬁts, as it is

essentially equivalent to the standard way of parameterizing the model. However, when faced with

multiple tasks, as in the MTRL setting considered in this work, the decomposition in Eqn.(1) offers

opportunities for tackling the challenges posed by the MTRL setting. More concretely, since Eqn.(1)

decomposes the parameters to two parts: i) task-agnostic

and ii) task-aware

wτ

, we can share the

task-agnostic Φacross all the tasks, while still ensure task awareness via wτ, leading to:

[θ1,· · · ,θτ,· · · ,θT] = Φ[w1,· · · ,wτ,· · · wT]

Θ=ΦW.(2)

For MTRL, let

Jτ(θ)

denotes the summation of both actor and critic losses implemented in the same

way as in SAC [

] for task

, the multi-task loss is deﬁned as the summation of individual loss

Jτ

across tasks:

JΘ,PτJτ(θ)

where

denotes the collection of all the trainable parameters of both actor and critic networks.

Together with Eqn.(2), it can be observed that the multi-task loss

JΘ

contributes to the learning of

the model parameters in two ways:

•∂JΘ/∂Φ=Pτ∂Jτ/∂Φ: all the Ttasks contribute to the learning of the shared parameter set Φ;

•∂JΘ/∂W=Pτ∂Jτ/∂wτ: as for the training of the the task speciﬁc compositional vectors, each

task loss Jτwill impact only its own task speciﬁc compositional vector wτ.

The PaCo framework is illustrated in Figure 2. Additional implementation details about PaCo

are provided in Appendix A.2. The proposed approach has several attractive properties towards

addressing the MTRL challenges discussed earlier:

•

the compositional form of parameters as in Eqn.(2) offers ﬂexible parameter sharing between tasks,

by learning the appropriate compositional vectors for each task over the shared parameter set;

•

because of the clear separation between task-speciﬁc and task-agnostic parameters, it also offers a

natural solution for improving the stability of MTRL training, as detailed in the sequel.

In addition, the separation between task-speciﬁc and task-agnostic information has other beneﬁts

that beyond the scope of the current work. For example, the task-agnostic parameter set

could be

reused as a pre-trained policy basis in some transfer scenarios (initial attempts in Appendix A.3).

4.2 Stable Multi-Task Reinforcement Learning

One inherent challenge in MTRL is the interference during training among tasks due to parameter

sharing. One consequence of this is that the failure of training on one task may adversely impact

the training of other tasks [

]. For example, it has been empirically observed that some task

losses may explode during training on Meta-World [

], which will contribute a signiﬁcant portion in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PaCo:Parameter-CompositionalMulti-TaskReinforcementLearningLingfengSun1yHaichaoZhang2WeiXu2MasayoshiTomizuka11UniversityofCaliforniaBerkeley2HorizonRoboticslingfengsun@berkeley.edu{haichao.zhang,wei.xu}@horizon.aitomizuka@berkeley.eduAbstractThepurposeofmulti-taskreinforcementlearning(MTRL)istotra...

展开>> 收起<<

PaCo Parameter-Compositional Multi-Task Reinforcement Learning Lingfeng Sun1yHaichao Zhang2Wei Xu2Masayoshi Tomizuka1.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PaCo Parameter-Compositional Multi-Task Reinforcement Learning Lingfeng Sun1yHaichao Zhang2Wei Xu2Masayoshi Tomizuka1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: