SIMPLE EMERGENT ACTION REPRESENTATIONS FROM MULTI -TASK POLICY TRAINING Pu Hua14 Yubei Chen2 Huazhe Xu134

2025-05-03 4 0 6.68MB 22 页 10玖币

侵权投诉

SIMPLE EMERGENT ACTION REPRESENTATIONS FROM

MULTI-TASK POLICY TRAINING

Pu Hua1,4, Yubei Chen∗2, Huazhe Xu∗1,3,4

1Tsinghua University, 2Center for Data Science, New York University, 3Shanghai AI Lab,

4Shanghai Qi Zhi Institute

ABSTRACT

The low-level sensory and motor signals in deep reinforcement learning, which

exist in high-dimensional spaces such as image observations or motor torques,

are inherently challenging to understand or utilize directly for downstream tasks.

While sensory representations have been extensively studied, the representations

of motor actions are still an area of active exploration. Our work reveals that

a space containing meaningful action representations emerges when a multi-task

policy network takes as inputs both states and task embeddings. Moderate con-

straints are added to improve its representation ability. Therefore, interpolated

or composed embeddings can function as a high-level interface within this space,

providing instructions to the agent for executing meaningful action sequences.

Empirical results demonstrate that the proposed action representations are effec-

tive for intra-action interpolation and inter-action composition with limited or

no additional learning. Furthermore, our approach exhibits superior task adap-

tation ability compared to strong baselines in Mujoco locomotion tasks. Our work

sheds light on the promising direction of learning action representations for efﬁ-

cient, adaptable, and composable RL, forming the basis of abstract action planning

and the understanding of motor signal space. Project page: https://sites.

google.com/view/emergent-action-representation/

1 INTRODUCTION

Deep reinforcement learning (RL) has shown great success in learning near-optimal policies for

performing low-level actions with pre-deﬁned reward functions. However, reusing this learned

knowledge to efﬁciently accomplish new tasks remains challenging. In contrast, humans naturally

summarize low-level muscle movements into high-level action representations, such as “pick up”

or “turn left”, which can be reused in novel tasks with slight modiﬁcations. As a result, we carry

out the most complicated movements without thinking about the detailed joint motions or muscle

contractions, relying instead on high-level action representations (Kandel et al., 2021). By analogy

with such abilities of humans, we ask the question: can RL agents have action representations of

low-level motor controls, which can be reused, modiﬁed, or composed to perform new tasks?

As pointed out in Kandel et al. (2021), “the task of the motor systems is the reverse of the task

of the sensory systems. Sensory processing generates an internal representation in the brain of the

outside world or of the state of the body. Motor processing begins with an internal representation:

the desired purpose of movement.” In the past decade, representation learning has made signiﬁcant

progress in representing high-dimensional sensory signals, such as images and audio, to reveal the

geometric and semantic structures hidden in raw signals (Bengio et al., 2013; Chen et al., 2018;

Kornblith et al., 2019; Chen et al., 2020; Baevski et al., 2020; Radford et al., 2021; Bardes et al.,

2021; Bommasani et al., 2021; He et al., 2022; Chen et al., 2022). With the generalization ability of

sensory representation learning, downstream control tasks can be accomplished efﬁciently, as shown

by recent studies Nair et al. (2022); Xiao et al. (2022); Yuan et al. (2022). While there have been sig-

niﬁcant advances in sensory representation learning, action representation learning remains largely

unexplored. To address this gap, we aim to investigate the topic and discover generalizable action

representations that can be reused or efﬁciently adapted to perform new tasks. An important concept

*Denotes equal contributions.

arXiv:2210.09566v2 [cs.AI] 6 Mar 2023

in sensory representation learning is pretraining with a comprehensive task or set of tasks, followed

by reusing the resulting latent representation. We plan to extend this approach to action represen-

tation learning and explore its potential for enhancing the efﬁciency and adaptability of reinforce-

ment learning agents. We propose a multi-task policy network that enables a set of tasks to share

the same latent action representation space. Further, the time-variant sensory representations and

time-invariant action representations are decoupled and then concatenated as the sensory-action rep-

resentations, which is ﬁnally transformed by a policy network to form the low-level action control.

Surprisingly, when trained on a comprehensive set of tasks, this simple structure learns an emergent

self-organized action representation that can be reused for various downstream tasks. In particular,

we demonstrate the efﬁcacy of this representation in Mujoco locomotion environments, showing

zero-shot interpolation/composition and few-shot task adaptation in the representation space, out-

performing strong meta RL baselines. Additionally, we ﬁnd that the decoupled time-variant sensory

representation exhibits equivariant properties. The evidence elucidates that reusable and generaliz-

able action representations may lead to efﬁcient, adaptable, and composable RL, thus forming the

basis of abstract action planning and understanding motor signal space. The primary contributions

in this work are listed as follows:

1. We put forward the idea of leveraging emergent action representations from multi-task

learners to better understand motor action space and accomplish task generalization.

2. We decouple the state-related and task-related information of the sensory-action represen-

tations and reuse them to conduct action planning more efﬁciently.

3. Our approach is a strong adapter, which achieves higher rewards with fewer steps than

strong meta RL baselines when adapting to new tasks.

4. Our approach supports intra-action interpolation as well as inter-action composition by

modifying and composing the learned action representations.

Next, we begin our technical discussion right below and leave the discussion of many valuable and

related literature to the end.

2 PRELIMINARIES

Soft Actor-Critic. In this paper, our approach is built on Soft Actor-Critic (SAC) (Haarnoja et al.,

2018). SAC is a stable off-policy actor-critic algorithm based on the maximum entropy reinforce-

ment learning framework, in which the actor maximizes both the returns and the entropy. We leave

more details of SAC in Appendix A.

Task Distribution. We assume the tasks that the agent may meet are drawn from a pre-deﬁned task

distribution p(T). Each task in p(T)corresponds to a Markov Decision Process (MDP). Therefore,

a task Tcan be deﬁned by a tuple (S,A, P, p0, R), in which Sand Aare respectively the state and

action space, Pthe transition probability, p0the initial state distribution and Rthe reward function.

The concept of task distribution is frequently employed in meta RL problems, but we have made

some extensions on it to better match with the setting in this work. We divide all the task dis-

tributions into two main categories, the “uni-modal” task distributions and the “multi-modal” task

distributions. Concretely, the two scenarios are deﬁned as follows:

•Deﬁnition 1 (Uni-modal task distribution): In a uni-modal task distribution, there is only one

modality among all the tasks in the task distribution. For example, in HalfCheetah-Vel, a Mujoco

locomotion environment, we train the agent to run at different target velocities. Therefore, running

is the only modality in this task distribution.

•Deﬁnition 2 (Multi-modal task distribution): In contrast to uni-modal task distribution, there are

multiple modalities among the tasks in this task distribution. A multi-modal task distribution

includes tasks of several different uni-modal task distributions. For instance, we design a multi-

modal task distribution called HalfCheetah-Run-Jump, which contains two modalities including

HalfCheetah-BackVel and HalfCheetah-BackJump. The former has been deﬁned in the previous

section, and the latter contains tasks that train the agent to jump with different reward weight. In

our implementation, we actually train four motions in this environment, running, walking, jumping

ans standing. We will leave more details in Section 4 and Appendix B.1.

Figure 1: Emergent action representations from multi-task training. The sensory information

and task information are encoded separately. When both are concatenated, an action decoder de-

codes them into a low level action.

3 EMERGENT ACTION REPRESENTATIONS FROM MULTI-TASK TRAINING

In this section, we ﬁrst introduce the sensory-action decoupled policy network architecture. Next,

we discuss the multitask policy training details, along with the additional constraints to the task

embedding for the emergence of action representations. Lastly, we demonstrate the emergence of

action representations through various phenomena and applications.

3.1 MULTITASK POLICY NETWORK AND TRAINING

Decoupled embedding and concatenated decoding. An abstract high-level task, e.g., “move for-

ward”, typically changes relatively slower than the transient sensory states. As a simpliﬁcation, we

decouple the latent representation into a time-variant sensory embedding Zstand a time-invariant

task embedding ZT, which is shown in Figure 1. These embeddings concatenate to form a sensory-

action embedding ZA(st,T)=[Zst,ZT], which is transformed by the policy network (action

decoder) ψto output a low-level action distribution p(at) = ψ(at|Zst,ZT), e.g., motor torques.

The action decoder ψis a multi-layer perceptron (MLP) that outputs a Gaussian distribution in the

low-level action space A.

Latent sensory embedding (LSE). The low-level sensory state information is encoded by an MLP

state encoder φinto a latent sensory embedding Zst=φ(st)∈Rm. It includes the proprioceptive

information of each time step. LSE is time-variant in an RL trajectory, and the state encoder is

shared among different tasks. We use LSE and sensory representation interchangeably in this paper.

Latent task embedding (LTE). A latent task embedding ZT∈Rdencodes the time-invariant

knowledge of a speciﬁc task. Let’s assume we are going to train Ndifferent tasks, and their embed-

dings form an LTE set {ZTN}. These Ndifferent tasks share the same state encoder φand action

decoder ψ; in other words, these Ntasks share the same policy network interface, except for their

task embeddings being different. For implementation, we adopt a fully-connected encoder, which

takes as input the one-hot encodings of different training tasks, to initialize the set {ZTN}. This task

encoder is learnable during training.

After training, the LTE interface can be reused as a high-level action interface. Hence, we use LTE

and action representation interchangeably in this paper.

Training of the multi-task policy networks. A detailed description of the multi-task training is

demonstrated in Algorithm 1. When computing objectives and their gradients, we use policy π

parameterized by ωto indicate all the parameters in the state encoder, action decoder, and {ZTN}.

The overall training procedure is based on SAC. The only difference is that the policy network and

Q networks additionally take as input the LTE ZTand a task label, respectively. During training,

we also apply two techniques to constrain this space: 1) we normalize the LTEs so that they lie on a

hypersphere; 2) we inject a random noise to the LTEs to enhance the smoothness of the space. An

ablation study on these two constraints is included in Appendix B.7.

Algorithm 1 Multi-task Training

Input: Training task set {TN} ∼ p(T),θ1,θ2,ω

θ1←θ1, θ2←θ2,B ← ∅

Initialize LTE set {ZTN}for {TN}

for each pre-train epoch do

for Tiin {Tn}do

Sample a batch Biof multi-task RL transitions with πω

B ← B ∪ Bi

end for

for each train epoch do

Sample RL batch b∼ B

for all transition data in bdo

Zst=φ(st)

ZTi=normalize(ZTi+n)and n∼ N (0, σ2)

Sample action at∼ψ(·|Zst,e

ZTi)for computing SAC objectives

end for

for each optimization step do

Compute SAC objectives J(α), Jπ(ω), JQ(θ)with bbased on Equation 234

Update SAC parameters

end for

Output: The optimal model of state encoder φ∗and action decoder ψ∗and a set of LTEs {ZTN}

3.2 THE EMERGENCE OF ACTION REPRESENTATION

After we train the multi-task policy network with a comprehensive set of tasks, where the LTE

vectors in {ZTN}share the same embedding space, we ﬁnd that {ZTN}self-organizes into a ge-

ometrically and semantically meaningful structure. Tasks with the same modality are embedded

in a continuous fashion, which facilitates intra-task interpolation. Surprisingly, the composition

of task embeddings from different modalities leads to novel tasks, e.g., “run” +“jump” =“jump

run”. Further, the action representation can be used for efﬁcient task adaptation. Visualization also

reveals interesting geometric structures in task embedding and sensory representation spaces. In

this subsection, we dive into these intriguing phenomena, demonstrating the emergence of action

representation and showing the generalization of the emergent action representation.

Task interpolation & composition. After training the RL agent to accomplish multiple tasks, we

select two pre-trained tasks and generate a new LTE through linear integration between the LTEs

of the two chosen tasks. The newly-generated task embedding is expected to conduct the agent to

perform another different task. The generated LTE is deﬁned by:

Z0=f(βZTi+ (1 −β)ZTj)(1)

where i, j are the indices of the selected tasks and ZTi,ZTjare their corresponding LTEs. βis a

hyperparameter ranging in (0,1). The function f(·)is a regularization function related to the pre-

deﬁned quality of the LTE Space. For instance, in this paper, f(·)is a normalization function to

extend or shorten the result of interpolation to a unit sphere.

A new task is interpolated by applying the aforementioned operation on the LTEs of tasks sampled

from a uni-modal distribution. The interpolated task usually has the same semantic meaning as

the source tasks while having different quantity in speciﬁc parameters, e.g., running with different

speeds. A new task is composed by applying the same operation on tasks sampled from a multi-

modal distribution. The newly composed task usually lies in a new modality between the source

tasks. For example, when we compose “run” and “jump” together, we will have a combination of

an agent running while trying to jump.

Efﬁcient adaptation. We ﬁnd that an agent trained with the multi-task policy network can adapt to

unseen tasks quickly by only optimizing the LTEs. This shows that the LTEs learn a general pattern

of the overall task distribution. When given a new task after pre-training, the agent explores in the

LTE Space to ﬁnd a suitable LTE for the task. Speciﬁcally, we perform a gradient-free cross-entropy

method (CEM) (De Boer et al., 2005) in the LTE space for accomplishing the desired task. Detailed

description can be found in Algorithm 2.

Geometric Structures of LTEs and the LSEs. We then explore what the sensory representations

and the action representation space look like after multi-task training. In order to understand their

geometric structures, we visualize the LSEs and LTEs. The detailed results of our analysis are

presented in Section 4.5.

Algorithm 2 Adaptation via LTE Optimization

Input: Adaptation task T ∼ p(T),φ∗, ψ∗, capacity of elite set m, number of sampling n

Initialize the elite set Zewith mrandomly sampled LTEs from the LTE Space

for each adapt epoch do

Initialize the overall test set by Z← ∅

for Ziin Zedo

Sample nLTEs Zi1,...,Zin near Zi

Z←Z∪ {Zi,Zi1,...,Zin}

end for

for Zjin Zdo

while not done do

Zst=φ∗(st),

at∼ψ∗(·|Zst,Zj)

rt=R(st,at|T )

st+1 ∼p(st+1|st,at)

end while

end for

Sort the task embeddings in Zby high cumulative reward in the trajectory

Select the top mLTEs in Zto update Ze

end for

4EXPERIMENTS

In this section, we ﬁrst demonstrate the training process and performance of the multi-task policy

network. Then, we use the LTEs as a high-level action interface to instruct the agents to perform

unseen skills through interpolation without any training. After that, we conduct experiments to

evaluate the effectiveness of the LTEs in task adaptation. Lastly, we visualize the LSEs and LTEs

to further understand the structure of the state and action representation. We use emergent action

representation (EAR) to refer to the policy using the LTEs.

4.1 EXPERIMENTAL SETUPS

Environments. We evaluate our method on ﬁve locomotion control environments (HalfCheetah-

Vel, Ant-Dir, Hopper-Vel, Walker-Vel, HalfCheetah-Run-Jump) based on OpenAI Gym and the

Mujoco simulator. Detailed descriptions of these RL benchmarks are listed in Appendix B.1. Be-

yond the locomotion domain, we also conduct a simple test on our method in the domain of robot

manipulation, please refer to Appendix B.6 for detailed results.

Baselines. We compare EAR-SAC, the emergent action representation based SAC with several

multi-task RL and meta RL baselines. For multi-task RL baselines, we use multi-head multi-task

SAC (MHMT-SAC) and one-hot embedding SAC (OHE-SAC; for ablation). For meta RL baselines,

we use MAML (Finn et al., 2017) and PEARL (Rakelly et al., 2019). Detailed descriptions of these

baselines are listed in Appendix B.2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SIMPLEEMERGENTACTIONREPRESENTATIONSFROMMULTI-TASKPOLICYTRAININGPuHua1,4,YubeiChen2,HuazheXu1,3,41TsinghuaUniversity,2CenterforDataScience,NewYorkUniversity,3ShanghaiAILab,4ShanghaiQiZhiInstituteABSTRACTThelow-levelsensoryandmotorsignalsindeepreinforcementlearning,whichexistinhigh-dimensionalspaces...

展开>> 收起<<

SIMPLE EMERGENT ACTION REPRESENTATIONS FROM MULTI -TASK POLICY TRAINING Pu Hua14 Yubei Chen2 Huazhe Xu134.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SIMPLE EMERGENT ACTION REPRESENTATIONS FROM MULTI -TASK POLICY TRAINING Pu Hua14 Yubei Chen2 Huazhe Xu134

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: