SIMPLE EMERGENT ACTION REPRESENTATIONS FROM MULTI -TASK POLICY TRAINING Pu Hua14 Yubei Chen2 Huazhe Xu134

2025-05-03 0 0 6.68MB 22 页 10玖币
侵权投诉
SIMPLE EMERGENT ACTION REPRESENTATIONS FROM
MULTI-TASK POLICY TRAINING
Pu Hua1,4, Yubei Chen2, Huazhe Xu1,3,4
1Tsinghua University, 2Center for Data Science, New York University, 3Shanghai AI Lab,
4Shanghai Qi Zhi Institute
ABSTRACT
The low-level sensory and motor signals in deep reinforcement learning, which
exist in high-dimensional spaces such as image observations or motor torques,
are inherently challenging to understand or utilize directly for downstream tasks.
While sensory representations have been extensively studied, the representations
of motor actions are still an area of active exploration. Our work reveals that
a space containing meaningful action representations emerges when a multi-task
policy network takes as inputs both states and task embeddings. Moderate con-
straints are added to improve its representation ability. Therefore, interpolated
or composed embeddings can function as a high-level interface within this space,
providing instructions to the agent for executing meaningful action sequences.
Empirical results demonstrate that the proposed action representations are effec-
tive for intra-action interpolation and inter-action composition with limited or
no additional learning. Furthermore, our approach exhibits superior task adap-
tation ability compared to strong baselines in Mujoco locomotion tasks. Our work
sheds light on the promising direction of learning action representations for effi-
cient, adaptable, and composable RL, forming the basis of abstract action planning
and the understanding of motor signal space. Project page: https://sites.
google.com/view/emergent-action-representation/
1 INTRODUCTION
Deep reinforcement learning (RL) has shown great success in learning near-optimal policies for
performing low-level actions with pre-defined reward functions. However, reusing this learned
knowledge to efficiently accomplish new tasks remains challenging. In contrast, humans naturally
summarize low-level muscle movements into high-level action representations, such as “pick up”
or “turn left”, which can be reused in novel tasks with slight modifications. As a result, we carry
out the most complicated movements without thinking about the detailed joint motions or muscle
contractions, relying instead on high-level action representations (Kandel et al., 2021). By analogy
with such abilities of humans, we ask the question: can RL agents have action representations of
low-level motor controls, which can be reused, modified, or composed to perform new tasks?
As pointed out in Kandel et al. (2021), “the task of the motor systems is the reverse of the task
of the sensory systems. Sensory processing generates an internal representation in the brain of the
outside world or of the state of the body. Motor processing begins with an internal representation:
the desired purpose of movement. In the past decade, representation learning has made significant
progress in representing high-dimensional sensory signals, such as images and audio, to reveal the
geometric and semantic structures hidden in raw signals (Bengio et al., 2013; Chen et al., 2018;
Kornblith et al., 2019; Chen et al., 2020; Baevski et al., 2020; Radford et al., 2021; Bardes et al.,
2021; Bommasani et al., 2021; He et al., 2022; Chen et al., 2022). With the generalization ability of
sensory representation learning, downstream control tasks can be accomplished efficiently, as shown
by recent studies Nair et al. (2022); Xiao et al. (2022); Yuan et al. (2022). While there have been sig-
nificant advances in sensory representation learning, action representation learning remains largely
unexplored. To address this gap, we aim to investigate the topic and discover generalizable action
representations that can be reused or efficiently adapted to perform new tasks. An important concept
*Denotes equal contributions.
1
arXiv:2210.09566v2 [cs.AI] 6 Mar 2023
in sensory representation learning is pretraining with a comprehensive task or set of tasks, followed
by reusing the resulting latent representation. We plan to extend this approach to action represen-
tation learning and explore its potential for enhancing the efficiency and adaptability of reinforce-
ment learning agents. We propose a multi-task policy network that enables a set of tasks to share
the same latent action representation space. Further, the time-variant sensory representations and
time-invariant action representations are decoupled and then concatenated as the sensory-action rep-
resentations, which is finally transformed by a policy network to form the low-level action control.
Surprisingly, when trained on a comprehensive set of tasks, this simple structure learns an emergent
self-organized action representation that can be reused for various downstream tasks. In particular,
we demonstrate the efficacy of this representation in Mujoco locomotion environments, showing
zero-shot interpolation/composition and few-shot task adaptation in the representation space, out-
performing strong meta RL baselines. Additionally, we find that the decoupled time-variant sensory
representation exhibits equivariant properties. The evidence elucidates that reusable and generaliz-
able action representations may lead to efficient, adaptable, and composable RL, thus forming the
basis of abstract action planning and understanding motor signal space. The primary contributions
in this work are listed as follows:
1. We put forward the idea of leveraging emergent action representations from multi-task
learners to better understand motor action space and accomplish task generalization.
2. We decouple the state-related and task-related information of the sensory-action represen-
tations and reuse them to conduct action planning more efficiently.
3. Our approach is a strong adapter, which achieves higher rewards with fewer steps than
strong meta RL baselines when adapting to new tasks.
4. Our approach supports intra-action interpolation as well as inter-action composition by
modifying and composing the learned action representations.
Next, we begin our technical discussion right below and leave the discussion of many valuable and
related literature to the end.
2 PRELIMINARIES
Soft Actor-Critic. In this paper, our approach is built on Soft Actor-Critic (SAC) (Haarnoja et al.,
2018). SAC is a stable off-policy actor-critic algorithm based on the maximum entropy reinforce-
ment learning framework, in which the actor maximizes both the returns and the entropy. We leave
more details of SAC in Appendix A.
Task Distribution. We assume the tasks that the agent may meet are drawn from a pre-defined task
distribution p(T). Each task in p(T)corresponds to a Markov Decision Process (MDP). Therefore,
a task Tcan be defined by a tuple (S,A, P, p0, R), in which Sand Aare respectively the state and
action space, Pthe transition probability, p0the initial state distribution and Rthe reward function.
The concept of task distribution is frequently employed in meta RL problems, but we have made
some extensions on it to better match with the setting in this work. We divide all the task dis-
tributions into two main categories, the “uni-modal” task distributions and the “multi-modal” task
distributions. Concretely, the two scenarios are defined as follows:
Definition 1 (Uni-modal task distribution): In a uni-modal task distribution, there is only one
modality among all the tasks in the task distribution. For example, in HalfCheetah-Vel, a Mujoco
locomotion environment, we train the agent to run at different target velocities. Therefore, running
is the only modality in this task distribution.
Definition 2 (Multi-modal task distribution): In contrast to uni-modal task distribution, there are
multiple modalities among the tasks in this task distribution. A multi-modal task distribution
includes tasks of several different uni-modal task distributions. For instance, we design a multi-
modal task distribution called HalfCheetah-Run-Jump, which contains two modalities including
HalfCheetah-BackVel and HalfCheetah-BackJump. The former has been defined in the previous
section, and the latter contains tasks that train the agent to jump with different reward weight. In
our implementation, we actually train four motions in this environment, running, walking, jumping
ans standing. We will leave more details in Section 4 and Appendix B.1.
2
Figure 1: Emergent action representations from multi-task training. The sensory information
and task information are encoded separately. When both are concatenated, an action decoder de-
codes them into a low level action.
3 EMERGENT ACTION REPRESENTATIONS FROM MULTI-TASK TRAINING
In this section, we first introduce the sensory-action decoupled policy network architecture. Next,
we discuss the multitask policy training details, along with the additional constraints to the task
embedding for the emergence of action representations. Lastly, we demonstrate the emergence of
action representations through various phenomena and applications.
3.1 MULTITASK POLICY NETWORK AND TRAINING
Decoupled embedding and concatenated decoding. An abstract high-level task, e.g., “move for-
ward”, typically changes relatively slower than the transient sensory states. As a simplification, we
decouple the latent representation into a time-variant sensory embedding Zstand a time-invariant
task embedding ZT, which is shown in Figure 1. These embeddings concatenate to form a sensory-
action embedding ZA(st,T)=[Zst,ZT], which is transformed by the policy network (action
decoder) ψto output a low-level action distribution p(at) = ψ(at|Zst,ZT), e.g., motor torques.
The action decoder ψis a multi-layer perceptron (MLP) that outputs a Gaussian distribution in the
low-level action space A.
Latent sensory embedding (LSE). The low-level sensory state information is encoded by an MLP
state encoder φinto a latent sensory embedding Zst=φ(st)Rm. It includes the proprioceptive
information of each time step. LSE is time-variant in an RL trajectory, and the state encoder is
shared among different tasks. We use LSE and sensory representation interchangeably in this paper.
Latent task embedding (LTE). A latent task embedding ZTRdencodes the time-invariant
knowledge of a specific task. Let’s assume we are going to train Ndifferent tasks, and their embed-
dings form an LTE set {ZTN}. These Ndifferent tasks share the same state encoder φand action
decoder ψ; in other words, these Ntasks share the same policy network interface, except for their
task embeddings being different. For implementation, we adopt a fully-connected encoder, which
takes as input the one-hot encodings of different training tasks, to initialize the set {ZTN}. This task
encoder is learnable during training.
After training, the LTE interface can be reused as a high-level action interface. Hence, we use LTE
and action representation interchangeably in this paper.
Training of the multi-task policy networks. A detailed description of the multi-task training is
demonstrated in Algorithm 1. When computing objectives and their gradients, we use policy π
parameterized by ωto indicate all the parameters in the state encoder, action decoder, and {ZTN}.
The overall training procedure is based on SAC. The only difference is that the policy network and
Q networks additionally take as input the LTE ZTand a task label, respectively. During training,
3
we also apply two techniques to constrain this space: 1) we normalize the LTEs so that they lie on a
hypersphere; 2) we inject a random noise to the LTEs to enhance the smoothness of the space. An
ablation study on these two constraints is included in Appendix B.7.
Algorithm 1 Multi-task Training
Input: Training task set {TN} ∼ p(T),θ1,θ2,ω
θ1θ1, θ2θ2,B ← ∅
Initialize LTE set {ZTN}for {TN}
for each pre-train epoch do
for Tiin {Tn}do
Sample a batch Biof multi-task RL transitions with πω
B B ∪ Bi
end for
end for
for each train epoch do
Sample RL batch b∼ B
for all transition data in bdo
Zst=φ(st)
e
ZTi=normalize(ZTi+n)and n N (0, σ2)
Sample action atψ(·|Zst,e
ZTi)for computing SAC objectives
end for
for each optimization step do
Compute SAC objectives J(α), Jπ(ω), JQ(θ)with bbased on Equation 234
Update SAC parameters
end for
end for
Output: The optimal model of state encoder φand action decoder ψand a set of LTEs {ZTN}
3.2 THE EMERGENCE OF ACTION REPRESENTATION
After we train the multi-task policy network with a comprehensive set of tasks, where the LTE
vectors in {ZTN}share the same embedding space, we find that {ZTN}self-organizes into a ge-
ometrically and semantically meaningful structure. Tasks with the same modality are embedded
in a continuous fashion, which facilitates intra-task interpolation. Surprisingly, the composition
of task embeddings from different modalities leads to novel tasks, e.g., “run” +“jump” =“jump
run”. Further, the action representation can be used for efficient task adaptation. Visualization also
reveals interesting geometric structures in task embedding and sensory representation spaces. In
this subsection, we dive into these intriguing phenomena, demonstrating the emergence of action
representation and showing the generalization of the emergent action representation.
Task interpolation & composition. After training the RL agent to accomplish multiple tasks, we
select two pre-trained tasks and generate a new LTE through linear integration between the LTEs
of the two chosen tasks. The newly-generated task embedding is expected to conduct the agent to
perform another different task. The generated LTE is defined by:
Z0=f(βZTi+ (1 β)ZTj)(1)
where i, j are the indices of the selected tasks and ZTi,ZTjare their corresponding LTEs. βis a
hyperparameter ranging in (0,1). The function f(·)is a regularization function related to the pre-
defined quality of the LTE Space. For instance, in this paper, f(·)is a normalization function to
extend or shorten the result of interpolation to a unit sphere.
A new task is interpolated by applying the aforementioned operation on the LTEs of tasks sampled
from a uni-modal distribution. The interpolated task usually has the same semantic meaning as
the source tasks while having different quantity in specific parameters, e.g., running with different
speeds. A new task is composed by applying the same operation on tasks sampled from a multi-
modal distribution. The newly composed task usually lies in a new modality between the source
4
tasks. For example, when we compose “run” and “jump” together, we will have a combination of
an agent running while trying to jump.
Efficient adaptation. We find that an agent trained with the multi-task policy network can adapt to
unseen tasks quickly by only optimizing the LTEs. This shows that the LTEs learn a general pattern
of the overall task distribution. When given a new task after pre-training, the agent explores in the
LTE Space to find a suitable LTE for the task. Specifically, we perform a gradient-free cross-entropy
method (CEM) (De Boer et al., 2005) in the LTE space for accomplishing the desired task. Detailed
description can be found in Algorithm 2.
Geometric Structures of LTEs and the LSEs. We then explore what the sensory representations
and the action representation space look like after multi-task training. In order to understand their
geometric structures, we visualize the LSEs and LTEs. The detailed results of our analysis are
presented in Section 4.5.
Algorithm 2 Adaptation via LTE Optimization
Input: Adaptation task T p(T),φ, ψ, capacity of elite set m, number of sampling n
Initialize the elite set Zewith mrandomly sampled LTEs from the LTE Space
for each adapt epoch do
Initialize the overall test set by Z← ∅
for Ziin Zedo
Sample nLTEs Zi1,...,Zin near Zi
ZZ∪ {Zi,Zi1,...,Zin}
end for
for Zjin Zdo
while not done do
Zst=φ(st),
atψ(·|Zst,Zj)
rt=R(st,at|T )
st+1 p(st+1|st,at)
end while
end for
Sort the task embeddings in Zby high cumulative reward in the trajectory
Select the top mLTEs in Zto update Ze
end for
4EXPERIMENTS
In this section, we first demonstrate the training process and performance of the multi-task policy
network. Then, we use the LTEs as a high-level action interface to instruct the agents to perform
unseen skills through interpolation without any training. After that, we conduct experiments to
evaluate the effectiveness of the LTEs in task adaptation. Lastly, we visualize the LSEs and LTEs
to further understand the structure of the state and action representation. We use emergent action
representation (EAR) to refer to the policy using the LTEs.
4.1 EXPERIMENTAL SETUPS
Environments. We evaluate our method on five locomotion control environments (HalfCheetah-
Vel, Ant-Dir, Hopper-Vel, Walker-Vel, HalfCheetah-Run-Jump) based on OpenAI Gym and the
Mujoco simulator. Detailed descriptions of these RL benchmarks are listed in Appendix B.1. Be-
yond the locomotion domain, we also conduct a simple test on our method in the domain of robot
manipulation, please refer to Appendix B.6 for detailed results.
Baselines. We compare EAR-SAC, the emergent action representation based SAC with several
multi-task RL and meta RL baselines. For multi-task RL baselines, we use multi-head multi-task
SAC (MHMT-SAC) and one-hot embedding SAC (OHE-SAC; for ablation). For meta RL baselines,
we use MAML (Finn et al., 2017) and PEARL (Rakelly et al., 2019). Detailed descriptions of these
baselines are listed in Appendix B.2.
5
摘要:

SIMPLEEMERGENTACTIONREPRESENTATIONSFROMMULTI-TASKPOLICYTRAININGPuHua1,4,YubeiChen2,HuazheXu1,3,41TsinghuaUniversity,2CenterforDataScience,NewYorkUniversity,3ShanghaiAILab,4ShanghaiQiZhiInstituteABSTRACTThelow-levelsensoryandmotorsignalsindeepreinforcementlearning,whichexistinhigh-dimensionalspaces...

展开>> 收起<<
SIMPLE EMERGENT ACTION REPRESENTATIONS FROM MULTI -TASK POLICY TRAINING Pu Hua14 Yubei Chen2 Huazhe Xu134.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:6.68MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注