
in sensory representation learning is pretraining with a comprehensive task or set of tasks, followed
by reusing the resulting latent representation. We plan to extend this approach to action represen-
tation learning and explore its potential for enhancing the efficiency and adaptability of reinforce-
ment learning agents. We propose a multi-task policy network that enables a set of tasks to share
the same latent action representation space. Further, the time-variant sensory representations and
time-invariant action representations are decoupled and then concatenated as the sensory-action rep-
resentations, which is finally transformed by a policy network to form the low-level action control.
Surprisingly, when trained on a comprehensive set of tasks, this simple structure learns an emergent
self-organized action representation that can be reused for various downstream tasks. In particular,
we demonstrate the efficacy of this representation in Mujoco locomotion environments, showing
zero-shot interpolation/composition and few-shot task adaptation in the representation space, out-
performing strong meta RL baselines. Additionally, we find that the decoupled time-variant sensory
representation exhibits equivariant properties. The evidence elucidates that reusable and generaliz-
able action representations may lead to efficient, adaptable, and composable RL, thus forming the
basis of abstract action planning and understanding motor signal space. The primary contributions
in this work are listed as follows:
1. We put forward the idea of leveraging emergent action representations from multi-task
learners to better understand motor action space and accomplish task generalization.
2. We decouple the state-related and task-related information of the sensory-action represen-
tations and reuse them to conduct action planning more efficiently.
3. Our approach is a strong adapter, which achieves higher rewards with fewer steps than
strong meta RL baselines when adapting to new tasks.
4. Our approach supports intra-action interpolation as well as inter-action composition by
modifying and composing the learned action representations.
Next, we begin our technical discussion right below and leave the discussion of many valuable and
related literature to the end.
2 PRELIMINARIES
Soft Actor-Critic. In this paper, our approach is built on Soft Actor-Critic (SAC) (Haarnoja et al.,
2018). SAC is a stable off-policy actor-critic algorithm based on the maximum entropy reinforce-
ment learning framework, in which the actor maximizes both the returns and the entropy. We leave
more details of SAC in Appendix A.
Task Distribution. We assume the tasks that the agent may meet are drawn from a pre-defined task
distribution p(T). Each task in p(T)corresponds to a Markov Decision Process (MDP). Therefore,
a task Tcan be defined by a tuple (S,A, P, p0, R), in which Sand Aare respectively the state and
action space, Pthe transition probability, p0the initial state distribution and Rthe reward function.
The concept of task distribution is frequently employed in meta RL problems, but we have made
some extensions on it to better match with the setting in this work. We divide all the task dis-
tributions into two main categories, the “uni-modal” task distributions and the “multi-modal” task
distributions. Concretely, the two scenarios are defined as follows:
•Definition 1 (Uni-modal task distribution): In a uni-modal task distribution, there is only one
modality among all the tasks in the task distribution. For example, in HalfCheetah-Vel, a Mujoco
locomotion environment, we train the agent to run at different target velocities. Therefore, running
is the only modality in this task distribution.
•Definition 2 (Multi-modal task distribution): In contrast to uni-modal task distribution, there are
multiple modalities among the tasks in this task distribution. A multi-modal task distribution
includes tasks of several different uni-modal task distributions. For instance, we design a multi-
modal task distribution called HalfCheetah-Run-Jump, which contains two modalities including
HalfCheetah-BackVel and HalfCheetah-BackJump. The former has been defined in the previous
section, and the latter contains tasks that train the agent to jump with different reward weight. In
our implementation, we actually train four motions in this environment, running, walking, jumping
ans standing. We will leave more details in Section 4 and Appendix B.1.
2