Low-Rank Modular Reinforcement Learning via Muscle Synergy Heng Dong

2025-05-02 0 0 5.87MB 19 页 10玖币

侵权投诉

Low-Rank Modular Reinforcement Learning via

Muscle Synergy

Heng Dong∗

IIIS, Tsinghua University

drdhxi@gmail.com

Tonghan Wang∗

Harvard University

twang1@g.harvard.edu

Jiayuan Liu

IIIS, Tsinghua University

georgejiayuan@gmail.com

Chongjie Zhang

IIIS, Tsinghua University

chongjie@tsinghua.edu.cn

Abstract

Modular Reinforcement Learning (RL) decentralizes the control of multi-joint

robots by learning policies for each actuator. Previous work on modular RL has

proven its ability to control morphologically different agents with a shared actuator

policy. However, with the increase in the Degree of Freedom (DoF) of robots,

training a morphology-generalizable modular controller becomes exponentially

difﬁcult. Motivated by the way the human central nervous system controls numer-

ous muscles, we propose a Synergy-Oriented LeARning (SOLAR) framework that

exploits the redundant nature of DoF in robot control. Actuators are grouped into

synergies by an unsupervised learning method, and a synergy action is learned to

control multiple actuators in synchrony. In this way, we achieve a low-rank control

at the synergy level. We extensively evaluate our method on a variety of robot

morphologies, and the results show its superior efﬁciency and generalizability,

especially on robots with a large DoF like Humanoid++ and UNIMALs.

1 Introduction

Deep reinforcement learning (RL) has contributed signiﬁcantly to the sensorimotor control of both

simulated [Heess et al., 2017, Zhu et al., 2020] and real-world [Levine et al., 2016, Mahmood

et al., 2018] robots. Monolithic learning is a popular paradigm for learning control policies. In this

paradigm, a policy inferring a joint action for all limb actuators based on a global sensory state is

learned. Although monolithic learning has made impressive progress [Chen et al., 2020, Kuznetsov

et al., 2020], it has two major shortcomings. First, the input and output space is large. For robots

with more joints, learning controlling policies puts a heavy burden on the representational capacity of

neural networks. Second, the input and output dimensions are ﬁxed, making it inﬂexible to transfer

the learned control policies to robots with different morphologies.

Modular reinforcement learning provides an elegant solution to these problems. In this learning

paradigm [Wang et al., 2018], the control policy is decentralized [Peng et al., 2021], and each limb

actuator is controlled by a local policy. Recent research efforts show that the local policies can learn

high-performance and transferable control strategies by sharing parameters [Huang et al., 2020],

communicating to each other by message passing [Huang et al., 2020], and adaptively paying attention

to other actuators via graph neural networks [Kurin et al., 2020]. By exploiting the ﬂexibility and

generalizability provided by modularity, a modular policy can now control robots of up to thousands

of morphologies [Gupta et al., 2021a].

*These authors contributed equally to this work.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15479v1 [cs.LG] 26 Oct 2022

Despite the signiﬁcant progress, modular reinforcement learning is still limited in terms of the

complexity of morphological structures that can be controlled and struggles on robots with many

joints like Humanoid [Kurin et al., 2020]. The large degree of control freedom presents a major

challenge for learning control policies. A question is why humans can control hundreds of muscles

with dexterity while the most advanced RL policy can only control less than ﬁfteen actuators.

Studies on muscle synergies [d’Avella et al., 2003] may provide an answer. A human central nervous

system decreases the control complexity by producing a small number of electrical signals and

activating muscles in groups [Ting and McKay, 2007]. Muscle synergy is the coordination of muscles

that are activated in synchrony. With muscle synergies, the human nervous system achieves low-rank

control over its actuators. In this paper, we aim to use the inspiration of muscle synergies to reduce

the control complexity and improve the learning performance of modular RL.

The ﬁrst challenge of incorporating muscle synergies into modular RL is to discover a synergy

structure that can promote policy learning. Neuroscience researchers factorize electrical signals [Saito

et al., 2018, Falaki et al., 2017, Kieliba et al., 2018] to analyze the synergy structure, but policy

signals are sub-optimal or even absent during reinforcement learning. We thus exploit the functional

similarity and morphological context of actuators and use a clustering algorithm to identify actuators

in the same synergy. The intuition is that muscles in a synergy typically serve the same functional

purpose and have similar morphological contexts. We quantify the functional similarity by the

inﬂuence of an actuator’s action on the global value function, and the morphological structure is

encoded as a distance matrix. To use the two types of information simultaneously, we adopt the

afﬁnity propagation algorithm [Frey and Dueck, 2007]. The synergy structure is updated periodically

during learning to promptly reﬂect changes in value functions.

To exploit the discovered synergy structure, we design a synergy-aware architecture for policy

learning. The major novelty here is that the policy learns action selection for each synergy, and the

synergy actions are transformed linearly to get actuator actions. Since the number of synergies is

typically much smaller than actuators, we actually learn a low-rank control policy where the physical

actions are a linear mapping from a low-dimensional action space. Moreover, for better processing

state information, the synergy-aware policy adopts a two-level transformer structure, which ﬁrst

aggregates information within each synergy and then processes information across synergies.

We evaluate our Synergy-Oriented LeARning (SOLAR) framework on two MuJoCo [Todorov et al.,

2012] locomotion benchmarks [Huang et al., 2020, Gupta et al., 2021b] and in multi-task to zero-shot

learning, single-task settings. SOLAR signiﬁcantly outperforms previous state-of-the-art algorithms in

terms of both sample efﬁciency and ﬁnal performance on all tested settings, especially on robots with

a large DoF like

Humanoid

++ [Huang et al., 2020] and UNIMALs [Gupta et al., 2021b]. Performance

comparison and the visualization of learned synergy structures strongly support the effectiveness of

our synergy discovery method and synergy-aware transformer-based policy learning approach. Our

experimental results reveal the low-rank nature of multi-joint robot control signals.

2 Background

Modular RL

. Modular Reinforcement Learning decentralizes the control of multi-joint robots by

learning policies for each actuator. Each joint has its controlling policy and they coordinate with

each other via various message passing schemes. Modular RL usually needs to deal with agents with

different morphologies. To do so, Wang et al. [2018] and Pathak et al. [2019] represent the robot’s

morphology as a graph and use GNNs as policy and message passing networks. Huang et al. [2020]

uses both bottom-up and top-down message passing scheme through the links between joints for

coordinating. All of these GNN-like works show the beneﬁts of modular policies over a monolithic

policy in tasks tackling different morphologies. However, recently, Kurin et al. [2020] validated a

hypothesis that any beneﬁt GNNs can extract from morphological structures is outweighed by the

difﬁculty of message passing across multiple hops. They further propose a transformer-based method,

AMORPHEUS, that utilizes self-attention mechanisms as a message passing approach. AMORPHEUS

outperforms prior works and our work is based on AMORPHEUS. Previous works mainly focused on

effective message passing schemes, while our work aims at reducing learning complexities when the

DoF of the robot is large.

Muscle Synergy

. How the human central nervous system (CNS) coordinates the activation of a

large number of muscles during movement is still an open question. According to numerous studies,

the CNS activates muscles in groups to decrease the complexity required to control each individual

muscle [d’Avella et al., 2003, Ting and McKay, 2007]. According to muscle synergy theory, the

CNS produces a small number of signals. The combinations of these signals are distributed to the

muscles [Wojtara et al., 2014]. Muscle synergy is the term for the coordination of muscles that

activate at the same time [Ferrante et al., 2016]. A synergy can include multiple muscles, and a

muscle can belong to multiple synergies. Synergies produce complicated activation patterns for a set

of muscles during the performance of a task, which is commonly measured using electromyography

(EMG) [Tresch et al., 2002, Singh et al., 2018]. EMG signals are typically recorded as a matrix with

a column for activation signals for a moment and a row for activation of a muscle [Rabbi et al., 2020].

Factorisation methods on the matrix are used to extract muscle synergies from muscle activation pat-

terns. Four most commonly used factorization methods are non-negative matrix factorisation [Steele

et al., 2015, Schwartz et al., 2016, Lee and Seung, 1999, Rozumalski et al., 2017, Shuman et al.,

2016, Saito et al., 2018] , principal component analysis [Ting and Macpherson, 2005, Ting et al.,

2015, Danion and Latash, 2010, Falaki et al., 2017], independent component analysis [Hyvärinen and

Oja, 2000, Hart and Giszter, 2013], and factor analysis [Kieliba et al., 2018, Saito et al., 2015].

In the ﬁeld of robot control, only a few works [Palli et al., 2014, Wimböck et al., 2011, Ficuciello

et al., 2016] have exploited the idea of muscle synergy for dimensionality reduction to simplify the

control. However, these works usually ﬁrst use motion dataset from humans to obtain the synergy

space and then learn to control in this synergy space. In contrast, our work learns the synergy space

simultaneously with the control policy in the synergy space.

Afﬁnity propagation

[Frey and Dueck, 2007] is a clustering algorithm based on multi-round message

passing between input data points. It does not need to pre-deﬁne the number of clusters and proceeds

by ﬁnding each instance an exemplar. Data points that choose the same exemplar belongs to the same

cluster.

Suppose

{xi}n

i=1

is a set of data points. Deﬁne

S∈Rn×n

as a similarity matrix. When

i6=j

, the

element

si,j

th row and

th column is the similarity between

and

, which can be measured as,

for example, the negative squared distance of two data points. When

i=j

, the element

si,j

represents

how likely the corresponding instance is to become an exemplar. The vector of diagonal elements,

(s11, s22 , . . . , snn)

, is called preference. Non-diagonal elements in

constitute the afﬁnity matrix.

The algorithm takes

as input and proceeds by updating two matrices: the responsibility matrix

whose values

ri,j

represent whether

is well-suited to be the exemplar for

; the availability

matrix

whose values

ai,j

quantify the appropriateness for

picking

as its exemplar [Frey and

Dueck, 2007]. These two matrices are initialized to be zeroes and can be regarded as log-probability

tables. The algorithm then alternatives between two message-passing steps. First, the responsibility

matrix is updated:

ri,j ←si,j −max

j06=j(ai,j0+si,j0).(1)

Then, the availability matrix is updated:

ai,j ←min 0, rj,j +X

i06∈{i,j}

max(0, ri0,j )for i6=j;aj,j ←X

i06=j

max(0, ri0,j ).(2)

Messages are passed until the clusters stabilize or the pre-determined number of iterations is reached.

Then the exemplar of iis arg maxjri,j +ai,j .

3 Method

In this section, we present our Synergy-Oriented LeARning (SOLAR) scheme that incorporates the

muscle synergy mechanism into modular reinforcement learning to reduce its learning complexity.

Our method has two major components. The ﬁrst one is an unsupervised learning module that utilizes

the morphological structure and value information to discover the synergy hierarchy. The second

is a novel attention-based policy architecture that supports synergy-aware learning. Both of the

components are specially designed to enable the control of robots with different morphologies. We

ﬁrst introduce the problem settings and then describe the details of the two components.

Problem settings.

We consider

robots, each with a unique morphology. Agent

contains

limb actuators that are connected together to constitute its overall morphological structure. Examples

of such robots that are studied in this paper include

Humanoid

++ and UNIMALs. At each discrete

timestep

, actuator

k∈ {1,2, . . . , Kn}

of a robot

n∈ {1,2, . . . , N}

receives a local sensory state

sn,k

as input and outputs individual torque values

an,k

for the corresponding actuator. Then the

robot

executes the joint action

t={an,k

t}Kn

k=1

at time

, after which the environment returns the

next state

t+1 ={sn,k

t+1}Kn

k=1

corresponding to all limbs of the agent

and a collective reward for

the whole morphology

t(sn

t,an

. We learn a policy

πθ

to generate actions based on states. The

learning objective of the policy is to maximize the expected return on all the tasks:

J(θ) = Eθ

n=1

∞

t=0 γtrn

t(sn

t,an

t),(3)

where

is a discount factor. We adopt an actor-critic framework for policy learning. The critic is

shared among all tasks and estimates the Q-function for each robot n:

Qπθ(sn,an) = Eθ

∞

t=0 γtrn

t(sn

t,an

t)|sn

0=sn,an

0=an.(4)

Intra-Synergy Self-Attention

Inter-Synergy Self-Attention

Action Transformation

Synergy Structure

Per actuator action

Preference: Δ𝑄

#!,#

Affinity:exp'(−𝑫#,$ )

Similarity Matrix Synergy Mask 𝑴𝐧

Affinity Propagation

Robot 𝑛State 𝑠!,# Hidden Repr. ℎ!,% Synergy Act. 𝑎!,%

&Actuator Act. 𝑎!,%

Synergy 𝑆!,%

Figure 1: Synergy-aware policy learning. The intra-synergy attention module aggregates actuator

information within each synergy. The inter-synergy attention module synthesizes information from

all synergies to produce synergy actions. Synergy actions are then transformed linearly to obtain

actuator actions. Actuator actions are of a lower rank, reducing the control complexity.

3.1 Discovering synergy structure

In Neuroscience, the muscle synergies are usually discovered by a factorization of electrical muscle

signals during performing tasks [Todorov and Ghahramani, 2004, Rabbi et al., 2020]. Factorization is

a method in hindsight – it statistically analyzes the optimal control policies of animals embodied in the

electrical signals. By contrast, in reinforcement learning, we do not have the optimal control policies

in advance. The synergy hierarchy learned from non-optimal policies is more likely incorrect, which

will hamper policy learning. Therefore, we propose learning the synergy hierarchy by an unsupervised

learning method that incorporates morphological information besides learning information. In this

section, we describe our synergy hierarchy discovery methods.

Intuitively, actuators in the same synergy are activated simultaneously and they together ﬁnish a

motion of an end effector. This gives us a hint that actuators with similar functions are supposed to

be in a synergy. Formally, the function of an actuator (the

th actuator in robot

) can be modelled

by its inﬂuence to the value function:

∆Qn,k =Esn,an,k ,an,-kQπ(sn,[an,k,an,-k]) −Qπ(sn,[bn,k,an,-k]),(5)

where

[·,·]

combines two terms,

an,k

is the actual action of actuator

bn,k

is a default action of

actuator

(

bn,k = 0

in MuJoCo environments) and

an,-k

is the actions of actuators on robot

except for the

th one. In practice, we use a SoftMax function to regularize

∆Qn,k

∆˜

Qn,k =

exp(∆Qn,k)/Pjexp(∆Qn,j ).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Low-RankModularReinforcementLearningviaMuscleSynergyHengDongIIIS,TsinghuaUniversitydrdhxi@gmail.comTonghanWangHarvardUniversitytwang1@g.harvard.eduJiayuanLiuIIIS,TsinghuaUniversitygeorgejiayuan@gmail.comChongjieZhangIIIS,TsinghuaUniversitychongjie@tsinghua.edu.cnAbstractModularReinforcementLearnin...

展开>> 收起<<

Low-Rank Modular Reinforcement Learning via Muscle Synergy Heng Dong.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Low-Rank Modular Reinforcement Learning via Muscle Synergy Heng Dong

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: