Low-Rank Modular Reinforcement Learning via Muscle Synergy Heng Dong

2025-05-02 0 0 5.87MB 19 页 10玖币
侵权投诉
Low-Rank Modular Reinforcement Learning via
Muscle Synergy
Heng Dong
IIIS, Tsinghua University
drdhxi@gmail.com
Tonghan Wang
Harvard University
twang1@g.harvard.edu
Jiayuan Liu
IIIS, Tsinghua University
georgejiayuan@gmail.com
Chongjie Zhang
IIIS, Tsinghua University
chongjie@tsinghua.edu.cn
Abstract
Modular Reinforcement Learning (RL) decentralizes the control of multi-joint
robots by learning policies for each actuator. Previous work on modular RL has
proven its ability to control morphologically different agents with a shared actuator
policy. However, with the increase in the Degree of Freedom (DoF) of robots,
training a morphology-generalizable modular controller becomes exponentially
difficult. Motivated by the way the human central nervous system controls numer-
ous muscles, we propose a Synergy-Oriented LeARning (SOLAR) framework that
exploits the redundant nature of DoF in robot control. Actuators are grouped into
synergies by an unsupervised learning method, and a synergy action is learned to
control multiple actuators in synchrony. In this way, we achieve a low-rank control
at the synergy level. We extensively evaluate our method on a variety of robot
morphologies, and the results show its superior efficiency and generalizability,
especially on robots with a large DoF like Humanoid++ and UNIMALs.
1 Introduction
Deep reinforcement learning (RL) has contributed significantly to the sensorimotor control of both
simulated [Heess et al., 2017, Zhu et al., 2020] and real-world [Levine et al., 2016, Mahmood
et al., 2018] robots. Monolithic learning is a popular paradigm for learning control policies. In this
paradigm, a policy inferring a joint action for all limb actuators based on a global sensory state is
learned. Although monolithic learning has made impressive progress [Chen et al., 2020, Kuznetsov
et al., 2020], it has two major shortcomings. First, the input and output space is large. For robots
with more joints, learning controlling policies puts a heavy burden on the representational capacity of
neural networks. Second, the input and output dimensions are fixed, making it inflexible to transfer
the learned control policies to robots with different morphologies.
Modular reinforcement learning provides an elegant solution to these problems. In this learning
paradigm [Wang et al., 2018], the control policy is decentralized [Peng et al., 2021], and each limb
actuator is controlled by a local policy. Recent research efforts show that the local policies can learn
high-performance and transferable control strategies by sharing parameters [Huang et al., 2020],
communicating to each other by message passing [Huang et al., 2020], and adaptively paying attention
to other actuators via graph neural networks [Kurin et al., 2020]. By exploiting the flexibility and
generalizability provided by modularity, a modular policy can now control robots of up to thousands
of morphologies [Gupta et al., 2021a].
*These authors contributed equally to this work.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15479v1 [cs.LG] 26 Oct 2022
Despite the significant progress, modular reinforcement learning is still limited in terms of the
complexity of morphological structures that can be controlled and struggles on robots with many
joints like Humanoid [Kurin et al., 2020]. The large degree of control freedom presents a major
challenge for learning control policies. A question is why humans can control hundreds of muscles
with dexterity while the most advanced RL policy can only control less than fifteen actuators.
Studies on muscle synergies [d’Avella et al., 2003] may provide an answer. A human central nervous
system decreases the control complexity by producing a small number of electrical signals and
activating muscles in groups [Ting and McKay, 2007]. Muscle synergy is the coordination of muscles
that are activated in synchrony. With muscle synergies, the human nervous system achieves low-rank
control over its actuators. In this paper, we aim to use the inspiration of muscle synergies to reduce
the control complexity and improve the learning performance of modular RL.
The first challenge of incorporating muscle synergies into modular RL is to discover a synergy
structure that can promote policy learning. Neuroscience researchers factorize electrical signals [Saito
et al., 2018, Falaki et al., 2017, Kieliba et al., 2018] to analyze the synergy structure, but policy
signals are sub-optimal or even absent during reinforcement learning. We thus exploit the functional
similarity and morphological context of actuators and use a clustering algorithm to identify actuators
in the same synergy. The intuition is that muscles in a synergy typically serve the same functional
purpose and have similar morphological contexts. We quantify the functional similarity by the
influence of an actuator’s action on the global value function, and the morphological structure is
encoded as a distance matrix. To use the two types of information simultaneously, we adopt the
affinity propagation algorithm [Frey and Dueck, 2007]. The synergy structure is updated periodically
during learning to promptly reflect changes in value functions.
To exploit the discovered synergy structure, we design a synergy-aware architecture for policy
learning. The major novelty here is that the policy learns action selection for each synergy, and the
synergy actions are transformed linearly to get actuator actions. Since the number of synergies is
typically much smaller than actuators, we actually learn a low-rank control policy where the physical
actions are a linear mapping from a low-dimensional action space. Moreover, for better processing
state information, the synergy-aware policy adopts a two-level transformer structure, which first
aggregates information within each synergy and then processes information across synergies.
We evaluate our Synergy-Oriented LeARning (SOLAR) framework on two MuJoCo [Todorov et al.,
2012] locomotion benchmarks [Huang et al., 2020, Gupta et al., 2021b] and in multi-task to zero-shot
learning, single-task settings. SOLAR significantly outperforms previous state-of-the-art algorithms in
terms of both sample efficiency and final performance on all tested settings, especially on robots with
a large DoF like
Humanoid
++ [Huang et al., 2020] and UNIMALs [Gupta et al., 2021b]. Performance
comparison and the visualization of learned synergy structures strongly support the effectiveness of
our synergy discovery method and synergy-aware transformer-based policy learning approach. Our
experimental results reveal the low-rank nature of multi-joint robot control signals.
2 Background
Modular RL
. Modular Reinforcement Learning decentralizes the control of multi-joint robots by
learning policies for each actuator. Each joint has its controlling policy and they coordinate with
each other via various message passing schemes. Modular RL usually needs to deal with agents with
different morphologies. To do so, Wang et al. [2018] and Pathak et al. [2019] represent the robot’s
morphology as a graph and use GNNs as policy and message passing networks. Huang et al. [2020]
uses both bottom-up and top-down message passing scheme through the links between joints for
coordinating. All of these GNN-like works show the benefits of modular policies over a monolithic
policy in tasks tackling different morphologies. However, recently, Kurin et al. [2020] validated a
hypothesis that any benefit GNNs can extract from morphological structures is outweighed by the
difficulty of message passing across multiple hops. They further propose a transformer-based method,
AMORPHEUS, that utilizes self-attention mechanisms as a message passing approach. AMORPHEUS
outperforms prior works and our work is based on AMORPHEUS. Previous works mainly focused on
effective message passing schemes, while our work aims at reducing learning complexities when the
DoF of the robot is large.
Muscle Synergy
. How the human central nervous system (CNS) coordinates the activation of a
large number of muscles during movement is still an open question. According to numerous studies,
2
the CNS activates muscles in groups to decrease the complexity required to control each individual
muscle [d’Avella et al., 2003, Ting and McKay, 2007]. According to muscle synergy theory, the
CNS produces a small number of signals. The combinations of these signals are distributed to the
muscles [Wojtara et al., 2014]. Muscle synergy is the term for the coordination of muscles that
activate at the same time [Ferrante et al., 2016]. A synergy can include multiple muscles, and a
muscle can belong to multiple synergies. Synergies produce complicated activation patterns for a set
of muscles during the performance of a task, which is commonly measured using electromyography
(EMG) [Tresch et al., 2002, Singh et al., 2018]. EMG signals are typically recorded as a matrix with
a column for activation signals for a moment and a row for activation of a muscle [Rabbi et al., 2020].
Factorisation methods on the matrix are used to extract muscle synergies from muscle activation pat-
terns. Four most commonly used factorization methods are non-negative matrix factorisation [Steele
et al., 2015, Schwartz et al., 2016, Lee and Seung, 1999, Rozumalski et al., 2017, Shuman et al.,
2016, Saito et al., 2018] , principal component analysis [Ting and Macpherson, 2005, Ting et al.,
2015, Danion and Latash, 2010, Falaki et al., 2017], independent component analysis [Hyvärinen and
Oja, 2000, Hart and Giszter, 2013], and factor analysis [Kieliba et al., 2018, Saito et al., 2015].
In the field of robot control, only a few works [Palli et al., 2014, Wimböck et al., 2011, Ficuciello
et al., 2016] have exploited the idea of muscle synergy for dimensionality reduction to simplify the
control. However, these works usually first use motion dataset from humans to obtain the synergy
space and then learn to control in this synergy space. In contrast, our work learns the synergy space
simultaneously with the control policy in the synergy space.
Affinity propagation
[Frey and Dueck, 2007] is a clustering algorithm based on multi-round message
passing between input data points. It does not need to pre-define the number of clusters and proceeds
by finding each instance an exemplar. Data points that choose the same exemplar belongs to the same
cluster.
Suppose
{xi}n
i=1
is a set of data points. Define
SRn×n
as a similarity matrix. When
i6=j
, the
element
si,j
at
i
th row and
j
th column is the similarity between
xi
and
xj
, which can be measured as,
for example, the negative squared distance of two data points. When
i=j
, the element
si,j
represents
how likely the corresponding instance is to become an exemplar. The vector of diagonal elements,
(s11, s22 , . . . , snn)
, is called preference. Non-diagonal elements in
S
constitute the affinity matrix.
The algorithm takes
S
as input and proceeds by updating two matrices: the responsibility matrix
R
whose values
ri,j
represent whether
xj
is well-suited to be the exemplar for
xi
; the availability
matrix
A
whose values
ai,j
quantify the appropriateness for
xi
picking
xj
as its exemplar [Frey and
Dueck, 2007]. These two matrices are initialized to be zeroes and can be regarded as log-probability
tables. The algorithm then alternatives between two message-passing steps. First, the responsibility
matrix is updated:
ri,j si,j max
j06=j(ai,j0+si,j0).(1)
Then, the availability matrix is updated:
ai,j min 0, rj,j +X
i06∈{i,j}
max(0, ri0,j )for i6=j;aj,j X
i06=j
max(0, ri0,j ).(2)
Messages are passed until the clusters stabilize or the pre-determined number of iterations is reached.
Then the exemplar of iis arg maxjri,j +ai,j .
3 Method
In this section, we present our Synergy-Oriented LeARning (SOLAR) scheme that incorporates the
muscle synergy mechanism into modular reinforcement learning to reduce its learning complexity.
Our method has two major components. The first one is an unsupervised learning module that utilizes
the morphological structure and value information to discover the synergy hierarchy. The second
is a novel attention-based policy architecture that supports synergy-aware learning. Both of the
components are specially designed to enable the control of robots with different morphologies. We
first introduce the problem settings and then describe the details of the two components.
Problem settings.
We consider
N
robots, each with a unique morphology. Agent
n
contains
Kn
limb actuators that are connected together to constitute its overall morphological structure. Examples
of such robots that are studied in this paper include
Humanoid
++ and UNIMALs. At each discrete
3
timestep
t
, actuator
k∈ {1,2, . . . , Kn}
of a robot
n∈ {1,2, . . . , N}
receives a local sensory state
sn,k
t
as input and outputs individual torque values
an,k
t
for the corresponding actuator. Then the
robot
n
executes the joint action
an
t={an,k
t}Kn
k=1
at time
t
, after which the environment returns the
next state
sn
t+1 ={sn,k
t+1}Kn
k=1
corresponding to all limbs of the agent
n
and a collective reward for
the whole morphology
rn
t(sn
t,an
t)
. We learn a policy
πθ
to generate actions based on states. The
learning objective of the policy is to maximize the expected return on all the tasks:
J(θ) = Eθ
N
X
n=1
X
t=0 γtrn
t(sn
t,an
t),(3)
where
γ
is a discount factor. We adopt an actor-critic framework for policy learning. The critic is
shared among all tasks and estimates the Q-function for each robot n:
Qπθ(sn,an) = Eθ
X
t=0 γtrn
t(sn
t,an
t)|sn
0=sn,an
0=an.(4)
k
l
Intra-Synergy Self-Attention
k
l
i
Inter-Synergy Self-Attention
i
Action Transformation
k
l
Synergy Structure
Per actuator action
Preference: Δ𝑄
#!,#
Affinity:exp'(−𝑫#,$ )
Similarity Matrix Synergy Mask 𝑴𝐧
Affinity Propagation
Robot 𝑛State 𝑠!,# Hidden Repr. !,% Synergy Act. 𝑎!,%
&Actuator Act. 𝑎!,%
Synergy 𝑆!,%
Figure 1: Synergy-aware policy learning. The intra-synergy attention module aggregates actuator
information within each synergy. The inter-synergy attention module synthesizes information from
all synergies to produce synergy actions. Synergy actions are then transformed linearly to obtain
actuator actions. Actuator actions are of a lower rank, reducing the control complexity.
3.1 Discovering synergy structure
In Neuroscience, the muscle synergies are usually discovered by a factorization of electrical muscle
signals during performing tasks [Todorov and Ghahramani, 2004, Rabbi et al., 2020]. Factorization is
a method in hindsight – it statistically analyzes the optimal control policies of animals embodied in the
electrical signals. By contrast, in reinforcement learning, we do not have the optimal control policies
in advance. The synergy hierarchy learned from non-optimal policies is more likely incorrect, which
will hamper policy learning. Therefore, we propose learning the synergy hierarchy by an unsupervised
learning method that incorporates morphological information besides learning information. In this
section, we describe our synergy hierarchy discovery methods.
Intuitively, actuators in the same synergy are activated simultaneously and they together finish a
motion of an end effector. This gives us a hint that actuators with similar functions are supposed to
be in a synergy. Formally, the function of an actuator (the
k
th actuator in robot
n
) can be modelled
by its influence to the value function:
Qn,k =Esn,an,k ,an,-kQπ(sn,[an,k,an,-k]) Qπ(sn,[bn,k,an,-k]),(5)
where
[·,·]
combines two terms,
an,k
is the actual action of actuator
k
,
bn,k
is a default action of
actuator
k
(
bn,k = 0
in MuJoCo environments) and
an,-k
is the actions of actuators on robot
n
except for the
k
th one. In practice, we use a SoftMax function to regularize
Qn,k
:
˜
Qn,k =
exp(∆Qn,k)/Pjexp(∆Qn,j ).
4
摘要:

Low-RankModularReinforcementLearningviaMuscleSynergyHengDongIIIS,TsinghuaUniversitydrdhxi@gmail.comTonghanWangHarvardUniversitytwang1@g.harvard.eduJiayuanLiuIIIS,TsinghuaUniversitygeorgejiayuan@gmail.comChongjieZhangIIIS,TsinghuaUniversitychongjie@tsinghua.edu.cnAbstractModularReinforcementLearnin...

展开>> 收起<<
Low-Rank Modular Reinforcement Learning via Muscle Synergy Heng Dong.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:5.87MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注