Local Connection Reinforcement Learning Method for Efficient Control of Robotic Peg -in-Hole Assembly Yuhang Gai Jiwen Zhang Dan Wu and Ken Chen

2025-05-02 0 0 2.05MB 21 页 10玖币
侵权投诉
Local Connection Reinforcement Learning Method
for Efficient Control of Robotic Peg-in-Hole Assembly
Yuhang Gai, Jiwen Zhang, Dan Wu*, and Ken Chen
All authors are from the State Key Laboratory of Tribology in Advanced Equipment,
Department of Mechanical Engineering, Tsinghua University, Beijing, China.
*Corresponding author: Dan Wu (phone: 1-391-083-2965; e-mail: wud@mail.tsinghua.edu.cn).
AbstractTraditional control methods of robotic peg-in-hole assembly rely on complex contact state
analysis. Reinforcement learning (RL) is gradually becoming a preferred method of controlling robotic peg-
in-hole assembly tasks. However, the training process of RL is quite time-consuming because RL methods
are always globally connected, which means all state components are assumed to be the input of policies for
all action components, thus increasing action space and state space to be explored. In this paper, we first
define continuous space serialized Shapley value (CS3) and construct a connection graph to clarify the
correlativity of action components on state components. Then we propose a local connection reinforcement
learning (LCRL) method based on the connection graph, which eliminates the influence of irrelevant state
components on the selection of action components. The simulation and experiment results demonstrate that
the control strategy obtained through LCRL method improves the stability and rapidity of the control process.
LCRL method will enhance the data-efficiency and increase the final reward of the training process.
Keywordsrobotic assembly, compliance control, local connection reinforcement learning, connection
graph
1. Introduction
1.1. Robotic Assembly Control
Robotic assembly based on off-line planning cannot coordinate stress and guarantee precision between
assembly objects [1]. Hence, the robot is usually guided by external feedback of vision or force information
to execute assembly tasks. The perception range of vision is limited and it is easy to be affected by the
environment, while force information is more intrinsic to feel the stress and pose error between assembly
objects. Hence, compliance control methods based on force feedback are more widely used [2].
Compliance control methods used in assembly tasks construct the mapping between force/moment and
the relative pose of assembly objects [3]. According to whether the structure and parameters of a controller
are adaptive, compliance control methods can be divided into three categories: constant compliance control
methods, artificially designed adaptive compliance control methods, and learning-based adaptive
compliance control methods [4].
The structure and parameters of a constant compliance controller are pre-configured and always constant
in the assembly process [5]-[7]. Constant compliance control methods can only solve simple assembly tasks
with weak nonlinear dynamics. However, the dynamics of peg-in-hole assembly tasks are continuously
changing and strongly nonlinear. The capacity of the constant compliance controller may be insufficient to
handle the peg-in-hole assembly tasks.
Hence, adaptive compliance control methods are proposed to solve assembly tasks to obtain better
control performance [8]-[10]. Variable compliance centre and variable compliance parameters are the most
common artificially designed adaptive compliance control methods. Variable compliance centre method
changes the structure of the controller by converting motion and force/moment information from the robot
and sensor to the dynamic compliance centre. Variable compliance parameters method changes the
parameters of the controller according to an artificially designed adaptive law. However, the performance of
artificially designed adaptive compliance controllers is limited by human experience.
Learning-based adaptive compliance control methods are proposed to obtain better performance on
assembly tasks [11]-[14]. RL method is the most widely employed learning method [15]-[21]. RL abstracts
an assembly task as a Markov Decision Process (MDP) and supplies a policy to guide each control step of
the assembly process. Because learning-based adaptive compliance control methods construct adaptive law
through data-driven methods, the adaptive law is more optimal for the current assembly task. Learning-based
adaptive controllers perform much better than artificially designed ones in assembly tasks. However, the
biggest dilemma is that learning-based methods are time-consuming and not stable enough in industry
applications.
1.2. Efficient RL-Based Control
RL is widely used in the field of continuous control [22]. Compared with artificially designed controllers,
RL tends to use networks to take place of analytic feedback control laws and use exploration and exploitation
to take place of empirical designing. However, RL is plagued by the problem of long training sessions [23]-
[26], especially in high-dimensional continuous action space and state space. When action space and state
space are high-dimensional and continuous, the exploration will become relatively inefficient, thus reducing
the data-efficiency and convergence speed of the training process. One key technology to improve data-
efficiency and convergence speed is to optimize the dimensionality of action space and state space.
A primary method for accelerating the training process is to construct a decoded mapping from action
space or state space to a low-dimensional latent space, and then learn in the latent space [27]-[31]. Because
the dimensionality of latent space is smaller, the size of latent space to be explored is reduced, which makes
the exploration of RL more efficient. Gaussian Process Regression is usually employed to learn the mapping
between latent space and action or state space. In the fields of multi-agent RL and sequence action space RL,
extracting and training in latent space is also an effective means to improve data-efficiency and convergence
speed [32][33].
Hierarchical RL optimizes dimensionality by decomposing a high-dimensional action space into a
sequence of high-level and low-level sub-action spaces [34]-[36]. Then hierarchical RL learns a policy
consisting of multiple layers, each responsible for a different level of control. The dimensionality of low-
level sub-action spaces is reduced to be smaller than that of original action space. The action space to be
explored in hierarchical RL is divided and conquered, thus improving data-efficiency and convergence speed.
Compared with training in latent space, hierarchical RL is potential to perform better on a task but suffers
from the selection of low-level sub-action spaces.
An action dimensionality extension (ADE) method is proposed in [37], which draws on the ideas of
latent space and hierarchical RL but works differently. ADE method first constructs a low-dimensional
action space according to the similarity between action components and trains a primitive agent effectively.
Then the agent is extended into high-dimensional original space and continues to be trained to obtain a better
performance on the task. ADE combines the advantages of higher data-efficiency in low-dimensional action
space and better performance in high-dimensional action space.
All the methods mentioned above can accelerate RL by optimizing the dimensionality of action and state
space. However, the policies are still globally connected. Specifically, the policy for each action component
is decided by all state components. Some irrelevant state components will damage the performance of
policies and also decrease the data-efficiency and convergence speed. If the dependence of action
components on state components is obtained, the state space could be divided into several sub-state spaces
and construct several local connection policies for each action component, which is beneficial to improve
data-efficiency and convergence speed.
1.3 Motivation and Contribution
Motivated by constructing controllers more efficiently through RL method, this paper first defines CS3
to judge the effect of an action component on a state component in complex tasks. Then a connection graph
is constructed based on CS3 to clarify the correlativity of action components with state components and
define the input of policies. LCRL method based on the connection graph is proposed to eliminate the
influence of irrelevant state components on the policies for action components, thus accelerating the training
process.
The main contributions of this paper are listed as follows. First, we propose LCRL method to accelerate
convergence and increase the final reward of RL algorithms. LCRL method is based on the definitions of
CS3 and the connection graph, which show the dependence of action components on state components.
Besides, LCRL method is implemented to construct a learning-based compliance controller of the peg-in-
hole assembly task, which causes lower force/moment and guarantees a more stable control process.
The rest of the paper is organized as follows. Section 2 introduces LCRL method. Section 3 develops
the control method of robotic peg-in-hole assembly using LCRL method. Sections 4 and 5 provide simulation
and experiment verifications. Section 6 summarizes the research work of this paper.
2. Local Connection Reinforcement Learning
2.1 Reinforcement Learning in Continuous Space
RL abstracts arbitrary control problems into Markov decision processes (MDP)
 
, , , ,
,
where is state space, is action space,
: [0,1]  
is the state transition function,
:  
is the reward function, and γ[0,1) is the discount rate. Here we record state space
as a m-dimensional space and action space as a n-dimensional space. State
s
is a m-dimensional
vector
12
[ , ,..., ]
m
s s s s
. Action
a
is a n-dimensional vector
12
[ , ,..., ]
n
a a a a
. State transition
function is recorded as
, which gives the probability of next state
1t
s
once current state
t
s
and action
t
a
are determined. State transition function is usually implicit in complex tasks. The targets of
tasks are shaped through the reward function
 
1
,,
t t t
r s a s
. The long-term value can be evaluated through
the sum of discounted rewards.
1
0
( , ) ( , , ) i
t i t i t i
i
Q s a r s a s

 



(1)
The terminal goal of the RL is to train an optimal policy
*
to maximize the value function.
( ) argmax ( , )s Q s a
(2)
For normalized continuous action space and state space, can be taken apart as m orthogonal
subspace
, 1,2,...,
jjm
. Subspace
j
involves states, all components of which are zeros except
component
js
.
 
| 0,
jk
s s k j  
(3)
Similarly, can be taken apart as orthogonal subspaces
, 1,2,...,
iin
. Subspace
i
involves
actions, all components of which are zeros except component
ia
.
 
| 0,
ik
a a k i  
(4)
note: , ,
j
,
i
are not strictly linear spaces because elements are bounded. Except for the
boundedness of elements, , ,
j
,
i
meet all the properties of linear space. In the following
contents, we still use related concepts of linear space when not causing confusion.
(a) Global connection RL. (b) Local connection RL.
Fig. 1. Definitions of global connection RL and local connection RL.
At each control step, state
t
s
at step t is collected according to the status of the environment. Action
t
a
is determined by RL policy
()
tt
as
. As shown in Fig. 1, according to the mapping between action
subspaces and state subspaces as well as the dependence of action components on state components, RL
methods can be divided into two categories:
global connection RL (GCRL): Each action subspace is mapped to all state subspaces and each action
component is determined by all state components.
local connection RL (LCRL): Each action subspace is mapped to part of the state subspaces and each
action component is determined by part of state components.
In existing RL methods, policy
()
tt
as
is globally connected by default. Some irrelevant state
components have a negative influence on the selection of action components. Hence, GCRL method could
cause potential curse of dimensionality and reduce data-efficiency of the training process. If local connection
relation is set appropriately, it is possible to avoid mapping irrelevant state subspaces to some action
subspaces, thus ensuring both optimality and data-efficiency. The specific LCRL method is proposed in later
subsections.
2.2. Connection Graph
The basic idea of LCRL method is that if some action component
ia
does not affect some state
component
js
during several state transitions, the selection of
ia
is independent of
js
. The key step is
depicting the effect of
ia
on
js
. As indicated as state transition function, the next state is decided by
current state
t
s
and action
t
a
.
js
may continuously change even action is a zero vector. Key metric of
the effect of
ia
on
js
is the bias between state trajectories of
js
before and after activating
ia
. Here
we propose a conception of CS3 to judge the effect of
ia
on
js
.
(1) (2) 00
1(1) (2) (1)
11
\ , , , ' 0
( , ) '( , ') ( , )
ii
m
i j j j
t t t t
a a s s t
a s s a a s s a s

 

 


(5)
where CS3
()
describes the effect of
ia
to
js
and is in a similar format of Shapley function [38]. The
larger CS3 value is, the more
ia
affects
js
. is expectation function.
\
is orthogonal complement
space function.
is subspace function. represents a subspace of
\i
. a(1) and a(2) are vectors in
subspace and
i
. In summary, CS3 describes the effect of
ia
on
js
through comparing state
trajectories before and after adding a(2) in
i
to a(1) in .
Since RL-based control methods are always implemented in continuous space, CS3 is defined in the
manner of sampling and expectation, which is profitable to decrease algorithm complexity. Besides, another
unique feature of CS3 is defined along a state trajectory whose length is the dimensionality of state space m
instead of a single state point. That is because an action component may not change a state component in
one step and the effect may occur after several steps in some cases. Besides, the integration of
(2) i
a
(set
i
t
a
to be non-zero) may increase or decrease the coming state
1
j
t
s
, causing that the terminal state
j
m
s
and
'
j
m
s
may be equal. For example, the state trajectory after adding
(2) i
a
is 1→0→1, while the
state trajectory before adding
(2) i
a
is 1→1→1. The terminal state are the same but the sum of absolute
value of state bias are different, which shows the rationality of CS3. Hence, the serialized feature allows
long-term effect to be considered, which is not only beneficial, but also necessary in some cases to reflect
the effect of
ia
on
js
.
Based on CS3
( , )
ij
as
for arbitrary i and j, the effect of all action components on all state components
can be defined through a connection graph G = [Gij]
nm
. Each element in
G
is 1 or 0, which shows
whether the action component contributes to changing the state component or not.
摘要:

LocalConnectionReinforcementLearningMethodforEfficientControlofRoboticPeg-in-HoleAssemblyYuhangGai,JiwenZhang,DanWu*,andKenChenAllauthorsarefromtheStateKeyLaboratoryofTribologyinAdvancedEquipment,DepartmentofMechanicalEngineering,TsinghuaUniversity,Beijing,China.*Correspondingauthor:DanWu(phone:1-39...

展开>> 收起<<
Local Connection Reinforcement Learning Method for Efficient Control of Robotic Peg -in-Hole Assembly Yuhang Gai Jiwen Zhang Dan Wu and Ken Chen.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:2.05MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注