Local Connection Reinforcement Learning Method for Efficient Control of Robotic Peg -in-Hole Assembly Yuhang Gai Jiwen Zhang Dan Wu and Ken Chen

2025-05-02 0 0 2.05MB 21 页 10玖币

侵权投诉

Local Connection Reinforcement Learning Method

for Efficient Control of Robotic Peg-in-Hole Assembly

Yuhang Gai, Jiwen Zhang, Dan Wu*, and Ken Chen

All authors are from the State Key Laboratory of Tribology in Advanced Equipment,

Department of Mechanical Engineering, Tsinghua University, Beijing, China.

*Corresponding author: Dan Wu (phone: 1-391-083-2965; e-mail: wud@mail.tsinghua.edu.cn).

Abstract—Traditional control methods of robotic peg-in-hole assembly rely on complex contact state

analysis. Reinforcement learning (RL) is gradually becoming a preferred method of controlling robotic peg-

in-hole assembly tasks. However, the training process of RL is quite time-consuming because RL methods

are always globally connected, which means all state components are assumed to be the input of policies for

all action components, thus increasing action space and state space to be explored. In this paper, we first

define continuous space serialized Shapley value (CS3) and construct a connection graph to clarify the

correlativity of action components on state components. Then we propose a local connection reinforcement

learning (LCRL) method based on the connection graph, which eliminates the influence of irrelevant state

components on the selection of action components. The simulation and experiment results demonstrate that

the control strategy obtained through LCRL method improves the stability and rapidity of the control process.

LCRL method will enhance the data-efficiency and increase the final reward of the training process.

Keywords—robotic assembly, compliance control, local connection reinforcement learning, connection

graph

1. Introduction

1.1. Robotic Assembly Control

Robotic assembly based on off-line planning cannot coordinate stress and guarantee precision between

assembly objects [1]. Hence, the robot is usually guided by external feedback of vision or force information

to execute assembly tasks. The perception range of vision is limited and it is easy to be affected by the

environment, while force information is more intrinsic to feel the stress and pose error between assembly

objects. Hence, compliance control methods based on force feedback are more widely used [2].

Compliance control methods used in assembly tasks construct the mapping between force/moment and

the relative pose of assembly objects [3]. According to whether the structure and parameters of a controller

are adaptive, compliance control methods can be divided into three categories: constant compliance control

methods, artificially designed adaptive compliance control methods, and learning-based adaptive

compliance control methods [4].

The structure and parameters of a constant compliance controller are pre-configured and always constant

in the assembly process [5]-[7]. Constant compliance control methods can only solve simple assembly tasks

with weak nonlinear dynamics. However, the dynamics of peg-in-hole assembly tasks are continuously

changing and strongly nonlinear. The capacity of the constant compliance controller may be insufficient to

handle the peg-in-hole assembly tasks.

Hence, adaptive compliance control methods are proposed to solve assembly tasks to obtain better

control performance [8]-[10]. Variable compliance centre and variable compliance parameters are the most

common artificially designed adaptive compliance control methods. Variable compliance centre method

changes the structure of the controller by converting motion and force/moment information from the robot

and sensor to the dynamic compliance centre. Variable compliance parameters method changes the

parameters of the controller according to an artificially designed adaptive law. However, the performance of

artificially designed adaptive compliance controllers is limited by human experience.

Learning-based adaptive compliance control methods are proposed to obtain better performance on

assembly tasks [11]-[14]. RL method is the most widely employed learning method [15]-[21]. RL abstracts

an assembly task as a Markov Decision Process (MDP) and supplies a policy to guide each control step of

the assembly process. Because learning-based adaptive compliance control methods construct adaptive law

through data-driven methods, the adaptive law is more optimal for the current assembly task. Learning-based

adaptive controllers perform much better than artificially designed ones in assembly tasks. However, the

biggest dilemma is that learning-based methods are time-consuming and not stable enough in industry

applications.

1.2. Efficient RL-Based Control

RL is widely used in the field of continuous control [22]. Compared with artificially designed controllers,

RL tends to use networks to take place of analytic feedback control laws and use exploration and exploitation

to take place of empirical designing. However, RL is plagued by the problem of long training sessions [23]-

[26], especially in high-dimensional continuous action space and state space. When action space and state

space are high-dimensional and continuous, the exploration will become relatively inefficient, thus reducing

the data-efficiency and convergence speed of the training process. One key technology to improve data-

efficiency and convergence speed is to optimize the dimensionality of action space and state space.

A primary method for accelerating the training process is to construct a decoded mapping from action

space or state space to a low-dimensional latent space, and then learn in the latent space [27]-[31]. Because

the dimensionality of latent space is smaller, the size of latent space to be explored is reduced, which makes

the exploration of RL more efficient. Gaussian Process Regression is usually employed to learn the mapping

between latent space and action or state space. In the fields of multi-agent RL and sequence action space RL,

extracting and training in latent space is also an effective means to improve data-efficiency and convergence

speed [32][33].

Hierarchical RL optimizes dimensionality by decomposing a high-dimensional action space into a

sequence of high-level and low-level sub-action spaces [34]-[36]. Then hierarchical RL learns a policy

consisting of multiple layers, each responsible for a different level of control. The dimensionality of low-

level sub-action spaces is reduced to be smaller than that of original action space. The action space to be

explored in hierarchical RL is divided and conquered, thus improving data-efficiency and convergence speed.

Compared with training in latent space, hierarchical RL is potential to perform better on a task but suffers

from the selection of low-level sub-action spaces.

An action dimensionality extension (ADE) method is proposed in [37], which draws on the ideas of

latent space and hierarchical RL but works differently. ADE method first constructs a low-dimensional

action space according to the similarity between action components and trains a primitive agent effectively.

Then the agent is extended into high-dimensional original space and continues to be trained to obtain a better

performance on the task. ADE combines the advantages of higher data-efficiency in low-dimensional action

space and better performance in high-dimensional action space.

All the methods mentioned above can accelerate RL by optimizing the dimensionality of action and state

space. However, the policies are still globally connected. Specifically, the policy for each action component

is decided by all state components. Some irrelevant state components will damage the performance of

policies and also decrease the data-efficiency and convergence speed. If the dependence of action

components on state components is obtained, the state space could be divided into several sub-state spaces

and construct several local connection policies for each action component, which is beneficial to improve

data-efficiency and convergence speed.

1.3 Motivation and Contribution

Motivated by constructing controllers more efficiently through RL method, this paper first defines CS3

to judge the effect of an action component on a state component in complex tasks. Then a connection graph

is constructed based on CS3 to clarify the correlativity of action components with state components and

define the input of policies. LCRL method based on the connection graph is proposed to eliminate the

influence of irrelevant state components on the policies for action components, thus accelerating the training

process.

The main contributions of this paper are listed as follows. First, we propose LCRL method to accelerate

convergence and increase the final reward of RL algorithms. LCRL method is based on the definitions of

CS3 and the connection graph, which show the dependence of action components on state components.

Besides, LCRL method is implemented to construct a learning-based compliance controller of the peg-in-

hole assembly task, which causes lower force/moment and guarantees a more stable control process.

The rest of the paper is organized as follows. Section 2 introduces LCRL method. Section 3 develops

the control method of robotic peg-in-hole assembly using LCRL method. Sections 4 and 5 provide simulation

and experiment verifications. Section 6 summarizes the research work of this paper.

2. Local Connection Reinforcement Learning

2.1 Reinforcement Learning in Continuous Space

RL abstracts arbitrary control problems into Markov decision processes (MDP)

 

, , , ,





where is state space, is action space,

: [0,1]  

is the state transition function,

:  

is the reward function, and γ∈[0,1) is the discount rate. Here we record state space

as a m-dimensional space and action space as a n-dimensional space. State

s

is a m-dimensional

vector

[ , ,..., ]

s s s s

. Action

a

is a n-dimensional vector

[ , ,..., ]

a a a a

. State transition

function is recorded as

( , )

t t t

p s s a



, which gives the probability of next state

s

once current state

and action

are determined. State transition function is usually implicit in complex tasks. The targets of

tasks are shaped through the reward function

 

t t t

r s a s 

. The long-term value can be evaluated through

the sum of discounted rewards.

( , ) ( , , ) i

t i t i t i

Q s a r s a s

 



   











(1)

The terminal goal of the RL is to train an optimal policy



to maximize the value function.

( ) argmax ( , )s Q s a





(2)

For normalized continuous action space and state space, can be taken apart as m orthogonal

subspace

, 1,2,...,

jjm

. Subspace

involves states, all components of which are zeros except

component

 

| 0,

s s k j  

(3)

Similarly, can be taken apart as orthogonal subspaces

, 1,2,...,

iin

. Subspace

involves

actions, all components of which are zeros except component

 

| 0,

a a k i  

(4)

note: , ,

are not strictly linear spaces because elements are bounded. Except for the

boundedness of elements, , ,

meet all the properties of linear space. In the following

contents, we still use related concepts of linear space when not causing confusion.

(a) Global connection RL. (b) Local connection RL.

Fig. 1. Definitions of global connection RL and local connection RL.

At each control step, state

s

at step t is collected according to the status of the environment. Action

a

is determined by RL policy

()



. As shown in Fig. 1, according to the mapping between action

subspaces and state subspaces as well as the dependence of action components on state components, RL

methods can be divided into two categories:

global connection RL (GCRL): Each action subspace is mapped to all state subspaces and each action

component is determined by all state components.

local connection RL (LCRL): Each action subspace is mapped to part of the state subspaces and each

action component is determined by part of state components.

In existing RL methods, policy

()



is globally connected by default. Some irrelevant state

components have a negative influence on the selection of action components. Hence, GCRL method could

cause potential curse of dimensionality and reduce data-efficiency of the training process. If local connection

relation is set appropriately, it is possible to avoid mapping irrelevant state subspaces to some action

subspaces, thus ensuring both optimality and data-efficiency. The specific LCRL method is proposed in later

subsections.

2.2. Connection Graph

The basic idea of LCRL method is that if some action component

does not affect some state

component

during several state transitions, the selection of

is independent of

. The key step is

depicting the effect of

. As indicated as state transition function, the next state is decided by

current state

and action

may continuously change even action is a zero vector. Key metric of

the effect of

is the bias between state trajectories of

before and after activating

. Here

we propose a conception of CS3 to judge the effect of

(1) (2) 00

1(1) (2) (1)

\ , , , ' 0

( , ) '( , ') ( , )

i j j j

t t t t

a a s s t

a s s a a s s a s







     



  







(5)

where CS3

()



describes the effect of

and is in a similar format of Shapley function [38]. The

larger CS3 value is, the more

affects

. is expectation function.

is orthogonal complement

space function.



is subspace function. represents a subspace of

. a(1) and a(2) are vectors in

subspace and

. In summary, CS3 describes the effect of

through comparing state

trajectories before and after adding a(2) in

to a(1) in .

Since RL-based control methods are always implemented in continuous space, CS3 is defined in the

manner of sampling and expectation, which is profitable to decrease algorithm complexity. Besides, another

unique feature of CS3 is defined along a state trajectory whose length is the dimensionality of state space m

instead of a single state point. That is because an action component may not change a state component in

one step and the effect may occur after several steps in some cases. Besides, the integration of

(2) i

a

(set

to be non-zero) may increase or decrease the coming state

s

, causing that the terminal state

and

may be equal. For example, the state trajectory after adding

(2) i

a

is 1→0→1, while the

state trajectory before adding

(2) i

a

is 1→1→1. The terminal state are the same but the sum of absolute

value of state bias are different, which shows the rationality of CS3. Hence, the serialized feature allows

long-term effect to be considered, which is not only beneficial, but also necessary in some cases to reflect

the effect of

Based on CS3

( , )



for arbitrary i and j, the effect of all action components on all state components

can be defined through a connection graph G = [Gij]

nm



. Each element in

is 1 or 0, which shows

whether the action component contributes to changing the state component or not.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LocalConnectionReinforcementLearningMethodforEfficientControlofRoboticPeg-in-HoleAssemblyYuhangGai,JiwenZhang,DanWu*,andKenChenAllauthorsarefromtheStateKeyLaboratoryofTribologyinAdvancedEquipment,DepartmentofMechanicalEngineering,TsinghuaUniversity,Beijing,China.*Correspondingauthor:DanWu(phone:1-39...

展开>> 收起<<

Local Connection Reinforcement Learning Method for Efficient Control of Robotic Peg -in-Hole Assembly Yuhang Gai Jiwen Zhang Dan Wu and Ken Chen.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Local Connection Reinforcement Learning Method for Efficient Control of Robotic Peg -in-Hole Assembly Yuhang Gai Jiwen Zhang Dan Wu and Ken Chen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: