IPPO Obstacle Avoidance for Robotic Manipulators in Joint Space via Improved Proximal Policy Optimization Yongliang Wang

2025-05-03 0 0 7.83MB 11 页 10玖币

侵权投诉

IPPO: Obstacle Avoidance for Robotic Manipulators in Joint

Space via Improved Proximal Policy Optimization

Yongliang Wang

Department of Artiﬁcial Intelligence

Bernoulli Institute, Faculty of Science and Engineering

University of Groningen, The Netherlands

Email: yongliang.wang@rug.nl

Hamidreza Kasaei

Department of Artiﬁcial Intelligence

Bernoulli Institute, Faculty of Science and Engineering

University of Groningen, The Netherlands

Email: hamidreza.kasaei@rug.nl

Abstract—Reaching tasks with random targets and obstacles

is a challenging task for robotic manipulators. In this study,

we propose a novel model-free reinforcement learning approach

based on proximal policy optimization (PPO) for training a

deep policy to map the task space to the joint space of a 6-

DoF manipulator. To facilitate the training process in a large

workspace, we develop an efﬁcient representation of environmen-

tal inputs and outputs. The calculation of the distance between

obstacles and manipulator links is incorporated into the state

representation using a geometry-based method. Additionally, to

enhance the performance of the model in reaching tasks, we

introduce the action ensembles method and design the policy

to directly participate in value function updates in PPO. To

overcome the challenges associated with training in real-robot

environments, we develop a simulation environment in Gazebo

to train the model as it produces a smaller Sim-to-Real gap

compared to other simulators. However, training in Gazebo

is time-intensive. To address this issue, we propose a Sim-

to-Sim method to signiﬁcantly reduce the training time. The

trained model is then directly applied in a real-robot setup

without ﬁne-tuning. To evaluate the performance of the proposed

approach, we perform several rounds of experiments in both

simulated and real robots. We also compare the performance

of the proposed approach with six baselines. The experimental

results demonstrate the effectiveness of the proposed method

in performing reaching tasks with and without obstacles. our

method outperformed the selected baselines by a large margin

in different reaching task scenarios. A video of these experiments

has been attached to the paper as supplementary material.

I. INTRODUCTION

The goal-reaching task is a crucial capability for robotic

manipulators, as it is a fundamental requirement for many

robotic applications [1, 2, 3, 4, 5]. In human-centric environ-

ments, robots often operate in complex and dynamic domains

where the presence of obstacles makes motion planning a

challenging task. The motion-planning task for high-degree-

of-freedom robotic manipulators in dynamic environments is

challenging due to the need for a mathematical model that is

both complex and difﬁcult to establish. Classical approaches,

such as the rapid exploring random trees (RRT) algorithm [6]

and the node control-bidirectional RRT (NC-BRRT) algo-

rithm [7], have limitations in handling dynamic environments

and often require prior knowledge of the surroundings, leading

to intensive computation. Previous research has shown that

such motion planning methods are inadequate for dynamic

Fig. 1. Overview of the proposed approach (IPPO): the agent is ﬁrst trained

in the Pybullet environment to learn the mapping from task space to the joint

space for the UR5e manipulator. The trained policy is then used in Gazebo

for the Sim-to-Sim adaptation, followed by direct application in real robot

scenarios without further ﬁne-tuning.

domains [1, 8, 9], leading to a need for the development of

advanced methods that can effectively handle these challenges.

Linghuan et al. [10] proposed an adaptive neural network

bounded control scheme for an n-link rigid robotic manipulator

with unknown dynamics. However, their method necessitated

prior knowledge of environmental limitations. Current path-

planning approaches for dynamic environments relied on

having prior knowledge of the surroundings and required

intense online computation [11, 12]. When the environment

was complex, the excessive online computation could make

the system unresponsive to change in its state. To sum up,

when the targets and obstacles change at random, the motion

planning task for high-degree-of-freedom manipulators will

become notoriously challenging in such uncertain environ-

ments as mathematical models that are complex and difﬁcult

to establish [13]. Meanwhile, traditional control approaches

are often unable to navigate in unstructured environments.

arXiv:2210.00803v2 [cs.RO] 9 Feb 2023

Besides, most methods rely on solving inverse kinematics or

dynamics equations to map the task space to joint space for

collision avoidance, which requires a highly accurate model of

the robot’s dynamics and is not easily generalized to diverse

tasks and different robots [14, 15].

In recent years, deep Reinforcement Learning (RL) offers a

promising solution to this challenge by providing a data-driven

approach to motion planning [16]. RL algorithms can learn

from interactions with the environment, enabling robots to

perform complex goal-reaching tasks while avoiding obstacles

in real-time. RL-based methods for robotic manipulation have

gained popularity and are increasingly being used as an alter-

native to traditional analytical control systems [17, 18, 19, 20,

21, 22], as they have shown great potential for improving the

accuracy and efﬁciency of goal-reaching tasks. For instance,

Adel et al. [23] proposed a reinforcement learning framework

that combines nonlinear model predictive control with obstacle

avoidance. This approach was evaluated on a 6-DoF robot

manipulator and demonstrated its ability to successfully avoid

collisions with static obstacles. However, it has limitations

when it comes to handling obstacles in dynamic environments.

Research has shown that the presence of moving obstacles can

greatly increase the difﬁculty of motion planning tasks [24].

Adarsh Sehgal et al. proposed a deep deterministic policy

gradient (DDPG) and hindsight experience replay (HER) based

method using of the genetic algorithm (GA) to ﬁne-tune

the parameters’ values. They experimented on six robotic

manipulation tasks and got better results than baselines [25].

Franceschetti et al. proposed an extensive comparison of the

trust region policy optimization (TRPO) and deep Q-Network

with normalized advantage functions (DQN-NAF) with respect

to other state-of-the-art algorithms, namely DDPG and vanilla

policy gradient (VPG) [26]. Unlike our work, these studies

only concentrate on reaching a single target position. For

multi-target trajectory planning, Wang et al. introduced action

ensembles based on the Poisson distribution (AEP) to PPO,

their method could be easily extended to realize the task

that the end-effector tracks a speciﬁc trajectory [27]. For

space robots, the workspace is enough to complete tasks,

but for industrial robots, it is insufﬁcient. Their approach

did not produce favorable outcomes when applied to a larger

workspace after training. Thus, the algorithm requires further

development. In another work, Kumar et al. proposed a

simple, versatile joint-level controller via PPO. Experiments

showed the method capable of achieving similar error to

traditional methods, while greatly simplifying the process by

automatically handling redundancy, joint limits, and acceler-

ation or deceleration proﬁles [28]. Nevertheless, the output

of the neural network is the velocity of the end-effector.

Additionally, the majority of DRL-based research completes

learning in task space rather than joint space [29, 30, 31],

which is prone to produce weak results for reaching tasks

[32, 29]. Furthermore, such approaches still need to calculate

the inverse kinematics and cannot accomplish reaching tasks

when obstacles are close to the manipulator’s links.

In this paper, we propose a novel model-free deep rein-

forcement learning approach, called Improved PPO (IPPO), to

tackle reaching multi-target goals while avoiding obstacles in

dynamic environments. An overview of the method is depicted

in Fig. 1. In particular, we train a deep policy to map from

task space to joint space for a 6-DoF manipulator. To improve

the effectiveness of the model’s output on reaching tasks,

the action ensembles method is introduced, and the policy is

designed to join in value function updates directly in PPO.

Additionally, since training such a task in real-robot is time-

consuming and strenuous, we develop a simulation environ-

ment to train the model in Gazebo as it produces a smaller

Sim-to-Real gap compared to other simulators. However, as

training robots in Gazebo is computationally expensive and

requires a long training time, we propose a Sim-to-Sim method

to signiﬁcantly reduce the training time. Finally, the trained

model is directly used in a real-robot setup without ﬁne-tuning.

In comparison to prior works [33, 34, 35, 36], our research

makes three signiﬁcant contributions: (i) The calculation of

the distance between obstacles and the manipulator’s links is

done using a geometry-based method, which improves the

reaching task in the presence of obstacles. (ii) An action

ensembles approach is introduced to enhance the efﬁciency

of the policy. (iii) An adaptive discount factor for PPO is

designed, allowing the policy to directly participate in the

value function update. Empirical results demonstrate that the

proposed approach outperforms other baseline methods in

different testing scenarios.

II. PRELIMINARY

Our research focuses on developing an effective method

for obstacle avoidance in reaching tasks for manipulators. To

achieve this goal, we aim to make the manipulator safely

interact with the environment multiple times. To design our

training process, we initially chose Gazebo due to its better

compatibility with the Robot Operating System (ROS) com-

pared to Pybullet. However, DRL methods tend to have a long

training time in Gazebo. To mitigate this challenge, we created

a similar environment in Pybullet for initial training, and then

transferred and evaluated the learned model in Gazebo through

a Sim-to-Sim transfer process. Finally, we tested the efﬁciency

of the model on a real robot using Sim-to-Real transfer. The

simulation environments in Gazebo and Pybullet are depicted

in Fig. 2. In both environments, a UR5e robot equipped with a

robotiq 140 is utilized as the manipulator. During the training

and testing phases, the pose of the target (represented in red)

and obstacles (represented in blue) is randomly set within the

workspace.

A. Proximal Policy Optimization (PPO)

In this work, we use PPO, one of the state-of-the-art

online DRL methods, since it is known for its stability and

effectiveness in various environments. The algorithm balances

exploration and exploitation to ﬁnd a policy that maximizes the

reward. PPO also offers a trade-off between stability and high

sample efﬁciency, making it suitable for the complex robotic

environment with obstacle avoidance. In particular, PPO is

a type of policy gradient training that alternates between

sampling data through environmental interaction and optimiz-

ing a clipped surrogate objective function using stochastic

gradient descent [37]. The clipped surrogate objective function

improves training stability by limiting the size of the policy

change at each step. In PPO, the clipped surrogate objective

function is designed as follows:

LCLIP (θπ) = ˙

Et[min(rt(θπ)ˆ

At,

clip(rt(θπ),1−, 1 + )ˆ

At)] (1)

At=δt+ (γλ)δt+1 +· · · +· · · + (γλ)T−t+1δT−1(2)

δt=rt+γV (st+1)−V(st)(3)

V(s) = Es,a∼π[G(s)|s](4)

G(s) =

∞

i=t

γi−tr(si)(5)

where θπis the parameters of the policy neural network,

rt(θπ)denotes the probability ratio, ˆ

Atrepresents the gen-

eralized advantage estimator (GAE) and is used to calculate

the policy gradient. The reward value at tis shown by rt, and

the is a constant between 0 and 1, which is set to 0.2 in

the baseline algorithm. γ= 0.99,V(s)refers to the expected

return of state sand Grepresents the discounted cumulative

reward. Likewise, Vtarget(s)is the target value. Additionally,

the value loss function is expressed as follows:

LV(θV) = Es,a∼π[(V(s)−Vtarget(s))2](6)

B. Sim-to-Sim and Sim-to-Real Adaptation

As stated earlier, we train a DRL policy in Pybullet ﬁrst and

then transfer it to Gazebo to reduce training time [38, 39, 40].

The ﬁnal goal is to evaluate the policy’s performance in a

real-world scenario through a Sim-to-Real adaptation [41, 42].

While previous research in the ﬁeld of learning navigation and

manipulation policies have focused on bridging the Sim-to-

Real gap in domain adaptation [18, 43, 44, 45], our work

differs in that obstacle avoidance is accomplished in joint

Fig. 2. The environment of 6-DoF manipulator in the Pybullet (left) and the

Gazebo (right): The goal is shown by a red block and obstacles are highlighted

by blue spheres.

space. Our primary objective is to achieve high accuracy

in simulation for real-world applications. The Sim-to-Sim

transfer method is efﬁcient in reducing training time and

evaluating the robustness of the proposed approach over noises

and inaccurate robot models before deploying the learned

model on a real-robot platform. Therefore, we consider both

Sim-to-Sim and Sim-to-Real transfers [41, 46, 47] in order to

quickly train and evaluate the proposed model in a variety of

tasks and domains. As illustrated in Fig. 1, we train a deep

policy in the Pybullet environment to learn the mapping from

the task space to the joint space of the UR5e manipulator.

The learned policy is then subjected to a Sim-to-Sim phase in

the Gazebo environment, allowing us to evaluate and test the

policy within a simulated environment prior to deployment in

real-world scenarios. Finally, the policy is directly applied in

real-robot without further reﬁnement or ﬁne-tuning.

III. STRATEGY FOR LEARNING

We adopt PPO to accomplish obstacle avoidance with the

mapping from task space to joint space. For reinforcement

learning, one of the important aspects is to devise a good

learning strategy [29, 39, 48], which includes selecting an

appropriate state and action representation [49, 50, 51]. The

strategy is implemented as a deep policy, which is designed

as a multi-layer perception network with two hidden layers.

A. State and Action Representations

It is crucial for DRL methods to choose the appropriate

state and action space. Most researchers prefer to represent

both states and actions in task space, which is ineffective for

avoiding collision between links and obstacles. To accomplish

collision avoidance in the whole workspace, we consider the

position of 6 joints, end-effector, and targets as part of state

representation. Furthermore, the errors in X, Y, and Z axes,

and the distance between obstacles and the ﬁve links are also

considered state representation. It is worth mentioning that

we do not consider the distance of the obstacles to the base

link. Therefore, the state is represented as a vector: s∈R19.

For action representation, we consider the position of the six

joints in order to avoid complex and time-consuming inverse

kinematics calculations and map from task space to joint

space. In the following subsections, we discuss the state and

action spaces in more detail.

1) States in Reaching Task without Obstacles: In the case

of the obstacle-free goal-reaching task, we represent the state

as:

st=<qt,pe,pt,error >(7)

where qt= (qt1. . . qt6)is the position of the six joints,

pe= (pex, pey, pez )represents the position of the end-

effector, pt= (ptx, pty , ptz)is referred to the target position.

error = (e, ex, ey, ez)is the error vector including absolute

distance and distances in X, Y, and Z axes, respectively.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IPPO:ObstacleAvoidanceforRoboticManipulatorsinJointSpaceviaImprovedProximalPolicyOptimizationYongliangWangDepartmentofArticialIntelligenceBernoulliInstitute,FacultyofScienceandEngineeringUniversityofGroningen,TheNetherlandsEmail:yongliang.wang@rug.nlHamidrezaKasaeiDepartmentofArticialIntelligenceB...

展开>> 收起<<

IPPO Obstacle Avoidance for Robotic Manipulators in Joint Space via Improved Proximal Policy Optimization Yongliang Wang.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IPPO Obstacle Avoidance for Robotic Manipulators in Joint Space via Improved Proximal Policy Optimization Yongliang Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: