IPPO Obstacle Avoidance for Robotic Manipulators in Joint Space via Improved Proximal Policy Optimization Yongliang Wang

2025-05-03 0 0 7.83MB 11 页 10玖币
侵权投诉
IPPO: Obstacle Avoidance for Robotic Manipulators in Joint
Space via Improved Proximal Policy Optimization
Yongliang Wang
Department of Artificial Intelligence
Bernoulli Institute, Faculty of Science and Engineering
University of Groningen, The Netherlands
Email: yongliang.wang@rug.nl
Hamidreza Kasaei
Department of Artificial Intelligence
Bernoulli Institute, Faculty of Science and Engineering
University of Groningen, The Netherlands
Email: hamidreza.kasaei@rug.nl
Abstract—Reaching tasks with random targets and obstacles
is a challenging task for robotic manipulators. In this study,
we propose a novel model-free reinforcement learning approach
based on proximal policy optimization (PPO) for training a
deep policy to map the task space to the joint space of a 6-
DoF manipulator. To facilitate the training process in a large
workspace, we develop an efficient representation of environmen-
tal inputs and outputs. The calculation of the distance between
obstacles and manipulator links is incorporated into the state
representation using a geometry-based method. Additionally, to
enhance the performance of the model in reaching tasks, we
introduce the action ensembles method and design the policy
to directly participate in value function updates in PPO. To
overcome the challenges associated with training in real-robot
environments, we develop a simulation environment in Gazebo
to train the model as it produces a smaller Sim-to-Real gap
compared to other simulators. However, training in Gazebo
is time-intensive. To address this issue, we propose a Sim-
to-Sim method to significantly reduce the training time. The
trained model is then directly applied in a real-robot setup
without fine-tuning. To evaluate the performance of the proposed
approach, we perform several rounds of experiments in both
simulated and real robots. We also compare the performance
of the proposed approach with six baselines. The experimental
results demonstrate the effectiveness of the proposed method
in performing reaching tasks with and without obstacles. our
method outperformed the selected baselines by a large margin
in different reaching task scenarios. A video of these experiments
has been attached to the paper as supplementary material.
I. INTRODUCTION
The goal-reaching task is a crucial capability for robotic
manipulators, as it is a fundamental requirement for many
robotic applications [1, 2, 3, 4, 5]. In human-centric environ-
ments, robots often operate in complex and dynamic domains
where the presence of obstacles makes motion planning a
challenging task. The motion-planning task for high-degree-
of-freedom robotic manipulators in dynamic environments is
challenging due to the need for a mathematical model that is
both complex and difficult to establish. Classical approaches,
such as the rapid exploring random trees (RRT) algorithm [6]
and the node control-bidirectional RRT (NC-BRRT) algo-
rithm [7], have limitations in handling dynamic environments
and often require prior knowledge of the surroundings, leading
to intensive computation. Previous research has shown that
such motion planning methods are inadequate for dynamic
Fig. 1. Overview of the proposed approach (IPPO): the agent is first trained
in the Pybullet environment to learn the mapping from task space to the joint
space for the UR5e manipulator. The trained policy is then used in Gazebo
for the Sim-to-Sim adaptation, followed by direct application in real robot
scenarios without further fine-tuning.
domains [1, 8, 9], leading to a need for the development of
advanced methods that can effectively handle these challenges.
Linghuan et al. [10] proposed an adaptive neural network
bounded control scheme for an n-link rigid robotic manipulator
with unknown dynamics. However, their method necessitated
prior knowledge of environmental limitations. Current path-
planning approaches for dynamic environments relied on
having prior knowledge of the surroundings and required
intense online computation [11, 12]. When the environment
was complex, the excessive online computation could make
the system unresponsive to change in its state. To sum up,
when the targets and obstacles change at random, the motion
planning task for high-degree-of-freedom manipulators will
become notoriously challenging in such uncertain environ-
ments as mathematical models that are complex and difficult
to establish [13]. Meanwhile, traditional control approaches
are often unable to navigate in unstructured environments.
arXiv:2210.00803v2 [cs.RO] 9 Feb 2023
Besides, most methods rely on solving inverse kinematics or
dynamics equations to map the task space to joint space for
collision avoidance, which requires a highly accurate model of
the robot’s dynamics and is not easily generalized to diverse
tasks and different robots [14, 15].
In recent years, deep Reinforcement Learning (RL) offers a
promising solution to this challenge by providing a data-driven
approach to motion planning [16]. RL algorithms can learn
from interactions with the environment, enabling robots to
perform complex goal-reaching tasks while avoiding obstacles
in real-time. RL-based methods for robotic manipulation have
gained popularity and are increasingly being used as an alter-
native to traditional analytical control systems [17, 18, 19, 20,
21, 22], as they have shown great potential for improving the
accuracy and efficiency of goal-reaching tasks. For instance,
Adel et al. [23] proposed a reinforcement learning framework
that combines nonlinear model predictive control with obstacle
avoidance. This approach was evaluated on a 6-DoF robot
manipulator and demonstrated its ability to successfully avoid
collisions with static obstacles. However, it has limitations
when it comes to handling obstacles in dynamic environments.
Research has shown that the presence of moving obstacles can
greatly increase the difficulty of motion planning tasks [24].
Adarsh Sehgal et al. proposed a deep deterministic policy
gradient (DDPG) and hindsight experience replay (HER) based
method using of the genetic algorithm (GA) to fine-tune
the parameters’ values. They experimented on six robotic
manipulation tasks and got better results than baselines [25].
Franceschetti et al. proposed an extensive comparison of the
trust region policy optimization (TRPO) and deep Q-Network
with normalized advantage functions (DQN-NAF) with respect
to other state-of-the-art algorithms, namely DDPG and vanilla
policy gradient (VPG) [26]. Unlike our work, these studies
only concentrate on reaching a single target position. For
multi-target trajectory planning, Wang et al. introduced action
ensembles based on the Poisson distribution (AEP) to PPO,
their method could be easily extended to realize the task
that the end-effector tracks a specific trajectory [27]. For
space robots, the workspace is enough to complete tasks,
but for industrial robots, it is insufficient. Their approach
did not produce favorable outcomes when applied to a larger
workspace after training. Thus, the algorithm requires further
development. In another work, Kumar et al. proposed a
simple, versatile joint-level controller via PPO. Experiments
showed the method capable of achieving similar error to
traditional methods, while greatly simplifying the process by
automatically handling redundancy, joint limits, and acceler-
ation or deceleration profiles [28]. Nevertheless, the output
of the neural network is the velocity of the end-effector.
Additionally, the majority of DRL-based research completes
learning in task space rather than joint space [29, 30, 31],
which is prone to produce weak results for reaching tasks
[32, 29]. Furthermore, such approaches still need to calculate
the inverse kinematics and cannot accomplish reaching tasks
when obstacles are close to the manipulator’s links.
In this paper, we propose a novel model-free deep rein-
forcement learning approach, called Improved PPO (IPPO), to
tackle reaching multi-target goals while avoiding obstacles in
dynamic environments. An overview of the method is depicted
in Fig. 1. In particular, we train a deep policy to map from
task space to joint space for a 6-DoF manipulator. To improve
the effectiveness of the model’s output on reaching tasks,
the action ensembles method is introduced, and the policy is
designed to join in value function updates directly in PPO.
Additionally, since training such a task in real-robot is time-
consuming and strenuous, we develop a simulation environ-
ment to train the model in Gazebo as it produces a smaller
Sim-to-Real gap compared to other simulators. However, as
training robots in Gazebo is computationally expensive and
requires a long training time, we propose a Sim-to-Sim method
to significantly reduce the training time. Finally, the trained
model is directly used in a real-robot setup without fine-tuning.
In comparison to prior works [33, 34, 35, 36], our research
makes three significant contributions: (i) The calculation of
the distance between obstacles and the manipulator’s links is
done using a geometry-based method, which improves the
reaching task in the presence of obstacles. (ii) An action
ensembles approach is introduced to enhance the efficiency
of the policy. (iii) An adaptive discount factor for PPO is
designed, allowing the policy to directly participate in the
value function update. Empirical results demonstrate that the
proposed approach outperforms other baseline methods in
different testing scenarios.
II. PRELIMINARY
Our research focuses on developing an effective method
for obstacle avoidance in reaching tasks for manipulators. To
achieve this goal, we aim to make the manipulator safely
interact with the environment multiple times. To design our
training process, we initially chose Gazebo due to its better
compatibility with the Robot Operating System (ROS) com-
pared to Pybullet. However, DRL methods tend to have a long
training time in Gazebo. To mitigate this challenge, we created
a similar environment in Pybullet for initial training, and then
transferred and evaluated the learned model in Gazebo through
a Sim-to-Sim transfer process. Finally, we tested the efficiency
of the model on a real robot using Sim-to-Real transfer. The
simulation environments in Gazebo and Pybullet are depicted
in Fig. 2. In both environments, a UR5e robot equipped with a
robotiq 140 is utilized as the manipulator. During the training
and testing phases, the pose of the target (represented in red)
and obstacles (represented in blue) is randomly set within the
workspace.
A. Proximal Policy Optimization (PPO)
In this work, we use PPO, one of the state-of-the-art
online DRL methods, since it is known for its stability and
effectiveness in various environments. The algorithm balances
exploration and exploitation to find a policy that maximizes the
reward. PPO also offers a trade-off between stability and high
sample efficiency, making it suitable for the complex robotic
environment with obstacle avoidance. In particular, PPO is
a type of policy gradient training that alternates between
sampling data through environmental interaction and optimiz-
ing a clipped surrogate objective function using stochastic
gradient descent [37]. The clipped surrogate objective function
improves training stability by limiting the size of the policy
change at each step. In PPO, the clipped surrogate objective
function is designed as follows:
LCLIP (θπ) = ˙
Et[min(rt(θπ)ˆ
At,
clip(rt(θπ),1, 1 + )ˆ
At)] (1)
ˆ
At=δt+ (γλ)δt+1 +· · · +· · · + (γλ)Tt+1δT1(2)
δt=rt+γV (st+1)V(st)(3)
V(s) = Es,aπ[G(s)|s](4)
G(s) =
X
i=t
γitr(si)(5)
where θπis the parameters of the policy neural network,
rt(θπ)denotes the probability ratio, ˆ
Atrepresents the gen-
eralized advantage estimator (GAE) and is used to calculate
the policy gradient. The reward value at tis shown by rt, and
the is a constant between 0 and 1, which is set to 0.2 in
the baseline algorithm. γ= 0.99,V(s)refers to the expected
return of state sand Grepresents the discounted cumulative
reward. Likewise, Vtarget(s)is the target value. Additionally,
the value loss function is expressed as follows:
LV(θV) = Es,aπ[(V(s)Vtarget(s))2](6)
B. Sim-to-Sim and Sim-to-Real Adaptation
As stated earlier, we train a DRL policy in Pybullet first and
then transfer it to Gazebo to reduce training time [38, 39, 40].
The final goal is to evaluate the policy’s performance in a
real-world scenario through a Sim-to-Real adaptation [41, 42].
While previous research in the field of learning navigation and
manipulation policies have focused on bridging the Sim-to-
Real gap in domain adaptation [18, 43, 44, 45], our work
differs in that obstacle avoidance is accomplished in joint
Fig. 2. The environment of 6-DoF manipulator in the Pybullet (left) and the
Gazebo (right): The goal is shown by a red block and obstacles are highlighted
by blue spheres.
space. Our primary objective is to achieve high accuracy
in simulation for real-world applications. The Sim-to-Sim
transfer method is efficient in reducing training time and
evaluating the robustness of the proposed approach over noises
and inaccurate robot models before deploying the learned
model on a real-robot platform. Therefore, we consider both
Sim-to-Sim and Sim-to-Real transfers [41, 46, 47] in order to
quickly train and evaluate the proposed model in a variety of
tasks and domains. As illustrated in Fig. 1, we train a deep
policy in the Pybullet environment to learn the mapping from
the task space to the joint space of the UR5e manipulator.
The learned policy is then subjected to a Sim-to-Sim phase in
the Gazebo environment, allowing us to evaluate and test the
policy within a simulated environment prior to deployment in
real-world scenarios. Finally, the policy is directly applied in
real-robot without further refinement or fine-tuning.
III. STRATEGY FOR LEARNING
We adopt PPO to accomplish obstacle avoidance with the
mapping from task space to joint space. For reinforcement
learning, one of the important aspects is to devise a good
learning strategy [29, 39, 48], which includes selecting an
appropriate state and action representation [49, 50, 51]. The
strategy is implemented as a deep policy, which is designed
as a multi-layer perception network with two hidden layers.
A. State and Action Representations
It is crucial for DRL methods to choose the appropriate
state and action space. Most researchers prefer to represent
both states and actions in task space, which is ineffective for
avoiding collision between links and obstacles. To accomplish
collision avoidance in the whole workspace, we consider the
position of 6 joints, end-effector, and targets as part of state
representation. Furthermore, the errors in X, Y, and Z axes,
and the distance between obstacles and the five links are also
considered state representation. It is worth mentioning that
we do not consider the distance of the obstacles to the base
link. Therefore, the state is represented as a vector: sR19.
For action representation, we consider the position of the six
joints in order to avoid complex and time-consuming inverse
kinematics calculations and map from task space to joint
space. In the following subsections, we discuss the state and
action spaces in more detail.
1) States in Reaching Task without Obstacles: In the case
of the obstacle-free goal-reaching task, we represent the state
as:
st=<qt,pe,pt,error >(7)
where qt= (qt1. . . qt6)is the position of the six joints,
pe= (pex, pey, pez )represents the position of the end-
effector, pt= (ptx, pty , ptz)is referred to the target position.
error = (e, ex, ey, ez)is the error vector including absolute
distance and distances in X, Y, and Z axes, respectively.
摘要:

IPPO:ObstacleAvoidanceforRoboticManipulatorsinJointSpaceviaImprovedProximalPolicyOptimizationYongliangWangDepartmentofArticialIntelligenceBernoulliInstitute,FacultyofScienceandEngineeringUniversityofGroningen,TheNetherlandsEmail:yongliang.wang@rug.nlHamidrezaKasaeiDepartmentofArticialIntelligenceB...

展开>> 收起<<
IPPO Obstacle Avoidance for Robotic Manipulators in Joint Space via Improved Proximal Policy Optimization Yongliang Wang.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:7.83MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注