
Besides, most methods rely on solving inverse kinematics or
dynamics equations to map the task space to joint space for
collision avoidance, which requires a highly accurate model of
the robot’s dynamics and is not easily generalized to diverse
tasks and different robots [14, 15].
In recent years, deep Reinforcement Learning (RL) offers a
promising solution to this challenge by providing a data-driven
approach to motion planning [16]. RL algorithms can learn
from interactions with the environment, enabling robots to
perform complex goal-reaching tasks while avoiding obstacles
in real-time. RL-based methods for robotic manipulation have
gained popularity and are increasingly being used as an alter-
native to traditional analytical control systems [17, 18, 19, 20,
21, 22], as they have shown great potential for improving the
accuracy and efficiency of goal-reaching tasks. For instance,
Adel et al. [23] proposed a reinforcement learning framework
that combines nonlinear model predictive control with obstacle
avoidance. This approach was evaluated on a 6-DoF robot
manipulator and demonstrated its ability to successfully avoid
collisions with static obstacles. However, it has limitations
when it comes to handling obstacles in dynamic environments.
Research has shown that the presence of moving obstacles can
greatly increase the difficulty of motion planning tasks [24].
Adarsh Sehgal et al. proposed a deep deterministic policy
gradient (DDPG) and hindsight experience replay (HER) based
method using of the genetic algorithm (GA) to fine-tune
the parameters’ values. They experimented on six robotic
manipulation tasks and got better results than baselines [25].
Franceschetti et al. proposed an extensive comparison of the
trust region policy optimization (TRPO) and deep Q-Network
with normalized advantage functions (DQN-NAF) with respect
to other state-of-the-art algorithms, namely DDPG and vanilla
policy gradient (VPG) [26]. Unlike our work, these studies
only concentrate on reaching a single target position. For
multi-target trajectory planning, Wang et al. introduced action
ensembles based on the Poisson distribution (AEP) to PPO,
their method could be easily extended to realize the task
that the end-effector tracks a specific trajectory [27]. For
space robots, the workspace is enough to complete tasks,
but for industrial robots, it is insufficient. Their approach
did not produce favorable outcomes when applied to a larger
workspace after training. Thus, the algorithm requires further
development. In another work, Kumar et al. proposed a
simple, versatile joint-level controller via PPO. Experiments
showed the method capable of achieving similar error to
traditional methods, while greatly simplifying the process by
automatically handling redundancy, joint limits, and acceler-
ation or deceleration profiles [28]. Nevertheless, the output
of the neural network is the velocity of the end-effector.
Additionally, the majority of DRL-based research completes
learning in task space rather than joint space [29, 30, 31],
which is prone to produce weak results for reaching tasks
[32, 29]. Furthermore, such approaches still need to calculate
the inverse kinematics and cannot accomplish reaching tasks
when obstacles are close to the manipulator’s links.
In this paper, we propose a novel model-free deep rein-
forcement learning approach, called Improved PPO (IPPO), to
tackle reaching multi-target goals while avoiding obstacles in
dynamic environments. An overview of the method is depicted
in Fig. 1. In particular, we train a deep policy to map from
task space to joint space for a 6-DoF manipulator. To improve
the effectiveness of the model’s output on reaching tasks,
the action ensembles method is introduced, and the policy is
designed to join in value function updates directly in PPO.
Additionally, since training such a task in real-robot is time-
consuming and strenuous, we develop a simulation environ-
ment to train the model in Gazebo as it produces a smaller
Sim-to-Real gap compared to other simulators. However, as
training robots in Gazebo is computationally expensive and
requires a long training time, we propose a Sim-to-Sim method
to significantly reduce the training time. Finally, the trained
model is directly used in a real-robot setup without fine-tuning.
In comparison to prior works [33, 34, 35, 36], our research
makes three significant contributions: (i) The calculation of
the distance between obstacles and the manipulator’s links is
done using a geometry-based method, which improves the
reaching task in the presence of obstacles. (ii) An action
ensembles approach is introduced to enhance the efficiency
of the policy. (iii) An adaptive discount factor for PPO is
designed, allowing the policy to directly participate in the
value function update. Empirical results demonstrate that the
proposed approach outperforms other baseline methods in
different testing scenarios.
II. PRELIMINARY
Our research focuses on developing an effective method
for obstacle avoidance in reaching tasks for manipulators. To
achieve this goal, we aim to make the manipulator safely
interact with the environment multiple times. To design our
training process, we initially chose Gazebo due to its better
compatibility with the Robot Operating System (ROS) com-
pared to Pybullet. However, DRL methods tend to have a long
training time in Gazebo. To mitigate this challenge, we created
a similar environment in Pybullet for initial training, and then
transferred and evaluated the learned model in Gazebo through
a Sim-to-Sim transfer process. Finally, we tested the efficiency
of the model on a real robot using Sim-to-Real transfer. The
simulation environments in Gazebo and Pybullet are depicted
in Fig. 2. In both environments, a UR5e robot equipped with a
robotiq 140 is utilized as the manipulator. During the training
and testing phases, the pose of the target (represented in red)
and obstacles (represented in blue) is randomly set within the
workspace.
A. Proximal Policy Optimization (PPO)
In this work, we use PPO, one of the state-of-the-art
online DRL methods, since it is known for its stability and
effectiveness in various environments. The algorithm balances
exploration and exploitation to find a policy that maximizes the
reward. PPO also offers a trade-off between stability and high
sample efficiency, making it suitable for the complex robotic
environment with obstacle avoidance. In particular, PPO is