Smooth Trajectory Collision Avoidance through Deep Reinforcement Learning Sirui Song Kirk Saunders Ye Yue Jundong Liu

2025-04-26 0 0 4.21MB 6 页 10玖币
侵权投诉
Smooth Trajectory Collision Avoidance through
Deep Reinforcement Learning
Sirui Song, Kirk Saunders, Ye Yue, Jundong Liu
School of Electrical Engineering and Computer Science,
Ohio University, Athens, OH 45701
Abstract—Collision avoidance is a crucial task in vision-guided
autonomous navigation. Solutions based on deep reinforcement
learning (DRL) has become increasingly popular. In this work, we
proposed several novel agent state and reward function designs
to tackle two critical issues in DRL-based navigation solutions:
1) smoothness of the trained flight trajectories; and 2) model
generalization to handle unseen environments.
Formulated under a DRL framework, our model relies on
margin reward and smoothness constraints to ensure UAVs fly
smoothly while greatly reducing the chance of collision. The
proposed smoothness reward minimizes a combination of first-
order and second-order derivatives of flight trajectories, which
can also drive the points to be evenly distributed, leading to
stable flight speed. To enhance the agent’s capability of handling
new unseen environments, two practical setups are proposed to
improve the invariance of both the state and reward function
when deploying in different scenes. Experiments demonstrate the
effectiveness of our overall design and individual components.
Index Terms—Deep reinforcement learning, collision avoid-
ance, UAV, smoothness, rewards.
I. INTRODUCTION
Autonomous navigation capability is of great importance
for unmanned aerial vehicles (UAVs) to fly in complex en-
vironments where communication might be limited. Collision
avoidance (CA) is among the most crucial components of high-
performance autonomy and thus has been extensively studied.
Generally speaking, the existing CA solutions can be grouped
into two categories: geometry-based and learning-based solu-
tions. Geometry-based solutions are commonly formulated as
a two-step procedure: first to detect obstacles and estimate the
geometry surrounding a UAV, followed by a path planning step
to identify a traversable route for escape maneuver.
Learning-based CA solutions extract patterns from training
data to perceive environments and make maneuver decisions.
Such solutions can be broadly divided into two categories:
supervised learning-based and reinforcement learning-based.
The former performs perception and decision-making simul-
taneously, predicting control policies directly from raw input
images [1]–[5]. Supervised-based methods are straightforward,
but they normally require a large amount of labeled training
samples, which are often difficult or expensive to obtain.
Reinforcement learning [6], on the other hand, relies on a scale
Corresponding author: Dr. Jundong Liu. Email: liuj1@ohio.edu. This
project is supported in part by the Ohio University OURC program.
reward function to motivate the learning agent and explores
policy through trial and error. Combined with neural networks,
deep reinforcement learning (DRL) has been shown to achieve
superhuman performance on a number of games by fully
exploring raw images [7]–[9]. DRL-based collision avoidance
has also been recently proposed [10] [11] [12] [13]. In order
to reduce cost and increase effectiveness, such training is often
first carried out within a certain simulation environment.
While remarkable progress has been made in DRL-based
navigation solutions, insufficient attention has been given to
two critical issues: 1) smoothness of the navigation trajec-
tories; and 2) model generalization to handle unseen envi-
ronments. For the former, Kahn et al. [14] proposed a RL-
based solution that seeks a tradeoff of collision uncertainty
and speed of UAV motion. When collision uncertainty is high,
the motion of the robot/UAV is set to slower, and vice versa.
The smoothness of the flight trajectories, however, is not di-
rectly addressed. Hasanzade et al. [15] proposed an RL-based
UAV navigation solution based on a trajectory re-planning
algorithm, where high order B-splines are used to define and
specify flight trajectories. Due to the local support property of
B-spline, such trajectories can be updated quickly, allowing the
small UAVs to navigate in clutter environments aggressively.
However, new knots need to be inserted over the training
process for the re-planning procedure to be fully realized,
negatively impacting the overall trajectory smoothness.
Model generalization is a critical issue in machine learning,
especially for DRL solutions. Many current DRL works,
however, were evaluated on the same environments as they
were trained on, such as Atari [16], MuJoCo [17] and OpenAI
Gym [18]. For UAV training, there is an additional sim-to-real
layer, which complicates the problem even more. Kong et al.
[19] explored the generalization of various DRL algorithm by
training them with different (but not unseen) environments.
Doukui et al. [20] tackle this issue by mapping exteroceptive
sensors, robot state, and goal information to continuous ve-
locity control inputs, but their exploration was only tested on
unseen targets instead of unseen scenes.
In this work, we address the afore-mentioned issues with
novel designs for agent state and reward functions. To ensure
the smoothness of the learned flight trajectories, we inte-
grate two curve smoothness terms, based on first-order and
second-order derivatives respectively, into the agent reward
arXiv:2210.06377v1 [cs.RO] 12 Oct 2022
(a): Original deep depth (b): Truncated shallow depth
Fig. 1: An example pair of deep and shallow depth maps.
functions. To improve the agent’s generalization capability,
two practical setups, shallow depth and unit vector towards
the target, are adopted to boost the robustness of the state
and reward function in dealing with new environments. The
proposed designs are trained and tested in simulation scenes
with large geometric obstacles. Experiments demonstrate the
effectiveness of our overall design and individual components.
II. METHOD
In this work, a multirotor UAV takes off at a designated
starting point and navigates autonomously towards a destina-
tion. The line segment connecting the start and end points
is regarded as a predefined path, along which certain objects
have been put as obstacles.
A. Design and environment setup
Our overall design goal is to fly the UAV mostly along
the predefined route while being able to avoid the obstacles.
Such capability is trained through DRL with the following
considerations. Firstly, to ensure the UAV to fly along the
predetermined path, we minimize the distance of the drone’s
trajectory away from such a path. Secondly, to ensure the
drone avoids collisions while flying smoothly, we set up a
variety of rewards, including those for margin,arrival, and
penalty for collisions and smoothness. In addition, we aim to
design DRL agent with a good generalization capability in
handling unseen environments.
States,actions, and rewards are three basic components of
most DRL algorithms. In this work, the state stat time tis
defined to include three components: 1) the depth map of the
current view facing the camera; 2) the current velocity of the
UAV, and 3) a unit vector pointing from the UAV’s current
position to the target.
Our choices of 1) and 3) are both with model generalization
in mind. At each time point, a depth image is obtained from
the onboard monocular camera of the UAV. In order to limit
the impact of environment changes and thus improve the
generalization capability of our model, we focus on nearby
objects and ignore those beyond a certain distance. We call
this truncated depth image as shallow depth, in contrast with
the original deep depth. An example pair are shown in Fig 1.
We include unit vector to the target as part of the agent’s
state, which later will also be used in our proposed reward
function. This is in contrast to the Euclidean distance of the
UAV away from the destination. Compared with the distances,
our unit vectors are scale invariant, and therefore have better
generalization potentials to deal with new environments of
different sizes. Each action atat time tis defined as (vxt, vyt),
a velocity vector with x-axis and y-axis components. The
proposed reward functions will be explained in the next
subsection.
We choose Deep Deterministic Policy Gradient (DDPG)
[21] as the DRL algorithm to train the flight policy of the UAV.
DDPG uses an actor-critic method in which the critic network
learns the value function (Q value), and the actor network
decides how the policy model should be updated. The output
of the Actor network can be real-valued vectors, which enables
the DDPG model to directly learn actions in continuous space.
The detailed network structure and state composition of our
DDPG model can be found in Fig. 2.
Furthermore, for each state, we keep historical information
and stack together depth images from several consecutive time
points. This is designed to alleviate blind spot issues in flight
and allow us to monitor the flight trajectory for smoothness
control. To reduce the dimensionality of the depth map, we
use an LSTM network on the depth stack, as shown in Fig. 2,
to capture basic information before feeding it into the actor
and critic networks.
B. Reward functions
In this work, the overall reward rtat time tis designed to
include multiple components, each of which corresponds to a
desired system condition. Note that our design goals include:
1) avoiding collisions, 2) enhancing model generalization, and
3) encouraging smooth flights. The overall rtis given as
follows,
rt=Rmargin +Rtowards +Rsmooth +
Rgif at destination
Rcif collision
Rfif normal flight
where Rmargin and Rsmooth denote the rewards to ensure margin
and smoothness respectively; Rtowards is aimed to attract the
UAV to fly towards the target; Rgis rewarded if the UAV
reaches the end point; Rcis a a penalty (negative reward) if
collisions happen; and Rfcontains a reward for flying forward
and a penalty for any deviation from the predefined route.
Rmargin is design to penalize the UAV for getting too close to
the obstacles. Two margin zones, soft margin and hard margin
are set up, as shown in Fig. 3. When the drone flies into
the soft margin zone, it will be pushed back with a moderate
force. If it enters the hard margin zone, the system should
provide a rapidly increasing repulsive force to prevent the
drone from getting closer to the obstacle. Computationally,
this two-margin design is implemented as:
Rmargin =
C1(dsoft dobs)/(dsoft dhard)in soft-margin
C2/dobs in hard-margin
0otherwise
where C1and C2are positive constants; dobs represents the
minimum distance from the drone to the nearest obstacle; dsoft,
摘要:

SmoothTrajectoryCollisionAvoidancethroughDeepReinforcementLearningSiruiSong,KirkSaunders,YeYue,JundongLiuSchoolofElectricalEngineeringandComputerScience,OhioUniversity,Athens,OH45701Abstract—Collisionavoidanceisacrucialtaskinvision-guidedautonomousnavigation.Solutionsbasedondeepreinforcementlearnin...

展开>> 收起<<
Smooth Trajectory Collision Avoidance through Deep Reinforcement Learning Sirui Song Kirk Saunders Ye Yue Jundong Liu.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:4.21MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注