Smooth Trajectory Collision Avoidance through Deep Reinforcement Learning Sirui Song Kirk Saunders Ye Yue Jundong Liu

2025-04-26 2 0 4.21MB 6 页 10玖币

侵权投诉

Smooth Trajectory Collision Avoidance through

Deep Reinforcement Learning

Sirui Song, Kirk Saunders, Ye Yue, Jundong Liu∗

School of Electrical Engineering and Computer Science,

Ohio University, Athens, OH 45701

Abstract—Collision avoidance is a crucial task in vision-guided

autonomous navigation. Solutions based on deep reinforcement

learning (DRL) has become increasingly popular. In this work, we

proposed several novel agent state and reward function designs

to tackle two critical issues in DRL-based navigation solutions:

1) smoothness of the trained ﬂight trajectories; and 2) model

generalization to handle unseen environments.

Formulated under a DRL framework, our model relies on

margin reward and smoothness constraints to ensure UAVs ﬂy

smoothly while greatly reducing the chance of collision. The

proposed smoothness reward minimizes a combination of ﬁrst-

order and second-order derivatives of ﬂight trajectories, which

can also drive the points to be evenly distributed, leading to

stable ﬂight speed. To enhance the agent’s capability of handling

new unseen environments, two practical setups are proposed to

improve the invariance of both the state and reward function

when deploying in different scenes. Experiments demonstrate the

effectiveness of our overall design and individual components.

Index Terms—Deep reinforcement learning, collision avoid-

ance, UAV, smoothness, rewards.

I. INTRODUCTION

Autonomous navigation capability is of great importance

for unmanned aerial vehicles (UAVs) to ﬂy in complex en-

vironments where communication might be limited. Collision

avoidance (CA) is among the most crucial components of high-

performance autonomy and thus has been extensively studied.

Generally speaking, the existing CA solutions can be grouped

into two categories: geometry-based and learning-based solu-

tions. Geometry-based solutions are commonly formulated as

a two-step procedure: ﬁrst to detect obstacles and estimate the

geometry surrounding a UAV, followed by a path planning step

to identify a traversable route for escape maneuver.

Learning-based CA solutions extract patterns from training

data to perceive environments and make maneuver decisions.

Such solutions can be broadly divided into two categories:

supervised learning-based and reinforcement learning-based.

The former performs perception and decision-making simul-

taneously, predicting control policies directly from raw input

images [1]–[5]. Supervised-based methods are straightforward,

but they normally require a large amount of labeled training

samples, which are often difﬁcult or expensive to obtain.

Reinforcement learning [6], on the other hand, relies on a scale

∗Corresponding author: Dr. Jundong Liu. Email: liuj1@ohio.edu. This

project is supported in part by the Ohio University OURC program.

reward function to motivate the learning agent and explores

policy through trial and error. Combined with neural networks,

deep reinforcement learning (DRL) has been shown to achieve

superhuman performance on a number of games by fully

exploring raw images [7]–[9]. DRL-based collision avoidance

has also been recently proposed [10] [11] [12] [13]. In order

to reduce cost and increase effectiveness, such training is often

ﬁrst carried out within a certain simulation environment.

While remarkable progress has been made in DRL-based

navigation solutions, insufﬁcient attention has been given to

two critical issues: 1) smoothness of the navigation trajec-

tories; and 2) model generalization to handle unseen envi-

ronments. For the former, Kahn et al. [14] proposed a RL-

based solution that seeks a tradeoff of collision uncertainty

and speed of UAV motion. When collision uncertainty is high,

the motion of the robot/UAV is set to slower, and vice versa.

The smoothness of the ﬂight trajectories, however, is not di-

rectly addressed. Hasanzade et al. [15] proposed an RL-based

UAV navigation solution based on a trajectory re-planning

algorithm, where high order B-splines are used to deﬁne and

specify ﬂight trajectories. Due to the local support property of

B-spline, such trajectories can be updated quickly, allowing the

small UAVs to navigate in clutter environments aggressively.

However, new knots need to be inserted over the training

process for the re-planning procedure to be fully realized,

negatively impacting the overall trajectory smoothness.

Model generalization is a critical issue in machine learning,

especially for DRL solutions. Many current DRL works,

however, were evaluated on the same environments as they

were trained on, such as Atari [16], MuJoCo [17] and OpenAI

Gym [18]. For UAV training, there is an additional sim-to-real

layer, which complicates the problem even more. Kong et al.

[19] explored the generalization of various DRL algorithm by

training them with different (but not unseen) environments.

Doukui et al. [20] tackle this issue by mapping exteroceptive

sensors, robot state, and goal information to continuous ve-

locity control inputs, but their exploration was only tested on

unseen targets instead of unseen scenes.

In this work, we address the afore-mentioned issues with

novel designs for agent state and reward functions. To ensure

the smoothness of the learned ﬂight trajectories, we inte-

grate two curve smoothness terms, based on ﬁrst-order and

second-order derivatives respectively, into the agent reward

arXiv:2210.06377v1 [cs.RO] 12 Oct 2022

(a): Original deep depth (b): Truncated shallow depth

Fig. 1: An example pair of deep and shallow depth maps.

functions. To improve the agent’s generalization capability,

two practical setups, shallow depth and unit vector towards

the target, are adopted to boost the robustness of the state

and reward function in dealing with new environments. The

proposed designs are trained and tested in simulation scenes

with large geometric obstacles. Experiments demonstrate the

effectiveness of our overall design and individual components.

II. METHOD

In this work, a multirotor UAV takes off at a designated

starting point and navigates autonomously towards a destina-

tion. The line segment connecting the start and end points

is regarded as a predeﬁned path, along which certain objects

have been put as obstacles.

A. Design and environment setup

Our overall design goal is to ﬂy the UAV mostly along

the predeﬁned route while being able to avoid the obstacles.

Such capability is trained through DRL with the following

considerations. Firstly, to ensure the UAV to ﬂy along the

predetermined path, we minimize the distance of the drone’s

trajectory away from such a path. Secondly, to ensure the

drone avoids collisions while ﬂying smoothly, we set up a

variety of rewards, including those for margin,arrival, and

penalty for collisions and smoothness. In addition, we aim to

design DRL agent with a good generalization capability in

handling unseen environments.

States,actions, and rewards are three basic components of

most DRL algorithms. In this work, the state stat time tis

deﬁned to include three components: 1) the depth map of the

current view facing the camera; 2) the current velocity of the

UAV, and 3) a unit vector pointing from the UAV’s current

position to the target.

Our choices of 1) and 3) are both with model generalization

in mind. At each time point, a depth image is obtained from

the onboard monocular camera of the UAV. In order to limit

the impact of environment changes and thus improve the

generalization capability of our model, we focus on nearby

objects and ignore those beyond a certain distance. We call

this truncated depth image as shallow depth, in contrast with

the original deep depth. An example pair are shown in Fig 1.

We include unit vector to the target as part of the agent’s

state, which later will also be used in our proposed reward

function. This is in contrast to the Euclidean distance of the

UAV away from the destination. Compared with the distances,

our unit vectors are scale invariant, and therefore have better

generalization potentials to deal with new environments of

different sizes. Each action atat time tis deﬁned as (vxt, vyt),

a velocity vector with x-axis and y-axis components. The

proposed reward functions will be explained in the next

subsection.

We choose Deep Deterministic Policy Gradient (DDPG)

[21] as the DRL algorithm to train the ﬂight policy of the UAV.

DDPG uses an actor-critic method in which the critic network

learns the value function (Q value), and the actor network

decides how the policy model should be updated. The output

of the Actor network can be real-valued vectors, which enables

the DDPG model to directly learn actions in continuous space.

The detailed network structure and state composition of our

DDPG model can be found in Fig. 2.

Furthermore, for each state, we keep historical information

and stack together depth images from several consecutive time

points. This is designed to alleviate blind spot issues in ﬂight

and allow us to monitor the ﬂight trajectory for smoothness

control. To reduce the dimensionality of the depth map, we

use an LSTM network on the depth stack, as shown in Fig. 2,

to capture basic information before feeding it into the actor

and critic networks.

B. Reward functions

In this work, the overall reward rtat time tis designed to

include multiple components, each of which corresponds to a

desired system condition. Note that our design goals include:

1) avoiding collisions, 2) enhancing model generalization, and

3) encouraging smooth ﬂights. The overall rtis given as

follows,

rt=Rmargin +Rtowards +Rsmooth +









Rgif at destination

Rcif collision

Rfif normal ﬂight

where Rmargin and Rsmooth denote the rewards to ensure margin

and smoothness respectively; Rtowards is aimed to attract the

UAV to ﬂy towards the target; Rgis rewarded if the UAV

reaches the end point; Rcis a a penalty (negative reward) if

collisions happen; and Rfcontains a reward for ﬂying forward

and a penalty for any deviation from the predeﬁned route.

Rmargin is design to penalize the UAV for getting too close to

the obstacles. Two margin zones, soft margin and hard margin

are set up, as shown in Fig. 3. When the drone ﬂies into

the soft margin zone, it will be pushed back with a moderate

force. If it enters the hard margin zone, the system should

provide a rapidly increasing repulsive force to prevent the

drone from getting closer to the obstacle. Computationally,

this two-margin design is implemented as:

Rmargin =









−C1(dsoft −dobs)/(dsoft −dhard)in soft-margin

−C2/dobs in hard-margin

0otherwise

where C1and C2are positive constants; dobs represents the

minimum distance from the drone to the nearest obstacle; dsoft,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SmoothTrajectoryCollisionAvoidancethroughDeepReinforcementLearningSiruiSong,KirkSaunders,YeYue,JundongLiuSchoolofElectricalEngineeringandComputerScience,OhioUniversity,Athens,OH45701AbstractCollisionavoidanceisacrucialtaskinvision-guidedautonomousnavigation.Solutionsbasedondeepreinforcementlearnin...

展开>> 收起<<

Smooth Trajectory Collision Avoidance through Deep Reinforcement Learning Sirui Song Kirk Saunders Ye Yue Jundong Liu.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Smooth Trajectory Collision Avoidance through Deep Reinforcement Learning Sirui Song Kirk Saunders Ye Yue Jundong Liu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: