Smooth Trajectory Collision Avoidance through
Deep Reinforcement Learning
Sirui Song, Kirk Saunders, Ye Yue, Jundong Liu∗
School of Electrical Engineering and Computer Science,
Ohio University, Athens, OH 45701
Abstract—Collision avoidance is a crucial task in vision-guided
autonomous navigation. Solutions based on deep reinforcement
learning (DRL) has become increasingly popular. In this work, we
proposed several novel agent state and reward function designs
to tackle two critical issues in DRL-based navigation solutions:
1) smoothness of the trained flight trajectories; and 2) model
generalization to handle unseen environments.
Formulated under a DRL framework, our model relies on
margin reward and smoothness constraints to ensure UAVs fly
smoothly while greatly reducing the chance of collision. The
proposed smoothness reward minimizes a combination of first-
order and second-order derivatives of flight trajectories, which
can also drive the points to be evenly distributed, leading to
stable flight speed. To enhance the agent’s capability of handling
new unseen environments, two practical setups are proposed to
improve the invariance of both the state and reward function
when deploying in different scenes. Experiments demonstrate the
effectiveness of our overall design and individual components.
Index Terms—Deep reinforcement learning, collision avoid-
ance, UAV, smoothness, rewards.
I. INTRODUCTION
Autonomous navigation capability is of great importance
for unmanned aerial vehicles (UAVs) to fly in complex en-
vironments where communication might be limited. Collision
avoidance (CA) is among the most crucial components of high-
performance autonomy and thus has been extensively studied.
Generally speaking, the existing CA solutions can be grouped
into two categories: geometry-based and learning-based solu-
tions. Geometry-based solutions are commonly formulated as
a two-step procedure: first to detect obstacles and estimate the
geometry surrounding a UAV, followed by a path planning step
to identify a traversable route for escape maneuver.
Learning-based CA solutions extract patterns from training
data to perceive environments and make maneuver decisions.
Such solutions can be broadly divided into two categories:
supervised learning-based and reinforcement learning-based.
The former performs perception and decision-making simul-
taneously, predicting control policies directly from raw input
images [1]–[5]. Supervised-based methods are straightforward,
but they normally require a large amount of labeled training
samples, which are often difficult or expensive to obtain.
Reinforcement learning [6], on the other hand, relies on a scale
∗Corresponding author: Dr. Jundong Liu. Email: liuj1@ohio.edu. This
project is supported in part by the Ohio University OURC program.
reward function to motivate the learning agent and explores
policy through trial and error. Combined with neural networks,
deep reinforcement learning (DRL) has been shown to achieve
superhuman performance on a number of games by fully
exploring raw images [7]–[9]. DRL-based collision avoidance
has also been recently proposed [10] [11] [12] [13]. In order
to reduce cost and increase effectiveness, such training is often
first carried out within a certain simulation environment.
While remarkable progress has been made in DRL-based
navigation solutions, insufficient attention has been given to
two critical issues: 1) smoothness of the navigation trajec-
tories; and 2) model generalization to handle unseen envi-
ronments. For the former, Kahn et al. [14] proposed a RL-
based solution that seeks a tradeoff of collision uncertainty
and speed of UAV motion. When collision uncertainty is high,
the motion of the robot/UAV is set to slower, and vice versa.
The smoothness of the flight trajectories, however, is not di-
rectly addressed. Hasanzade et al. [15] proposed an RL-based
UAV navigation solution based on a trajectory re-planning
algorithm, where high order B-splines are used to define and
specify flight trajectories. Due to the local support property of
B-spline, such trajectories can be updated quickly, allowing the
small UAVs to navigate in clutter environments aggressively.
However, new knots need to be inserted over the training
process for the re-planning procedure to be fully realized,
negatively impacting the overall trajectory smoothness.
Model generalization is a critical issue in machine learning,
especially for DRL solutions. Many current DRL works,
however, were evaluated on the same environments as they
were trained on, such as Atari [16], MuJoCo [17] and OpenAI
Gym [18]. For UAV training, there is an additional sim-to-real
layer, which complicates the problem even more. Kong et al.
[19] explored the generalization of various DRL algorithm by
training them with different (but not unseen) environments.
Doukui et al. [20] tackle this issue by mapping exteroceptive
sensors, robot state, and goal information to continuous ve-
locity control inputs, but their exploration was only tested on
unseen targets instead of unseen scenes.
In this work, we address the afore-mentioned issues with
novel designs for agent state and reward functions. To ensure
the smoothness of the learned flight trajectories, we inte-
grate two curve smoothness terms, based on first-order and
second-order derivatives respectively, into the agent reward
arXiv:2210.06377v1 [cs.RO] 12 Oct 2022