reinforcement learning algorithm, this imposes the risk that the
agent behavior overfits to this single scenario. Consequently,
the policy would only perform well on the given scenario.
Therefore, existing works in path planning using reinforcement
learning, have modeled the target behavior as random. The
work in [4] optimizes the path of a range-only or bearing-only
sensor and models the targets as either stationary or following
a 2-D Brownian motion model. The work in [5] considers
a target in an urban context, which moves randomly on a
partially occluded road grid. The observer learns a behavior to
always keep the target in its field of view. In [6], the authors
train a policy to localize stationary targets, which are placed
randomly based on a prior. We note that some of the works
based on online optimization also evaluate their policies using
random target behavior [7]. Commonly, this random behavior
does not follow a specific intent of the target, but instead is
based on a fixed probability distribution on the action space.
However, the assumption that a target moves randomly
without intent is often not met in practice. The targets typically
have some intent in their behavior. An alternative to the
random model would be to find real data based on targets be-
havior, on which a policy is trained. This would require a large
amount of data to avoid overfitting on specific trajectories. In
addition, targets might behave differently when the observer
policy changes. Lastly, real-world data is mostly difficult and
costly to acquire.
Alternatively, a game-theoretic approach can be taken. In-
stead of modeling specific target trajectories, we assume that
the target has the intent to maximally degrade the tracking
performance. When training an observer policy under this
worst-case assumption, we can expect to achieve better track-
ing performance for other target behaviors. Such a worst-case
target is known as an evading target and has been considered
at several places in the literature for path planning based
on online optimization. The work in [8] includes, among
others, an evasive target model. The target knows the position
of the tracking UAVs and always moves away from the
closest observer. A more elaborate avoidance model is used
in [9], where the ground-based target optimizes its trajectory
to hide from an observer between obstacles. The observer
takes this target behavior into account when optimizing its
own trajectory.
Still, optimizing with respect to a hard-coded evasive target
model can lead to overfitting. In this work, we therefore take
the approach of training the policy of the evading target in
parallel to the policy of the tracking UAV. The target is another
agent, whose goal is to deteriorate the tracking performance
of the observer as far as possible. If the observer policy
specializes too much on the current policy of the target, the
target could ideally learn that a change in its behavior leads
to the observer tracking it less accurately and, consequently,
choose another behavior.
Extending the field of RL to multiple agents is called multi-
agent reinforcement learning. In this setting, the policy learned
by each agent not only depends on the environment, but also
on the learned policies of the other agents. As the other agents
might have different goals, the policy of each agent needs to
take the policies of the other agents into account. When these
other agents improve in a competitive setting, this leads to
a successively increasing difficulty for an agent to achieve
its own goals. This feedback can be interpreted as a form
of curriculum learning [10]. Co-training of an agent and its
antagonist has led to several noteworthy breakthroughs in
recent years, especially in the form of self-play, where the
policy plays against a potentially different version of itself.
Exemplary applications are to learn Go [11] or Starcraft [12].
In this paper, we formulate a setting related to pursuit-
evasion problems. In these tasks, a single pursuer or a group
needs to catch one or multiple evading targets by reaching
their position. Several solutions to this problem are based on
explicitly modeling the agent-behavior [13]–[15]. Recently,
learning based solutions have been investigated [16], [17].
Training an additional policy for the evading targets can lead
to a complex co-evolution of strategies [18]. While addressing
a similar application, the problem studied in this paper varies
from the pursuit-evasion category mentioned in previous work.
In the traditional pursuit-evasion setting, pursuers are required
to reach the position of the target. In contrary, this paper
addresses a sensor management problem, where a pursuer
needs to achieve optimal measurement geometry towards the
evading target. The goal is not to reach the target position, but
instead to optimally localize it.
In this paper, we do not consider the actual transfer from the
simulation to a real system. This adds additional complexities
next to the behavior of the target, as the sensor model and
the movement model of all platforms must correspond to the
real system. Such a sim-to-real transfer is actively researched
in the reinforcement learning and robotics communities, using
techniques like domain randomization with promising results
[19], [20].
In this paper, we apply multi-agent reinforcement learning
to the problem of tracking an evading target. In Section II
we describe the tracking approach, the sensor management
problem and the training method. In Section III we show
simulative results and explanations for the trained policies.
Finally, Section IV concludes the paper.
II. METHOD
A. Multi-Agent Reinforcement Learning
Due to its direct relationship to the trained agent behavior,
the simulation environment has to be carefully designed and
parameterized. In reinforcement learning, the environment is
commonly modelled as a Markov decision process (MDP) and
can be described as a tuple (S, A, P, R, γ). More precisely,
it consists of the environment state S, action space A, state
transition probabilities P:S×A×S→[0,1], reward function
R:S×A→Rand discount factor γ∈[0,1]. Typically,
agents cannot observe the environment state directly, but only
through a projection Ω : S→Ofor an observation space O.
Altogether, this yields a Partially Observable Markov Decision
Process (POMDP), defined as (S, A, P, R, γ, O, Ω).