Co-Training an Observer and an Evading Target Andr e Brandenburger Folker Hoffmann Alexander Charlish Fraunhofer FKIE

2025-05-06 0 0 430.4KB 8 页 10玖币
侵权投诉
Co-Training an Observer and an Evading Target
Andr´
e Brandenburger, Folker Hoffmann, Alexander Charlish
Fraunhofer FKIE
{andre.brandenburger,folker.hoffmann,alexander.charlish}@fkie.fraunhofer.de
Abstract—Reinforcement learning (RL) is already widely ap-
plied to applications such as robotics, but it is only sparsely
used in sensor management. In this paper, we apply the popular
Proximal Policy Optimization (PPO) approach to a multi-agent
UAV tracking scenario. While recorded data of real scenarios can
accurately reflect the real world, the required amount of data is
not always available. Simulation data, however, is typically cheap
to generate, but the utilized target behavior is often naive and
only vaguely represents the real world. In this paper, we utilize
multi-agent RL to jointly generate protagonistic and antagonistic
policies and overcome the data generation problem, as the policies
are generated on-the-fly and adapt continuously. This way, we are
able to clearly outperform baseline methods and robustly gener-
ate competitive policies. In addition, we investigate explainable
artificial intelligence (XAI) by interpreting feature saliency and
generating an easy-to-read decision tree as a simplified policy.
I. INTRODUCTION
Reinforcement learning (RL) [1] offers the promise of learn-
ing the behavior of an agent, requiring only the specification
of its reward function. RL could therefore lead to a generic
way to perform sensor management, in which only the sensing
objective needs to be defined by the system designer. The
reinforcement learning algorithm then automatically learns a
behavior, called the policy, to fulfill this objective. In many
applications, the agents could theoretically learn their policies
online in the real world, however the poor performance during
early stages of training and in novel, unseen situations make
such an approach unpractical. Instead, learning the behavior in
a simulated environment and afterwards transferring it to the
real world is more feasible. To achieve this, the environment
must specify the movement of the non-cooperative targets to
be tracked. One could use pre-defined, fixed trajectories, which
induces the risk of overfitting to those scenarios. Alternatively,
it is possible to let the targets move randomly, which is less
realistic. In this work, we instead follow the approach of
training against a worst-case target, which counteracts the
sensor management. This is achieved by training a target
policy with inverted rewards parallel to the sensor manage-
ment. The policies are trained using multi-agent reinforcement
learning based on observations of the platform states and target
estimates. A schematic overview of the method can be seen
in Fig. 1.
We consider a sensor path planning problem, where the
trajectory of an unmanned aerial vehicle (UAV) is optimized.
The UAV tracks a mobile ground-based target using a range-
bearing sensor, which is restricted in its field of view (FOV).
Such measurements are typical for a radar. Utilizing path
planning to optimize the performance of a sensor is a classical
Policies
Observations ot
Rewards
rt
Actions at
Environment
zt
Observer
(PPO)
Target
(PPO)
rt
EKF
Fig. 1: A schematic overview of our method. We solve a
classical UAV control task through reinforcement learning.
Two separate policies (top) choose actions (right) based on
realistic environment observations (left). An EKF is utilized
to filter noisy sensor measurements. Most importantly, we
employ contradicting reward signals for the individual policies
to innervate antagonistic behaviors.
problem of sensor management and several algorithms have
been proposed, mostly based on online trajectory optimization.
A common way to evaluate and demonstrate these methods
is based on pre-defined target trajectories. For example, [2]
optimizes the trajectories of bearing-only sensors and [3]
optimizes the trajectories of heterogeneous sensor platforms
containing range and/or bearing sensors. In both cases, results
are shown for linearly moving targets.
Evaluating a sensor path planning algorithm on a pre-
defined scenario is reasonable if the algorithm is guaranteed
to also work for other situations. However, when using a
arXiv:2210.11126v1 [cs.RO] 20 Oct 2022
reinforcement learning algorithm, this imposes the risk that the
agent behavior overfits to this single scenario. Consequently,
the policy would only perform well on the given scenario.
Therefore, existing works in path planning using reinforcement
learning, have modeled the target behavior as random. The
work in [4] optimizes the path of a range-only or bearing-only
sensor and models the targets as either stationary or following
a 2-D Brownian motion model. The work in [5] considers
a target in an urban context, which moves randomly on a
partially occluded road grid. The observer learns a behavior to
always keep the target in its field of view. In [6], the authors
train a policy to localize stationary targets, which are placed
randomly based on a prior. We note that some of the works
based on online optimization also evaluate their policies using
random target behavior [7]. Commonly, this random behavior
does not follow a specific intent of the target, but instead is
based on a fixed probability distribution on the action space.
However, the assumption that a target moves randomly
without intent is often not met in practice. The targets typically
have some intent in their behavior. An alternative to the
random model would be to find real data based on targets be-
havior, on which a policy is trained. This would require a large
amount of data to avoid overfitting on specific trajectories. In
addition, targets might behave differently when the observer
policy changes. Lastly, real-world data is mostly difficult and
costly to acquire.
Alternatively, a game-theoretic approach can be taken. In-
stead of modeling specific target trajectories, we assume that
the target has the intent to maximally degrade the tracking
performance. When training an observer policy under this
worst-case assumption, we can expect to achieve better track-
ing performance for other target behaviors. Such a worst-case
target is known as an evading target and has been considered
at several places in the literature for path planning based
on online optimization. The work in [8] includes, among
others, an evasive target model. The target knows the position
of the tracking UAVs and always moves away from the
closest observer. A more elaborate avoidance model is used
in [9], where the ground-based target optimizes its trajectory
to hide from an observer between obstacles. The observer
takes this target behavior into account when optimizing its
own trajectory.
Still, optimizing with respect to a hard-coded evasive target
model can lead to overfitting. In this work, we therefore take
the approach of training the policy of the evading target in
parallel to the policy of the tracking UAV. The target is another
agent, whose goal is to deteriorate the tracking performance
of the observer as far as possible. If the observer policy
specializes too much on the current policy of the target, the
target could ideally learn that a change in its behavior leads
to the observer tracking it less accurately and, consequently,
choose another behavior.
Extending the field of RL to multiple agents is called multi-
agent reinforcement learning. In this setting, the policy learned
by each agent not only depends on the environment, but also
on the learned policies of the other agents. As the other agents
might have different goals, the policy of each agent needs to
take the policies of the other agents into account. When these
other agents improve in a competitive setting, this leads to
a successively increasing difficulty for an agent to achieve
its own goals. This feedback can be interpreted as a form
of curriculum learning [10]. Co-training of an agent and its
antagonist has led to several noteworthy breakthroughs in
recent years, especially in the form of self-play, where the
policy plays against a potentially different version of itself.
Exemplary applications are to learn Go [11] or Starcraft [12].
In this paper, we formulate a setting related to pursuit-
evasion problems. In these tasks, a single pursuer or a group
needs to catch one or multiple evading targets by reaching
their position. Several solutions to this problem are based on
explicitly modeling the agent-behavior [13]–[15]. Recently,
learning based solutions have been investigated [16], [17].
Training an additional policy for the evading targets can lead
to a complex co-evolution of strategies [18]. While addressing
a similar application, the problem studied in this paper varies
from the pursuit-evasion category mentioned in previous work.
In the traditional pursuit-evasion setting, pursuers are required
to reach the position of the target. In contrary, this paper
addresses a sensor management problem, where a pursuer
needs to achieve optimal measurement geometry towards the
evading target. The goal is not to reach the target position, but
instead to optimally localize it.
In this paper, we do not consider the actual transfer from the
simulation to a real system. This adds additional complexities
next to the behavior of the target, as the sensor model and
the movement model of all platforms must correspond to the
real system. Such a sim-to-real transfer is actively researched
in the reinforcement learning and robotics communities, using
techniques like domain randomization with promising results
[19], [20].
In this paper, we apply multi-agent reinforcement learning
to the problem of tracking an evading target. In Section II
we describe the tracking approach, the sensor management
problem and the training method. In Section III we show
simulative results and explanations for the trained policies.
Finally, Section IV concludes the paper.
II. METHOD
A. Multi-Agent Reinforcement Learning
Due to its direct relationship to the trained agent behavior,
the simulation environment has to be carefully designed and
parameterized. In reinforcement learning, the environment is
commonly modelled as a Markov decision process (MDP) and
can be described as a tuple (S, A, P, R, γ). More precisely,
it consists of the environment state S, action space A, state
transition probabilities P:S×A×S[0,1], reward function
R:S×ARand discount factor γ[0,1]. Typically,
agents cannot observe the environment state directly, but only
through a projection Ω : SOfor an observation space O.
Altogether, this yields a Partially Observable Markov Decision
Process (POMDP), defined as (S, A, P, R, γ, O, Ω).
摘要:

Co-TraininganObserverandanEvadingTargetAndr´eBrandenburger,FolkerHoffmann,AlexanderCharlishFraunhoferFKIEfandre.brandenburger,folker.hoffmann,alexander.charlishg@fkie.fraunhofer.deAbstract—Reinforcementlearning(RL)isalreadywidelyap-pliedtoapplicationssuchasrobotics,butitisonlysparselyusedinsensorman...

展开>> 收起<<
Co-Training an Observer and an Evading Target Andr e Brandenburger Folker Hoffmann Alexander Charlish Fraunhofer FKIE.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:430.4KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注