Co-Training an Observer and an Evading Target Andr e Brandenburger Folker Hoffmann Alexander Charlish Fraunhofer FKIE

2025-05-06 0 0 430.4KB 8 页 10玖币

侵权投诉

Co-Training an Observer and an Evading Target

Andr´

e Brandenburger, Folker Hoffmann, Alexander Charlish

Fraunhofer FKIE

{andre.brandenburger,folker.hoffmann,alexander.charlish}@fkie.fraunhofer.de

Abstract—Reinforcement learning (RL) is already widely ap-

plied to applications such as robotics, but it is only sparsely

used in sensor management. In this paper, we apply the popular

Proximal Policy Optimization (PPO) approach to a multi-agent

UAV tracking scenario. While recorded data of real scenarios can

accurately reﬂect the real world, the required amount of data is

not always available. Simulation data, however, is typically cheap

to generate, but the utilized target behavior is often naive and

only vaguely represents the real world. In this paper, we utilize

multi-agent RL to jointly generate protagonistic and antagonistic

policies and overcome the data generation problem, as the policies

are generated on-the-ﬂy and adapt continuously. This way, we are

able to clearly outperform baseline methods and robustly gener-

ate competitive policies. In addition, we investigate explainable

artiﬁcial intelligence (XAI) by interpreting feature saliency and

generating an easy-to-read decision tree as a simpliﬁed policy.

I. INTRODUCTION

Reinforcement learning (RL) [1] offers the promise of learn-

ing the behavior of an agent, requiring only the speciﬁcation

of its reward function. RL could therefore lead to a generic

way to perform sensor management, in which only the sensing

objective needs to be deﬁned by the system designer. The

reinforcement learning algorithm then automatically learns a

behavior, called the policy, to fulﬁll this objective. In many

applications, the agents could theoretically learn their policies

online in the real world, however the poor performance during

early stages of training and in novel, unseen situations make

such an approach unpractical. Instead, learning the behavior in

a simulated environment and afterwards transferring it to the

real world is more feasible. To achieve this, the environment

must specify the movement of the non-cooperative targets to

be tracked. One could use pre-deﬁned, ﬁxed trajectories, which

induces the risk of overﬁtting to those scenarios. Alternatively,

it is possible to let the targets move randomly, which is less

realistic. In this work, we instead follow the approach of

training against a worst-case target, which counteracts the

sensor management. This is achieved by training a target

policy with inverted rewards parallel to the sensor manage-

ment. The policies are trained using multi-agent reinforcement

learning based on observations of the platform states and target

estimates. A schematic overview of the method can be seen

in Fig. 1.

We consider a sensor path planning problem, where the

trajectory of an unmanned aerial vehicle (UAV) is optimized.

The UAV tracks a mobile ground-based target using a range-

bearing sensor, which is restricted in its ﬁeld of view (FOV).

Such measurements are typical for a radar. Utilizing path

planning to optimize the performance of a sensor is a classical

Policies

Observations ot

Rewards

Actions at

Environment

Observer

(PPO)

Target

(PPO)

−rt

EKF

Fig. 1: A schematic overview of our method. We solve a

classical UAV control task through reinforcement learning.

Two separate policies (top) choose actions (right) based on

realistic environment observations (left). An EKF is utilized

to ﬁlter noisy sensor measurements. Most importantly, we

employ contradicting reward signals for the individual policies

to innervate antagonistic behaviors.

problem of sensor management and several algorithms have

been proposed, mostly based on online trajectory optimization.

A common way to evaluate and demonstrate these methods

is based on pre-deﬁned target trajectories. For example, [2]

optimizes the trajectories of bearing-only sensors and [3]

optimizes the trajectories of heterogeneous sensor platforms

containing range and/or bearing sensors. In both cases, results

are shown for linearly moving targets.

Evaluating a sensor path planning algorithm on a pre-

deﬁned scenario is reasonable if the algorithm is guaranteed

to also work for other situations. However, when using a

arXiv:2210.11126v1 [cs.RO] 20 Oct 2022

reinforcement learning algorithm, this imposes the risk that the

agent behavior overﬁts to this single scenario. Consequently,

the policy would only perform well on the given scenario.

Therefore, existing works in path planning using reinforcement

learning, have modeled the target behavior as random. The

work in [4] optimizes the path of a range-only or bearing-only

sensor and models the targets as either stationary or following

a 2-D Brownian motion model. The work in [5] considers

a target in an urban context, which moves randomly on a

partially occluded road grid. The observer learns a behavior to

always keep the target in its ﬁeld of view. In [6], the authors

train a policy to localize stationary targets, which are placed

randomly based on a prior. We note that some of the works

based on online optimization also evaluate their policies using

random target behavior [7]. Commonly, this random behavior

does not follow a speciﬁc intent of the target, but instead is

based on a ﬁxed probability distribution on the action space.

However, the assumption that a target moves randomly

without intent is often not met in practice. The targets typically

have some intent in their behavior. An alternative to the

random model would be to ﬁnd real data based on targets be-

havior, on which a policy is trained. This would require a large

amount of data to avoid overﬁtting on speciﬁc trajectories. In

addition, targets might behave differently when the observer

policy changes. Lastly, real-world data is mostly difﬁcult and

costly to acquire.

Alternatively, a game-theoretic approach can be taken. In-

stead of modeling speciﬁc target trajectories, we assume that

the target has the intent to maximally degrade the tracking

performance. When training an observer policy under this

worst-case assumption, we can expect to achieve better track-

ing performance for other target behaviors. Such a worst-case

target is known as an evading target and has been considered

at several places in the literature for path planning based

on online optimization. The work in [8] includes, among

others, an evasive target model. The target knows the position

of the tracking UAVs and always moves away from the

closest observer. A more elaborate avoidance model is used

in [9], where the ground-based target optimizes its trajectory

to hide from an observer between obstacles. The observer

takes this target behavior into account when optimizing its

own trajectory.

Still, optimizing with respect to a hard-coded evasive target

model can lead to overﬁtting. In this work, we therefore take

the approach of training the policy of the evading target in

parallel to the policy of the tracking UAV. The target is another

agent, whose goal is to deteriorate the tracking performance

of the observer as far as possible. If the observer policy

specializes too much on the current policy of the target, the

target could ideally learn that a change in its behavior leads

to the observer tracking it less accurately and, consequently,

choose another behavior.

Extending the ﬁeld of RL to multiple agents is called multi-

agent reinforcement learning. In this setting, the policy learned

by each agent not only depends on the environment, but also

on the learned policies of the other agents. As the other agents

might have different goals, the policy of each agent needs to

take the policies of the other agents into account. When these

other agents improve in a competitive setting, this leads to

a successively increasing difﬁculty for an agent to achieve

its own goals. This feedback can be interpreted as a form

of curriculum learning [10]. Co-training of an agent and its

antagonist has led to several noteworthy breakthroughs in

recent years, especially in the form of self-play, where the

policy plays against a potentially different version of itself.

Exemplary applications are to learn Go [11] or Starcraft [12].

In this paper, we formulate a setting related to pursuit-

evasion problems. In these tasks, a single pursuer or a group

needs to catch one or multiple evading targets by reaching

their position. Several solutions to this problem are based on

explicitly modeling the agent-behavior [13]–[15]. Recently,

learning based solutions have been investigated [16], [17].

Training an additional policy for the evading targets can lead

to a complex co-evolution of strategies [18]. While addressing

a similar application, the problem studied in this paper varies

from the pursuit-evasion category mentioned in previous work.

In the traditional pursuit-evasion setting, pursuers are required

to reach the position of the target. In contrary, this paper

addresses a sensor management problem, where a pursuer

needs to achieve optimal measurement geometry towards the

evading target. The goal is not to reach the target position, but

instead to optimally localize it.

In this paper, we do not consider the actual transfer from the

simulation to a real system. This adds additional complexities

next to the behavior of the target, as the sensor model and

the movement model of all platforms must correspond to the

real system. Such a sim-to-real transfer is actively researched

in the reinforcement learning and robotics communities, using

techniques like domain randomization with promising results

[19], [20].

In this paper, we apply multi-agent reinforcement learning

to the problem of tracking an evading target. In Section II

we describe the tracking approach, the sensor management

problem and the training method. In Section III we show

simulative results and explanations for the trained policies.

Finally, Section IV concludes the paper.

II. METHOD

A. Multi-Agent Reinforcement Learning

Due to its direct relationship to the trained agent behavior,

the simulation environment has to be carefully designed and

parameterized. In reinforcement learning, the environment is

commonly modelled as a Markov decision process (MDP) and

can be described as a tuple (S, A, P, R, γ). More precisely,

it consists of the environment state S, action space A, state

transition probabilities P:S×A×S→[0,1], reward function

R:S×A→Rand discount factor γ∈[0,1]. Typically,

agents cannot observe the environment state directly, but only

through a projection Ω : S→Ofor an observation space O.

Altogether, this yields a Partially Observable Markov Decision

Process (POMDP), deﬁned as (S, A, P, R, γ, O, Ω).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Co-TraininganObserverandanEvadingTargetAndr´eBrandenburger,FolkerHoffmann,AlexanderCharlishFraunhoferFKIEfandre.brandenburger,folker.hoffmann,alexander.charlishg@fkie.fraunhofer.deAbstractReinforcementlearning(RL)isalreadywidelyap-pliedtoapplicationssuchasrobotics,butitisonlysparselyusedinsensorman...

展开>> 收起<<

Co-Training an Observer and an Evading Target Andr e Brandenburger Folker Hoffmann Alexander Charlish Fraunhofer FKIE.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Co-Training an Observer and an Evading Target Andr e Brandenburger Folker Hoffmann Alexander Charlish Fraunhofer FKIE

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: