An Opponent-Aware Reinforcement Learning Method for Team-to-Team Multi-Vehicle Pursuit via Maximizing Mutual Information Indicator

2025-04-30 0 0 2.62MB 8 页 10玖币

侵权投诉

An Opponent-Aware Reinforcement Learning

Method for Team-to-Team Multi-Vehicle Pursuit via

Maximizing Mutual Information Indicator

Qinwen Wang, Xinhang Li, Zheng Yuan, Yiying Yang, Chen Xu, and Lin Zhang

School of Artiﬁcial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

{wangqinwen, lixinhang, yuanzheng, yyying, chen.xu, zhanglin}@bupt.edu.cn

Abstract—The pursuit-evasion game in Smart City brings a

profound impact on the Multi-vehicle Pursuit (MVP) problem,

when police cars cooperatively pursue suspected vehicles. Existing

studies on the MVP problems tend to set evading vehicles to

move randomly or in a ﬁxed prescribed route. The opponent

modeling method has proven considerable promise in tackling the

non-stationary caused by the adversary agent. However, most of

them focus on two-player competitive games and easy scenarios

without the interference of environments. This paper considers

a Team-to-Team Multi-vehicle Pursuit (T2TMVP) problem in

the complicated urban trafﬁc scene where the evading vehicles

adopt the pre-trained dynamic strategies to execute decisions

intelligently. To solve this problem, we propose an opponent-

aware reinforcement learning via maximizing mutual information

indicator (OARLM2I2) method to improve pursuit efﬁciency

in the complicated environment. First, a sequential encoding-

based opponents joint strategy modeling (SEOJSM) mechanism

is proposed to generate evading vehicles’ joint strategy model,

which assists the multi-agent decision-making process based on

deep Q-network (DQN). Then, we design a mutual information-

united loss, simultaneously considering the reward fed back

from the environment and the effectiveness of opponents joint

strategy model, to update pursuing vehicles’ decision-making

process. Extensive experiments based on SUMO demonstrate our

method outperforms other baselines by 21.48%on average in

reducing pursuit time. The code is available at https://github.

com/ANT-ITS/OARLM2I2.

Index Terms—intelligent transportation, team-to-team multi-

vehicle pursuit, multi-agent reinforcement pursuit

I. INTRODUCTION

With the development of Smart City, Intelligent Trans-

portation System (ITS) [1] effectively leveraging the Internet

of Vehicles (IoV) technology brings a profound impact on

people’s lives [2], [3]. Multi-vehicle pursuit (MVP), a special

and realistically meaningful problem in ITS, has been widely

attracted. For example, the vehicle pursuit guideline [4] has

been published by the New York police department details

the tactical operations to improve pursuit efﬁciency while

cooperatively pursuing suspected vehicles.

Essentially, the MVP problem can be modeled as pursuit-

evasion game (PEG). In recent years, multi-agent reinforce-

ment learning (MARL), showing signiﬁcant advances in in-

telligent decision-making, has proven to be a fruitful method

This work was supported by the National Natural Science Foundation of

China (Grant No. 62071179) and project A02B01C01-201916D2

in PEG. Aiming at improving the cooperation between pur-

suers, [5], [6] separately introduced curriculum learning and

cross-task transfer learning in PEG. [7] proposed attention-

enhanced reinforcement learning to address communication

issues for multi-agent cooperation. As for homogeneous agents

in MVP, [8] proposed a transformer-based time and team

reinforcement learning scheme. In addition to cooperation,

some studies focus on the inﬂuence of opponents. [9] focused

on predicting the future trajectory of the opponent to promote

pursuit efﬁciency. However, these studies ignore the inﬂuence

of the opponent’s strategy, especially when the opponent is

characterized by a dynamic strategy which will bring extreme

non-stationarity to the pursuit and thus increase the difﬁculty

as well as randomness to a successful capture.

The opponent modeling method is integrated into MARL

as a promising solution [10] for building up the cognition

of the opponent’s dynamic strategy and alleviating the non-

stationarity during the pursuit. In self-play scenarios, [11]

recursive reasons the opponent’s reactions to the protagonist’s

potential behaviors and ﬁnds the best response. Targeting the

non-stationarity brought by opponent’s changing behaviors,

[12] learned a general policy adaptive to changeable strate-

gies. [13] used policy distillation method to realize accurate

policy detection and reuse in face of non-stationary opponents.

[14] learned low-level latent dynamics of the opponent, and

leveraged the stability reward to stabilize the opponent strategy

reducing the non-stationarity in tasks. However, the aforemen-

tioned methods suffer from a non-adaptation to the team-to-

team multi-vehicle pursuit problem. On the one hand, state-of-

the-art methods only focused on the two-player game and were

difﬁcultly adaptive to team-to-team competitions for that both

generating and modeling complex strategies of opponents are

challenging. On the other hand, the existed opponent modeling

methods based on MARL is rarely applied to MVP scenario

with complicated road structures and trafﬁc restrictions.

This paper considers a team-to-team multi-vehicle pursuit

problem (T2TMVP) in the complicated urban trafﬁc scene.

The evading vehicles adopt the pre-trained policy to choose

the optimal actions rather than move randomly or in a ﬁxed

route, which is what we call dynamic strategies. The main

target of this paper is allivating the non-stationarity brought

by dynamic strategies of evading vehicles and further im-

prove pursuit efﬁciency. For this purpose, an opponent-aware

arXiv:2210.13015v1 [cs.MA] 24 Oct 2022

Fig. 1. Overall Architecture of OARLM2I2. (a) Complicated urban trafﬁc scene for T2TMVP problem, consisting of trafﬁc lights, background vehicles, evading

vehicles, and pursuing vehicles. (b) SEOJSM mechanism. This mechanism models the dynamic strategies of opponents assisted by mutual information-united

loss. (c) Multi-agent reinforcement learning framework for pursuing agents. Each pursuing agent adopts DQN to make decisions with the assistance of the

opponents joint strategy model. (d) State-sensitive joint dynamic strategy of opponents. Each evading agent leverages Q-learning to select actions with the

highest Q-values.

reinforcement learning via maximizing mutual information

indicator (OARLM2I2) method is proposed to improve pursuit

efﬁciency as shown in Fig. 1. OARLM2I2is equipped with the

sequential encoding-based opponents joint strategy modeling

(SEOJSM) mechanism to extract the joint features of dynamic

strategies of evading vehicles based on Q-learning. Mean-

while, the DQN based pursuing vehicles implement efﬁcient

decision-making by leveraging the joint partial observation and

the joint strategy model of evading vehicles, and the mutual

information between them is served as an indicator to update

the SEOJSM mechanism. The main contributions of this paper

are as follows:

1. This paper models the team-to-team multi-vehicle pursuit

(T2TMVP) problem in a complicated urban trafﬁc scene.

Two competitive teams, pursuing vehicle team and evading

vehicle team, separately make ﬂexible decisions according to

intelligent dynamic strategies.

2. This paper proposes an opponent-aware reinforce-

ment learning via maximizing mutual information indicator

(OARLM2I2) method to improve the pursuit efﬁciency for the

T2TMVP problem. A sequential encoding-based opponents’

joint strategy modeling (SEOJSM) mechanism is deliberately

designed to assist in tackling the non-stationarity brought by

dynamic strategies of evading vehicles.

3. This paper leverages the novel mutual information-united

loss to train our OARLM2I2. The mutual information-united

loss comprehensively considers the effectiveness of decision-

making network and opponents’ joint strategy model.

The outline of this article is given as follows. Section

II introduces the T2TMVP problem statement and problem

instantiation. In Section III, the state-sensitive joint dynamic

strategy of evading vehicles is introduced, and the SEOJSM

mechanism is proposed. Section IV details the deep Q-network

for pursuing agents and the training process with the mu-

tual information-united loss. Section V provides experiment

settings and sufﬁcient experiments to verify the effectiveness

of the proposed OARLM2I2method. Finally, conclusion and

future work are presented in Section VI.

II. T2TMVP PROBLEM STATEMENT AND INSTANTIATION

In this section, we ﬁrst state the T2TMVP problem. Then,

we instantiate the T2TMVP problem as a partially observed

Markov decision processes (POMDP).

A. T2TMVP Problem Statement

This paper considers a team-to-team multi-vehicle pursuit

(T2TMVP) problem in a complicated urban trafﬁc scene as

shown in Fig. 2. Competition is the vital theme of T2TMVP,

and two competitive teams of vehicles make intelligent de-

cisions to separately accomplish their own goals. Different

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnOpponent-AwareReinforcementLearningMethodforTeam-to-TeamMulti-VehiclePursuitviaMaximizingMutualInformationIndicatorQinwenWang,XinhangLi,ZhengYuan,YiyingYang,ChenXu,andLinZhangSchoolofArticialIntelligence,BeijingUniversityofPostsandTelecommunications,Beijing,Chinafwangqinwen,lixinhang,yuanzheng,yy...

展开>> 收起<<

An Opponent-Aware Reinforcement Learning Method for Team-to-Team Multi-Vehicle Pursuit via Maximizing Mutual Information Indicator.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An Opponent-Aware Reinforcement Learning Method for Team-to-Team Multi-Vehicle Pursuit via Maximizing Mutual Information Indicator

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: