An Opponent-Aware Reinforcement Learning Method for Team-to-Team Multi-Vehicle Pursuit via Maximizing Mutual Information Indicator

2025-04-30 0 0 2.62MB 8 页 10玖币
侵权投诉
An Opponent-Aware Reinforcement Learning
Method for Team-to-Team Multi-Vehicle Pursuit via
Maximizing Mutual Information Indicator
Qinwen Wang, Xinhang Li, Zheng Yuan, Yiying Yang, Chen Xu, and Lin Zhang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
{wangqinwen, lixinhang, yuanzheng, yyying, chen.xu, zhanglin}@bupt.edu.cn
Abstract—The pursuit-evasion game in Smart City brings a
profound impact on the Multi-vehicle Pursuit (MVP) problem,
when police cars cooperatively pursue suspected vehicles. Existing
studies on the MVP problems tend to set evading vehicles to
move randomly or in a fixed prescribed route. The opponent
modeling method has proven considerable promise in tackling the
non-stationary caused by the adversary agent. However, most of
them focus on two-player competitive games and easy scenarios
without the interference of environments. This paper considers
a Team-to-Team Multi-vehicle Pursuit (T2TMVP) problem in
the complicated urban traffic scene where the evading vehicles
adopt the pre-trained dynamic strategies to execute decisions
intelligently. To solve this problem, we propose an opponent-
aware reinforcement learning via maximizing mutual information
indicator (OARLM2I2) method to improve pursuit efficiency
in the complicated environment. First, a sequential encoding-
based opponents joint strategy modeling (SEOJSM) mechanism
is proposed to generate evading vehicles’ joint strategy model,
which assists the multi-agent decision-making process based on
deep Q-network (DQN). Then, we design a mutual information-
united loss, simultaneously considering the reward fed back
from the environment and the effectiveness of opponents joint
strategy model, to update pursuing vehicles’ decision-making
process. Extensive experiments based on SUMO demonstrate our
method outperforms other baselines by 21.48%on average in
reducing pursuit time. The code is available at https://github.
com/ANT-ITS/OARLM2I2.
Index Terms—intelligent transportation, team-to-team multi-
vehicle pursuit, multi-agent reinforcement pursuit
I. INTRODUCTION
With the development of Smart City, Intelligent Trans-
portation System (ITS) [1] effectively leveraging the Internet
of Vehicles (IoV) technology brings a profound impact on
people’s lives [2], [3]. Multi-vehicle pursuit (MVP), a special
and realistically meaningful problem in ITS, has been widely
attracted. For example, the vehicle pursuit guideline [4] has
been published by the New York police department details
the tactical operations to improve pursuit efficiency while
cooperatively pursuing suspected vehicles.
Essentially, the MVP problem can be modeled as pursuit-
evasion game (PEG). In recent years, multi-agent reinforce-
ment learning (MARL), showing significant advances in in-
telligent decision-making, has proven to be a fruitful method
This work was supported by the National Natural Science Foundation of
China (Grant No. 62071179) and project A02B01C01-201916D2
in PEG. Aiming at improving the cooperation between pur-
suers, [5], [6] separately introduced curriculum learning and
cross-task transfer learning in PEG. [7] proposed attention-
enhanced reinforcement learning to address communication
issues for multi-agent cooperation. As for homogeneous agents
in MVP, [8] proposed a transformer-based time and team
reinforcement learning scheme. In addition to cooperation,
some studies focus on the influence of opponents. [9] focused
on predicting the future trajectory of the opponent to promote
pursuit efficiency. However, these studies ignore the influence
of the opponent’s strategy, especially when the opponent is
characterized by a dynamic strategy which will bring extreme
non-stationarity to the pursuit and thus increase the difficulty
as well as randomness to a successful capture.
The opponent modeling method is integrated into MARL
as a promising solution [10] for building up the cognition
of the opponent’s dynamic strategy and alleviating the non-
stationarity during the pursuit. In self-play scenarios, [11]
recursive reasons the opponent’s reactions to the protagonist’s
potential behaviors and finds the best response. Targeting the
non-stationarity brought by opponent’s changing behaviors,
[12] learned a general policy adaptive to changeable strate-
gies. [13] used policy distillation method to realize accurate
policy detection and reuse in face of non-stationary opponents.
[14] learned low-level latent dynamics of the opponent, and
leveraged the stability reward to stabilize the opponent strategy
reducing the non-stationarity in tasks. However, the aforemen-
tioned methods suffer from a non-adaptation to the team-to-
team multi-vehicle pursuit problem. On the one hand, state-of-
the-art methods only focused on the two-player game and were
difficultly adaptive to team-to-team competitions for that both
generating and modeling complex strategies of opponents are
challenging. On the other hand, the existed opponent modeling
methods based on MARL is rarely applied to MVP scenario
with complicated road structures and traffic restrictions.
This paper considers a team-to-team multi-vehicle pursuit
problem (T2TMVP) in the complicated urban traffic scene.
The evading vehicles adopt the pre-trained policy to choose
the optimal actions rather than move randomly or in a fixed
route, which is what we call dynamic strategies. The main
target of this paper is allivating the non-stationarity brought
by dynamic strategies of evading vehicles and further im-
prove pursuit efficiency. For this purpose, an opponent-aware
arXiv:2210.13015v1 [cs.MA] 24 Oct 2022
Fig. 1. Overall Architecture of OARLM2I2. (a) Complicated urban traffic scene for T2TMVP problem, consisting of traffic lights, background vehicles, evading
vehicles, and pursuing vehicles. (b) SEOJSM mechanism. This mechanism models the dynamic strategies of opponents assisted by mutual information-united
loss. (c) Multi-agent reinforcement learning framework for pursuing agents. Each pursuing agent adopts DQN to make decisions with the assistance of the
opponents joint strategy model. (d) State-sensitive joint dynamic strategy of opponents. Each evading agent leverages Q-learning to select actions with the
highest Q-values.
reinforcement learning via maximizing mutual information
indicator (OARLM2I2) method is proposed to improve pursuit
efficiency as shown in Fig. 1. OARLM2I2is equipped with the
sequential encoding-based opponents joint strategy modeling
(SEOJSM) mechanism to extract the joint features of dynamic
strategies of evading vehicles based on Q-learning. Mean-
while, the DQN based pursuing vehicles implement efficient
decision-making by leveraging the joint partial observation and
the joint strategy model of evading vehicles, and the mutual
information between them is served as an indicator to update
the SEOJSM mechanism. The main contributions of this paper
are as follows:
1. This paper models the team-to-team multi-vehicle pursuit
(T2TMVP) problem in a complicated urban traffic scene.
Two competitive teams, pursuing vehicle team and evading
vehicle team, separately make flexible decisions according to
intelligent dynamic strategies.
2. This paper proposes an opponent-aware reinforce-
ment learning via maximizing mutual information indicator
(OARLM2I2) method to improve the pursuit efficiency for the
T2TMVP problem. A sequential encoding-based opponents’
joint strategy modeling (SEOJSM) mechanism is deliberately
designed to assist in tackling the non-stationarity brought by
dynamic strategies of evading vehicles.
3. This paper leverages the novel mutual information-united
loss to train our OARLM2I2. The mutual information-united
loss comprehensively considers the effectiveness of decision-
making network and opponents’ joint strategy model.
The outline of this article is given as follows. Section
II introduces the T2TMVP problem statement and problem
instantiation. In Section III, the state-sensitive joint dynamic
strategy of evading vehicles is introduced, and the SEOJSM
mechanism is proposed. Section IV details the deep Q-network
for pursuing agents and the training process with the mu-
tual information-united loss. Section V provides experiment
settings and sufficient experiments to verify the effectiveness
of the proposed OARLM2I2method. Finally, conclusion and
future work are presented in Section VI.
II. T2TMVP PROBLEM STATEMENT AND INSTANTIATION
In this section, we first state the T2TMVP problem. Then,
we instantiate the T2TMVP problem as a partially observed
Markov decision processes (POMDP).
A. T2TMVP Problem Statement
This paper considers a team-to-team multi-vehicle pursuit
(T2TMVP) problem in a complicated urban traffic scene as
shown in Fig. 2. Competition is the vital theme of T2TMVP,
and two competitive teams of vehicles make intelligent de-
cisions to separately accomplish their own goals. Different
摘要:

AnOpponent-AwareReinforcementLearningMethodforTeam-to-TeamMulti-VehiclePursuitviaMaximizingMutualInformationIndicatorQinwenWang,XinhangLi,ZhengYuan,YiyingYang,ChenXu,andLinZhangSchoolofArticialIntelligence,BeijingUniversityofPostsandTelecommunications,Beijing,Chinafwangqinwen,lixinhang,yuanzheng,yy...

展开>> 收起<<
An Opponent-Aware Reinforcement Learning Method for Team-to-Team Multi-Vehicle Pursuit via Maximizing Mutual Information Indicator.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:2.62MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注