An Opponent-Aware Reinforcement Learning
Method for Team-to-Team Multi-Vehicle Pursuit via
Maximizing Mutual Information Indicator
Qinwen Wang, Xinhang Li, Zheng Yuan, Yiying Yang, Chen Xu, and Lin Zhang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
{wangqinwen, lixinhang, yuanzheng, yyying, chen.xu, zhanglin}@bupt.edu.cn
Abstract—The pursuit-evasion game in Smart City brings a
profound impact on the Multi-vehicle Pursuit (MVP) problem,
when police cars cooperatively pursue suspected vehicles. Existing
studies on the MVP problems tend to set evading vehicles to
move randomly or in a fixed prescribed route. The opponent
modeling method has proven considerable promise in tackling the
non-stationary caused by the adversary agent. However, most of
them focus on two-player competitive games and easy scenarios
without the interference of environments. This paper considers
a Team-to-Team Multi-vehicle Pursuit (T2TMVP) problem in
the complicated urban traffic scene where the evading vehicles
adopt the pre-trained dynamic strategies to execute decisions
intelligently. To solve this problem, we propose an opponent-
aware reinforcement learning via maximizing mutual information
indicator (OARLM2I2) method to improve pursuit efficiency
in the complicated environment. First, a sequential encoding-
based opponents joint strategy modeling (SEOJSM) mechanism
is proposed to generate evading vehicles’ joint strategy model,
which assists the multi-agent decision-making process based on
deep Q-network (DQN). Then, we design a mutual information-
united loss, simultaneously considering the reward fed back
from the environment and the effectiveness of opponents joint
strategy model, to update pursuing vehicles’ decision-making
process. Extensive experiments based on SUMO demonstrate our
method outperforms other baselines by 21.48%on average in
reducing pursuit time. The code is available at https://github.
com/ANT-ITS/OARLM2I2.
Index Terms—intelligent transportation, team-to-team multi-
vehicle pursuit, multi-agent reinforcement pursuit
I. INTRODUCTION
With the development of Smart City, Intelligent Trans-
portation System (ITS) [1] effectively leveraging the Internet
of Vehicles (IoV) technology brings a profound impact on
people’s lives [2], [3]. Multi-vehicle pursuit (MVP), a special
and realistically meaningful problem in ITS, has been widely
attracted. For example, the vehicle pursuit guideline [4] has
been published by the New York police department details
the tactical operations to improve pursuit efficiency while
cooperatively pursuing suspected vehicles.
Essentially, the MVP problem can be modeled as pursuit-
evasion game (PEG). In recent years, multi-agent reinforce-
ment learning (MARL), showing significant advances in in-
telligent decision-making, has proven to be a fruitful method
This work was supported by the National Natural Science Foundation of
China (Grant No. 62071179) and project A02B01C01-201916D2
in PEG. Aiming at improving the cooperation between pur-
suers, [5], [6] separately introduced curriculum learning and
cross-task transfer learning in PEG. [7] proposed attention-
enhanced reinforcement learning to address communication
issues for multi-agent cooperation. As for homogeneous agents
in MVP, [8] proposed a transformer-based time and team
reinforcement learning scheme. In addition to cooperation,
some studies focus on the influence of opponents. [9] focused
on predicting the future trajectory of the opponent to promote
pursuit efficiency. However, these studies ignore the influence
of the opponent’s strategy, especially when the opponent is
characterized by a dynamic strategy which will bring extreme
non-stationarity to the pursuit and thus increase the difficulty
as well as randomness to a successful capture.
The opponent modeling method is integrated into MARL
as a promising solution [10] for building up the cognition
of the opponent’s dynamic strategy and alleviating the non-
stationarity during the pursuit. In self-play scenarios, [11]
recursive reasons the opponent’s reactions to the protagonist’s
potential behaviors and finds the best response. Targeting the
non-stationarity brought by opponent’s changing behaviors,
[12] learned a general policy adaptive to changeable strate-
gies. [13] used policy distillation method to realize accurate
policy detection and reuse in face of non-stationary opponents.
[14] learned low-level latent dynamics of the opponent, and
leveraged the stability reward to stabilize the opponent strategy
reducing the non-stationarity in tasks. However, the aforemen-
tioned methods suffer from a non-adaptation to the team-to-
team multi-vehicle pursuit problem. On the one hand, state-of-
the-art methods only focused on the two-player game and were
difficultly adaptive to team-to-team competitions for that both
generating and modeling complex strategies of opponents are
challenging. On the other hand, the existed opponent modeling
methods based on MARL is rarely applied to MVP scenario
with complicated road structures and traffic restrictions.
This paper considers a team-to-team multi-vehicle pursuit
problem (T2TMVP) in the complicated urban traffic scene.
The evading vehicles adopt the pre-trained policy to choose
the optimal actions rather than move randomly or in a fixed
route, which is what we call dynamic strategies. The main
target of this paper is allivating the non-stationarity brought
by dynamic strategies of evading vehicles and further im-
prove pursuit efficiency. For this purpose, an opponent-aware
arXiv:2210.13015v1 [cs.MA] 24 Oct 2022