Graded-Q Reinforcement Learning with
Information-Enhanced State Encoder for
Hierarchical Collaborative Multi-Vehicle Pursuit
Yiying Yang, Xinhang Li, Zheng Yuan, Qinwen Wang, Chen Xu, and Lin Zhang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
{yyying, lixinhang, yuanzheng, wangqinwen, chen.xu, zhanglin}@bupt.edu.cn
Abstract—The multi-vehicle pursuit (MVP), as a problem
abstracted from various real-world scenarios, is becoming a hot
research topic in Intelligent Transportation System (ITS). The
combination of Artificial Intelligence (AI) and connected vehicles
has greatly promoted the research development of MVP. However,
existing works on MVP pay little attention to the importance
of information exchange and cooperation among pursuing vehi-
cles under the complex urban traffic environment. This paper
proposed a graded-Q reinforcement learning with information-
enhanced state encoder (GQRL-IESE) framework to address
this hierarchical collaborative multi-vehicle pursuit (HCMVP)
problem. In the GQRL-IESE, a cooperative graded Q scheme is
proposed to facilitate the decision-making of pursuing vehicles to
improve pursuing efficiency. Each pursuing vehicle further uses
a deep Q network (DQN) to make decisions based on its encoded
state. A coordinated Q optimizing network adjusts the individual
decisions based on the current environment traffic information to
obtain the global optimal action set. In addition, an information-
enhanced state encoder is designed to extract critical information
from multiple perspectives, and uses the attention mechanism to
assist each pursuing vehicle in effectively determining the target.
Extensive experimental results based on SUMO indicate that the
total timestep of the proposed GQRL-IESE is less than other
methods on average by 47.64%, which demonstrates the excellent
pursuing efficiency of the GQRL-IESE. Codes are outsourced in
https://github.com/ANT-ITS/GQRL-IESE.
Index Terms—cooperative multi-agent reinforcement learning,
hierarchical collaborative multi-vehicle pursuit, GQRL-IESE
I. INTRODUCTION
The Intelligent Transportation System (ITS), as an essential
part of the smart city, is greatly facilitated by the development
of emerging technologies. The Internet of Vehicles (IoVs)
enables ITS to realize dynamic and intelligent management
of traffic [1] [2]. Pursuit-evasion game (PEG), as a realistic
problem for studying the self-learning and autonomous control
of multiple agents, has been extensively studied in many fields,
such as spacecraft control [3] and robot control [4]. Multi-
vehicle pursuit (MVP), as an embodiment of PEG in ITS, has
more conditional constraints, such as complex road structures,
additional traffic participants, and traffic rules constraints. A
patrol guide released by the New York City Police Department,
representatively describes an MVP game, where multiple pol-
icy vehicles cooperate to capture single or multiple suspected
vehicles [5].
This work was supported by the National Natural Science Foundation of
China (Grant No. 62071179) and project A02B01C01-201916D2
Regarding MVP, there have been some works on game
theory-based methods. [6] focused on the multi-player pursuit
game with malicious pursuers and constructed a nonzero-sum
game framework to learn pursuers with different emotional in-
tentions to complete the task. [7] developed a model predictive
control method to address the problem of limited information
on the pursuers, in which each pursuer only focused on
its opponents’ information. [8] adopted the graph-theoretic
method to learn the interaction between the perception-limited
agents and set the Minmax strategy to maintain the safe
operation when the system failed to reach the Nash equilib-
rium. However, it is difficult for these methods to construct
a suitable objective function, and these methods pay little
attention to the cooperation among pursuers in the dynamic
traffic environment, which directly affects the effectiveness of
the pursuit.
Cooperative multi-agent reinforcement learning (CoMARL)
has been widely used in the coordinated control of multi-
agent systems (MASs), such as traffic light control [9], and
network resource allocation [10]. CoMARL aims to maximize
all agents’ expected long-term common accumulative reward
by learning a series of optimal policies or action sets [11].
There is a growing research interest in applying CoMARL to
MVP problem due to the powerful coordination mechanism
and real-time decision-making ability of CoMARL. [12] de-
veloped a probabilistic reward-based reinforcement learning
(RL) method based on multi-agent deep deterministic policy
gradient (MADDPG), where all pursuing agents are trained
by a critic network, to accomplish the pursuit. [13] designed a
target prediction network in the traditional general multi-agent
reinforcement learning framework to more usefully assist the
agents in decision-making. [14] introduced adversarial attack
ticks and adversarial learning based on MADDPG to help
agents learn more robust strategies. [15] added Transformer
based on QMIX and learned historical observations from time
and team, thereby promoting pursuers to learn cooperative
pursuing strategies. [16] developed a CoMARL framework
combining collaborative exploration and attention-QMIX to
coordinately complete tasks, and the collaborative effective-
ness of the CoMARL framework had been verified on a
predator-prey scenario. However, these CoMARL methods on
MVP are performed on the open or grid environment, and the
complex traffic environments and traffic rules constraints will
arXiv:2210.13470v1 [cs.LG] 24 Oct 2022