
MARL to utilize the shared spatial and temporal in-
formation among CAVs. Compared with the baseline
model, our solution is enabled to achieve higher safety
metrics and overall returns in challenging scenarios.
•We introduce coordination mechanisms to MARL with
information-sharing and cooperative policy-learning.
Our experiment results show that cooperation among
CAVs improves the collision-free rate and overall return.
II. RELATED WORK
a) Planning and Control of Autonomous Vehicles:
To learn the output control signals for steering angle and
acceleration directly based on the observed environment,
end-to-end learning is designed in CNN-based supervised
learning [8], and CBF-based Deep Reinforcement Learn-
ing [9], when only considering lane-keeping without lane-
changing behavior. The other popular way is to separate the
learning and control phases. Learning methods can give a
high-level decision, such as “go straight”, “go left” [10],
or whether or not to yield to another vehicle [11]. It also
works to first extract image features and then apply control
upon these features [12]. However, the works mentioned
above do not consider the connection between CAVs, while
we consider how CAVs should use information sharing to
improve the safety and efficiency of the system, and design
an MARL-based algorithm such that CAVs cooperatively
take actions under challenging driving scenarios.
b) GCN, Transformer and Deep MARL: It has not been
addressed yet how to specifically design a neural network
structure to utilize the communication among CAVs to
improve the system’s safety or efficiency in policy learning.
Recent advances like GCN [13] and Transformer [14], [15]
show their advantages in processing spatial and temporal
properties of data. We utilize a GCN-Transformer structure to
capture the spatial-temporal information of driving scenarios
to improve the coordination among CAVs. To the best of
our knowledge, we are the first to design a GCN-Transformer
structure-based deep constrained MARL framework to utilize
the shared information among CAVs. We validate that this
design improves the safety rates and total rewards for CAVs
in challenging scenarios with traffic hazards.
c) Constrained MDP and Safe RL: Existing multi-
agent reinforcement learning (MARL) literature [16], [17],
[18], [19] has not fully solved the challenges for CAVs.
Constrained Markov Decision Process (CMDP) [20], [21]
learns a policy to maximize the total reward while main-
taining the total cost under certain constraints. However,
the cost or the constraint does not explicitly represents all
the safety requirements of physical dynamic systems and
cannot be directly applied to solve CAV challenges. The
recent advance with a formal safety guarantee is the model
predictive shielding (MPS) that also works for multi-agent
systems [22], [23]. However, their safety guarantee assumes
an accurate model of vehicles which is difficult to find in
reality. Control Barrier Functions are used to map unsafe
actions to a safe action set in MARL [24], but they do
not consider how to design a spatial-temporal encoder actor
or critic network structure for challenging scenarios with
hazard vehicles. In this work, we first integrate the strengths
of both constrained MARL and CBF-based safety shield to
further improve the safety of CAVs under the threat of traffic
hazards.
III. PROBLEM FORMULATION
A. Problem Description
We consider the cooperative policy-learning problem for
CAVs in challenging scenarios occurred on a multi-lane
urban intersection or on a multi-lane highway (as shown in
Fig.1). Other traffic participants include unconnected vehicle
(UCVs) and a hazard vehicle (HAZV). Meanwhile infras-
tructures that have sensing, communication and computation
abilities also play a supportive role to CAVs.
A CAV agent is primarily supported with its own observa-
tion oi, the shared observation oNifrom neighboring agents
Nibased on V2V communication and the shared observation
oinf from the road infrastructures. Specifically, Niprovides
extra sensor measurements and sensor-detection data, such
as lane-detection with camera images and object detection
with LiDARs [25]. oinf is broadcasted messages to CAVs
from road infrastructures, such as Radar that can broadcast
the detected speed and location of nearby vehicles.
B. Constrained MARL Problem Formulation
A Constrained MARL is defined as a tuple G=
(S,A, P, {ri},{ci},G, γ)where G:= (N,E)is the commu-
nication network of all CAV agents; Sis the joint state space
of all agents: S:= S1× · · · × Sn. The state space of agent
i:Si={oi, oj∈Ni, oinf}contains information from three
sources: self-observation oifrom vehicle i’s own odometers
and sensors, observation oj∈Nishared by other connected
agents and observation oinf shared by infrastructure. The
observation of each CAV is oi={(li,vi,αi),deti}, where
(li,vi,αi)is the GPS location, velocity and acceleration of
agent i, detiis the vision-based sensors (on-board camera
and 3D point-cloud LiDAR) object detection results. The
joint action set is A:= A1× · · · × Anwhere Ai=
{ai,1, ai,2,· · · , ai,4+k}is the discrete finite action space for
agent i, and
•ai,1: KEEP-LANE-SPEED - the CAV imaintains cur-
rent speed in the current lane
•ai,2: CHANGE-LANE-LEFT - the CAV ichanges to
its left lane. In experiment, by taking ai,2we set the
target waypoint on the left lane.
•ai,3: CHANGE-LANE-RIGHT - the CAV ichanges to
its right lane. In experiment, by taking ai,3we set the
target waypoint on the right lane.
•ai,4: BRAKE. In the experiment, the CAV i’s actuator
will compute a brake value within range braket
i∈
[0,0.5] at time t.
•ai,5, ai,6, . . . , ai,4+kare kdiscretized throttle intervals.
Given the available throttle value set in the simulator
as [0,1], we set ai,4+j= [j−1
k,j
k]. By choosing the
action ai,5, for example, the actuator of the vehicle iwill