occur, demonstrating the potential of simple object-level
fusion to handle dynamic errors.
•The proposed system gives a general solution indepen-
dent of the type and model of onboard sensors, which
can be easily extended to the vehicle-to-everything-based
scenarios, and the proposed system transmits only object-
level information, providing a low-cost solution with a
low communication burden and easy implementation.
The rest of the paper is organized as follows: In Section
II, the related work on cooperative perception and optimal
transport is introduced. The problem is formulated in Section
III. Section IV contains the proposed object-level cooperative
perception framework and detailed algorithms. Experimental
results and discussion are presented in Section V.
II. RELATED WORK
A. Cooperative Perception
Recent studies mainly focus on the aggregation of multi-
agent information to improve the average precision of percep-
tion results. Arnold et al. evaluated the performance of early,
and late fusion, as well as their hybrid combination schemes
in driving scenarios using infrastructure sensors [2]. F-Cooper
introduced feature-level data fusion that extracts and aggregates
the feature map of the raw sensor data by deep learning
networks and then detects objects on the fused feature map
[3]. V2VNet aggregated the feature information received from
nearby vehicles and took the downstream motion forecasting
performance into consideration [4]. OPV2V released the first
large-scale simulated V2V cooperation dataset and presented
a benchmark with 16 implemented models, within which we
implement our models [5]. However, these existing studies are
vulnerable to location and pose errors that are common and
inevitable in real-world applications.
FPV-RCNN tried to introduce a location error correction
module based on key-point matching before feature fusion to
make the model more robust [6]. Vadivelu et al. proposed
a deep learning-based framework to estimate potential errors
[7], but they rely on feature-level fusion, which requires high
computational capacity and is not general among different
scenarios. Gao et al. proposed a graph matching-based method
to identify the correspondence between the cooperative ve-
hicles and can be used to promote the robustness against
spatial errors [8]. They formulated the problem as a non-convex
constrained optimization problem and developed a sampling-
based algorithm to solve it, however, the problem is difficult
to solve and time-consuming, which hinders its application
in the real-world. In this paper, we try to take these errors
into account and design an efficient and robust object-level
cooperative perception framework.
B. Optimal Transport Theory
The optimal transport (OT) theory has been widely used
in the assignment problem in various fields. In the field of
intelligent vehicles, Hungarian algorithm is one of the most
popular variations of optimal transport methods and has been
widely used to match two targets for its effectiveness and
low complexity O(n3). For instance, Cai et al. used it to
assign vehicles to the generated goals in a formation to get
least lane changing overall [9]. For the perception problem,
Sinkhorn’s matrix scaling algorithm [10] is more powerful
for its high efficiency on the graphic processing unit (GPU)
since Cuturi smoothed the classical optimal transport problem
with an entropic regularization term in 2013 [11]. This makes
the GPU available for the OT problem and accelerates its
calculation much more than conventional methods. In recent
years, OT with Sinkhorn has shown strong performance on
several vision tasks with the rapid development of GPU. For
example, Sarlin et al. [12] formulated the assignment of graph
features as a differentiable OT problem and acheived state-
of-the-art performance on image matching. Qin et al. [13]
applied OT theory on the point cloud registration problem and
developed a method with 100 times acceleration with respect to
traditional methods. For the efficiency of OT and the Sinkhorn
algorithm, it is deployed to find the object correspondences
between the observation of the Ego and CAVs.
III. PROBLEM FORMULATION
We consider a distributed cooperative perception scenario,
where any cooperative CAV can share the local state and the
information of the detected objects with the Ego vehicle. Let
X={oi, i = 1,2, .., m}be the object set detected by the
Ego vehicle and Y={oj, j = 1,2, .., n}be the object set
detected by the CAV. For object i, it is represented as a 6D
vector oi=xT
i,θT
iT, where xi∈R3and θi∈R3are the
3D position and orientation, respectively.
Cooperative fusion is to transform Yinto the Ego frame
and aggregate it with X. However, errors presented in the
state of both connected vehicles cause an inaccurate relative
transform estimation, which is to be corrected in this paper.
The first challenge is to determine the co-visible region and
associate co-visible objects, given the local state of Ego vehicle
and CAV, as well as noisy measurements Xand Y, provided
that the co-visible objects set Mis achievable. The second
problem is to estimate a transform Fdefined as the function
of rotation matrix R∈SO(3) and translation vector t∈R3
between objects in the Xand Yto approach the accurate spatial
transform. It can be formulated as the following optimization
problem
min
FX
(i,j)∈M
||xi− F(yj)||2
(1)
where xidenotes the position vector of oi∈ X (similar as yj
to oj∈ Y), and (i, j)is a possible object pair representing the
same target. Operator F(∗)is defined as F(∗) = R·(∗) + t.
The third task is to complete the fusion using the estimated
transform to maximize the perception capacity of the Ego.
IV. PROPOSED METHOD
The proposed fusion framework consists of four submod-
ules: preprocess, co-visible object association, optimal trans-
form estimation, global fusion and dynamic mapping.