and controllers [
1
,
17
], interpretable representations [
3
,
17
], post-hoc explanations [
4
], and advising
the planner via natural language [
5
,
6
]. In contrast, our proposed architecture explains decisions of a
tracker by providing interpretable SCMs as a proxy for its network’s reasoning procedure.
2 Interpretable Tracking Design
Our design follows the tracking-by-detection paradigm, matching every incoming detection to a
corresponding track. Fundamentally, this can be thought of as a link prediction problem in a bipartite
graph (see fig. 2), where track and detection nodes are constrained to have a single edge. Our nuanced
decomposition has seven possible decisions: two detection-and-track decisions, appearance match
and BBOX match; two detection-only decisions, newborn track and false positive detection; and three
track-only decisions, out of range track,false positive track, and occluded track. To make decisions,
we assume access to knowledge sources commonly predicted by 3D detectors [
8
,
15
,
18
]: BEV
feature map, 3D bounding boxes, position, velocity, and confidence scores. We also assume additional
information sources: appearance features for each detection (e.g., extracted from the BEV feature
map), an occlusion map of the scene in BEV (produced by the detector), and the ego vehicle’s current
state. The key idea of our design is the integration of a highly structured end-to-end optimizable
tracker with SCMs that represent its decision-making domain knowledge.
Base Tracker
Starting from a LiDAR-based 3D object detection backbone (e.g. [
14
]), our tracker
(fig. 1) predicts bounding boxes of current detections and an occlusion map for the current scene,
and it extracts appearance feature vectors for each detection from the BEV feature map. At timestep
T, tracks and detection features are passed to a graph neural network, yielding detection-informed
and track-informed features respectively. These features are then fed to subnetworks for each
possible decision. One and only one link decision for each track and each detection is selected.
This is accomplished via Hungarian matching [
7
], where each decision corresponds to an edge in
the bipartite graph. Training the network simply involves enforcing a margin between all correct
decisions and all incorrect decisions. Once these detections are matched to tracks or become newborn
tracks, their feature vectors are fed to corresponding LSTMs to compute track features (not shown
for simplicity). These track-level representations are then used to forecast the track’s trajectory for
h
timesteps into the future.
Tracking Decision SCMs
Due to uncertainty in the inputs to the SCMs (computed by the detector),
we must assume that the SCMs have access to oracle models which can somehow correct errors in the
detector’s input (otherwise they could not infer the correct output). These oracle models are, of course,
imaginary, but we can obtain their outputs via the ground truth labels. Due to space constraints, each
SCM is shown in the appendix (see figures: 4, 5, 6, 7, 8, 9, and 10).
Track Only Decisions These decisions are made when a track is not matched to any detection and
represent one of three causes: the track has gone out of the detectable range of the ego vehicle, the
track is a false positive, or the detection corresponding to the current track is occluded. The track-only
SCMs use three main high-level binary variables to make decisions. Each computes the track’s
bounding box at the current time step and uses it to determine two intermediate nodes: whether the
track matches any detection and whether the track is out of range. The SCM for out of range track
predicts that the track is out of range if it is predicted to be so and no detection matched the track.
The SCMs for occluded track and false positive track use additional occlusion information to make
their predictions. They predict that, respectively, a track is occluded if it is predicted to be so and
is neither out of range nor matches any detection and that the track is a false positive if it is neither
occluded nor out of range nor matches any detection.
Detection Only Decisions These decisions are made whenever a detection is not matched to a track.
This can occur in two possible situations: the detection is correct but has not been tracked before or
the detection is a false positive. The detection-only SCMs make decisions by assessing the validity
of detections and whether they match with existing tracks. If a detection is determined to be valid
but matches no track, it is declared a newborn track by the corresponding SCM. If a detection is
determined to have an invalid appearance and an invalid bounding box, then the associated SCM
decides that it is a false positive detection.
Detection & Track Decisions These decisions are taken when a detection is matched to a track by
appearance or BBOX. The SCMs for these cases are correspondingly simple.
2