Interpretable Deep Tracking Benjamin Thérien Krzysztof Czarnecki Department of Computer Science

2025-05-06 0 0 1.04MB 13 页 10玖币
侵权投诉
Interpretable Deep Tracking
Benjamin Thérien Krzysztof Czarnecki
Department of Computer Science
University of Waterloo
{btherien,k2czarne}@uwaterloo.ca
Abstract
Imagine experiencing a crash as the passenger of an autonomous vehicle. Wouldn’t
you want to know why it happened? Current end-to-end optimizable deep neural
networks (DNNs) in 3D detection, multi-object tracking, and motion forecasting
provide little to no explanations about how they make their decisions. To help
bridge this gap, we design an end-to-end optimizable multi-object tracking architec-
ture and training protocol inspired by the recently proposed method of interchange
intervention training (IIT). By enumerating different tracking decisions and as-
sociated reasoning procedures, we can train individual networks to reason about
the possible decisions via IIT. Each network’s decisions can be explained by the
high-level structural causal model (SCM) it is trained in alignment with. Moreover,
our proposed model learns to rank these outcomes, leveraging the promise of deep
learning in end-to-end training, while being inherently interpretable.
1 Introduction
As autonomous driving systems (ADS) become more and more capable, they are deployed at
increasing levels of autonomy. Yet, most proposed DNNs as part of such systems are still black
boxes, slowing this progress. Interpretability is essential for increasing social acceptance, legality,
and safety of ADS. Passengers should have the right to understand why the vehicle they are riding
in made a certain decision [
4
]. In the event of a crash, legal entities require the ADS to produce
explanations for its actions. Interpretable architectures can also improve vehicle safety by providing a
better understanding of failiure cases and enabling the detection and handling of errors during online
processing. Our proposed interpretable tracker is end-to-end optimizable and can be trained for
performance and interpretability, allowing it to effectively tradeoff these two desireable properties.
Many interpretability methods for DNNs exist [
11
]; however, only a few fulfill our desired character-
istics. We would like to explain the decisions our model makes, while incurring minimal performance
degradation compared to a black-box model. Post-hoc techniques, while potentially achieving this
goal, are unreliable [
9
] and not amenable to online processing. In contrast, intrinsically interpretable
models avoid these drawbacks. Existing techniques typically involve distilling a DNN’s knowledge
into a more interpretable model [
13
] or using the interpretable model outright [
11
]. While intrinsic
interpretability would allow for online processing, these techniques are not ideal as they cannot
leverage an end-to-end trained DNN at inference. Interchange intervention training (IIT) [
2
] is a
recently proposed method that addresses this problem. This technique fulfills all our desired criteria
and it is the interpretability-engine of our proposed tracker.
In the following, we study how to instill interpretability into an end-to-end trainable multi-object
tracker. We make two main contributions. First, we handcraft structural causal models (SCMs) for
each tracking decision, so that they can be used to train our network via IIT. Second, we propose
how these SCMs can be integrated into a 3D detection, multi-object tracking, and motion forecasting
network, similar to [
10
], enabling end-to-end training and interpretability. Other relevant works also
apply interpretability techniques to autonomous driving, but most focus on interpretable planners
Preprint. Under review.
arXiv:2210.01266v1 [cs.CV] 3 Oct 2022
and controllers [
1
,
17
], interpretable representations [
3
,
17
], post-hoc explanations [
4
], and advising
the planner via natural language [
5
,
6
]. In contrast, our proposed architecture explains decisions of a
tracker by providing interpretable SCMs as a proxy for its network’s reasoning procedure.
2 Interpretable Tracking Design
Our design follows the tracking-by-detection paradigm, matching every incoming detection to a
corresponding track. Fundamentally, this can be thought of as a link prediction problem in a bipartite
graph (see fig. 2), where track and detection nodes are constrained to have a single edge. Our nuanced
decomposition has seven possible decisions: two detection-and-track decisions, appearance match
and BBOX match; two detection-only decisions, newborn track and false positive detection; and three
track-only decisions, out of range track,false positive track, and occluded track. To make decisions,
we assume access to knowledge sources commonly predicted by 3D detectors [
8
,
15
,
18
]: BEV
feature map, 3D bounding boxes, position, velocity, and confidence scores. We also assume additional
information sources: appearance features for each detection (e.g., extracted from the BEV feature
map), an occlusion map of the scene in BEV (produced by the detector), and the ego vehicle’s current
state. The key idea of our design is the integration of a highly structured end-to-end optimizable
tracker with SCMs that represent its decision-making domain knowledge.
Base Tracker
Starting from a LiDAR-based 3D object detection backbone (e.g. [
14
]), our tracker
(fig. 1) predicts bounding boxes of current detections and an occlusion map for the current scene,
and it extracts appearance feature vectors for each detection from the BEV feature map. At timestep
T, tracks and detection features are passed to a graph neural network, yielding detection-informed
and track-informed features respectively. These features are then fed to subnetworks for each
possible decision. One and only one link decision for each track and each detection is selected.
This is accomplished via Hungarian matching [
7
], where each decision corresponds to an edge in
the bipartite graph. Training the network simply involves enforcing a margin between all correct
decisions and all incorrect decisions. Once these detections are matched to tracks or become newborn
tracks, their feature vectors are fed to corresponding LSTMs to compute track features (not shown
for simplicity). These track-level representations are then used to forecast the track’s trajectory for
h
timesteps into the future.
Tracking Decision SCMs
Due to uncertainty in the inputs to the SCMs (computed by the detector),
we must assume that the SCMs have access to oracle models which can somehow correct errors in the
detector’s input (otherwise they could not infer the correct output). These oracle models are, of course,
imaginary, but we can obtain their outputs via the ground truth labels. Due to space constraints, each
SCM is shown in the appendix (see figures: 4, 5, 6, 7, 8, 9, and 10).
Track Only Decisions These decisions are made when a track is not matched to any detection and
represent one of three causes: the track has gone out of the detectable range of the ego vehicle, the
track is a false positive, or the detection corresponding to the current track is occluded. The track-only
SCMs use three main high-level binary variables to make decisions. Each computes the track’s
bounding box at the current time step and uses it to determine two intermediate nodes: whether the
track matches any detection and whether the track is out of range. The SCM for out of range track
predicts that the track is out of range if it is predicted to be so and no detection matched the track.
The SCMs for occluded track and false positive track use additional occlusion information to make
their predictions. They predict that, respectively, a track is occluded if it is predicted to be so and
is neither out of range nor matches any detection and that the track is a false positive if it is neither
occluded nor out of range nor matches any detection.
Detection Only Decisions These decisions are made whenever a detection is not matched to a track.
This can occur in two possible situations: the detection is correct but has not been tracked before or
the detection is a false positive. The detection-only SCMs make decisions by assessing the validity
of detections and whether they match with existing tracks. If a detection is determined to be valid
but matches no track, it is declared a newborn track by the corresponding SCM. If a detection is
determined to have an invalid appearance and an invalid bounding box, then the associated SCM
decides that it is a false positive detection.
Detection & Track Decisions These decisions are taken when a detection is matched to a track by
appearance or BBOX. The SCMs for these cases are correspondingly simple.
2
Scenario
Matches
Detection
Is
Occluded
Is OOR
Occluded
Track?
A
1
1
1
0
B
1
0
1
0
C
1
1
0
0
D
1
0
0
0
E
0
1
1
0
F
0
0
1
0
G
0
1
0
1
H
0
0
0
0
Update tracks using decisions.
LiDAR
Input
at T
3D Object Detector
3D
Backbone
Detection
Head
Segmentation
Head Occlusion Map
BEV
Feature Map
Detector Output at T
MLP
Extract BEV
Features
Active Track Memory at T
Track Appearance
Features(T=K to T-1)
Trk:1
Trk:2
Trk:M
...
Track BBOX
Features (T=K to T-1)
Trk:1
Trk:2
Trk:M
...
Track Motion Forecast
forT at T-1
Trk:2
Trk:M
Trk:1
...
Detection
AppearanceFeatures
Det:1
Det:2
Det:N
...
Track BBOXES
at T-1
Trk:2 - BBOX
Trk:M - BBOX
Trk:1 - BBOX
...
BBOXES
MLP
Alignment Operator:
Detection Features at T
Detection
BBOX Features
Det:1
Det:2
Det:N
...
Detection
BBOX(ES)
at T
Track
Appearance
at T
Track
Motion Forecast
For T at T-1
Ego
State
at T
Track
BBOX
at T-1
Detection
Appearance(s)
at T
Predicted
BBOX
at T
Matches
BBOX
at T
Occluded
Area
at T
Is Out of
Range
at T
Matches
Appearance
at T
Matches
Detection
at T
Structural Causal Model at T
MLP
Ego State at T
GNN
Track Aggregate Features
Trk:1
Trk:2
Trk:M
...
Detection Informed
Track Appearance Features
Trk:1
Trk:2
Trk:N
...
MLP
Detection Informed
Track BBOX Features
Trk:1
Trk:2
Trk:N
...
Proj
Detection Informed
Track Aggregate Features
Trk:1
Trk:2
Trk:N
...
Aligned Network at T
Occluded
Detection
at T
Is
Occluded
at T
Network Information Flow:
SCM Directed Edge:
Occluded
Detection
at T
Trk:1
Trk:2
Trk:M
...
x7 Decisions
Use decisions to solve constrained link prediction problem.
1
2
3
4
5
6
7
124
3
6
5
7
Track Coordinate Frame at T:
Input
Knowledge
Source
Intermediate
Node
Track-only
Decision
Detection Feature Extraction
Detections
Det:1
Det:2
Det:N
...
Tracks
Trk:1
Trk:2
Trk:N
...
Track-only Decisions
False
Positive
Track
Out of
Range
Track
Occluded
Detection
Detection-only Decisions
Newborn
Detection
False
Positive
Detection
Truth Table For SCM Final Decision
Appearance
Match
BBOX
Match
Detection-only
Decision
Detection &
Track
Decision
Figure 1:
Proposed network structure and example alignment for the Occluded Track decision.
The diagram depicts an alignment between the SCM for an
occluded detection decision and the DNN structure. The diagram only depicts one alignment due to space constraints; the remaining six can be formulated similarly.
We provide the associated causal models in the appendix.
3
摘要:

InterpretableDeepTrackingBenjaminThérienKrzysztofCzarneckiDepartmentofComputerScienceUniversityofWaterloo{btherien,k2czarne}@uwaterloo.caAbstractImagineexperiencingacrashasthepassengerofanautonomousvehicle.Wouldn'tyouwanttoknowwhyithappened?Currentend-to-endoptimizabledeepneuralnetworks(DNNs)in3Ddet...

展开>> 收起<<
Interpretable Deep Tracking Benjamin Thérien Krzysztof Czarnecki Department of Computer Science.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注