Interpretable Deep Tracking Benjamin Thérien Krzysztof Czarnecki Department of Computer Science

2025-05-06 0 0 1.04MB 13 页 10玖币

侵权投诉

Interpretable Deep Tracking

Benjamin Thérien Krzysztof Czarnecki

Department of Computer Science

University of Waterloo

{btherien,k2czarne}@uwaterloo.ca

Abstract

Imagine experiencing a crash as the passenger of an autonomous vehicle. Wouldn’t

you want to know why it happened? Current end-to-end optimizable deep neural

networks (DNNs) in 3D detection, multi-object tracking, and motion forecasting

provide little to no explanations about how they make their decisions. To help

bridge this gap, we design an end-to-end optimizable multi-object tracking architec-

ture and training protocol inspired by the recently proposed method of interchange

intervention training (IIT). By enumerating different tracking decisions and as-

sociated reasoning procedures, we can train individual networks to reason about

the possible decisions via IIT. Each network’s decisions can be explained by the

high-level structural causal model (SCM) it is trained in alignment with. Moreover,

our proposed model learns to rank these outcomes, leveraging the promise of deep

learning in end-to-end training, while being inherently interpretable.

1 Introduction

As autonomous driving systems (ADS) become more and more capable, they are deployed at

increasing levels of autonomy. Yet, most proposed DNNs as part of such systems are still black

boxes, slowing this progress. Interpretability is essential for increasing social acceptance, legality,

and safety of ADS. Passengers should have the right to understand why the vehicle they are riding

in made a certain decision [

]. In the event of a crash, legal entities require the ADS to produce

explanations for its actions. Interpretable architectures can also improve vehicle safety by providing a

better understanding of failiure cases and enabling the detection and handling of errors during online

processing. Our proposed interpretable tracker is end-to-end optimizable and can be trained for

performance and interpretability, allowing it to effectively tradeoff these two desireable properties.

Many interpretability methods for DNNs exist [

]; however, only a few fulﬁll our desired character-

istics. We would like to explain the decisions our model makes, while incurring minimal performance

degradation compared to a black-box model. Post-hoc techniques, while potentially achieving this

goal, are unreliable [

] and not amenable to online processing. In contrast, intrinsically interpretable

models avoid these drawbacks. Existing techniques typically involve distilling a DNN’s knowledge

into a more interpretable model [

] or using the interpretable model outright [

]. While intrinsic

interpretability would allow for online processing, these techniques are not ideal as they cannot

leverage an end-to-end trained DNN at inference. Interchange intervention training (IIT) [

] is a

recently proposed method that addresses this problem. This technique fulﬁlls all our desired criteria

and it is the interpretability-engine of our proposed tracker.

In the following, we study how to instill interpretability into an end-to-end trainable multi-object

tracker. We make two main contributions. First, we handcraft structural causal models (SCMs) for

each tracking decision, so that they can be used to train our network via IIT. Second, we propose

how these SCMs can be integrated into a 3D detection, multi-object tracking, and motion forecasting

network, similar to [

], enabling end-to-end training and interpretability. Other relevant works also

apply interpretability techniques to autonomous driving, but most focus on interpretable planners

Preprint. Under review.

arXiv:2210.01266v1 [cs.CV] 3 Oct 2022

and controllers [

], interpretable representations [

], post-hoc explanations [

], and advising

the planner via natural language [

]. In contrast, our proposed architecture explains decisions of a

tracker by providing interpretable SCMs as a proxy for its network’s reasoning procedure.

2 Interpretable Tracking Design

Our design follows the tracking-by-detection paradigm, matching every incoming detection to a

corresponding track. Fundamentally, this can be thought of as a link prediction problem in a bipartite

graph (see ﬁg. 2), where track and detection nodes are constrained to have a single edge. Our nuanced

decomposition has seven possible decisions: two detection-and-track decisions, appearance match

and BBOX match; two detection-only decisions, newborn track and false positive detection; and three

track-only decisions, out of range track,false positive track, and occluded track. To make decisions,

we assume access to knowledge sources commonly predicted by 3D detectors [

]: BEV

feature map, 3D bounding boxes, position, velocity, and conﬁdence scores. We also assume additional

information sources: appearance features for each detection (e.g., extracted from the BEV feature

map), an occlusion map of the scene in BEV (produced by the detector), and the ego vehicle’s current

state. The key idea of our design is the integration of a highly structured end-to-end optimizable

tracker with SCMs that represent its decision-making domain knowledge.

Base Tracker

Starting from a LiDAR-based 3D object detection backbone (e.g. [

]), our tracker

(ﬁg. 1) predicts bounding boxes of current detections and an occlusion map for the current scene,

and it extracts appearance feature vectors for each detection from the BEV feature map. At timestep

T, tracks and detection features are passed to a graph neural network, yielding detection-informed

and track-informed features respectively. These features are then fed to subnetworks for each

possible decision. One and only one link decision for each track and each detection is selected.

This is accomplished via Hungarian matching [

], where each decision corresponds to an edge in

the bipartite graph. Training the network simply involves enforcing a margin between all correct

decisions and all incorrect decisions. Once these detections are matched to tracks or become newborn

tracks, their feature vectors are fed to corresponding LSTMs to compute track features (not shown

for simplicity). These track-level representations are then used to forecast the track’s trajectory for

timesteps into the future.

Tracking Decision SCMs

Due to uncertainty in the inputs to the SCMs (computed by the detector),

we must assume that the SCMs have access to oracle models which can somehow correct errors in the

detector’s input (otherwise they could not infer the correct output). These oracle models are, of course,

imaginary, but we can obtain their outputs via the ground truth labels. Due to space constraints, each

SCM is shown in the appendix (see ﬁgures: 4, 5, 6, 7, 8, 9, and 10).

Track Only Decisions These decisions are made when a track is not matched to any detection and

represent one of three causes: the track has gone out of the detectable range of the ego vehicle, the

track is a false positive, or the detection corresponding to the current track is occluded. The track-only

SCMs use three main high-level binary variables to make decisions. Each computes the track’s

bounding box at the current time step and uses it to determine two intermediate nodes: whether the

track matches any detection and whether the track is out of range. The SCM for out of range track

predicts that the track is out of range if it is predicted to be so and no detection matched the track.

The SCMs for occluded track and false positive track use additional occlusion information to make

their predictions. They predict that, respectively, a track is occluded if it is predicted to be so and

is neither out of range nor matches any detection and that the track is a false positive if it is neither

occluded nor out of range nor matches any detection.

Detection Only Decisions These decisions are made whenever a detection is not matched to a track.

This can occur in two possible situations: the detection is correct but has not been tracked before or

the detection is a false positive. The detection-only SCMs make decisions by assessing the validity

of detections and whether they match with existing tracks. If a detection is determined to be valid

but matches no track, it is declared a newborn track by the corresponding SCM. If a detection is

determined to have an invalid appearance and an invalid bounding box, then the associated SCM

decides that it is a false positive detection.

Detection & Track Decisions These decisions are taken when a detection is matched to a track by

appearance or BBOX. The SCMs for these cases are correspondingly simple.

Scenario

Matches

Detection

Occluded

Is OOR

Occluded

Track?

Update tracks using decisions.

LiDAR

Input

at T

3D Object Detector

Backbone

Detection

Head

Segmentation

Head Occlusion Map

BEV

Feature Map

Detector Output at T

MLP

Extract BEV

Features

Active Track Memory at T

Track Appearance

Features(T=K to T-1)

Trk:1

Trk:2

Trk:M

...

Track BBOX

Features (T=K to T-1)

Trk:1

Trk:2

Trk:M

...

Track Motion Forecast

forT at T-1

Trk:2

Trk:M

Trk:1

...

Detection

AppearanceFeatures

Det:1

Det:2

Det:N

...

Track BBOXES

at T-1

Trk:2 - BBOX

Trk:M - BBOX

Trk:1 - BBOX

...

BBOXES

MLP

Alignment Operator:

Detection Features at T

Detection

BBOX Features

Det:1

Det:2

Det:N

...

Detection

BBOX(ES)

at T

Track

Appearance

at T

Track

Motion Forecast

For T at T-1

Ego

State

at T

Track

BBOX

at T-1

Detection

Appearance(s)

at T

Predicted

BBOX

at T

Matches

BBOX

at T

Occluded

Area

at T

Is Out of

Range

at T

Matches

Appearance

at T

Matches

Detection

at T

Structural Causal Model at T

MLP

Ego State at T

GNN

Track Aggregate Features

Trk:1

Trk:2

Trk:M

...

Detection Informed

Track Appearance Features

Trk:1

Trk:2

Trk:N

...

MLP

Detection Informed

Track BBOX Features

Trk:1

Trk:2

Trk:N

...

Proj

Detection Informed

Track Aggregate Features

Trk:1

Trk:2

Trk:N

...

Aligned Network at T

Occluded

Detection

at T

Occluded

at T

Network Information Flow:

SCM Directed Edge:

Occluded

Detection

at T

Trk:1

Trk:2

Trk:M

...

x7 Decisions

Use decisions to solve constrained link prediction problem.

124

Track Coordinate Frame at T:

Input

Knowledge

Source

Intermediate

Node

Track-only

Decision

Detection Feature Extraction

Detections

Det:1

Det:2

Det:N

...

Tracks

Trk:1

Trk:2

Trk:N

...

Track-only Decisions

False

Positive

Track

Out of

Range

Track

Occluded

Detection

Detection-only Decisions

Newborn

Detection

False

Positive

Detection

Truth Table For SCM Final Decision

Appearance

Match

BBOX

Match

Detection-only

Decision

Detection &

Track

Decision

Figure 1:

Proposed network structure and example alignment for the Occluded Track decision.

The diagram depicts an alignment between the SCM for an

occluded detection decision and the DNN structure. The diagram only depicts one alignment due to space constraints; the remaining six can be formulated similarly.

We provide the associated causal models in the appendix.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InterpretableDeepTrackingBenjaminThérienKrzysztofCzarneckiDepartmentofComputerScienceUniversityofWaterloo{btherien,k2czarne}@uwaterloo.caAbstractImagineexperiencingacrashasthepassengerofanautonomousvehicle.Wouldn'tyouwanttoknowwhyithappened?Currentend-to-endoptimizabledeepneuralnetworks(DNNs)in3Ddet...

展开>> 收起<<

Interpretable Deep Tracking Benjamin Thérien Krzysztof Czarnecki Department of Computer Science.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Interpretable Deep Tracking Benjamin Thérien Krzysztof Czarnecki Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: