Real-Time Driver Monitoring Systems through Modality and View Analysis Yiming Ma

2025-04-29 0 0 5.61MB 13 页 10玖币

侵权投诉

Real-Time Driver Monitoring Systems through Modality and View

Analysis

Yiming Ma

University of Warwick

Coventry, UK

Victor Sanchez

University of Warwick

Coventry, UK

Soodeh Nikan

Ford Motor Company

USA

Devesh Upadhyay

Ford Motor Company

USA

Bhushan Atote

University of Warwick

Coventry, UK

Tanaya Guha

University of Glasgow

Glasgow, UK

October 19, 2022

Abstract

Driver distractions are known to be the dominant cause

of road accidents. While monitoring systems can de-

tect non-driving-related activities and facilitate reducing

the risks, they must be accurate and efﬁcient to be appli-

cable. Unfortunately, state-of-the-art methods prioritize

accuracy while ignoring latency because they leverage

cross-view and multimodal videos in which consecutive

frames are highly similar. Thus, in this paper, we pursue

time-effective detection models by neglecting the tempo-

ral relation between video frames and investigate the im-

portance of each sensing modality in detecting drives’ ac-

tivities. Experiments demonstrate that 1) our proposed

algorithms are real-time and can achieve similar perfor-

mances (97.5% AUC-PR) with signiﬁcantly reduced com-

putation compared with video-based models; 2) the top

view with the infrared channel is more informative than

any other single modality. Furthermore, we enhance the

DAD dataset by manually annotating its test set to enable

multiclassiﬁcation. We also thoroughly analyze the inﬂu-

ence of visual sensor types and their placements on the

prediction of each class. The code and the new labels will

be released.

1 Introduction

While the proliferation of on-road trafﬁc has dramatically

beneﬁted society, it has also increased fatal trafﬁc acci-

dents. According to the World Health Organization, un-

til June 2022, there have been about 1.3 million casual-

ties and more than 20 million injuries from car crashes

every year, and these accidents cost most countries ap-

proximately 3% of their GDPs. In these crashes, one of

the dominant contributing factors is human errors from

inattention, such as preoccupation with mobile phones

while driving. Hence, for L2+ self-driving-enabled cars,

it is crucial to develop effective driver monitoring systems

(DMSs) to estimate the drivers’ readiness for driving and

take over the control when necessary to prevent accidents.

As an essential information source, vision is often ex-

ploited by DMSs to detect drivers’ non-driving-related

activities (NDRAs). DAD [1] is one of the latest video

databases for vision-based monitoring systems. In addi-

tion to the rich diversity of NDRAs in the training set,

its test set also contains unseen types of actions. Since

there can be unboundedly many actions that drivers may

conduct and the unseen cases in the test set of DAD en-

able it to better generalize to realistic driving, we decide

to establish our work on this open-set recognition dataset.

arXiv:2210.09441v1 [cs.CV] 17 Oct 2022

NDRAs in DAD training set NDRA labels we annotated for DAD test set

Talking on the phone - left Talking on the phone - left Adjusting side mirror Wearing glasses

Talking on the phone - right Talking on the phone - right Adjusting clothes Taking off glasses

Messaging left Messaging left Adjusting glasses Picking up something

Messaging right Messaging right Adjusting rear-view mirror Wiping sweat

Talking with passengers Talking with passengers Adjusting sunroof Touching face/hair

Reaching behind Reaching behind Wiping nose Sneezing

Adjusting radio Adjusting radio Head dropping (dozing off) Coughing

Drinking Drinking Eating Reading

Table 1: Details of the non-driving-related activities (NDRAs) in the DAD dataset [1]. Activities in black are present

in both the training and test sets, and those in red are exclusive to the test set. The original DAD dataset only provides

binary (normal/anomalous) labels for the test set as it is initially developed for anomaly detection.

However, the test-set labels of DAD are binary, only in-

dicating whether drivers are participating in NDRAs or

not. This inadequacy hinders the multiclassiﬁcation of

drivers’ activities, which is essential since different activ-

ities may not be equally hazardous. Hence, we manually

annotate the test set with the speciﬁc task names to allow

the recognition of these actions and class-based evalua-

tion of DMSs.

On the other hand, the latest vision-based DMSs are

not sufﬁciently efﬁcient. In-cockpit vision systems (in-

cluding DAD) usually consist of visual sensors of various

types (e.g. RGB) installed at different locations (e.g. top)

to provide videos of the driver from diverse streams and

views. Thus, to maximize the accuracy of activity detec-

tion, existing approaches often employ spatial and tem-

poral information from all modalities and views. These

methods unavoidably introduce billions of ﬂoating-point

operations (FLOPs) into inference, which embedded de-

vices cannot perform in real time. Therefore, in this paper,

we aim to develop a real-time DMS by neglecting the tem-

poral dimension and only employing the most informative

vision source. Experiments on DAD demonstrate that 1)

neighboring frames within a sequence are highly resem-

bling; 2) single-modal architectures based on the top view

and the infrared modality can also achieve state-of-the-art

performances. These two ﬁndings validate our approach.

Our contributions are summarized as follows:

1. We propose efﬁcient image-based DMSs with com-

parable performances (97.5% AUC-PR & 95.6%

AUC-ROC) with state-of-the-art video-based mod-

els. Unlike other methods’ substantial computation

load, our models’ low latency makes them deploy-

able in the real world.

2. We analyze the performance of our models on each

view and modality and provide the most economical

solution to the placement (top) and the type (IR) of

cameras that DMSs should leverage.

3. We annotate the test set of DAD with speciﬁc ac-

tivities, thereby enabling the multiclassiﬁcation of

drivers’ actions on it. This particularized labelling

allows detailed analysis of algorithms based on each

class, and detecting the most dangerous distractions

(e.g. using mobile phones) can be prioritized.

2 Related Work

2.1 In-Car Vision-Based Datasets

Early datasets focus only on parts of the driver’s body, like

the head [2]–[6] or hands [7]–[9]. Although these may

have contributory values in other tasks, such as gesture

recognition and pose study, only a narrow range of the

whole-body movements are typically captured, thereby

covering very few NDRAs.

Later datasets that followed [10]–[12] instead concen-

trate on body actions. AUC-DD [10] is based on a sin-

gle sensing modality, with images collected from a side

view perspective and the RGB stream. By contrast, DMD

[12] is a multimodal and video-based dataset, currently

the largest for SAE L2-L3 autonomous driving. It con-

sists of 41 hours of videos recorded from three different

views and three distinct channels. However, its training

set and test set share the same classes of activities, so

models trained on it may be overﬁtted to these seen ac-

tions and thus fail to generalize well in real-world sce-

narios. In comparison, DAD [1] is devised for open-set

recognition. In addition to the categories in the training

set, its test set also comprises unseen types of activities, as

illustrated in Table 1. Therefore, this database is more ap-

propriate for estimating DMSs’ real-world performances.

However, the DAD test labels only indicate whether or not

a driver is engaging in driving or NDRAs, without speci-

fying the categories of actions. This binary labelling con-

ﬁnes its usage to anomaly detection. As a result, DMSs

developed using this dataset remain oblivious to the type

of potentially distracting activity. This granularity may be

critical in designing a composite attention metric where

the different types of NDRAs may have different sensi-

tivities. Hence, we found that a richer set of labels is es-

sential for studying NDRAs and their impacts on overall

driver attention.

(a) Front & Depth. (b) Front & IR.

Figure 1: During the collection of DAD [1], two cameras

with synchronized depth and IR modalities were installed

at the top and front of the cabin, capturing the movement

of hands and parts of the body, respectively. The resolu-

tion of each video frame is 224 ×171.

2.2 Vision-Based Driver Monitoring Sys-

tems

Image-based DMSs. [10] proposes an approach based on

object detection, in which the hand and face areas of the

driver in the image are detected ﬁrst and then fed into an

ensemble of convolutional neural networks (CNNs). Es-

tablished on this architecture, a model proposed in [13]

further includes a skin segmentation branch to resolve the

issue of variable lighting conditions that imposes perfor-

mance degradation onto RGB-based models. Contrary to

multi-branch structures, [14] leverages classic CNN clas-

siﬁers such as VGG [15], also leading to state-of-the-art

results. This ﬁnding motivates us to establish our work on

pre-trained ResNets [16] and MobileNet [17].

Video-based DMSs. Currently, most video-based meth-

ods [1], [12] leverage 3D CNN classiﬁers pre-trained

on Kinetics-600 [18], [19]. [12] proposes a DMS

on the DMD dataset [12] that ﬁrst exploits 3D CNNs

(MobileNet-V2 [17] and ShufﬂeNet-V2 [20]) as the

backbone to extract spatial-temporal features and then

uses a consensus module to further capture the tempo-

ral correlations between frames. 3D ResNet-18 [16],

3D MobileNet-V2 [17] and 3D ShufﬂeNet-V2 [20] are

utilised in [1] to extract embeddings and contrastive learn-

ing with noise estimation [21] is adopted to optimize the

cosine similarities of these representations. However, [22]

suggests that spatial details are more inﬂuential than tem-

poral relations. We also prove that high-level similari-

ties exist between consecutive frames. These two obser-

vations indicate that leveraging all frames within a video

clip may not be cost-effective, so an image-based DMS

does not need to make inferences for each frame, thereby

showing the potential of real-time performances.

2.3 Multimodal Feature Fusion

The fundamental problem in multimodality is fusion. Pre-

vious DMSs with multimodality [1], [12] are based on

decision-level fusion, while combing multimodal features

is rarely studied. We regard multimodal feature fusion

as a particular case of general feature fusion, and various

approaches have been proposed during its evolvement.

Earlier ones leverage linear operations such as concatena-

tion (e.g. Inception [23]–[25]) and addition (e.g. ResNet

[16], FPN [26], U-Net [27]). This rigid linear aggrega-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Real-TimeDriverMonitoringSystemsthroughModalityandViewAnalysisYimingMaUniversityofWarwickCoventry,UKVictorSanchezUniversityofWarwickCoventry,UKSoodehNikanFordMotorCompanyUSADeveshUpadhyayFordMotorCompanyUSABhushanAtoteUniversityofWarwickCoventry,UKTanayaGuhaUniversityofGlasgowGlasgow,UKOctober19,202...

展开>> 收起<<

Real-Time Driver Monitoring Systems through Modality and View Analysis Yiming Ma.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Real-Time Driver Monitoring Systems through Modality and View Analysis Yiming Ma

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: