Real-Time Driver Monitoring Systems through Modality and View Analysis Yiming Ma

2025-04-29 0 0 5.61MB 13 页 10玖币
侵权投诉
Real-Time Driver Monitoring Systems through Modality and View
Analysis
Yiming Ma
University of Warwick
Coventry, UK
Victor Sanchez
University of Warwick
Coventry, UK
Soodeh Nikan
Ford Motor Company
USA
Devesh Upadhyay
Ford Motor Company
USA
Bhushan Atote
University of Warwick
Coventry, UK
Tanaya Guha
University of Glasgow
Glasgow, UK
October 19, 2022
Abstract
Driver distractions are known to be the dominant cause
of road accidents. While monitoring systems can de-
tect non-driving-related activities and facilitate reducing
the risks, they must be accurate and efficient to be appli-
cable. Unfortunately, state-of-the-art methods prioritize
accuracy while ignoring latency because they leverage
cross-view and multimodal videos in which consecutive
frames are highly similar. Thus, in this paper, we pursue
time-effective detection models by neglecting the tempo-
ral relation between video frames and investigate the im-
portance of each sensing modality in detecting drives’ ac-
tivities. Experiments demonstrate that 1) our proposed
algorithms are real-time and can achieve similar perfor-
mances (97.5% AUC-PR) with significantly reduced com-
putation compared with video-based models; 2) the top
view with the infrared channel is more informative than
any other single modality. Furthermore, we enhance the
DAD dataset by manually annotating its test set to enable
multiclassification. We also thoroughly analyze the influ-
ence of visual sensor types and their placements on the
prediction of each class. The code and the new labels will
be released.
1 Introduction
While the proliferation of on-road traffic has dramatically
benefited society, it has also increased fatal traffic acci-
dents. According to the World Health Organization, un-
til June 2022, there have been about 1.3 million casual-
ties and more than 20 million injuries from car crashes
every year, and these accidents cost most countries ap-
proximately 3% of their GDPs. In these crashes, one of
the dominant contributing factors is human errors from
inattention, such as preoccupation with mobile phones
while driving. Hence, for L2+ self-driving-enabled cars,
it is crucial to develop effective driver monitoring systems
(DMSs) to estimate the drivers’ readiness for driving and
take over the control when necessary to prevent accidents.
As an essential information source, vision is often ex-
ploited by DMSs to detect drivers’ non-driving-related
activities (NDRAs). DAD [1] is one of the latest video
databases for vision-based monitoring systems. In addi-
tion to the rich diversity of NDRAs in the training set,
its test set also contains unseen types of actions. Since
there can be unboundedly many actions that drivers may
conduct and the unseen cases in the test set of DAD en-
able it to better generalize to realistic driving, we decide
to establish our work on this open-set recognition dataset.
arXiv:2210.09441v1 [cs.CV] 17 Oct 2022
NDRAs in DAD training set NDRA labels we annotated for DAD test set
Talking on the phone - left Talking on the phone - left Adjusting side mirror Wearing glasses
Talking on the phone - right Talking on the phone - right Adjusting clothes Taking off glasses
Messaging left Messaging left Adjusting glasses Picking up something
Messaging right Messaging right Adjusting rear-view mirror Wiping sweat
Talking with passengers Talking with passengers Adjusting sunroof Touching face/hair
Reaching behind Reaching behind Wiping nose Sneezing
Adjusting radio Adjusting radio Head dropping (dozing off) Coughing
Drinking Drinking Eating Reading
Table 1: Details of the non-driving-related activities (NDRAs) in the DAD dataset [1]. Activities in black are present
in both the training and test sets, and those in red are exclusive to the test set. The original DAD dataset only provides
binary (normal/anomalous) labels for the test set as it is initially developed for anomaly detection.
However, the test-set labels of DAD are binary, only in-
dicating whether drivers are participating in NDRAs or
not. This inadequacy hinders the multiclassification of
drivers’ activities, which is essential since different activ-
ities may not be equally hazardous. Hence, we manually
annotate the test set with the specific task names to allow
the recognition of these actions and class-based evalua-
tion of DMSs.
On the other hand, the latest vision-based DMSs are
not sufficiently efficient. In-cockpit vision systems (in-
cluding DAD) usually consist of visual sensors of various
types (e.g. RGB) installed at different locations (e.g. top)
to provide videos of the driver from diverse streams and
views. Thus, to maximize the accuracy of activity detec-
tion, existing approaches often employ spatial and tem-
poral information from all modalities and views. These
methods unavoidably introduce billions of floating-point
operations (FLOPs) into inference, which embedded de-
vices cannot perform in real time. Therefore, in this paper,
we aim to develop a real-time DMS by neglecting the tem-
poral dimension and only employing the most informative
vision source. Experiments on DAD demonstrate that 1)
neighboring frames within a sequence are highly resem-
bling; 2) single-modal architectures based on the top view
and the infrared modality can also achieve state-of-the-art
performances. These two findings validate our approach.
Our contributions are summarized as follows:
1. We propose efficient image-based DMSs with com-
parable performances (97.5% AUC-PR & 95.6%
AUC-ROC) with state-of-the-art video-based mod-
els. Unlike other methods’ substantial computation
load, our models’ low latency makes them deploy-
able in the real world.
2. We analyze the performance of our models on each
view and modality and provide the most economical
solution to the placement (top) and the type (IR) of
cameras that DMSs should leverage.
3. We annotate the test set of DAD with specific ac-
tivities, thereby enabling the multiclassification of
drivers’ actions on it. This particularized labelling
allows detailed analysis of algorithms based on each
class, and detecting the most dangerous distractions
(e.g. using mobile phones) can be prioritized.
2 Related Work
2.1 In-Car Vision-Based Datasets
Early datasets focus only on parts of the driver’s body, like
the head [2]–[6] or hands [7]–[9]. Although these may
have contributory values in other tasks, such as gesture
recognition and pose study, only a narrow range of the
whole-body movements are typically captured, thereby
covering very few NDRAs.
Later datasets that followed [10]–[12] instead concen-
trate on body actions. AUC-DD [10] is based on a sin-
gle sensing modality, with images collected from a side
view perspective and the RGB stream. By contrast, DMD
[12] is a multimodal and video-based dataset, currently
the largest for SAE L2-L3 autonomous driving. It con-
sists of 41 hours of videos recorded from three different
2
views and three distinct channels. However, its training
set and test set share the same classes of activities, so
models trained on it may be overfitted to these seen ac-
tions and thus fail to generalize well in real-world sce-
narios. In comparison, DAD [1] is devised for open-set
recognition. In addition to the categories in the training
set, its test set also comprises unseen types of activities, as
illustrated in Table 1. Therefore, this database is more ap-
propriate for estimating DMSs’ real-world performances.
However, the DAD test labels only indicate whether or not
a driver is engaging in driving or NDRAs, without speci-
fying the categories of actions. This binary labelling con-
fines its usage to anomaly detection. As a result, DMSs
developed using this dataset remain oblivious to the type
of potentially distracting activity. This granularity may be
critical in designing a composite attention metric where
the different types of NDRAs may have different sensi-
tivities. Hence, we found that a richer set of labels is es-
sential for studying NDRAs and their impacts on overall
driver attention.
(a) Front & Depth. (b) Front & IR.
(c) Top & Depth. (d) Top & IR.
Figure 1: During the collection of DAD [1], two cameras
with synchronized depth and IR modalities were installed
at the top and front of the cabin, capturing the movement
of hands and parts of the body, respectively. The resolu-
tion of each video frame is 224 ×171.
2.2 Vision-Based Driver Monitoring Sys-
tems
Image-based DMSs. [10] proposes an approach based on
object detection, in which the hand and face areas of the
driver in the image are detected first and then fed into an
ensemble of convolutional neural networks (CNNs). Es-
tablished on this architecture, a model proposed in [13]
further includes a skin segmentation branch to resolve the
issue of variable lighting conditions that imposes perfor-
mance degradation onto RGB-based models. Contrary to
multi-branch structures, [14] leverages classic CNN clas-
sifiers such as VGG [15], also leading to state-of-the-art
results. This finding motivates us to establish our work on
pre-trained ResNets [16] and MobileNet [17].
Video-based DMSs. Currently, most video-based meth-
ods [1], [12] leverage 3D CNN classifiers pre-trained
on Kinetics-600 [18], [19]. [12] proposes a DMS
on the DMD dataset [12] that first exploits 3D CNNs
(MobileNet-V2 [17] and ShuffleNet-V2 [20]) as the
backbone to extract spatial-temporal features and then
uses a consensus module to further capture the tempo-
ral correlations between frames. 3D ResNet-18 [16],
3D MobileNet-V2 [17] and 3D ShuffleNet-V2 [20] are
utilised in [1] to extract embeddings and contrastive learn-
ing with noise estimation [21] is adopted to optimize the
cosine similarities of these representations. However, [22]
suggests that spatial details are more influential than tem-
poral relations. We also prove that high-level similari-
ties exist between consecutive frames. These two obser-
vations indicate that leveraging all frames within a video
clip may not be cost-effective, so an image-based DMS
does not need to make inferences for each frame, thereby
showing the potential of real-time performances.
2.3 Multimodal Feature Fusion
The fundamental problem in multimodality is fusion. Pre-
vious DMSs with multimodality [1], [12] are based on
decision-level fusion, while combing multimodal features
is rarely studied. We regard multimodal feature fusion
as a particular case of general feature fusion, and various
approaches have been proposed during its evolvement.
Earlier ones leverage linear operations such as concatena-
tion (e.g. Inception [23]–[25]) and addition (e.g. ResNet
[16], FPN [26], U-Net [27]). This rigid linear aggrega-
3
摘要:

Real-TimeDriverMonitoringSystemsthroughModalityandViewAnalysisYimingMaUniversityofWarwickCoventry,UKVictorSanchezUniversityofWarwickCoventry,UKSoodehNikanFordMotorCompanyUSADeveshUpadhyayFordMotorCompanyUSABhushanAtoteUniversityofWarwickCoventry,UKTanayaGuhaUniversityofGlasgowGlasgow,UKOctober19,202...

展开>> 收起<<
Real-Time Driver Monitoring Systems through Modality and View Analysis Yiming Ma.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:5.61MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注