
views and three distinct channels. However, its training
set and test set share the same classes of activities, so
models trained on it may be overfitted to these seen ac-
tions and thus fail to generalize well in real-world sce-
narios. In comparison, DAD [1] is devised for open-set
recognition. In addition to the categories in the training
set, its test set also comprises unseen types of activities, as
illustrated in Table 1. Therefore, this database is more ap-
propriate for estimating DMSs’ real-world performances.
However, the DAD test labels only indicate whether or not
a driver is engaging in driving or NDRAs, without speci-
fying the categories of actions. This binary labelling con-
fines its usage to anomaly detection. As a result, DMSs
developed using this dataset remain oblivious to the type
of potentially distracting activity. This granularity may be
critical in designing a composite attention metric where
the different types of NDRAs may have different sensi-
tivities. Hence, we found that a richer set of labels is es-
sential for studying NDRAs and their impacts on overall
driver attention.
(a) Front & Depth. (b) Front & IR.
(c) Top & Depth. (d) Top & IR.
Figure 1: During the collection of DAD [1], two cameras
with synchronized depth and IR modalities were installed
at the top and front of the cabin, capturing the movement
of hands and parts of the body, respectively. The resolu-
tion of each video frame is 224 ×171.
2.2 Vision-Based Driver Monitoring Sys-
tems
Image-based DMSs. [10] proposes an approach based on
object detection, in which the hand and face areas of the
driver in the image are detected first and then fed into an
ensemble of convolutional neural networks (CNNs). Es-
tablished on this architecture, a model proposed in [13]
further includes a skin segmentation branch to resolve the
issue of variable lighting conditions that imposes perfor-
mance degradation onto RGB-based models. Contrary to
multi-branch structures, [14] leverages classic CNN clas-
sifiers such as VGG [15], also leading to state-of-the-art
results. This finding motivates us to establish our work on
pre-trained ResNets [16] and MobileNet [17].
Video-based DMSs. Currently, most video-based meth-
ods [1], [12] leverage 3D CNN classifiers pre-trained
on Kinetics-600 [18], [19]. [12] proposes a DMS
on the DMD dataset [12] that first exploits 3D CNNs
(MobileNet-V2 [17] and ShuffleNet-V2 [20]) as the
backbone to extract spatial-temporal features and then
uses a consensus module to further capture the tempo-
ral correlations between frames. 3D ResNet-18 [16],
3D MobileNet-V2 [17] and 3D ShuffleNet-V2 [20] are
utilised in [1] to extract embeddings and contrastive learn-
ing with noise estimation [21] is adopted to optimize the
cosine similarities of these representations. However, [22]
suggests that spatial details are more influential than tem-
poral relations. We also prove that high-level similari-
ties exist between consecutive frames. These two obser-
vations indicate that leveraging all frames within a video
clip may not be cost-effective, so an image-based DMS
does not need to make inferences for each frame, thereby
showing the potential of real-time performances.
2.3 Multimodal Feature Fusion
The fundamental problem in multimodality is fusion. Pre-
vious DMSs with multimodality [1], [12] are based on
decision-level fusion, while combing multimodal features
is rarely studied. We regard multimodal feature fusion
as a particular case of general feature fusion, and various
approaches have been proposed during its evolvement.
Earlier ones leverage linear operations such as concatena-
tion (e.g. Inception [23]–[25]) and addition (e.g. ResNet
[16], FPN [26], U-Net [27]). This rigid linear aggrega-
3