autonomous driving. In [
7
], dos Santos et al. study the relationship between reliability and GPU
precision (half, single, and double) for object detection tasks. Other reliability related work can
be found in model uncertainty estimation [
10
]. To the best of our knowledge, none of the work
investigates reliability or uncertainty for multimodal applications, in particular for robotic bin picking.
Multimodal Data Fusion:
Multimodal learning [
22
,
17
,
23
,
8
,
14
] has been rigorously studied. In
multimodal learning, there are three types of data fusion: early fusion, intermediate fusion, and late
fusion. Each corresponds to merging information at input, intermediate, and output stage respectively.
Early fusion involves combining and pre-processing inputs. A simple example is replacing the blue
channel of RGB with depth channel [
18
]. Late fusion merges the low-dimensional output of all
networks. For example, Simonyan et al. [
27
] combine spatial and temporal network output with
i) averaging, and ii) linear Support Vector Machine [
28
]. Early fusion and late fusion are simpler
to implement but have a lower dimensional representation compared to the intermediate fusion.
Intermediate fusion involves merging high-dimensional feature vectors. Common intermediate fusion
includes concatenation [
22
], and weighted summation [
1
]. Recently, more advanced techniques are
developed to dynamically merge the modalities. In [
29
], Wang et al. propose a feature channel
exchange technique based on Batch Normalization’s [
16
] scaling factor to dynamically fuse the
modalities. In [
5
], Cao et al. propose to replace the basic convolution operator with Shapeconv to
achieve RGB and depth fusion at the basic operator level. In [32], Xue et al. focus on the efficiency
aspect of multimodal learning and propose a hard gating function which outputs an one-hot encoded
vector to select modalities. In robotic grasping, Back et al. [
1
] take the weighted summation approach
and propose a multi-scale feature fusion module by applying a 1x1 convolutional layer to the feature
layers before passing them into a feature pyramid network (FPN) [19].
The aforementioned works are designed to optimize the overall network performance but at the same
time introduce dependencies among modality features, which are extremely vulnerable in case of
an abnormal event, such as an input sensor failure. In this paper, we address the multimodal fusion
strategy from the system reliability perspective, where our goal is to design a simple yet effective
network architecture that enables sub-modal systems to work independently as well as collaboratively
to increase the overall system reliability.
Ensemble learning:
Ensemble learning typically involves training multiple weak learners and
aggregating their predictions to improve predictive performance [
35
]. One of the simplest approaches
to construct ensembles is bagging [
3
], where weak learners are trained on randomly-sampled subsets
of a dataset and subsequently have their predictions combined via averaging or voting techniques [
35
].
Instead of aggregating predictions directly, one may also use a meta-learner which considers the
input data as well as each weak learner’s predictions in order to make a final prediction, a technique
known as stacking [
30
]. Boosting [
9
] is another common approach where weak learners are added
sequentially and leverage the previous learner’s mistakes to re-weight training samples, effectively
attempting to correct the previous learner’s mistakes.
While ensemble learning has long been a common technique in classical machine learning, it can be
expensive to apply to deep learning due to the increased computational complexity and training time
of deep neural networks. Of particular relevance to this work is the application of ensemble learning
to multimodal deep learning problems. In multimodal problems, the data distributions typically
differ significantly between modalities and thus may violate the assumptions of certain ensembling
techniques [
20
]. Nevertheless, ensemble methods have been applied to a variety of multimodal
problems [
20
,
21
,
6
,
34
]. For example, Menon et al. [
21
] trained modality-specific convolutional
neural networks on three different magnetic resonance imaging modalities and combined the models’
predictions via majority voting. In [
34
], Zhou et al. used a stacking-based approach to combine
the outputs of neural networks trained on text, audio, and video inputs, thereby reducing noise and
inter-modality conflicts.
Rather than combining multiple models with a typical ensembling strategy, in this work we consider
adynamic ensemble where multiple unimodal systems are dynamically fused into a single network.
This network is capable of both unimodal operation using each of its inputs independently as well as
multimodal operation through the fusion of the constituent unimodal systems.
4