MMRNet Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy

2025-05-02 0 0 5.8MB 16 页 10玖币
侵权投诉
MMRNet: Improving Reliability for Multimodal
Object Detection and Segmentation for Bin Picking
via Multimodal Redundancy
Yuhao Chen1, Hayden Gunraj1, E. Zhixuan Zeng1,
Robbie Meyer1, Maximilian Gilles2, Alexander Wong1
1University of Waterloo, Canada
2Karlsruhe Institute of Technology, Germany
{yuhao.chen1, hayden.gunraj, ezzeng, robbie.meyer, alexander.wong}@uwaterloo.ca,
maximilian.gilles@kit.edu
Abstract
Recently, there has been tremendous interest in industry 4.0 infrastructure to
address labor shortages in global supply chains. Deploying artificial intelligence-
enabled robotic bin picking systems in real world has become particularly important
for reducing stress and physical demands of workers while increasing speed and
efficiency of warehouses. To this end, artificial intelligence-enabled robotic bin
picking systems may be used to automate order picking, but with the risk of causing
expensive damage during an abnormal event such as sensor failure. As such,
reliability becomes a critical factor for translating artificial intelligence research to
real world applications and products. In this paper, we propose a reliable object
detection and segmentation system with
M
ulti
M
odal
R
edundancy (MMRNet)
for tackling object detection and segmentation for robotic bin picking using data
from different modalities. This is the first system that introduces the concept of
multimodal redundancy to address sensor failure issues during deployment. In
particular, we realize the multimodal redundancy framework with a gate fusion
module and dynamic ensemble learning. Finally, we present a new label-free
multimodal consistency (MC) score that utilizes the output from all modalities to
measure the overall system output reliability and uncertainty. Through experiments,
we demonstrate that in an event of missing modality, our system provides a much
more reliable performance compared to baseline models. We also demonstrate
that our MC score is a more reliability indicator for outputs during inference time
compared to the model generated confidence scores that are often over-confident.
1 Introduction
Global labor shortages and the need for resilient supply chains has accelerated companies’ upgrades
to industry 4.0 and introduced a range of technologies such as big data, cloud computing, internet of
things (IoT), robotics, and artificial intelligence (AI) into production systems. With warehouses and
manufacturing units becoming smart environments, a crucial objective is to develop an autonomous
flow of both material and information, and robotic bin picking plays an essential role in this task.
Robotic bin picking has been an active area of research for many decades given the complexity of the
task, ranging from joint control and trajectory planning [
15
] to object identification [
1
] and grasp
detection [
4
]. In particular, we examine the object detection and segmentation task in autonomous bin
picking. Different from object detection and segmentation in other areas such as autonomous driving,
robotic vision system works in environments that are very close to the camera, dealing with heavy
Preprint. Under review.
arXiv:2210.10842v3 [cs.CV] 7 May 2023
Figure 1: The dynamic modality weight shifting of our network ensures a reliable overall performance
when a modality is missing. Row 2-4 heatmaps describe the average gate weights of each modality at
a single feature scale. Yellow indicates high weight, dark purple indicates low weight.
occlusions, shadows, dense object layouts, and complex stacking relations. It plays an essential role
in a robot’s perception system.
Deep neural networks have been proven effective for object detection and segmentation [
13
,
24
]. But,
deploying such systems in robotic picking applications is challenging due to the many sources of
uncertainty present in practical scenarios. Real-world bin scenes may consist of a wide variety of
unknown or occluded items arranged in an infinite number of poses and illuminated with variable
lighting conditions. In addition to the variability of real-world bin scenes, errors in the camera
system can make a computer vision system unreliable. Camera sensors are prone to noise and can
fail in various situations such as specular reflections (missing values), black areas (missing depth),
overexposure, blur, and artifacts. In practice, commercial systems are expected to run 24/7 to be
feasible, which increases the risk of imaging sensor failures compared to research environments. If
not accounted for, sensor failures can lead to wrongly commissioned orders and in the worst case
to product and hardware damages, leading to expensive recall campaigns or production downtime.
Therefore, vision systems capable of handling uncertain inputs and producing reliable predictions
under sensor errors are critical to creating fail-safe applications.
One approach to creating fault-tolerant object detection and segmentation systems is to introduce
system duplication, where portions of the system are duplicated to allow the system to continue
to operate despite failures of its constituent parts. This approach assumes that failures are caused
by either input sensor failures or computational failures. However, duplication may not provide
2
fault-tolerance in situations where the system is operating correctly but its sensors are unable to
adequately measure the inputs. For example, a camera may fail to adequately image a piece of glass
due to its transparency, and so the use of a second identical camera cannot address this issue. In
addition, deep neural networks as a data-driven approach are designed to capture feature distributions
of the input dataset. A simple duplication of these networks will not detect features that are not in
the training distribution. Instead, we add image data from depth sensor as an additional modality
to capture object feature characteristics from a different perspective. More specifically, depth data
has very simple texture yet rich geometric features, that are more transferable to unseen objects than
RGB data.
A good system duplication design duplicates components that are more likely to fail, preventing
any disruption in the information flow from the system input to output. Non-data-driven methods
have well defined explicit logic to control the information flow. In comparison, deep learning system
learns the input and output mapping through high-dimensional implicit feature representations. A
typical deep learning model encodes input information through a backbone network into a high-
dimensional latent representation, and downstream tasks use the representation to predict low-
dimensional outputs. Consequently, a large amount of information is lost during the dimensionality
reduction of downstream tasks. However, in robotic bin-picking, unseen items may contain highly
complex image characteristics that require both RGB and depth to work collaboratively. For example,
RGB backbones are better at detecting transparent objects and depth backbones are better at detecting
dark objects. A pair of eyeglasses with black frame will require the RGB backbone to focus on glass
parts while the depth backbone to focus on the frame for a complete detection and segmentation.
With the reduction of dimensionality, a simple result aggregation on two low quality detections will
create another low quality detection. Additional result merging networks or explicit merging logic
will introduce errors and instabilities into the system. An effective modality fusion technique that will
dynamically fuse modality features with limited loss of information is therefore greatly desired. In
addition, modality features merging may introduce dependencies between them, causing unexpected
model behavior when one of the modality feature is absent. We tackle this problem with a multimodal
redundancy framework consists of two key techniques: 1) we use a multi-scale soft-gating mechanism
to make the network learn to weigh and combine features between modalities dynamically, and 2) we
use a dynamic ensemble learning strategy to train the sub-system independently and collaboratively
in an alternating fashion. With this framework, only one modality needs to be present for the model
to operate.
Finally, we propose a novel multimodal consistency (MC) score as a more objective reliability
indicator for the system output based on the overlaps of detected bounding boxes and segmentation
masks. This can be used as an indicator for model uncertainty on individual predictions, as well as
model reliability on particular datasets.
Through experiments, we demonstrate that in an event of missing modality, our MMRNet provides
a much more reliable performance compared to baseline models. When depth is removed, our
network’s performance drop is within 1% where other models have a performance drop greater than
6%. When RGB is removed, our network’s performance drop is within 11% where other models
have a performance drop greater than 80%. Furthermore, we demonstrate that our MC score is a
more reliable indicator for output confidence during inference compared to the often overly-confident
confidence scores. We summarize our contribution as the following:
A multimodal redundancy framework consisting of a multi-scale soft-gating feature fusion
module and a dynamic ensemble learning strategy allowing trained sub-systems to operate
both independently and collaboratively.
A multimodal consistency score to describe the reliability of the system output.
2 Related Work
Reliability study for deep learning-based systems:
Deep learning-based methods are data-driven,
encoding the decision making process through continuous latent vectors, which makes the model
behavior hard to predict and fix. Only a few of the studies focus on the reliability aspect of the
deep learning-based systems. In [
26
], Santhanam et al. list differences between traditional and deep
learning-based software systems and discuss the challenges involved in the development of reliable
deep learning-based systems. In [
31
], Xu et al. study the reliability of object detection systems in
3
autonomous driving. In [
7
], dos Santos et al. study the relationship between reliability and GPU
precision (half, single, and double) for object detection tasks. Other reliability related work can
be found in model uncertainty estimation [
10
]. To the best of our knowledge, none of the work
investigates reliability or uncertainty for multimodal applications, in particular for robotic bin picking.
Multimodal Data Fusion:
Multimodal learning [
22
,
17
,
23
,
8
,
14
] has been rigorously studied. In
multimodal learning, there are three types of data fusion: early fusion, intermediate fusion, and late
fusion. Each corresponds to merging information at input, intermediate, and output stage respectively.
Early fusion involves combining and pre-processing inputs. A simple example is replacing the blue
channel of RGB with depth channel [
18
]. Late fusion merges the low-dimensional output of all
networks. For example, Simonyan et al. [
27
] combine spatial and temporal network output with
i) averaging, and ii) linear Support Vector Machine [
28
]. Early fusion and late fusion are simpler
to implement but have a lower dimensional representation compared to the intermediate fusion.
Intermediate fusion involves merging high-dimensional feature vectors. Common intermediate fusion
includes concatenation [
22
], and weighted summation [
1
]. Recently, more advanced techniques are
developed to dynamically merge the modalities. In [
29
], Wang et al. propose a feature channel
exchange technique based on Batch Normalization’s [
16
] scaling factor to dynamically fuse the
modalities. In [
5
], Cao et al. propose to replace the basic convolution operator with Shapeconv to
achieve RGB and depth fusion at the basic operator level. In [32], Xue et al. focus on the efficiency
aspect of multimodal learning and propose a hard gating function which outputs an one-hot encoded
vector to select modalities. In robotic grasping, Back et al. [
1
] take the weighted summation approach
and propose a multi-scale feature fusion module by applying a 1x1 convolutional layer to the feature
layers before passing them into a feature pyramid network (FPN) [19].
The aforementioned works are designed to optimize the overall network performance but at the same
time introduce dependencies among modality features, which are extremely vulnerable in case of
an abnormal event, such as an input sensor failure. In this paper, we address the multimodal fusion
strategy from the system reliability perspective, where our goal is to design a simple yet effective
network architecture that enables sub-modal systems to work independently as well as collaboratively
to increase the overall system reliability.
Ensemble learning:
Ensemble learning typically involves training multiple weak learners and
aggregating their predictions to improve predictive performance [
35
]. One of the simplest approaches
to construct ensembles is bagging [
3
], where weak learners are trained on randomly-sampled subsets
of a dataset and subsequently have their predictions combined via averaging or voting techniques [
35
].
Instead of aggregating predictions directly, one may also use a meta-learner which considers the
input data as well as each weak learner’s predictions in order to make a final prediction, a technique
known as stacking [
30
]. Boosting [
9
] is another common approach where weak learners are added
sequentially and leverage the previous learner’s mistakes to re-weight training samples, effectively
attempting to correct the previous learner’s mistakes.
While ensemble learning has long been a common technique in classical machine learning, it can be
expensive to apply to deep learning due to the increased computational complexity and training time
of deep neural networks. Of particular relevance to this work is the application of ensemble learning
to multimodal deep learning problems. In multimodal problems, the data distributions typically
differ significantly between modalities and thus may violate the assumptions of certain ensembling
techniques [
20
]. Nevertheless, ensemble methods have been applied to a variety of multimodal
problems [
20
,
21
,
6
,
34
]. For example, Menon et al. [
21
] trained modality-specific convolutional
neural networks on three different magnetic resonance imaging modalities and combined the models’
predictions via majority voting. In [
34
], Zhou et al. used a stacking-based approach to combine
the outputs of neural networks trained on text, audio, and video inputs, thereby reducing noise and
inter-modality conflicts.
Rather than combining multiple models with a typical ensembling strategy, in this work we consider
adynamic ensemble where multiple unimodal systems are dynamically fused into a single network.
This network is capable of both unimodal operation using each of its inputs independently as well as
multimodal operation through the fusion of the constituent unimodal systems.
4
摘要:

MMRNet:ImprovingReliabilityforMultimodalObjectDetectionandSegmentationforBinPickingviaMultimodalRedundancyYuhaoChen1,HaydenGunraj1,E.ZhixuanZeng1,RobbieMeyer1,MaximilianGilles2,AlexanderWong11UniversityofWaterloo,Canada2KarlsruheInstituteofTechnology,Germany{yuhao.chen1,hayden.gunraj,ezzeng,robbie.m...

展开>> 收起<<
MMRNet Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:5.8MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注