MMRNet Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy

2025-05-02 0 0 5.8MB 16 页 10玖币

侵权投诉

MMRNet: Improving Reliability for Multimodal

Object Detection and Segmentation for Bin Picking

via Multimodal Redundancy

Yuhao Chen1, Hayden Gunraj1, E. Zhixuan Zeng1,

Robbie Meyer1, Maximilian Gilles2, Alexander Wong1

1University of Waterloo, Canada

2Karlsruhe Institute of Technology, Germany

{yuhao.chen1, hayden.gunraj, ezzeng, robbie.meyer, alexander.wong}@uwaterloo.ca,

maximilian.gilles@kit.edu

Abstract

Recently, there has been tremendous interest in industry 4.0 infrastructure to

address labor shortages in global supply chains. Deploying artiﬁcial intelligence-

enabled robotic bin picking systems in real world has become particularly important

for reducing stress and physical demands of workers while increasing speed and

efﬁciency of warehouses. To this end, artiﬁcial intelligence-enabled robotic bin

picking systems may be used to automate order picking, but with the risk of causing

expensive damage during an abnormal event such as sensor failure. As such,

reliability becomes a critical factor for translating artiﬁcial intelligence research to

real world applications and products. In this paper, we propose a reliable object

detection and segmentation system with

ulti

odal

edundancy (MMRNet)

for tackling object detection and segmentation for robotic bin picking using data

from different modalities. This is the ﬁrst system that introduces the concept of

multimodal redundancy to address sensor failure issues during deployment. In

particular, we realize the multimodal redundancy framework with a gate fusion

module and dynamic ensemble learning. Finally, we present a new label-free

multimodal consistency (MC) score that utilizes the output from all modalities to

measure the overall system output reliability and uncertainty. Through experiments,

we demonstrate that in an event of missing modality, our system provides a much

more reliable performance compared to baseline models. We also demonstrate

that our MC score is a more reliability indicator for outputs during inference time

compared to the model generated conﬁdence scores that are often over-conﬁdent.

1 Introduction

Global labor shortages and the need for resilient supply chains has accelerated companies’ upgrades

to industry 4.0 and introduced a range of technologies such as big data, cloud computing, internet of

things (IoT), robotics, and artiﬁcial intelligence (AI) into production systems. With warehouses and

manufacturing units becoming smart environments, a crucial objective is to develop an autonomous

ﬂow of both material and information, and robotic bin picking plays an essential role in this task.

Robotic bin picking has been an active area of research for many decades given the complexity of the

task, ranging from joint control and trajectory planning [

] to object identiﬁcation [

] and grasp

detection [

]. In particular, we examine the object detection and segmentation task in autonomous bin

picking. Different from object detection and segmentation in other areas such as autonomous driving,

robotic vision system works in environments that are very close to the camera, dealing with heavy

Preprint. Under review.

arXiv:2210.10842v3 [cs.CV] 7 May 2023

Figure 1: The dynamic modality weight shifting of our network ensures a reliable overall performance

when a modality is missing. Row 2-4 heatmaps describe the average gate weights of each modality at

a single feature scale. Yellow indicates high weight, dark purple indicates low weight.

occlusions, shadows, dense object layouts, and complex stacking relations. It plays an essential role

in a robot’s perception system.

Deep neural networks have been proven effective for object detection and segmentation [

]. But,

deploying such systems in robotic picking applications is challenging due to the many sources of

uncertainty present in practical scenarios. Real-world bin scenes may consist of a wide variety of

unknown or occluded items arranged in an inﬁnite number of poses and illuminated with variable

lighting conditions. In addition to the variability of real-world bin scenes, errors in the camera

system can make a computer vision system unreliable. Camera sensors are prone to noise and can

fail in various situations such as specular reﬂections (missing values), black areas (missing depth),

overexposure, blur, and artifacts. In practice, commercial systems are expected to run 24/7 to be

feasible, which increases the risk of imaging sensor failures compared to research environments. If

not accounted for, sensor failures can lead to wrongly commissioned orders and in the worst case

to product and hardware damages, leading to expensive recall campaigns or production downtime.

Therefore, vision systems capable of handling uncertain inputs and producing reliable predictions

under sensor errors are critical to creating fail-safe applications.

One approach to creating fault-tolerant object detection and segmentation systems is to introduce

system duplication, where portions of the system are duplicated to allow the system to continue

to operate despite failures of its constituent parts. This approach assumes that failures are caused

by either input sensor failures or computational failures. However, duplication may not provide

fault-tolerance in situations where the system is operating correctly but its sensors are unable to

adequately measure the inputs. For example, a camera may fail to adequately image a piece of glass

due to its transparency, and so the use of a second identical camera cannot address this issue. In

addition, deep neural networks as a data-driven approach are designed to capture feature distributions

of the input dataset. A simple duplication of these networks will not detect features that are not in

the training distribution. Instead, we add image data from depth sensor as an additional modality

to capture object feature characteristics from a different perspective. More speciﬁcally, depth data

has very simple texture yet rich geometric features, that are more transferable to unseen objects than

RGB data.

A good system duplication design duplicates components that are more likely to fail, preventing

any disruption in the information ﬂow from the system input to output. Non-data-driven methods

have well deﬁned explicit logic to control the information ﬂow. In comparison, deep learning system

learns the input and output mapping through high-dimensional implicit feature representations. A

typical deep learning model encodes input information through a backbone network into a high-

dimensional latent representation, and downstream tasks use the representation to predict low-

dimensional outputs. Consequently, a large amount of information is lost during the dimensionality

reduction of downstream tasks. However, in robotic bin-picking, unseen items may contain highly

complex image characteristics that require both RGB and depth to work collaboratively. For example,

RGB backbones are better at detecting transparent objects and depth backbones are better at detecting

dark objects. A pair of eyeglasses with black frame will require the RGB backbone to focus on glass

parts while the depth backbone to focus on the frame for a complete detection and segmentation.

With the reduction of dimensionality, a simple result aggregation on two low quality detections will

create another low quality detection. Additional result merging networks or explicit merging logic

will introduce errors and instabilities into the system. An effective modality fusion technique that will

dynamically fuse modality features with limited loss of information is therefore greatly desired. In

addition, modality features merging may introduce dependencies between them, causing unexpected

model behavior when one of the modality feature is absent. We tackle this problem with a multimodal

redundancy framework consists of two key techniques: 1) we use a multi-scale soft-gating mechanism

to make the network learn to weigh and combine features between modalities dynamically, and 2) we

use a dynamic ensemble learning strategy to train the sub-system independently and collaboratively

in an alternating fashion. With this framework, only one modality needs to be present for the model

to operate.

Finally, we propose a novel multimodal consistency (MC) score as a more objective reliability

indicator for the system output based on the overlaps of detected bounding boxes and segmentation

masks. This can be used as an indicator for model uncertainty on individual predictions, as well as

model reliability on particular datasets.

Through experiments, we demonstrate that in an event of missing modality, our MMRNet provides

a much more reliable performance compared to baseline models. When depth is removed, our

network’s performance drop is within 1% where other models have a performance drop greater than

6%. When RGB is removed, our network’s performance drop is within 11% where other models

have a performance drop greater than 80%. Furthermore, we demonstrate that our MC score is a

more reliable indicator for output conﬁdence during inference compared to the often overly-conﬁdent

conﬁdence scores. We summarize our contribution as the following:

•

A multimodal redundancy framework consisting of a multi-scale soft-gating feature fusion

module and a dynamic ensemble learning strategy allowing trained sub-systems to operate

both independently and collaboratively.

• A multimodal consistency score to describe the reliability of the system output.

2 Related Work

Reliability study for deep learning-based systems:

Deep learning-based methods are data-driven,

encoding the decision making process through continuous latent vectors, which makes the model

behavior hard to predict and ﬁx. Only a few of the studies focus on the reliability aspect of the

deep learning-based systems. In [

], Santhanam et al. list differences between traditional and deep

learning-based software systems and discuss the challenges involved in the development of reliable

deep learning-based systems. In [

], Xu et al. study the reliability of object detection systems in

autonomous driving. In [

], dos Santos et al. study the relationship between reliability and GPU

precision (half, single, and double) for object detection tasks. Other reliability related work can

be found in model uncertainty estimation [

]. To the best of our knowledge, none of the work

investigates reliability or uncertainty for multimodal applications, in particular for robotic bin picking.

Multimodal Data Fusion:

Multimodal learning [

] has been rigorously studied. In

multimodal learning, there are three types of data fusion: early fusion, intermediate fusion, and late

fusion. Each corresponds to merging information at input, intermediate, and output stage respectively.

Early fusion involves combining and pre-processing inputs. A simple example is replacing the blue

channel of RGB with depth channel [

]. Late fusion merges the low-dimensional output of all

networks. For example, Simonyan et al. [

] combine spatial and temporal network output with

i) averaging, and ii) linear Support Vector Machine [

]. Early fusion and late fusion are simpler

to implement but have a lower dimensional representation compared to the intermediate fusion.

Intermediate fusion involves merging high-dimensional feature vectors. Common intermediate fusion

includes concatenation [

], and weighted summation [

]. Recently, more advanced techniques are

developed to dynamically merge the modalities. In [

], Wang et al. propose a feature channel

exchange technique based on Batch Normalization’s [

] scaling factor to dynamically fuse the

modalities. In [

], Cao et al. propose to replace the basic convolution operator with Shapeconv to

achieve RGB and depth fusion at the basic operator level. In [32], Xue et al. focus on the efﬁciency

aspect of multimodal learning and propose a hard gating function which outputs an one-hot encoded

vector to select modalities. In robotic grasping, Back et al. [

] take the weighted summation approach

and propose a multi-scale feature fusion module by applying a 1x1 convolutional layer to the feature

layers before passing them into a feature pyramid network (FPN) [19].

The aforementioned works are designed to optimize the overall network performance but at the same

time introduce dependencies among modality features, which are extremely vulnerable in case of

an abnormal event, such as an input sensor failure. In this paper, we address the multimodal fusion

strategy from the system reliability perspective, where our goal is to design a simple yet effective

network architecture that enables sub-modal systems to work independently as well as collaboratively

to increase the overall system reliability.

Ensemble learning:

Ensemble learning typically involves training multiple weak learners and

aggregating their predictions to improve predictive performance [

]. One of the simplest approaches

to construct ensembles is bagging [

], where weak learners are trained on randomly-sampled subsets

of a dataset and subsequently have their predictions combined via averaging or voting techniques [

Instead of aggregating predictions directly, one may also use a meta-learner which considers the

input data as well as each weak learner’s predictions in order to make a ﬁnal prediction, a technique

known as stacking [

]. Boosting [

] is another common approach where weak learners are added

sequentially and leverage the previous learner’s mistakes to re-weight training samples, effectively

attempting to correct the previous learner’s mistakes.

While ensemble learning has long been a common technique in classical machine learning, it can be

expensive to apply to deep learning due to the increased computational complexity and training time

of deep neural networks. Of particular relevance to this work is the application of ensemble learning

to multimodal deep learning problems. In multimodal problems, the data distributions typically

differ signiﬁcantly between modalities and thus may violate the assumptions of certain ensembling

techniques [

]. Nevertheless, ensemble methods have been applied to a variety of multimodal

problems [

]. For example, Menon et al. [

] trained modality-speciﬁc convolutional

neural networks on three different magnetic resonance imaging modalities and combined the models’

predictions via majority voting. In [

], Zhou et al. used a stacking-based approach to combine

the outputs of neural networks trained on text, audio, and video inputs, thereby reducing noise and

inter-modality conﬂicts.

Rather than combining multiple models with a typical ensembling strategy, in this work we consider

adynamic ensemble where multiple unimodal systems are dynamically fused into a single network.

This network is capable of both unimodal operation using each of its inputs independently as well as

multimodal operation through the fusion of the constituent unimodal systems.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MMRNet:ImprovingReliabilityforMultimodalObjectDetectionandSegmentationforBinPickingviaMultimodalRedundancyYuhaoChen1,HaydenGunraj1,E.ZhixuanZeng1,RobbieMeyer1,MaximilianGilles2,AlexanderWong11UniversityofWaterloo,Canada2KarlsruheInstituteofTechnology,Germany{yuhao.chen1,hayden.gunraj,ezzeng,robbie.m...

展开>> 收起<<

MMRNet Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MMRNet Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: