MFFN Multi-view Feature Fusion Network for Camouﬂaged Object Detection Dehua Zheng1Xiaochen Zheng2Laurence T. Yang1Yuan Gao3Chenlu Zhu4Yiheng Ruan4 1Huazhong University of Science and Technology China2ETH Z urich Switzerland

2025-04-27 0 0 2.1MB 14 页 10玖币

侵权投诉

MFFN: Multi-view Feature Fusion Network for Camouﬂaged Object Detection

Dehua Zheng1Xiaochen Zheng2Laurence T. Yang1*Yuan Gao3Chenlu Zhu4Yiheng Ruan4

1Huazhong University of Science and Technology, China 2ETH Z¨

urich, Switzerland

3Hainan University, China 4Hubei Chutian Expressway Digital Technology, China

dwardzheng@hust.edu.cn xzheng@student.ethz.ch ltyang@gmail.com

Abstract

Recent research about camouﬂaged object detec-

tion (COD) aims to segment highly concealed objects hid-

den in complex surroundings. The tiny, fuzzy camou-

ﬂaged objects result in visually indistinguishable proper-

ties. However, current single-view COD detectors are

sensitive to background distractors. Therefore, blurred

boundaries and variable shapes of the camouﬂaged ob-

jects are challenging to be fully captured with a single-

view detector. To overcome these obstacles, we propose

a behavior-inspired framework, called Multi-view Feature

Fusion Network (MFFN), which mimics the human behav-

iors of ﬁnding indistinct objects in images, i.e., observing

from multiple angles, distances, perspectives. Speciﬁcally,

the key idea behind it is to generate multiple ways of obser-

vation (multi-view) by data augmentation and apply them

as inputs. MFFN captures critical boundary and semantic

information by comparing and fusing extracted multi-view

features. In addition, our MFFN exploits the dependence

and interaction between views and channels. Speciﬁcally,

our methods leverage the complementary information be-

tween different views through a two-stage attention module

called Co-attention of Multi-view (CAMV). And we design

a local-overall module called Channel Fusion Unit (CFU)

to explore the channel-wise contextual clues of diverse fea-

ture maps in an iterative manner. The experiment results

show that our method performs favorably against existing

state-of-the-art methods via training with the same data.

The code will be available at https://github.com/

dwardzheng/MFFN_COD.

1. Introduction

Camouﬂage is a mechanism [4] by which organisms

protect themselves in nature. Camouﬂaged object detec-

tion (COD) is a countermeasure against the camouﬂage

mechanism, aiming to capture the slight differences be-

*Corresponding author.

Image GT Ours SINet

Figure 1: Visualization of camouﬂaged animal detection.

The state-of-the-art and classic single-view COD model

SINet [6] is confused by the background sharing highly sim-

ilarities with target objects and missed a lot of boundary and

region shape information (indicated by orange arrows). Our

multi-view scheme will eliminate these distractors and per-

form more efﬁciently and effectively.

tween the object and the background to obtain accurate de-

tection results. Unlike general object detection and salient

object detection, in which the objects and background can

be easily distinguished by human eyes or advanced deep

learning models, COD is more challenging because it re-

quires a sufﬁcient amount of visual input and prior knowl-

edge [46] to address the complicated problem caused by the

highly intrinsic similarity between the target object and the

background. Thus, COD has a wide range of valuable appli-

cations in promoting the search and detection of biological

species [43], assisting the medical diagnosis with medical

images [41, 13], and improving the detection of pests and

diseases in agriculture [10].

Recently, many researches put emphasis on learning

from a ﬁxed single view with either auxiliary tasks [18, 32,

34, 58, 67, 15], uncertainty discovery [20, 26], or vision

transformers [56, 38] and their proposed methods achieved

signiﬁcant progress. Nevertheless, due to visual insigniﬁ-

cance of camouﬂaged objects and contextual insufﬁciency

from single-view input, they are still striving to precisely

arXiv:2210.06361v3 [cs.CV] 19 Oct 2022

recognize camouﬂaged objects and their performance needs

to be improved. We found that the current COD meth-

ods are easily distracted by negative factors from decep-

tive background/surroundings, as illustrated in Fig. 1. As

a result, it is hard to mine discriminative and ﬁne-grained

semantic cues of camouﬂaged objects, making accurately

segment camouﬂaged objects from a confusing background

and predict some uncertain regions incapable. Meanwhile,

we learn that when people observe a concealed object in

images, they usually adjust the viewing distance, change

the viewing angle, and change the viewing position to ﬁnd

the target object more accurately. Inspired by it, we aim

to design a simple yet efﬁcient and effective strategy. The

aforementioned considerations motivate us to consider the

semantic and context exploration problem with multi-view.

We argue that corresponding clues, correlations, and mutual

constraints can be better obtained by utilizing information

from different viewpoint of the scene (e.g., changing ob-

servation distances and angles) as complementary. Further-

more, we argue that carefully designing the encoded feature

fusion modules can help the encoder learn accurate infor-

mation corresponding to boundary and semantics. Taking

these into mind, our research will focus on the following

three aspects: (1) how to exploit the effects of different types

of views on COD task, and the combination of multi-view

features to achieve the best detection effect; (2) how to bet-

ter fuse the features from multiple views based on correla-

tion awareness and how to enhance the semantic expression

ability of multi-view feature maps without increasing model

complexity; (3) how to incrementally explore the potential

context relationships of a multi-channel feature map.

To solve our concerned pain points of COD task, we pro-

pose a Multi-view Feature Fusion Network (MFFN) for the

COD task to make up for the semantic deﬁciency of ﬁxed

view observation. First, we use the multi-view raw data,

which are generated by different data augmentation, as the

inputs of a backbone extractor with shared weights. We im-

plement a ResNet model as the backbone extractor integrat-

ing the feature pyramid network (FPN) [24] to focus on ob-

ject information of different scales. In addition, we design

aCo-attention of Multi-view (CAMV) module to integrate

multi-view features and to explore the correlation between

different view types. CAMV consists of two stages of at-

tention operation. In the ﬁrst stage, the inherent correlation

and complementary analysis are mainly conducted for mul-

tiple viewing distances and angles to obtain the view fea-

tures with a uniﬁed scale. In the second stage, the external

constraint relations between viewing angles and distances

are further leveraged to enhance feature maps’ semantic ex-

pression. For the enhanced multi-view feature tensor, we

design a Channel Fusion Unit (CFU) to further exploit the

correlation between contexts. In the CFU module, we ﬁrst

carry out up-down feature interaction between channel di-

mensions and then carry out progressive iteration on the

overall features. CAMV is applied to observe the multi-

view attention features of different size feature maps of FPN

architecture. The CFU module contains the previous layer’s

information as each size’s feature maps are eventually re-

stored to their original size. Finally, the ﬁnal prediction

results are obtained by sigmoid operation. The prediction

further beneﬁts from UAL design.

Our contribution can be summarized as follows: 1) We

propose MFFN model to solve the challenging problems

faced by single-view COD models. MFFN can capture

complementary information acquired by different viewing

angles and distances and discover the progressive connec-

tion between contexts.

2) We design the CAMV module to mine the comple-

mentary relationships within and between different types of

view features and enhance the semantic expression ability

of multi-view feature tensors, and use the CFU module to

conduct progressive context cue mining.

3) Our model is tested on three datasets of

CHAMELEON [42], COK10K [6] and NC4K [32],

and quantitative analysis is conducted on ﬁve general

evaluation indicators of Sm[7], Fw

β[33], MAE,Fβ[1]

and Em[8], all of which achieved superior results.

2. Related work

Salient Object Detection (SOD). SOD is a kind of seg-

mentation task in essence. It calculates saliency map ﬁrst

and then merges and segmented saliency object. In previ-

ous studies, traditional methods based on manual features

pay more attention to color [2, 23], texture [54, 23], con-

trast [39, 16] and so on, but lack advantages in complex

scenes and structured description. With the development of

CNN, SOD algorithm has achieved leapfrog development.

Li et al. [22] combines local information with global in-

formation to overcome the problem of highlighting object

boundary but not the overall object in the model based on lo-

cal. The model structure design idea of multi-level features,

has been more widely applied in [25, 66, 14, 19]. Simi-

lar to COD, clear boundary information is crucial for SOD

task [40, 63, 44]. The development of attention mechanism

provides more schemes for exploring the correlation be-

tween channel dimension and spatial dimension [37, 9, 48].

The application of attention mechanism improves the per-

formance of SOD model [28, 62, 51]. SOD faces simpler

background surroundings. Although excellent performance

can be obtained by applying relevant models to COD task,

speciﬁc design is still needed to remove the interference

from the background surroundings.

Camouﬂaged Object Detection (COD). In recent years,

some researches applied multi-task learning to detect the

camouﬂaged objects. Le et al. [18] introduced the binary

View Combining Layer

𝑓!!

𝑓

Hierarchical

Channel Fusion

Decoder

Co-Attention of

Multi-View Fusion

Feature Pyramid

Encoder

CFU2

ConvBlock

Upsample

CFU3

CFU1

CFU4

CFU5

Input

Multi-View

𝑰𝑪𝟐

𝑰𝑪𝟏

𝑰𝑶

𝑰𝑽

𝑰𝑫

Vertical

Diagonal

CAMV1

CAMV2

CAMV3

CAMV4

𝑓

(

𝑓

)

𝑓

𝑓!$

𝑓

𝑓+

𝑓

𝑓,

𝑓

(

𝑓

(

𝑓

(

𝑓

(

𝑓

(

𝑓-

𝑓

)

𝑓

)

𝑓

)

𝑓

)

𝑓

)

Cat(𝑓

",𝑓

#,𝑓

$,𝑓

%!,𝑓

%")

CAMV5

mv-tensor

en-tensor

Figure 2: Overview of our model structure. We generate multiple views (Diagonally and Vertically ﬂipped views, Close

looking views) of the data by different transformation methods. The shared pyramid feature encoder is applied to extract

hierarchical features of different scales corresponding to different view choices. The view combining layer concatenates fea-

tures of same level from different views (fD

i, fV

i, fO

i, fC1

i, fC2

i) channel-wisely and output multi-view feature tensors (mv-

tensors). The model feeds mv-tensors into CAMVs and obtain multi-view enhanced feature tensor (en-tensor)fi. CAMV is

adopted to fuse features and aggregate vital clues between different views by a two-stage co-attention mechanism. The en-

tensors are further decoded and the contextual correlation are exploited by hierarchical channel fusion unit simultaneously.

In the end, a probability map of camouﬂaged object in the input image is computed by several convolutional blocks.

classiﬁcation task as the second branch and auxiliary task of

camouﬂaged object segmentation. Zhu et al. [67] proposed

a new boundary-guided separated attention network (BSA-

NET), which uses two streams of separated attention mod-

ules to highlight the boundaries of camouﬂaged objects.

Lv et al. [32] proposed a multi-task learning framework to

jointly localize and segment the camouﬂaged objects while

inferring their ranks. Zhai et al. [58] designed a mutual

graph learning model to detect the edge and region of the

objects simultaneously. There are some uncertainty-aware

methods. Li et al. [20] proposed an uncertainty-aware

framework containing a joint network for both salient and

camouﬂaged object detection. Yang et al. [56] introduced

Bayesian learning into the uncertainty-guided transformer

reasoning model. Liu et al. [26] designed an aleatoric

uncertainty estimation network to indicate the prediction

awareness. Sun et al. [45] placed emphasis on rich global

context information with the integration of cross-level fea-

tures. Pei et al. [38] applied a one-stage location-sensing

transformer and further fused the features from transformer

and CNN. Some bio-inspired methods are proposed. For ex-

ample, [35, 34, 6] use multi-scale information but from one

single view. Meanwhile, [35] shows single-view informa-

tion is not sufﬁcient for accurately detecting camouﬂaged

objects. We hereby argue that view generation and selec-

tion might play an important role and we aim to develop our

model by mimicking the behavior of humans when under-

standing complicated concealed objects by altering the way

they observing an image. Our proposed method exploits the

visual perception knowledge and semantic cues by aggre-

gating complementary information from multi-view. Ac-

cordingly, our model is simple yet efﬁcient and effective to

comprehensively understand scene and to accurately seg-

ment the camouﬂaged objects.

3. Method

Motivation. Motivated by the challenges of single-view

COD models, we attempt to capture boundary and regional

semantic information with rich viewing angles and ﬂexible

viewing distances. In order to merge diverse context infor-

mation from features of multi-view inputs and FPN multi-

level outputs, we design a feature fusion module based on

two-stage attention mechanism to obtain enhanced feature

tensors. It also avoids redundant structural design. To

leverage the rich information contained in channel dimen-

sions, we design a local-overall context/cues mining struc-

ture based on channel-wise integration. Meanwhile, it also

enhances the information expression of the feature tensors.

3.1. Multi-view Generation

As shown in Fig. 1, the single-view model misses nec-

essary boundary, region, and shape information. Inspired

by human behavior, taking complementary views of obser-

vation into account will overcome this defect and we de-

sign three different views: distance, angle, and perspective

view. We obtain different distance views through the re-

size operation with the proportional interval of the resize

operation larger than 0.5 to increase the distinction. We

get different angle views by mirror transformation, includ-

ing horizontal, vertical and diagonal mirror transformation.

We obtain different perspective views through afﬁne trans-

formation. Speciﬁcally, three corresponding points on the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MFFN:Multi-viewFeatureFusionNetworkforCamouagedObjectDetectionDehuaZheng1XiaochenZheng2LaurenceT.Yang1*YuanGao3ChenluZhu4YihengRuan41HuazhongUniversityofScienceandTechnology,China2ETHZ¨urich,Switzerland3HainanUniversity,China4HubeiChutianExpresswayDigitalTechnology,Chinadwardzheng@hust.edu.cnxzheng...

展开>> 收起<<

MFFN Multi-view Feature Fusion Network for Camouﬂaged Object Detection Dehua Zheng1Xiaochen Zheng2Laurence T. Yang1Yuan Gao3Chenlu Zhu4Yiheng Ruan4 1Huazhong University of Science and Technology China2ETH Z urich Switzerland.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MFFN Multi-view Feature Fusion Network for Camouﬂaged Object Detection Dehua Zheng1Xiaochen Zheng2Laurence T. Yang1Yuan Gao3Chenlu Zhu4Yiheng Ruan4 1Huazhong University of Science and Technology China2ETH Z urich Switzerland

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: