MFFN Multi-view Feature Fusion Network for Camouflaged Object Detection Dehua Zheng1Xiaochen Zheng2Laurence T. Yang1Yuan Gao3Chenlu Zhu4Yiheng Ruan4 1Huazhong University of Science and Technology China2ETH Z urich Switzerland

2025-04-27 0 0 2.1MB 14 页 10玖币
侵权投诉
MFFN: Multi-view Feature Fusion Network for Camouflaged Object Detection
Dehua Zheng1Xiaochen Zheng2Laurence T. Yang1*Yuan Gao3Chenlu Zhu4Yiheng Ruan4
1Huazhong University of Science and Technology, China 2ETH Z¨
urich, Switzerland
3Hainan University, China 4Hubei Chutian Expressway Digital Technology, China
dwardzheng@hust.edu.cn xzheng@student.ethz.ch ltyang@gmail.com
Abstract
Recent research about camouflaged object detec-
tion (COD) aims to segment highly concealed objects hid-
den in complex surroundings. The tiny, fuzzy camou-
flaged objects result in visually indistinguishable proper-
ties. However, current single-view COD detectors are
sensitive to background distractors. Therefore, blurred
boundaries and variable shapes of the camouflaged ob-
jects are challenging to be fully captured with a single-
view detector. To overcome these obstacles, we propose
a behavior-inspired framework, called Multi-view Feature
Fusion Network (MFFN), which mimics the human behav-
iors of finding indistinct objects in images, i.e., observing
from multiple angles, distances, perspectives. Specifically,
the key idea behind it is to generate multiple ways of obser-
vation (multi-view) by data augmentation and apply them
as inputs. MFFN captures critical boundary and semantic
information by comparing and fusing extracted multi-view
features. In addition, our MFFN exploits the dependence
and interaction between views and channels. Specifically,
our methods leverage the complementary information be-
tween different views through a two-stage attention module
called Co-attention of Multi-view (CAMV). And we design
a local-overall module called Channel Fusion Unit (CFU)
to explore the channel-wise contextual clues of diverse fea-
ture maps in an iterative manner. The experiment results
show that our method performs favorably against existing
state-of-the-art methods via training with the same data.
The code will be available at https://github.com/
dwardzheng/MFFN_COD.
1. Introduction
Camouflage is a mechanism [4] by which organisms
protect themselves in nature. Camouflaged object detec-
tion (COD) is a countermeasure against the camouflage
mechanism, aiming to capture the slight differences be-
*Corresponding author.
1
2
3
4
5
6
7
1
23
4
5
Image GT Ours SINet
Figure 1: Visualization of camouflaged animal detection.
The state-of-the-art and classic single-view COD model
SINet [6] is confused by the background sharing highly sim-
ilarities with target objects and missed a lot of boundary and
region shape information (indicated by orange arrows). Our
multi-view scheme will eliminate these distractors and per-
form more efficiently and effectively.
tween the object and the background to obtain accurate de-
tection results. Unlike general object detection and salient
object detection, in which the objects and background can
be easily distinguished by human eyes or advanced deep
learning models, COD is more challenging because it re-
quires a sufficient amount of visual input and prior knowl-
edge [46] to address the complicated problem caused by the
highly intrinsic similarity between the target object and the
background. Thus, COD has a wide range of valuable appli-
cations in promoting the search and detection of biological
species [43], assisting the medical diagnosis with medical
images [41, 13], and improving the detection of pests and
diseases in agriculture [10].
Recently, many researches put emphasis on learning
from a fixed single view with either auxiliary tasks [18, 32,
34, 58, 67, 15], uncertainty discovery [20, 26], or vision
transformers [56, 38] and their proposed methods achieved
significant progress. Nevertheless, due to visual insignifi-
cance of camouflaged objects and contextual insufficiency
from single-view input, they are still striving to precisely
arXiv:2210.06361v3 [cs.CV] 19 Oct 2022
recognize camouflaged objects and their performance needs
to be improved. We found that the current COD meth-
ods are easily distracted by negative factors from decep-
tive background/surroundings, as illustrated in Fig. 1. As
a result, it is hard to mine discriminative and fine-grained
semantic cues of camouflaged objects, making accurately
segment camouflaged objects from a confusing background
and predict some uncertain regions incapable. Meanwhile,
we learn that when people observe a concealed object in
images, they usually adjust the viewing distance, change
the viewing angle, and change the viewing position to find
the target object more accurately. Inspired by it, we aim
to design a simple yet efficient and effective strategy. The
aforementioned considerations motivate us to consider the
semantic and context exploration problem with multi-view.
We argue that corresponding clues, correlations, and mutual
constraints can be better obtained by utilizing information
from different viewpoint of the scene (e.g., changing ob-
servation distances and angles) as complementary. Further-
more, we argue that carefully designing the encoded feature
fusion modules can help the encoder learn accurate infor-
mation corresponding to boundary and semantics. Taking
these into mind, our research will focus on the following
three aspects: (1) how to exploit the effects of different types
of views on COD task, and the combination of multi-view
features to achieve the best detection effect; (2) how to bet-
ter fuse the features from multiple views based on correla-
tion awareness and how to enhance the semantic expression
ability of multi-view feature maps without increasing model
complexity; (3) how to incrementally explore the potential
context relationships of a multi-channel feature map.
To solve our concerned pain points of COD task, we pro-
pose a Multi-view Feature Fusion Network (MFFN) for the
COD task to make up for the semantic deficiency of fixed
view observation. First, we use the multi-view raw data,
which are generated by different data augmentation, as the
inputs of a backbone extractor with shared weights. We im-
plement a ResNet model as the backbone extractor integrat-
ing the feature pyramid network (FPN) [24] to focus on ob-
ject information of different scales. In addition, we design
aCo-attention of Multi-view (CAMV) module to integrate
multi-view features and to explore the correlation between
different view types. CAMV consists of two stages of at-
tention operation. In the first stage, the inherent correlation
and complementary analysis are mainly conducted for mul-
tiple viewing distances and angles to obtain the view fea-
tures with a unified scale. In the second stage, the external
constraint relations between viewing angles and distances
are further leveraged to enhance feature maps’ semantic ex-
pression. For the enhanced multi-view feature tensor, we
design a Channel Fusion Unit (CFU) to further exploit the
correlation between contexts. In the CFU module, we first
carry out up-down feature interaction between channel di-
mensions and then carry out progressive iteration on the
overall features. CAMV is applied to observe the multi-
view attention features of different size feature maps of FPN
architecture. The CFU module contains the previous layer’s
information as each size’s feature maps are eventually re-
stored to their original size. Finally, the final prediction
results are obtained by sigmoid operation. The prediction
further benefits from UAL design.
Our contribution can be summarized as follows: 1) We
propose MFFN model to solve the challenging problems
faced by single-view COD models. MFFN can capture
complementary information acquired by different viewing
angles and distances and discover the progressive connec-
tion between contexts.
2) We design the CAMV module to mine the comple-
mentary relationships within and between different types of
view features and enhance the semantic expression ability
of multi-view feature tensors, and use the CFU module to
conduct progressive context cue mining.
3) Our model is tested on three datasets of
CHAMELEON [42], COK10K [6] and NC4K [32],
and quantitative analysis is conducted on five general
evaluation indicators of Sm[7], Fw
β[33], MAE,Fβ[1]
and Em[8], all of which achieved superior results.
2. Related work
Salient Object Detection (SOD). SOD is a kind of seg-
mentation task in essence. It calculates saliency map first
and then merges and segmented saliency object. In previ-
ous studies, traditional methods based on manual features
pay more attention to color [2, 23], texture [54, 23], con-
trast [39, 16] and so on, but lack advantages in complex
scenes and structured description. With the development of
CNN, SOD algorithm has achieved leapfrog development.
Li et al. [22] combines local information with global in-
formation to overcome the problem of highlighting object
boundary but not the overall object in the model based on lo-
cal. The model structure design idea of multi-level features,
has been more widely applied in [25, 66, 14, 19]. Simi-
lar to COD, clear boundary information is crucial for SOD
task [40, 63, 44]. The development of attention mechanism
provides more schemes for exploring the correlation be-
tween channel dimension and spatial dimension [37, 9, 48].
The application of attention mechanism improves the per-
formance of SOD model [28, 62, 51]. SOD faces simpler
background surroundings. Although excellent performance
can be obtained by applying relevant models to COD task,
specific design is still needed to remove the interference
from the background surroundings.
Camouflaged Object Detection (COD). In recent years,
some researches applied multi-task learning to detect the
camouflaged objects. Le et al. [18] introduced the binary
View Combining Layer
𝑓!!
𝑓
!
"!
𝑓
#
"!
𝑓
$
"!
𝑓
%
"!
𝑓
&
"!
Hierarchical
Channel Fusion
Decoder
Co-Attention of
Multi-View Fusion
Feature Pyramid
Encoder
CFU2
ConvBlock
Upsample
+
+
+
+
CFU3
CFU1
CFU4
CFU5
Input
Multi-View
𝑰𝑪𝟐
𝑰𝑪𝟏
𝑰𝑶
𝑰𝑽
𝑰𝑫
Vertical
Diagonal
Close
Close
CAMV1
CAMV2
CAMV3
CAMV4
𝑓
&
𝑓
'
𝑓
(
𝑓
)
𝑓
*
𝑓!$
𝑓
!
"!
𝑓
#
"!
𝑓
$
"!
𝑓
%
"!
𝑓
&
"!
𝑓+
𝑓
!
'
𝑓
#
'
𝑓
$
'
𝑓
%
'
𝑓
&
'
𝑓,
𝑓
!
(
𝑓
#
(
𝑓
$
(
𝑓
%
(
𝑓
&
(
𝑓-
𝑓
!
)
𝑓
#
)
𝑓
$
)
𝑓
%
)
𝑓
&
)
Cat(𝑓
!
",𝑓
!
#,𝑓
!
$,𝑓
!
%!,𝑓
!
%")
CAMV5
mv-tensor
en-tensor
Figure 2: Overview of our model structure. We generate multiple views (Diagonally and Vertically flipped views, Close
looking views) of the data by different transformation methods. The shared pyramid feature encoder is applied to extract
hierarchical features of different scales corresponding to different view choices. The view combining layer concatenates fea-
tures of same level from different views (fD
i, fV
i, fO
i, fC1
i, fC2
i) channel-wisely and output multi-view feature tensors (mv-
tensors). The model feeds mv-tensors into CAMVs and obtain multi-view enhanced feature tensor (en-tensor)fi. CAMV is
adopted to fuse features and aggregate vital clues between different views by a two-stage co-attention mechanism. The en-
tensors are further decoded and the contextual correlation are exploited by hierarchical channel fusion unit simultaneously.
In the end, a probability map of camouflaged object in the input image is computed by several convolutional blocks.
classification task as the second branch and auxiliary task of
camouflaged object segmentation. Zhu et al. [67] proposed
a new boundary-guided separated attention network (BSA-
NET), which uses two streams of separated attention mod-
ules to highlight the boundaries of camouflaged objects.
Lv et al. [32] proposed a multi-task learning framework to
jointly localize and segment the camouflaged objects while
inferring their ranks. Zhai et al. [58] designed a mutual
graph learning model to detect the edge and region of the
objects simultaneously. There are some uncertainty-aware
methods. Li et al. [20] proposed an uncertainty-aware
framework containing a joint network for both salient and
camouflaged object detection. Yang et al. [56] introduced
Bayesian learning into the uncertainty-guided transformer
reasoning model. Liu et al. [26] designed an aleatoric
uncertainty estimation network to indicate the prediction
awareness. Sun et al. [45] placed emphasis on rich global
context information with the integration of cross-level fea-
tures. Pei et al. [38] applied a one-stage location-sensing
transformer and further fused the features from transformer
and CNN. Some bio-inspired methods are proposed. For ex-
ample, [35, 34, 6] use multi-scale information but from one
single view. Meanwhile, [35] shows single-view informa-
tion is not sufficient for accurately detecting camouflaged
objects. We hereby argue that view generation and selec-
tion might play an important role and we aim to develop our
model by mimicking the behavior of humans when under-
standing complicated concealed objects by altering the way
they observing an image. Our proposed method exploits the
visual perception knowledge and semantic cues by aggre-
gating complementary information from multi-view. Ac-
cordingly, our model is simple yet efficient and effective to
comprehensively understand scene and to accurately seg-
ment the camouflaged objects.
3. Method
Motivation. Motivated by the challenges of single-view
COD models, we attempt to capture boundary and regional
semantic information with rich viewing angles and flexible
viewing distances. In order to merge diverse context infor-
mation from features of multi-view inputs and FPN multi-
level outputs, we design a feature fusion module based on
two-stage attention mechanism to obtain enhanced feature
tensors. It also avoids redundant structural design. To
leverage the rich information contained in channel dimen-
sions, we design a local-overall context/cues mining struc-
ture based on channel-wise integration. Meanwhile, it also
enhances the information expression of the feature tensors.
3.1. Multi-view Generation
As shown in Fig. 1, the single-view model misses nec-
essary boundary, region, and shape information. Inspired
by human behavior, taking complementary views of obser-
vation into account will overcome this defect and we de-
sign three different views: distance, angle, and perspective
view. We obtain different distance views through the re-
size operation with the proportional interval of the resize
operation larger than 0.5 to increase the distinction. We
get different angle views by mirror transformation, includ-
ing horizontal, vertical and diagonal mirror transformation.
We obtain different perspective views through affine trans-
formation. Specifically, three corresponding points on the
摘要:

MFFN:Multi-viewFeatureFusionNetworkforCamouagedObjectDetectionDehuaZheng1XiaochenZheng2LaurenceT.Yang1*YuanGao3ChenluZhu4YihengRuan41HuazhongUniversityofScienceandTechnology,China2ETHZ¨urich,Switzerland3HainanUniversity,China4HubeiChutianExpresswayDigitalTechnology,Chinadwardzheng@hust.edu.cnxzheng...

展开>> 收起<<
MFFN Multi-view Feature Fusion Network for Camouflaged Object Detection Dehua Zheng1Xiaochen Zheng2Laurence T. Yang1Yuan Gao3Chenlu Zhu4Yiheng Ruan4 1Huazhong University of Science and Technology China2ETH Z urich Switzerland.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.1MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注