recognize camouflaged objects and their performance needs
to be improved. We found that the current COD meth-
ods are easily distracted by negative factors from decep-
tive background/surroundings, as illustrated in Fig. 1. As
a result, it is hard to mine discriminative and fine-grained
semantic cues of camouflaged objects, making accurately
segment camouflaged objects from a confusing background
and predict some uncertain regions incapable. Meanwhile,
we learn that when people observe a concealed object in
images, they usually adjust the viewing distance, change
the viewing angle, and change the viewing position to find
the target object more accurately. Inspired by it, we aim
to design a simple yet efficient and effective strategy. The
aforementioned considerations motivate us to consider the
semantic and context exploration problem with multi-view.
We argue that corresponding clues, correlations, and mutual
constraints can be better obtained by utilizing information
from different viewpoint of the scene (e.g., changing ob-
servation distances and angles) as complementary. Further-
more, we argue that carefully designing the encoded feature
fusion modules can help the encoder learn accurate infor-
mation corresponding to boundary and semantics. Taking
these into mind, our research will focus on the following
three aspects: (1) how to exploit the effects of different types
of views on COD task, and the combination of multi-view
features to achieve the best detection effect; (2) how to bet-
ter fuse the features from multiple views based on correla-
tion awareness and how to enhance the semantic expression
ability of multi-view feature maps without increasing model
complexity; (3) how to incrementally explore the potential
context relationships of a multi-channel feature map.
To solve our concerned pain points of COD task, we pro-
pose a Multi-view Feature Fusion Network (MFFN) for the
COD task to make up for the semantic deficiency of fixed
view observation. First, we use the multi-view raw data,
which are generated by different data augmentation, as the
inputs of a backbone extractor with shared weights. We im-
plement a ResNet model as the backbone extractor integrat-
ing the feature pyramid network (FPN) [24] to focus on ob-
ject information of different scales. In addition, we design
aCo-attention of Multi-view (CAMV) module to integrate
multi-view features and to explore the correlation between
different view types. CAMV consists of two stages of at-
tention operation. In the first stage, the inherent correlation
and complementary analysis are mainly conducted for mul-
tiple viewing distances and angles to obtain the view fea-
tures with a unified scale. In the second stage, the external
constraint relations between viewing angles and distances
are further leveraged to enhance feature maps’ semantic ex-
pression. For the enhanced multi-view feature tensor, we
design a Channel Fusion Unit (CFU) to further exploit the
correlation between contexts. In the CFU module, we first
carry out up-down feature interaction between channel di-
mensions and then carry out progressive iteration on the
overall features. CAMV is applied to observe the multi-
view attention features of different size feature maps of FPN
architecture. The CFU module contains the previous layer’s
information as each size’s feature maps are eventually re-
stored to their original size. Finally, the final prediction
results are obtained by sigmoid operation. The prediction
further benefits from UAL design.
Our contribution can be summarized as follows: 1) We
propose MFFN model to solve the challenging problems
faced by single-view COD models. MFFN can capture
complementary information acquired by different viewing
angles and distances and discover the progressive connec-
tion between contexts.
2) We design the CAMV module to mine the comple-
mentary relationships within and between different types of
view features and enhance the semantic expression ability
of multi-view feature tensors, and use the CFU module to
conduct progressive context cue mining.
3) Our model is tested on three datasets of
CHAMELEON [42], COK10K [6] and NC4K [32],
and quantitative analysis is conducted on five general
evaluation indicators of Sm[7], Fw
β[33], MAE,Fβ[1]
and Em[8], all of which achieved superior results.
2. Related work
Salient Object Detection (SOD). SOD is a kind of seg-
mentation task in essence. It calculates saliency map first
and then merges and segmented saliency object. In previ-
ous studies, traditional methods based on manual features
pay more attention to color [2, 23], texture [54, 23], con-
trast [39, 16] and so on, but lack advantages in complex
scenes and structured description. With the development of
CNN, SOD algorithm has achieved leapfrog development.
Li et al. [22] combines local information with global in-
formation to overcome the problem of highlighting object
boundary but not the overall object in the model based on lo-
cal. The model structure design idea of multi-level features,
has been more widely applied in [25, 66, 14, 19]. Simi-
lar to COD, clear boundary information is crucial for SOD
task [40, 63, 44]. The development of attention mechanism
provides more schemes for exploring the correlation be-
tween channel dimension and spatial dimension [37, 9, 48].
The application of attention mechanism improves the per-
formance of SOD model [28, 62, 51]. SOD faces simpler
background surroundings. Although excellent performance
can be obtained by applying relevant models to COD task,
specific design is still needed to remove the interference
from the background surroundings.
Camouflaged Object Detection (COD). In recent years,
some researches applied multi-task learning to detect the
camouflaged objects. Le et al. [18] introduced the binary