Synthetic Data Supervised Salient Object Detection

2025-05-02 0 0 6.44MB 9 页 10玖币
侵权投诉
Synthetic Data Supervised Salient Object Detection
Zhenyu Wu
State Key Laboratory of Virtual
Reality Technology and Systems,
Beihang University
Beijing, China
Lin Wang
School of Transportation Science and
Engineering, Beihang University
Beijing, China
Wei Wang
School of Computer Science and
Technology, Harbin Institute of
Technology
Shenzhen, China
Tengfei Shi
State Key Laboratory of Virtual
Reality Technology and Systems,
Beihang University
Beijing, China
Chenglizhao Chen
College of Computer Science and
Technology, China University of
Petroleum (East China)
Qingdao, China
Aimin Hao
State Key Laboratory of Virtual
Reality Technology and Systems,
Beihang University, Beijing
Peng Cheng Laboratory, Shenzhen
China
Shuo Li
Department of Medical Imaging,
Western University
London, Canada
ABSTRACT
Although deep salient object detection (SOD) has achieved remark-
able progress, deep SOD models are extremely data-hungry, requir-
ing large-scale pixel-wise annotations to deliver such promising re-
sults. In this paper, we propose a novel yet eective method for SOD,
coined SODGAN, which can generate innite high-quality image-
mask pairs requiring only a few labeled data, and these synthesized
pairs can replace the human-labeled DUTS-TR to train any o-the-
shelf SOD model. Its contribution is three-fold.
1)
Our proposed
diusion embedding network can address the manifold mismatch
and is tractable for the latent code generation, better matching
with the ImageNet latent space.
2)
For the rst time, our proposed
few-shot saliency mask generator can synthesize innite accurate
image synchronized saliency masks with a few labeled data.
3)
Our
proposed quality-aware discriminator can select highquality synthe-
sized image-mask pairs from noisy synthetic data pool, improving
the quality of synthetic data. For the rst time, our SODGAN tackles
SOD with synthetic data directly generated from the generative
model, which opens up a new research paradigm for SOD. Exten-
sive experimental results show that the saliency model trained
on synthetic data can achieve
98.4%
F-measure of the saliency
model trained on the DUTS-TR. Moreover, our approach achieves a
new SOTA performance in semi/weakly-supervised methods, and
even outperforms several fully-supervised SOTA methods. Code is
available at https://github.com/wuzhenyubuaa/SODGAN
Corresponding Author: Chenglizhao Chen, cclz123@163.com
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MM ’22, October 10–14, 2022, Lisboa, Portugal.
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9203-7/22/10. . . $15.00
https://doi.org/10.1145/3503161.3547930
CCS CONCEPTS
Computer methodologies Articial intelligence
;
Machine
learning.
KEYWORDS
Salient object detection, Synthetic data, Semi-supervised learning
ACM Reference Format:
Zhenyu Wu, Lin Wang, Wei Wang, Tengfei Shi, Chenglizhao Chen, Aimin
Hao, and Shuo Li. 2022. Synthetic Data Supervised Salient Object Detection.
In Proceedings of the 30th ACM International Conference on Multimedia (MM
’22), October 10–14, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 9 pages.
https://doi.org/10.1145/3503161.3547930
1 INTRODUCTION
Salient object detection (SOD) aims to segment interesting objects
that attract human attention in an image. As a fundamental tool,
it can be leveraged to various applications including scene under-
standing [
60
], semantic segmentation [
59
] and image editing [
5
,
15
].
Recently, SOD has achieved signicant progress [
12
,
19
,
33
,
36
,
42
,
53
,
57
] due to the development of deep model. However, deep net-
works are extremely data-hungry, typically requiring pixel-level
humanannotated datasets to achieve high performance (see Fig.
1.a). Labeling large-scale datasets with pixel-level annotations for
SOD is very time-consuming, e.g., generally more than ve peo-
ple were asked to annotate the same image to guarantee the label
consistency and another ten viewers were asked to cross-check the
quality of annotations in the SOC dataset [10].
To alleviate the dependency on pixel-wise annotation, many
weakly-supervised SOD methods [
17
,
37
,
49
] have been devised.
Typically, image-level labels (see Fig. 1.c) are utilized in [
17
,
37
] for
saliency localization, and then iteratively netune their models with
predicted saliency maps. Additionally, scribble annotations (see Fig.
1.b) has been proposed recently in [
52
] to reduce the uncertainty of
image-level labels. Although these methods are free of pixel-level
annotations, they suer from various disadvantages, including low
arXiv:2210.13835v1 [cs.CV] 25 Oct 2022
MM ’22, October 10–14, 2022, Lisboa, Portugal. Zhenyu Wu et al.
(a) PFSN [22] (b) SCWS [48] (c) MWS [49] (d) Ours
Figure 1: The saliency model trained on synthetic data out-
performs SOTA weakly-supervised methods, and is even
competitive with fully-supervised models.
prediction accuracy, complex training strategy, dedicated network
architecture, and extra data information (e.g., edge) to obtain high-
quality saliency maps.
In this paper, we propose a new paradigm SODGAN (see Fig. 1.d)
for SOD, which can generate innite high-quality image-mask pairs
with a few labeled data to replace the human-labeled DUTS-TR
[
37
] dataset. Concretely, our SODGAN has three stages: Stage 1.
Learning a few-shot saliency mask generator to synthesize image-
synchronous mask, while utilizing the existing generative adver-
sarial networks (BigGAN [
3
]) to generate realistic images. Stage 2.
Selecting high-quality image-mask pairs from the synthetic data
pool. Stage 3. Training a saliency network on these ltered image-
mask pairs. However, there are three main challenges with this
approach:
1) Lacking pixel-wise labeled data
as the training
dataset to learn a segmentor because BigGAN was trained on the
ImageNet that was designed to classication tasks without the pixel-
level label.
2) Discovering a meaningful direction
in GAN latent
space to disentangle foreground saliency objects from backgrounds
is nontrivial, which often requires domain knowledge and labori-
ous engineering.
3) Low-quality image-mask pairs
exist in the
synthesized datasets.
To tackle these three challenges,
rst
, we present a diusion
embedding network (DEN) (see Sec. 3.2) to utilize the existing well-
annotated dataset (i.e., DUTS-TR), which can infer the image’s latent
code that match with the ImageNet latent code space; thus, the
existing labeled DUTS-TR dataset can provide the pixel-wise label
for ImageNet.
Second
, in contrast to the existing works [
13
,
26
,
31
]
focusing on latent space, we propose a few-shot saliency mask
generator to automatically discover meaningful directions in the
GANs feature space (see Sec. 3.3), which can synthesize innite
high-quality image synchronized saliency masks with a few labeled
data.
Third
, we propose a quality-aware discriminator (see Sec. 3.4)
to select high-quality synthesized image-mask pairs from the noisy
synthetic data pool, improving the quality of synthetic data.
Our SODGAN has several desirable properties.
a) Fewer labels.
Our approach eliminates large-scale pixel-level supervision requir-
ing only a few labeled data, which reduces the annotation costs.
b) High performance.
We demonstrate that the saliency model
trained on synthetic data directly generated from GANs achieves
an average
98.4%
F-measure of the saliency model trained on the
DUTS-TR dataset. Moreover, our SODGAN achieves new SOTA
performance in semi/weakly-supervised methods, and even outper-
forms some fully supervised methods.
c) Generality.
The synthetic
data can be used to train any o-the-shelf SOD model without the
need of special architectures, showing strong generalization capabil-
ities on the real test datasets. We summarize the key contributions
as follows:
For the rst time, our SODGAN tackles SOD with synthetic
data directly generated from the generative model, which
opens up a new research paradigm for semi-supervised SOD
and signicantly reduces the annotation costs.
Our proposed the DEN can address manifold mismatch and
is tractable for the latent code generation, better matching
with the ImageNet latent space.
Our lightweight few-shot saliency mask generator can syn-
thesize innite accurate image-synchronous saliency masks
with a few labeled data.
Our proposed quality-aware discriminator can select high-
quality synthesized image-mask pairs from the noisy syn-
thetic data pool, improving the quality of synthetic data.
2 RELATED WORK
Semi/Weakly-supervised SOD Approaches.
With recent advances
in semi/weakly-supervised learning, a few existing works exploit
the potential of training saliency detectors on image-level [
17
,
37
,
49
], region-level [
48
,
51
,
52
], and limited pixel-level [
41
,
44
,
50
,
58
]
labeled data to relax the dependency of manually annotated pixel-
level saliency masks. For image-level supervision, these approaches
[
17
,
37
,
49
] follow the same technical route, i.e., producing initial
saliency maps with image-level labels and then further rening it
via iterative training. Recently, scribble annotation was proposed
in [
48
,
52
], but it requires large-scale scribble annotations (10553
images) and extra data information (e.g., edge) to recover integral
object structure.
Dierences.
Distinct from all these works, our
approach provides a new paradigm for semi-supervised SOD. In
particular, we introduce SODGAN, a generative model, which can
generate innite high-quality image-mask pairs requiring minimal
manual intervention. These generated pairs can then be used for
training any existing SOD approaches.
Latent Interpretability of GANs.
The previous works have shown
that the GANs latent spaces are endowed with human-interpretable
semantic arithmetic. A line of recent works [
6
,
13
,
26
,
31
,
32
,
47
] em-
ploy explicit human-provided supervision to identify interpretable
directions in the latent space. For instance, [
13
,
31
] use the classi-
ers pretrained on the CelebA [
21
] dataset to produce pseudo labels
for the generated images and their latent codes. Another active line
of study on GANs [
1
,
2
,
4
,
23
,
34
,
35
,
55
] targets the object segmen-
tation task. [
1
] and [
4
] are based on the idea of decomposing the
generative process in a layer-wise fashion. Other works [
2
,
23
,
35
]
exploit the idea that the object’s location or appearance can be
perturbed without aecting image realism.
Dierences.
In con-
trast to existing works manipulating the latent space, our approach
is able to discover interpretable directions in the GANs features
space, which allows complete control over the diversity of object
categories and can automatically nd the expected directions.
摘要:

SyntheticDataSupervisedSalientObjectDetectionZhenyuWuStateKeyLaboratoryofVirtualRealityTechnologyandSystems,BeihangUniversityBeijing,ChinaLinWangSchoolofTransportationScienceandEngineering,BeihangUniversityBeijing,ChinaWeiWangSchoolofComputerScienceandTechnology,HarbinInstituteofTechnologyShenzhen,C...

展开>> 收起<<
Synthetic Data Supervised Salient Object Detection.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:6.44MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注