
MM ’22, October 10–14, 2022, Lisboa, Portugal. Zhenyu Wu et al.
(a) PFSN [22] (b) SCWS [48] (c) MWS [49] (d) Ours
Figure 1: The saliency model trained on synthetic data out-
performs SOTA weakly-supervised methods, and is even
competitive with fully-supervised models.
prediction accuracy, complex training strategy, dedicated network
architecture, and extra data information (e.g., edge) to obtain high-
quality saliency maps.
In this paper, we propose a new paradigm SODGAN (see Fig. 1.d)
for SOD, which can generate innite high-quality image-mask pairs
with a few labeled data to replace the human-labeled DUTS-TR
[
37
] dataset. Concretely, our SODGAN has three stages: Stage 1.
Learning a few-shot saliency mask generator to synthesize image-
synchronous mask, while utilizing the existing generative adver-
sarial networks (BigGAN [
3
]) to generate realistic images. Stage 2.
Selecting high-quality image-mask pairs from the synthetic data
pool. Stage 3. Training a saliency network on these ltered image-
mask pairs. However, there are three main challenges with this
approach:
1) Lacking pixel-wise labeled data
as the training
dataset to learn a segmentor because BigGAN was trained on the
ImageNet that was designed to classication tasks without the pixel-
level label.
2) Discovering a meaningful direction
in GAN latent
space to disentangle foreground saliency objects from backgrounds
is nontrivial, which often requires domain knowledge and labori-
ous engineering.
3) Low-quality image-mask pairs
exist in the
synthesized datasets.
To tackle these three challenges,
rst
, we present a diusion
embedding network (DEN) (see Sec. 3.2) to utilize the existing well-
annotated dataset (i.e., DUTS-TR), which can infer the image’s latent
code that match with the ImageNet latent code space; thus, the
existing labeled DUTS-TR dataset can provide the pixel-wise label
for ImageNet.
Second
, in contrast to the existing works [
13
,
26
,
31
]
focusing on latent space, we propose a few-shot saliency mask
generator to automatically discover meaningful directions in the
GANs feature space (see Sec. 3.3), which can synthesize innite
high-quality image synchronized saliency masks with a few labeled
data.
Third
, we propose a quality-aware discriminator (see Sec. 3.4)
to select high-quality synthesized image-mask pairs from the noisy
synthetic data pool, improving the quality of synthetic data.
Our SODGAN has several desirable properties.
a) Fewer labels.
Our approach eliminates large-scale pixel-level supervision requir-
ing only a few labeled data, which reduces the annotation costs.
b) High performance.
We demonstrate that the saliency model
trained on synthetic data directly generated from GANs achieves
an average
98.4%
F-measure of the saliency model trained on the
DUTS-TR dataset. Moreover, our SODGAN achieves new SOTA
performance in semi/weakly-supervised methods, and even outper-
forms some fully supervised methods.
c) Generality.
The synthetic
data can be used to train any o-the-shelf SOD model without the
need of special architectures, showing strong generalization capabil-
ities on the real test datasets. We summarize the key contributions
as follows:
•
For the rst time, our SODGAN tackles SOD with synthetic
data directly generated from the generative model, which
opens up a new research paradigm for semi-supervised SOD
and signicantly reduces the annotation costs.
•
Our proposed the DEN can address manifold mismatch and
is tractable for the latent code generation, better matching
with the ImageNet latent space.
•
Our lightweight few-shot saliency mask generator can syn-
thesize innite accurate image-synchronous saliency masks
with a few labeled data.
•
Our proposed quality-aware discriminator can select high-
quality synthesized image-mask pairs from the noisy syn-
thetic data pool, improving the quality of synthetic data.
2 RELATED WORK
Semi/Weakly-supervised SOD Approaches.
With recent advances
in semi/weakly-supervised learning, a few existing works exploit
the potential of training saliency detectors on image-level [
17
,
37
,
49
], region-level [
48
,
51
,
52
], and limited pixel-level [
41
,
44
,
50
,
58
]
labeled data to relax the dependency of manually annotated pixel-
level saliency masks. For image-level supervision, these approaches
[
17
,
37
,
49
] follow the same technical route, i.e., producing initial
saliency maps with image-level labels and then further rening it
via iterative training. Recently, scribble annotation was proposed
in [
48
,
52
], but it requires large-scale scribble annotations (10553
images) and extra data information (e.g., edge) to recover integral
object structure.
Dierences.
Distinct from all these works, our
approach provides a new paradigm for semi-supervised SOD. In
particular, we introduce SODGAN, a generative model, which can
generate innite high-quality image-mask pairs requiring minimal
manual intervention. These generated pairs can then be used for
training any existing SOD approaches.
Latent Interpretability of GANs.
The previous works have shown
that the GANs latent spaces are endowed with human-interpretable
semantic arithmetic. A line of recent works [
6
,
13
,
26
,
31
,
32
,
47
] em-
ploy explicit human-provided supervision to identify interpretable
directions in the latent space. For instance, [
13
,
31
] use the classi-
ers pretrained on the CelebA [
21
] dataset to produce pseudo labels
for the generated images and their latent codes. Another active line
of study on GANs [
1
,
2
,
4
,
23
,
34
,
35
,
55
] targets the object segmen-
tation task. [
1
] and [
4
] are based on the idea of decomposing the
generative process in a layer-wise fashion. Other works [
2
,
23
,
35
]
exploit the idea that the object’s location or appearance can be
perturbed without aecting image realism.
Dierences.
In con-
trast to existing works manipulating the latent space, our approach
is able to discover interpretable directions in the GANs features
space, which allows complete control over the diversity of object
categories and can automatically nd the expected directions.