Promising or Elusive Unsupervised Object Segmentation from Real-world Single Images Yafei Yang Bo Yang

2025-05-02 0 0 5.08MB 30 页 10玖币
侵权投诉
Promising or Elusive? Unsupervised Object
Segmentation from Real-world Single Images
Yafei Yang Bo Yang
vLAR Group, The Hong Kong Polytechnic University
ya-fei.yang@connect.polyu.hk bo.yang@polyu.edu.hk
Abstract
In this paper, we study the problem of unsupervised object segmentation from
single images. We do not introduce a new algorithm, but systematically investi-
gate the effectiveness of existing unsupervised models on challenging real-world
images. We firstly introduce four complexity factors to quantitatively measure
the distributions of object- and scene-level biases in appearance and geometry for
datasets with human annotations. With the aid of these factors, we empirically
find that, not surprisingly, existing unsupervised models catastrophically fail to
segment generic objects in real-world images, although they can easily achieve
excellent performance on numerous simple synthetic datasets, due to the vast gap
in objectness biases between synthetic and real images. By conducting extensive
experiments on multiple groups of ablated real-world datasets, we ultimately find
that the key factors underlying the colossal failure of existing unsupervised models
on real-world images are the challenging distributions of object- and scene-level
biases in appearance and geometry. Because of this, the inductive biases introduced
in existing unsupervised models can hardly capture the diverse object distribu-
tions. Our research results suggest that future work should exploit more explicit
objectness biases in the network design.
1 Introduction
The capability of automatically identifying individual objects from complex visual observations
is a central aspect of human intelligence [
53
]. It serves as the key building block for higher-level
cognition tasks such as planning and reasoning [
28
]. In last years, a plethora of models have been
proposed to segment objects from single static images in an unsupervised fashion: from the early
AIR [
22
] and MONet [
8
] to the recent SPACE [
39
], SlotAtt [
40
], GENESIS-V2 [
20
], etc. They
jointly learn to represent and segment multiple objects from a single image, without needing any
human annotations in training. This process is often called perceptual grouping/binding or object-
centric learning. These methods and their variants have achieved impressive segmentation results on
numerous synthetic scene datasets such as dSprites [
42
] and CLEVR [
33
]. Such advances come with
great expectations that the unsupervised techniques would likely close the gap with fully-supervised
methods for real-world visual understanding. However, few work has systematically investigated
the true potential of the emerging unsupervised object segmentation models on complex real-world
images such as COCO dataset [38]. This naturally raises an essential question:
Is it promising or even possible to segment generic objects from real-world single images using
(existing) unsupervised methods?
What is an object?
To answer the above question involves another fundamental question: what
is an object? Exactly 100 years ago in Gestalt psychology, Wertheimer [
61
] first introduced a set
of Gestalt principles such as proximity, similarity and continuation to heuristically define visual
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02324v1 [cs.CV] 5 Oct 2022
Single
Images
Predicted
Object
Masks
Figure 1: The failure of SlotAtt [
40
] on three real-world images (right-hand side), although it can
perfectly segment simple objects on three synthetic images (left-hand side).
data as objects. However, these factors are highly subjective, whilst the real-world generic objects
are far more complex with extremely diverse appearances and shapes. Therefore, it is practically
impossible to quantitatively define what is an object, i.e., the objectness, from visual inputs (e.g., a
set of image pixels). Nevertheless, to thoroughly understand whether unsupervised methods can truly
learn objectness akin to the psychological process of humans, it is vital to investigate the underlying
factors that potentially facilitate or otherwise hinder the ability of unsupervised models. In this regard,
by drawing on Gestalt principles, we instead define a series of new factors to quantitatively measure
the complexity of objects and scenes in Section 2. By taking into account both the appearance and
geometry of objects and scenes, our complexity factors explicitly assess the difficulty of segmenting
a specific object. For example, a chair with colorful textures tend to have higher complexity than
a single-color ball for unsupervised methods. With the aid of these factors, we extensively study
whether and how existing unsupervised models can discover objects in Section 4.
What is the problem of unsupervised object segmentation from single images?
A large number
of models [
63
] aim to tackle the problem of unsupervised object segmentation from single images.
They share several key problem settings: 1) all training images do not have any human annotations;
2) every single image has multiple objects; 3) each image is treated as a static data point without
any dynamic or temporal information; 4) all models are trained from scratch without requiring any
pretrained networks on additional datasets. Ultimately, the goal of these models is to segment all
individual objects as accurate as the ground truth human annotations. In this paper, we regard these
settings as the basic and necessary part of unsupervised object segmentation from single images, and
empirically evaluate how successfully the existing models can exhibit on real-world images.
Contributions and findings.
This paper addresses the essential question regarding the potential of
unsupervised segmentation of generic objects from real-world single images. Our contributions are:
We firstly introduce 4 complexity factors to quantitatively measure the difficulty of objects and
scenes. These factors are key to investigate the true potential of existing unsupervised models.
We extensively evaluate current unsupervised approaches in a large-scale experimental study.
We implement 4 representative methods and train more than 130 models on 6 curated datasets
from scratch. The datasets, code and pretrained models are available at
https://github.com/
vLAR-group/UnsupObjSeg
We analyze our experimental results and find that: 1) existing unsupervised object segmentation
models cannot discover generic objects from single real-world images, although they can achieve
outstanding performance on synthetic datasets, as qualitatively illustrated in Figure 1; 2) the
challenging distributions of both object- and scene-level biases in appearance and geometry from
real-world images are the key factors incurring the failure of existing models; 3) the inductive biases
introduced in existing unsupervised models are fundamentally not matched with the objectness
biases exhibited in real-world images, and therefore fail to discover the real objectness.
Related Work.
Recently, ClevrTex [
35
] and the concurrent work [
43
] also study unsupervised object
segmentation on single images. Through evaluation on (complex) synthetic datasets only, both works
focus on benchmarking the effectiveness of particular network designs of baselines. By comparison,
our paper aims to explore what and how the objectness distribution gaps between synthetic and
real-world images incur the failure of existing models. The recent work [
60
] which investigates video
object discovery is orthogonal to our work as the motion signals do not exist in single images.
Scope of this research.
This paper does not investigate unsupervised object discovery on saliency
maps [
59
], static multi-views or dynamic videos [
63
]. Recent methods [
10
;
30
] requiring pretrained
models on monolithic object images such as ImageNet [47] are not evaluated as well.
2
2 Complexity Factors
Object #1 Object #2 Object #3
Scene #1 Scene #2 Scene #3
Figure 2: Complexity in appearance
and geometry for objects and scenes.
As illustrated in the top row of Figure 2, an individual object,
represented by a set of color pixels painted within a mask,
can vary significantly given different types of appearance and
geometric shape. A specific scene, represented by a set of
objects placed within an image, can also differ vastly given
different types of relative appearance and geometric layout
between objects, as illustrated in the bottom row. Unarguably,
such variation and complexity of appearance and geometry in
both object level and scene level directly affects human’s ability
to precisely separate all objects. Naturally, the performance
of unsupervised segmentation models are also expected to be
influenced by the variation. In this regard, we carefully define
the following two groups of factors to quantitatively describe
the complexity of different datasets.
2.1 Object-level Complexity Factors
As to a specific object, all its information can be described by appearance and geometry. Therefore
we define the below two factors to measure the complexity of appearance and geometry respectively.
Notably, both factors are nicely invariant to the object scale.
Object Color Gradient:
This factor aims to calculate how frequently the appearance changes
within the object mask. In particular, given the RGB image and mask of an object, we firstly
convert RGB into grayscale and then apply Sobel filter [
51
] to compute the gradients horizontally
and vertically for each pixel within the mask. The final gradient value is obtained by averaging out
all object pixels. Note that, the object boundary pixels are removed to avoid the interference of
background. Numerically, the higher this factor is, the more complex texture and/or lighting effect
the object has, and therefore it is likely harder to segment.
Object Shape Concavity:
This factor is designed to evaluate how irregular the object boundary
is. Particularly, given an object (binary) mask, denoted as
Mobj RH×W
, we firstly find the
smallest convex polygon mask (
Mcvx RH×W
) that surrounds the object mask using an existing
algorithm [
19
], and then the object shape concavity value is computed as:
1PMobj /PMcvx
.
Clearly, the higher this factor is, the more irregular object shape is, and segmentation is more tricky.
2.2 Scene-level Complexity Factors
As to a specific image, in addition to the object-level complexity, the spatial and appearance relation-
ships between all objects can also incur extra difficulty for segmentation. We define the following two
factors to quantify the complexity of relative appearance and geometry between objects in an image.
Inter-object Color Similarity:
This factor intends to assess the appearance similarity between all
objects in the same image. Specifically, we firstly calculate the average color for each object, and
then compute the pair-wise Euclidean distances of object colors, obtaining a
K×K
matrix where
K
represents the object number. The average color distance is calculated by averaging the matrix
excluding diagonal entries, and the final inter-object color similarity is computed as:
1
average
color distance
/(255 ×3)
. Intuitively, the higher this factor is, the more similar all objects appear
to be, the less distinctive each object is, and it is harder to separate each object.
Inter-object Shape Variation:
This factor aims to measure the relative geometry diversity between
all objects in the image. We firstly calculate the diagonal length of bounding box for each object,
and then compute the pair-wise absolute differences for all object diagonal lengths, obtaining a
K×K
matrix. The final inter-object shape variation is the average of the matrix excluding diagonal
entries. The higher this factor, the objects within an image have more diverse and imbalanced sizes,
and therefore segmenting both gigantic and tiny objects is likely more challenging.
By capturing the appearance and geometry in both object and scene levels, the four factors are
designed to quantify the complexity of objects and images. For illustration, Figure 3shows sample
images for the four factors at different values. The higher the values, the more complex the objects
3
object -
level
Object Color Gradient
0.000 0.165 0.518 0.802
Object Shape Concavity
0.020 0.121 0.526 0.750
scene -
level
0.265 0.359 0.787 0.936 0.005 0.105 0.257 0.565
Inter-object Color Similarity Inter-object Shape Variation
Figure 3: Sample objects and scenes for the four factors at different complexity values. All complexity
values are normalized to the range of [0,1].
and scenes. In fact, these factors are carefully selected from more than 10 candidates because they are
empirically more suitable to differentiate the gaps between synthetic and real-world images, and they
eventually serve as key indicators to diagnose existing unsupervised models in Section 4. Calculation
details of the four factors and other candidates are in appendix.
3 Experimental Design
3.1 Considered Methods
A range of works have explored unsupervised object segmentation in recent years. They are typically
formulated as (variational) autoencoders (AE/VAE) [
36
] or generative adversarial networks (GAN)
[
24
]. GAN based models [
13
;
3
;
7
;
56
;
5
;
58
;
1
] are usually limited to identifying a single foreground
object and can hardly discover multiple objects due to the training instabilities, therefore not consid-
ered in this paper. As shown in Table 1, the majority of existing models are based on AE/VAE and
can be generally divided into two groups according to the object representation:
Factor-based models
: Each object is represented by explicit factors such as size, position, ap-
pearance, etc., and the whole image is a spatial organization of multiple objects. Basically, such
representation explicitly enforces objects to be bounded within particular regions.
Layer-based models
: Each object is represented by an image layer, i.e., a binary mask, and the
whole image is a spatial mixture of multiple object layers. Intuitively, this representation does not
have strict spatial constrains, and instead is more flexible to cluster similar pixels as objects.
In order to decompose the input images into objects, these approaches introduce different types
of network architecture, loss functions, and regularization terms as inductive biases. These biases
broadly include: 1) variational encoding which encourages the disentanglement of latent variables; 2)
iterative inference which likely ends up with better scene representations over occlusions; 3) object
relationship regularization such as depth estimation and autoregressive prior which aims at capturing
the dependency of multiple objects; and many other biases. With different combinations of these
biases, many methods have shown outstanding performance in synthetic datasets. Among them, we
select 4 representative models for our investigation: 1) AIR [
22
], 2) MONet [
8
], 3) IODINE [
25
],
and 4) SlotAtt [
40
]. We also add the fully-supervised Mask R-CNN [
29
] as an additional baseline for
comprehensive comparison. Implementation details are provided in appendix.
3.2 Considered Datasets
We consider two groups of datasets for extensive benchmarking and analysis: 1) three commonly-used
synthetic datasets: dSprites [
42
], Tetris [
34
] and CLEVR [
33
], 2) three real-world datasets: YCB [
9
],
ScanNet [
17
], and COCO [
38
], representing the small-scale, indoor- and outdoor-level real scenes
respectively. Naturally, objects and scenes in different datasets tend to have very different types of
biases. For example, the objects in dSprites tend to have the single-color bias, while COCO does not.
Generally, the object-level biases can be divided as: 1) appearance biases including different textures
and lighting effects, and 2) geometry biases including the object shape and occlusions. Similarly,
the scene-level biases include: 1) appearance biases such as the color similarity between all objects,
and 2) geometry biases such as the diversity of all object shapes. In fact, our complexity factors
introduced in Section 2are designed to well capture these biases. Table 2qualitatively summarizes
the biases of selected datasets. We may hypothesize that the large gaps of biases between synthetic
and real-world datasets would have a huge impact on the effectiveness of existing models.
4
Table 1: Existing unsupervised models for object segmentation on single images. Each model
includes different inductive biases, such as variational autoencoding (VAE), iterative inference (Iter),
object relationship regularization (Rel), etc.
Factor-based Models Inductive Biases Layer-based Models Inductive Biases
VAE Iter Rel VAE Iter Rel
CST-VAE [31] ICLRW’16 XTagger [26] NIPS’16 X
AIR [22] NIPS’16 XRC [26] ICLRW’16 X
SPAIR [16] AAAI’19 XNEM [27] NIPS’17 X
SuPAIR [54] ICML’19 XMONet [8] arXiv’19 X
GMIO [64] ICML’19 X X IODINE [25] ICML’19 X X
ASR [62] NeurIPS’19 XECON [57] ICLRW’20 X X
SPACE [39] ICLR’20 XGENESIS [21] ICLR’20 X X
GNM [32] NeurIPS’20 X X SlotAtt [40] NeurIPS’20 X
SPLIT [11] arXiv’20 XGENESIS-V2 [20] NeurIPS’21 X X
OCIC [2] arXiv’20 X X R-MONet [50] arXiv’21 X X
GSGN [18] ICLR’21 X X CAE [41] arXiv’22 X
Table 2: The object- and scene-level biases in appearance and geometry of the considered datasets.
Synthetic Datasets Real-world Datasets
dSprites [42] Tetris [34] CLEVR [33] YCB [9] ScanNet [17] COCO [38]
Object-level Biases
Appearance Texture: simple simple simple diverse simple diverse
Lighting: no no synthetic real real real
Geometry Shape: simple simple simple simple diverse diverse
Occlusion: minor no minor severe severe severe
Scene-level Biases
Appearance Similarity: low low high high high high
Geometry Diversity: low low low high high high
To guarantee the fairness and consistency of all experiments, we carefully prepare all six datasets
using the following same protocols. Preparation details for each dataset are provided in appendix.
All images are rerendered or cropped with the same resolution of 128 ×128.
Each image has about 2 to 6 solid objects with a blank background.
Each dataset has about 10000 images for training, 2000 images for testing.
3.3 Considered Metrics
Having the six representative datasets and four existing unsupervised methods at hand, we choose
the following metrics to evaluate the object segmentation performance: 1) AP score which is widely
used for object detection and segmentation [
23
], 2) PQ score which is used to measure non-overlap
panoptic segmentation [
37
], and 3) Precision and Recall scores. A predicted mask is considered
correct if its IoU against a ground truth mask is above 0.5. All objects are treated as a single class. The
blank background is not taken into account for fair comparison. To compute AP, we simply treat the
mean value of the soft object mask as the object confidence score. Note that, the alternative metrics
ARI [45] and segmentation covering (SC) [4] are not considered as they can be easily saturated.
4 Key Experimental Results
4.1 Can current unsupervised models succeed on real-world datasets?
First of all, we evaluate all baselines on our six datasets separately. In particular, we train each model
from scratch on each dataset separately. For fair evaluation, we carefully tune the hyperparameters of
each model on every dataset and fully optimize the networks until convergence. Figure 4compares
the quantitative results. It can be seen that all methods demonstrate satisfactory segmentation results
on synthetic datasets, especially the recent strong baselines IODINE and SlotAtt. However, not
surprisingly, all unsupervised methods fail catastrophically on the three real-world datasets.
5
摘要:

PromisingorElusive?UnsupervisedObjectSegmentationfromReal-worldSingleImagesYafeiYangBoYangvLARGroup,TheHongKongPolytechnicUniversityya-fei.yang@connect.polyu.hkbo.yang@polyu.edu.hkAbstractInthispaper,westudytheproblemofunsupervisedobjectsegmentationfromsingleimages.Wedonotintroduceanewalgorithm,buts...

展开>> 收起<<
Promising or Elusive Unsupervised Object Segmentation from Real-world Single Images Yafei Yang Bo Yang.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:5.08MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注