Promising or Elusive Unsupervised Object Segmentation from Real-world Single Images Yafei Yang Bo Yang

2025-05-02 5 0 5.08MB 30 页 10玖币

侵权投诉

Promising or Elusive? Unsupervised Object

Segmentation from Real-world Single Images

Yafei Yang Bo Yang

vLAR Group, The Hong Kong Polytechnic University

ya-fei.yang@connect.polyu.hk bo.yang@polyu.edu.hk

Abstract

In this paper, we study the problem of unsupervised object segmentation from

single images. We do not introduce a new algorithm, but systematically investi-

gate the effectiveness of existing unsupervised models on challenging real-world

images. We ﬁrstly introduce four complexity factors to quantitatively measure

the distributions of object- and scene-level biases in appearance and geometry for

datasets with human annotations. With the aid of these factors, we empirically

ﬁnd that, not surprisingly, existing unsupervised models catastrophically fail to

segment generic objects in real-world images, although they can easily achieve

excellent performance on numerous simple synthetic datasets, due to the vast gap

in objectness biases between synthetic and real images. By conducting extensive

experiments on multiple groups of ablated real-world datasets, we ultimately ﬁnd

that the key factors underlying the colossal failure of existing unsupervised models

on real-world images are the challenging distributions of object- and scene-level

biases in appearance and geometry. Because of this, the inductive biases introduced

in existing unsupervised models can hardly capture the diverse object distribu-

tions. Our research results suggest that future work should exploit more explicit

objectness biases in the network design.

1 Introduction

The capability of automatically identifying individual objects from complex visual observations

is a central aspect of human intelligence [

]. It serves as the key building block for higher-level

cognition tasks such as planning and reasoning [

]. In last years, a plethora of models have been

proposed to segment objects from single static images in an unsupervised fashion: from the early

AIR [

] and MONet [

] to the recent SPACE [

], SlotAtt [

], GENESIS-V2 [

], etc. They

jointly learn to represent and segment multiple objects from a single image, without needing any

human annotations in training. This process is often called perceptual grouping/binding or object-

centric learning. These methods and their variants have achieved impressive segmentation results on

numerous synthetic scene datasets such as dSprites [

] and CLEVR [

]. Such advances come with

great expectations that the unsupervised techniques would likely close the gap with fully-supervised

methods for real-world visual understanding. However, few work has systematically investigated

the true potential of the emerging unsupervised object segmentation models on complex real-world

images such as COCO dataset [38]. This naturally raises an essential question:

Is it promising or even possible to segment generic objects from real-world single images using

(existing) unsupervised methods?

What is an object?

To answer the above question involves another fundamental question: what

is an object? Exactly 100 years ago in Gestalt psychology, Wertheimer [

] ﬁrst introduced a set

of Gestalt principles such as proximity, similarity and continuation to heuristically deﬁne visual

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02324v1 [cs.CV] 5 Oct 2022

Single

Images

Predicted

Object

Masks

Figure 1: The failure of SlotAtt [

] on three real-world images (right-hand side), although it can

perfectly segment simple objects on three synthetic images (left-hand side).

data as objects. However, these factors are highly subjective, whilst the real-world generic objects

are far more complex with extremely diverse appearances and shapes. Therefore, it is practically

impossible to quantitatively deﬁne what is an object, i.e., the objectness, from visual inputs (e.g., a

set of image pixels). Nevertheless, to thoroughly understand whether unsupervised methods can truly

learn objectness akin to the psychological process of humans, it is vital to investigate the underlying

factors that potentially facilitate or otherwise hinder the ability of unsupervised models. In this regard,

by drawing on Gestalt principles, we instead deﬁne a series of new factors to quantitatively measure

the complexity of objects and scenes in Section 2. By taking into account both the appearance and

geometry of objects and scenes, our complexity factors explicitly assess the difﬁculty of segmenting

a speciﬁc object. For example, a chair with colorful textures tend to have higher complexity than

a single-color ball for unsupervised methods. With the aid of these factors, we extensively study

whether and how existing unsupervised models can discover objects in Section 4.

What is the problem of unsupervised object segmentation from single images?

A large number

of models [

] aim to tackle the problem of unsupervised object segmentation from single images.

They share several key problem settings: 1) all training images do not have any human annotations;

2) every single image has multiple objects; 3) each image is treated as a static data point without

any dynamic or temporal information; 4) all models are trained from scratch without requiring any

pretrained networks on additional datasets. Ultimately, the goal of these models is to segment all

individual objects as accurate as the ground truth human annotations. In this paper, we regard these

settings as the basic and necessary part of unsupervised object segmentation from single images, and

empirically evaluate how successfully the existing models can exhibit on real-world images.

Contributions and ﬁndings.

This paper addresses the essential question regarding the potential of

unsupervised segmentation of generic objects from real-world single images. Our contributions are:

•

We ﬁrstly introduce 4 complexity factors to quantitatively measure the difﬁculty of objects and

scenes. These factors are key to investigate the true potential of existing unsupervised models.

•

We extensively evaluate current unsupervised approaches in a large-scale experimental study.

We implement 4 representative methods and train more than 130 models on 6 curated datasets

from scratch. The datasets, code and pretrained models are available at

https://github.com/

vLAR-group/UnsupObjSeg

•

We analyze our experimental results and ﬁnd that: 1) existing unsupervised object segmentation

models cannot discover generic objects from single real-world images, although they can achieve

outstanding performance on synthetic datasets, as qualitatively illustrated in Figure 1; 2) the

challenging distributions of both object- and scene-level biases in appearance and geometry from

real-world images are the key factors incurring the failure of existing models; 3) the inductive biases

introduced in existing unsupervised models are fundamentally not matched with the objectness

biases exhibited in real-world images, and therefore fail to discover the real objectness.

Related Work.

Recently, ClevrTex [

] and the concurrent work [

] also study unsupervised object

segmentation on single images. Through evaluation on (complex) synthetic datasets only, both works

focus on benchmarking the effectiveness of particular network designs of baselines. By comparison,

our paper aims to explore what and how the objectness distribution gaps between synthetic and

real-world images incur the failure of existing models. The recent work [

] which investigates video

object discovery is orthogonal to our work as the motion signals do not exist in single images.

Scope of this research.

This paper does not investigate unsupervised object discovery on saliency

maps [

], static multi-views or dynamic videos [

]. Recent methods [

;

] requiring pretrained

models on monolithic object images such as ImageNet [47] are not evaluated as well.

2 Complexity Factors

Object #1 Object #2 Object #3

Scene #1 Scene #2 Scene #3

Figure 2: Complexity in appearance

and geometry for objects and scenes.

As illustrated in the top row of Figure 2, an individual object,

represented by a set of color pixels painted within a mask,

can vary signiﬁcantly given different types of appearance and

geometric shape. A speciﬁc scene, represented by a set of

objects placed within an image, can also differ vastly given

different types of relative appearance and geometric layout

between objects, as illustrated in the bottom row. Unarguably,

such variation and complexity of appearance and geometry in

both object level and scene level directly affects human’s ability

to precisely separate all objects. Naturally, the performance

of unsupervised segmentation models are also expected to be

inﬂuenced by the variation. In this regard, we carefully deﬁne

the following two groups of factors to quantitatively describe

the complexity of different datasets.

2.1 Object-level Complexity Factors

As to a speciﬁc object, all its information can be described by appearance and geometry. Therefore

we deﬁne the below two factors to measure the complexity of appearance and geometry respectively.

Notably, both factors are nicely invariant to the object scale.

•Object Color Gradient:

This factor aims to calculate how frequently the appearance changes

within the object mask. In particular, given the RGB image and mask of an object, we ﬁrstly

convert RGB into grayscale and then apply Sobel ﬁlter [

] to compute the gradients horizontally

and vertically for each pixel within the mask. The ﬁnal gradient value is obtained by averaging out

all object pixels. Note that, the object boundary pixels are removed to avoid the interference of

background. Numerically, the higher this factor is, the more complex texture and/or lighting effect

the object has, and therefore it is likely harder to segment.

•Object Shape Concavity:

This factor is designed to evaluate how irregular the object boundary

is. Particularly, given an object (binary) mask, denoted as

Mobj ∈RH×W

, we ﬁrstly ﬁnd the

smallest convex polygon mask (

Mcvx ∈RH×W

) that surrounds the object mask using an existing

algorithm [

], and then the object shape concavity value is computed as:

1−PMobj /PMcvx

Clearly, the higher this factor is, the more irregular object shape is, and segmentation is more tricky.

2.2 Scene-level Complexity Factors

As to a speciﬁc image, in addition to the object-level complexity, the spatial and appearance relation-

ships between all objects can also incur extra difﬁculty for segmentation. We deﬁne the following two

factors to quantify the complexity of relative appearance and geometry between objects in an image.

•Inter-object Color Similarity:

This factor intends to assess the appearance similarity between all

objects in the same image. Speciﬁcally, we ﬁrstly calculate the average color for each object, and

then compute the pair-wise Euclidean distances of object colors, obtaining a

K×K

matrix where

represents the object number. The average color distance is calculated by averaging the matrix

excluding diagonal entries, and the ﬁnal inter-object color similarity is computed as:

1−

average

color distance

/(255 ×√3)

. Intuitively, the higher this factor is, the more similar all objects appear

to be, the less distinctive each object is, and it is harder to separate each object.

•Inter-object Shape Variation:

This factor aims to measure the relative geometry diversity between

all objects in the image. We ﬁrstly calculate the diagonal length of bounding box for each object,

and then compute the pair-wise absolute differences for all object diagonal lengths, obtaining a

K×K

matrix. The ﬁnal inter-object shape variation is the average of the matrix excluding diagonal

entries. The higher this factor, the objects within an image have more diverse and imbalanced sizes,

and therefore segmenting both gigantic and tiny objects is likely more challenging.

By capturing the appearance and geometry in both object and scene levels, the four factors are

designed to quantify the complexity of objects and images. For illustration, Figure 3shows sample

images for the four factors at different values. The higher the values, the more complex the objects

object -

level

Object Color Gradient

0.000 0.165 0.518 0.802

Object Shape Concavity

0.020 0.121 0.526 0.750

scene -

level

0.265 0.359 0.787 0.936 0.005 0.105 0.257 0.565

Inter-object Color Similarity Inter-object Shape Variation

Figure 3: Sample objects and scenes for the four factors at different complexity values. All complexity

values are normalized to the range of [0,1].

and scenes. In fact, these factors are carefully selected from more than 10 candidates because they are

empirically more suitable to differentiate the gaps between synthetic and real-world images, and they

eventually serve as key indicators to diagnose existing unsupervised models in Section 4. Calculation

details of the four factors and other candidates are in appendix.

3 Experimental Design

3.1 Considered Methods

A range of works have explored unsupervised object segmentation in recent years. They are typically

formulated as (variational) autoencoders (AE/VAE) [

] or generative adversarial networks (GAN)

[

]. GAN based models [

;

] are usually limited to identifying a single foreground

object and can hardly discover multiple objects due to the training instabilities, therefore not consid-

ered in this paper. As shown in Table 1, the majority of existing models are based on AE/VAE and

can be generally divided into two groups according to the object representation:

•Factor-based models

: Each object is represented by explicit factors such as size, position, ap-

pearance, etc., and the whole image is a spatial organization of multiple objects. Basically, such

representation explicitly enforces objects to be bounded within particular regions.

•Layer-based models

: Each object is represented by an image layer, i.e., a binary mask, and the

whole image is a spatial mixture of multiple object layers. Intuitively, this representation does not

have strict spatial constrains, and instead is more ﬂexible to cluster similar pixels as objects.

In order to decompose the input images into objects, these approaches introduce different types

of network architecture, loss functions, and regularization terms as inductive biases. These biases

broadly include: 1) variational encoding which encourages the disentanglement of latent variables; 2)

iterative inference which likely ends up with better scene representations over occlusions; 3) object

relationship regularization such as depth estimation and autoregressive prior which aims at capturing

the dependency of multiple objects; and many other biases. With different combinations of these

biases, many methods have shown outstanding performance in synthetic datasets. Among them, we

select 4 representative models for our investigation: 1) AIR [

], 2) MONet [

], 3) IODINE [

and 4) SlotAtt [

]. We also add the fully-supervised Mask R-CNN [

] as an additional baseline for

comprehensive comparison. Implementation details are provided in appendix.

3.2 Considered Datasets

We consider two groups of datasets for extensive benchmarking and analysis: 1) three commonly-used

synthetic datasets: dSprites [

], Tetris [

] and CLEVR [

], 2) three real-world datasets: YCB [

ScanNet [

], and COCO [

], representing the small-scale, indoor- and outdoor-level real scenes

respectively. Naturally, objects and scenes in different datasets tend to have very different types of

biases. For example, the objects in dSprites tend to have the single-color bias, while COCO does not.

Generally, the object-level biases can be divided as: 1) appearance biases including different textures

and lighting effects, and 2) geometry biases including the object shape and occlusions. Similarly,

the scene-level biases include: 1) appearance biases such as the color similarity between all objects,

and 2) geometry biases such as the diversity of all object shapes. In fact, our complexity factors

introduced in Section 2are designed to well capture these biases. Table 2qualitatively summarizes

the biases of selected datasets. We may hypothesize that the large gaps of biases between synthetic

and real-world datasets would have a huge impact on the effectiveness of existing models.

Table 1: Existing unsupervised models for object segmentation on single images. Each model

includes different inductive biases, such as variational autoencoding (VAE), iterative inference (Iter),

object relationship regularization (Rel), etc.

Factor-based Models Inductive Biases Layer-based Models Inductive Biases

VAE Iter Rel VAE Iter Rel

CST-VAE [31] ICLRW’16 XTagger [26] NIPS’16 X

AIR [22] NIPS’16 XRC [26] ICLRW’16 X

SPAIR [16] AAAI’19 XNEM [27] NIPS’17 X

SuPAIR [54] ICML’19 XMONet [8] arXiv’19 X

GMIO [64] ICML’19 X X IODINE [25] ICML’19 X X

ASR [62] NeurIPS’19 XECON [57] ICLRW’20 X X

SPACE [39] ICLR’20 XGENESIS [21] ICLR’20 X X

GNM [32] NeurIPS’20 X X SlotAtt [40] NeurIPS’20 X

SPLIT [11] arXiv’20 XGENESIS-V2 [20] NeurIPS’21 X X

OCIC [2] arXiv’20 X X R-MONet [50] arXiv’21 X X

GSGN [18] ICLR’21 X X CAE [41] arXiv’22 X

Table 2: The object- and scene-level biases in appearance and geometry of the considered datasets.

Synthetic Datasets Real-world Datasets

dSprites [42] Tetris [34] CLEVR [33] YCB [9] ScanNet [17] COCO [38]

Object-level Biases

Appearance Texture: simple simple simple diverse simple diverse

Lighting: no no synthetic real real real

Geometry Shape: simple simple simple simple diverse diverse

Occlusion: minor no minor severe severe severe

Scene-level Biases

Appearance Similarity: low low high high high high

Geometry Diversity: low low low high high high

To guarantee the fairness and consistency of all experiments, we carefully prepare all six datasets

using the following same protocols. Preparation details for each dataset are provided in appendix.

• All images are rerendered or cropped with the same resolution of 128 ×128.

• Each image has about 2 to 6 solid objects with a blank background.

• Each dataset has about 10000 images for training, 2000 images for testing.

3.3 Considered Metrics

Having the six representative datasets and four existing unsupervised methods at hand, we choose

the following metrics to evaluate the object segmentation performance: 1) AP score which is widely

used for object detection and segmentation [

], 2) PQ score which is used to measure non-overlap

panoptic segmentation [

], and 3) Precision and Recall scores. A predicted mask is considered

correct if its IoU against a ground truth mask is above 0.5. All objects are treated as a single class. The

blank background is not taken into account for fair comparison. To compute AP, we simply treat the

mean value of the soft object mask as the object conﬁdence score. Note that, the alternative metrics

ARI [45] and segmentation covering (SC) [4] are not considered as they can be easily saturated.

4 Key Experimental Results

4.1 Can current unsupervised models succeed on real-world datasets?

First of all, we evaluate all baselines on our six datasets separately. In particular, we train each model

from scratch on each dataset separately. For fair evaluation, we carefully tune the hyperparameters of

each model on every dataset and fully optimize the networks until convergence. Figure 4compares

the quantitative results. It can be seen that all methods demonstrate satisfactory segmentation results

on synthetic datasets, especially the recent strong baselines IODINE and SlotAtt. However, not

surprisingly, all unsupervised methods fail catastrophically on the three real-world datasets.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PromisingorElusive?UnsupervisedObjectSegmentationfromReal-worldSingleImagesYafeiYangBoYangvLARGroup,TheHongKongPolytechnicUniversityya-fei.yang@connect.polyu.hkbo.yang@polyu.edu.hkAbstractInthispaper,westudytheproblemofunsupervisedobjectsegmentationfromsingleimages.Wedonotintroduceanewalgorithm,buts...

展开>> 收起<<

Promising or Elusive Unsupervised Object Segmentation from Real-world Single Images Yafei Yang Bo Yang.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Promising or Elusive Unsupervised Object Segmentation from Real-world Single Images Yafei Yang Bo Yang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: