Unsupervised Multi-object Segmentation by Predicting Probable Motion Patterns Laurynas Karazija Subhabrata Choudhury

2025-05-06 2 0 2.78MB 27 页 10玖币

侵权投诉

Unsupervised Multi-object Segmentation

by Predicting Probable Motion Patterns

Laurynas Karazija∗∗, Subhabrata Choudhury∗,

Iro Laina, Christian Rupprecht, Andrea Vedaldi

Visual Geometry Group

University of Oxford

Oxford, UK

{laurynas,subha,iro,chrisr,vedaldi}@robots.ox.ac.uk

Abstract

We propose a new approach to learn to segment multiple image objects without

manual supervision. The method can extract objects form still images, but uses

videos for supervision. While prior works have considered motion for segmen-

tation, a key insight is that, while motion can be used to identify objects, not

all objects are necessarily in motion: the absence of motion does not imply the

absence of objects. Hence, our model learns to predict image regions that are

likely to contain motion patterns characteristic of objects moving rigidly. It does

not predict speciﬁc motion, which cannot be done unambiguously from a still

image, but a distribution of possible motions, which includes the possibility that

an object does not move at all. We demonstrate the advantage of this approach

over its deterministic counterpart and show state-of-the-art unsupervised object

segmentation performance on simulated and real-world benchmarks, surpassing

methods that use motion even at test time. As our approach is applicable to variety

of network architectures that segment the scenes, we also apply it to existing image

reconstruction-based models showing drastic improvement. Project page and code:

https://www.robots.ox.ac.uk/~vgg/research/ppmp.

1 Introduction

Humans have an innate ability to segment individual objects in a picture, but learning this capability

with an algorithm usually relies on manual supervision. In this paper, we consider the problem of

learning to segment objects from visual data only — without externally provided labels. Algorithms

for this task usually assume that objects are seen in different conﬁgurations and in front of different

backgrounds. They then exploit cues such as the visual consistency and the co-occurrence of

characteristic object parts to learn to discover and segment individual object instances.

Most such methods use still images as input and train with a reconstruction objective [

They work well on simple synthetic scenes, but they struggle in scenes with more complex visual

appearance [

]. This has motivated the development of algorithms that use videos as input and

can thus observe the motion of the objects as evidence of their presence. A common way of using

motion for unsupervised learning is to seek for a compact representation to reconstruct the video

itself [

]. Effectively, such methods seek for a compressed representation of appearance,

but do not sidestep entirely the difﬁcult task of modeling it. This has motivated authors to look instead

at reconstructing the video’s optical ﬂow [

]. In fact, the optical ﬂow measures the motion of the

objects directly and is much simpler to model than the objects’ appearance.

∗Authors contributed equally.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12148v1 [cs.CV] 21 Oct 2022

In this paper, we propose a method that lies in-between these two classes, i.e. single image-based and

video-based. Our model learns to segment objects in still images, and is thus based on appearance, but

learns to do so using video as a learning signal, in an unsupervised manner. The learning process can

be summarized as follows. Given an image, each pixel is assigned to a slot that represents a certain

object. The quality of the assignments is then measured by the coherence of the (unobserved) optical

ﬂow within the extracted regions. Because predicting optical ﬂow from a still image is intrinsically

ambiguous, the method models distributions of probable ﬂow patterns within each region. The idea is

that (rigid) objects generate characteristic ﬂow patterns that can be used to distinguish between them.

Note that, because the segmentation network is based on a single image, it will learn to partition all

objects contained in the scene, not just the ones that actually move in the video, i.e. solving an object

instance segmentation problem, rather than motion segmentation.

We derive closed-form distributions for the ﬂow ﬁeld generated by rigid objects moving in the scene.

We also derive efﬁcient expressions for the calculation of the ﬂow probability under such models. The

problem of decomposing the image into a number of regions is cast as a standard image segmentation

task and an off-the-shelf neural network can be used for it.

As our method uses videos to train an image-based model, we introduce two new datasets which are

straightforward video extensions of the existing image datasets CLEVR [

] and CLEVRTEX [

These datasets are built by animating the objects with initial velocities and using a physics simulation

to generate realistic object movement. Our datasets are constructed with the realistic assumption that

not all objects are moving at all times. This means that motion alone cannot be used as the sole cue

for objectness and reﬂects scenarios such as a workbench where a person only interacts with a small

number of objects at a time.

Empirically, we validate our model against several ablations and baselines. We compare our approach

to existing unsupervised multi-object segmentation methods achieving state-of-the-art performance.

We demonstrate particularly strong performance in visually complex scenes even with unseen objects

and textures at test time. Our experiments in comparison to image-based models and, in particular,

adding our motion-aware formulation to existing models shows substantial improvements, conﬁrming

that motion is an important cue to learn objectness. Furthermore, we show that our learned segmenter,

which operates on still images, produces better segmentation results than current video-based methods

that use motion information at test time. Finally, we also apply our method to real-world self-driving

scenarios where we show superior performance to prior work.

2 Related Work

Multi-object decomposition.

Learning unsupervised object segmentation for static scenes is a

well-researched problem in computer vision [

These methods aim to decompose the scene into constituent parts, e.g. the different foreground

objects and the background. Glimpse-based methods [

] ﬁnd input patches (glimpses) that

contain the objects in the scene. These methods learn object descriptors that encode their properties

(e.g. position, number, size of the objects) using variational inference, composing glimpses into the

ﬁnal picture. More related to ours are approaches that learn per-pixel object masks [

MoNet [

] and IODINE [

] employ multiple encoding-decoding steps to sequentially explain the

scene as a collection of regions. Slot Attention [

] uses a multi-step soft clustering-like scheme to

ﬁnd the regions simultaneously. In all cases, learning is posed as an image reconstruction problem.

In order to align learnable slots with semantic objects, models have to make efﬁcient use of a limited

representation available for each region, such as learning to only explain visual appearance. This

principle, however, is difﬁcult to extend to visually complex data [

] and relies on custom specialized

architectures. Instead, our method allows for any standard segmentation architecture to be used,

which we train to predict regions that are most likely described by rigid motion patterns.

Video-based multi object decomposition.

Another line of work extends the unsupervised object

decomposition problem to videos [

]. Many of these

methods work mainly with simpler datasets [

] and require sequential frames for training.

For example, SCALOR [

] is a glimpse-based method that discovers and propagates objects across

frames to learn intermediate object latents. SIMONe [

] processes the whole video at once, learning

both temporal scene representation and time-invariant object representations simultaneously. Slot

Attention for Video (SAVi) [

] poses the multi-object problem as optical-ﬂow prediction using

sequential frames as input. The internal slot-attention mechanism drives the network to learn regions

that move in a simple and consistent manner. Different to our work, it does not assume a speciﬁc

motion model but relies on directly regressing the ﬂow. It is computationally more expensive and

struggles when only one or few frames are available.

Unsupervised video object segmentation.

Unsupervised video object segmentation (VOS) is a

popular problem in computer vision [

], that focuses on extracting the

most salient object in the scene. Many of the approaches treat the problem as a motion segmentation

task as the background typically shows a dominant motion independent of the salient object. Motion

Grouping [

] employs the Slot Attention architecture to reconstruct optical ﬂow from itself, avoiding

appearance information entirely. Another related line of work [

] employs approximate

motion models. These approaches rely on a point estimate of the motion model parameters. In

contrast, we adopt a more principled probabilistic approach, placing a prior on the motion parameters

and integrating them out. To deal with ﬂow outliers that do not conform to a rigid motion model,

Mahendran et al.

[41]

use a histogram matching-based loss and GWM [

] over-segments the scene

relying on spectral clustering to produce a binary segmentation during inference. Instead, we model

the noise in our formulation directly. Finally, Meunier et al.

[42]

rely on ﬂow as input, limiting the

method to videos only.

3 Method

Let a frame

I ∈ R3×H×W

of a video and its optical ﬂow

f∈R2×H×W

be deﬁned on the

H×W

lattice. The optical ﬂow is a local summary of the motion from one frame to the next. We use it to

supervise a network

that, given the (single) image

as input, predicts soft assignments of each

pixel to up to

different image regions, outputting an

H×W

collection of probability vectors

Φ(I)∈ˆ

∆H×W

K⊂[0,1]K×H×W

, where

∆K

is the

K−1

-dimensional simplex. The quality of

the regions is measured based on how likely they contain ﬂow patterns typical of the motion of

independent objects.

In more detail, we represent the predicted image regions by a hard

-way pixel assignment (mask)

m∈∆H×W

K⊂ {0,1}K×H×W

, where

∆K

is the space of

-dimensional one-hot vectors. Each

mask is a sample from the categorical distribution output by the network, i.e.

m∼pΦ(m| I) =

Categorical[Φ(I)]

. Note that there is one categorical distribution for each pixel and that these are

mutually independent.

We then assume that the ﬂow depends only on the regions, in the sense that

pΦ(f,m| I) = p(f|

m)pΦ(m| I),

where

p(f|m)

is a model of the distribution of the ﬂow ﬁeld given the regions. The

likelihood of the modeled Φis bounded by:

log pΦ(f| I) = log E

m∼pΦ(m|I)

[p(f|m)] ≥E

m∼pΦ(m|I)

[log p(f|m)] .

Furthermore, inspired by ELBO, we regularize the model’s prediction

pΦ(m| I)

by taking its KL

divergence from a uniform prior p0(m), obtaining the learning objective

Lβ=E

m∼pΦ(m|I)

[−log p(f|m)] + βDKL (pΦ(m| I)kp0(m)) .(1)

Next, we introduce the closed-form motion model

p(f|m)

in Eq. (1) and then explain how the

Gumbel-Softmax trick can be used to train the network.

Approximate motion models for optical ﬂow.

We now turn to describing the models of motion

used in our work, which play a role in assessing the likelihood of optical ﬂow

p(f|m)

. Optical

ﬂow measures the coordinate change of pixels between neighboring frames, which arises due to the

motion of the camera and objects. We consider rigid-body motion of some object k.

Let

k,yt

k∈Rnk

be the spatial locations of the pixels belonging to region/object

at time

, where

is the number of pixels in the region. For convenience, we stack the coordinates in a single vector

Ωt

k= (xt

k,yt

k)∈R2nk.

The pixels comprising this object undergo coordinate change from

Ωt

Ωt+1

, giving rise to the optical ﬂow for this object as

fk= Ωt+1

k−Ωt

We assume this underlying

3D rigid-body motion can be approximated using a linear 2D parametric model

Πθ

with parameters

θ, so that:

Ωt+1

k= Πθ(Ωt

k) + , fk= Πθ(Ωt

k)−Ωt

k+, (2)

where



captures the residual error of the approximation. Several forms of models are available (see

[

] for an overview). Here, we consider two such models: the translation of an object within the

camera plane, and an afﬁne motion, given respectively by linear functions:

Πtr

θ(Ωt

k) = Ωt

k+1nk0

01nk

| {z }

Ptr

θ1

θ2,Πaﬀ

θ(Ωt

k) = xtyt1nk0 0 0

0 0 0 xtyt1nk

| {z }

Paff







θ1

θ6





,(3)

where we use 1nkis a vector of nkones and matrix Pkcontains the coefﬁcients of the model.

The afﬁne model supports object rotation, scaling and shearing in addition to translation. It is often a

sufﬁcient approximation to real-world optical ﬂow, provided the objects are rigid, convex, and mainly

rotating in-plane.

We can then use the motion equations

(2)

to construct the distribution

p(f|m)

by assuming a prior

on the motion parameters and by marginalizing over it. Speciﬁcally, denote by

the

-th slice

of the tensor

encoding the regions (i.e. the mask of the

-th region). We assume that regions are

statistically independent and decompose the log-likelihood p(f|m)as:

log p(f|m) = X

log p(fk|mk) = X

kZlog p(fk, θk|mk)dθk.(4)

Assuming that each object has i.i.d. parameters

θk

with a Gaussian prior

N(θ;µ, Σ),

and assuming



is a zero-mean noise with variance

σ2

, Eq. (2) gives marginal optical ﬂow likelihood for segment

p(fk|mk) = Nfk; Πµ(Ωk)−Ωk, PkΣP>

k+σ2I(5)

where

is the identity matrix. A practical issue with Eq. (5) is that, if segment

contains

nk=

Pi(mk)i

pixels, then the covariance matrix

PkΣP>

k+σ2I

has dimension

2nk×2nk

. Inverting

such a matrix in the evaluation of the Gaussian log-density is very slow except for very small regions.

Furthermore, it is not obvious how to relax Eq. (4) to support gradient-based learning, e.g. through

the Gumbel-Softmax approximation. We solve these problems in the next section.

Expressions for the likelihood.

We now derive expressions for Eq. (5) which are efﬁcient and

that lead to a natural relaxation for use in the Gumbel-Softmax sampling. Given the deﬁnitions

Fk=fk−Πµ(Ωk)+Ωkand Λ=Σ−1, we can rewrite Eq. (5) as:

p(fk|mk) = (2πσ2)−nkdet Sk

det Λ −1

e−d2

2σ2, d2=F>

kFk−1

σ2F>

kPkS−1

kP>

kFk,(6)

where

Sk=1

/σ2P>

kPk+ Λ.

The signiﬁcant advantage of this form is that it involves the computation

of the inverse and determinant of matrix

, whose size is only

2×2

(for the translation model) or

6×6(for the afﬁne one), instead of the much larger 2nk×2nk.

We can more explicitly introduce the dependency on the region assignments

by deﬁning selector

matrices

Rk∈ {0,1}2nk×2n

(with

n=Pknk=HW

) that extract the

and

coordinates of the

pixels that belong to the corresponding region, i.e.

Ωk=RkΩ.

We can then also write

Fk=RkF

and

Pk=RkP

. Furthermore, the product of the selectors

Lk=R>

kRk∈ {0,1}2n×2n

can be

written directly as a function of the assignment

Lk(m) = diag(mk,mk).

Plugging these back

in Eq. (6), we obtain expressions involving Lkonly:

nk=1

2|Lk|1, Sk=1

σ2P>LkP+ Λ, d2=F>LkF−1

σ2(F>LkP)S−1

k(P>LkF).(7)

Translation-only model.

Further simpliﬁcations are possible for speciﬁc models. For instance, for

the translation-only model, assuming that

Λ = diag(1/τ 2,1/τ2)

then

Sk= diag(nk+ 1/τ 2, nk+

1/τ2)and, after some calculations, we obtain the expression:

−log p(f|m) = nlog 2πσ2+X

log nk+σ2

τ2

σ2

τ2

2σ2F> I−X

nk+σ2

τ2hmkm>

mkm>

ki!F.

Afﬁne model.

For the afﬁne model, the expression for

−log p(f|m)

does not simplify as much.

Still, by exploiting the structure of matrix

Paﬀ

, we can reduce the calculations to the computation

of inverse and determinant of small

3×3

matrices, which can be implemented efﬁciently in closed

form. Please see the Appendix for the derivation. Unless otherwise stated, the mean vector is set to

µ=(100010)>centering the prior on the no-motion point.

Gumbel-softmax.

In order to train the network using gradient descent, we need a differentiable

version of loss

(1)

. To do so, we use the re-parametrizable Gumbel-softmax relaxation [

The Gumbel-softmax relaxation replaces categorical samples

m∈∆H×W

from the distribution

pΦ(m| I) = Categorical[Φ(I)]

with continuous samples

m∈ˆ

∆H×W

from the distribution

GumbelSoftmax[Φ(I)]

. We take

N= 3

samples from this distribution to evaluate the expected

negative log-likelihood, further reducing variance. Then we simply replace

for

in Eq. (1),

leading to differentiable quantities.

Post-processing.

Eq. (5) naturally encourages the model to form larger regions to explain parts of

the scene that move in a consistent (under the assumed prior) manner. However, we ﬁnd that this

can also lead to the model grouping together objects that only coincidentally move together (e.g. all

objects mostly falling due to gravity in one of the datasets). Furthermore, optical ﬂow is ambiguous

around object edges and occlusions. To address both the object grouping and occlusion boundary

issue, we use a simple post-processing step. We isolate connected components in the model output,

selecting the

largest masks, discarding any that are smaller than 0.1% of the image area, and

combining the left-over and discarded ones with the largest mask overall.

Warp loss.

Occasionally, the optical ﬂow used to supervise our model can be noisy as it is estimated

by other methods. This noise is also unlikely to be isotropic as some surfaces are easier to estimate

that others. Rather than supporting heterogeneous noise and approximation error (Eq. (2)), we instead

prioritize parts of the scene covered by higher-quality ﬂow. To this end, we introduce an additional

loss term that simply enforces consistency between adjacent frames

I1,I2

. In particular, it warps the

predicted mask distributions

Φ(I1),Φ(I2)

using the optical ﬂow, weighted by the error of warping

the frames themselves, as follows:

Lwarp(I1,I2, f1, b2) = w(I2, f1(I1)) ·d(Φ(I2), f1(Φ(I1))) (8)

+w(I1, b2(I2)) ·d(Φ(I1), b2(Φ(I2))),

w(Ia,Ib)=1−norm(|Ia− Ib|),

d(p, q) = DKL(pkq)/2 + DKL(qkp)/2,

where

f1(·)

indicates warping by forward optical ﬂow

(or backward

). The symmetrized KL

divergence,

d(·)

, measured agreement between predicted and warped mask distributions, weighted by

the absolute error of the warped frames normalized in

[0,1]

. While the use of this term is not central

to our method, we include it to show how tolerance to noisy optical ﬂow can be improved. We do not

use the warp loss (Eq. (8)) in our experiments, unless otherwise indicated by (WL). In that case, the

ﬁnal loss is simply sum of the two terms: Lβ+Lwarp.

4 Experiments

Our method lies in between image-based and video-based segmentation approaches, because it uses

videos for supervision, but trains an image segmentation network that operates on still images only.

We thus evaluate our approach under a number of settings. Firstly, we evaluate how well motion

can be used to supervise an object instance segmenter that operates on still images. Secondly, we

compare such a segmenter to state-of-the-art object segmentation methods that use motion also at test

time (and are thus advantaged compared to our model). We conduct further analysis to validate our

modelling assumptions and the model’s reliance on the quality of the optical ﬂow used for supervision.

Finally, we apply our method to a real-world setting.

4.1 Experimental setup

Datasets.

We evaluate our method on video and still image datasets. For video-based data, we use

the Multi-Object Video (MOVi) datasets, released as part of Kubric [

]. Speciﬁcally, we employ

MOVi-{A,C,D,E} versions. MOVi-A is similar to CLEVR [

] in terms of visual complexity and

contains videos of 3–10 falling objects on a simple, gray background. MOVi-C is signiﬁcantly more

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedMulti-objectSegmentationbyPredictingProbableMotionPatternsLaurynasKarazija,SubhabrataChoudhury,IroLaina,ChristianRupprecht,AndreaVedaldiVisualGeometryGroupUniversityofOxfordOxford,UK{laurynas,subha,iro,chrisr,vedaldi}@robots.ox.ac.ukAbstractWeproposeanewapproachtolearntosegmentmultipl...

展开>> 收起<<

Unsupervised Multi-object Segmentation by Predicting Probable Motion Patterns Laurynas Karazija Subhabrata Choudhury.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Multi-object Segmentation by Predicting Probable Motion Patterns Laurynas Karazija Subhabrata Choudhury

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: