Unsupervised Multi-object Segmentation by Predicting Probable Motion Patterns Laurynas Karazija Subhabrata Choudhury

2025-05-06 0 0 2.78MB 27 页 10玖币
侵权投诉
Unsupervised Multi-object Segmentation
by Predicting Probable Motion Patterns
Laurynas Karazija, Subhabrata Choudhury,
Iro Laina, Christian Rupprecht, Andrea Vedaldi
Visual Geometry Group
University of Oxford
Oxford, UK
{laurynas,subha,iro,chrisr,vedaldi}@robots.ox.ac.uk
Abstract
We propose a new approach to learn to segment multiple image objects without
manual supervision. The method can extract objects form still images, but uses
videos for supervision. While prior works have considered motion for segmen-
tation, a key insight is that, while motion can be used to identify objects, not
all objects are necessarily in motion: the absence of motion does not imply the
absence of objects. Hence, our model learns to predict image regions that are
likely to contain motion patterns characteristic of objects moving rigidly. It does
not predict specific motion, which cannot be done unambiguously from a still
image, but a distribution of possible motions, which includes the possibility that
an object does not move at all. We demonstrate the advantage of this approach
over its deterministic counterpart and show state-of-the-art unsupervised object
segmentation performance on simulated and real-world benchmarks, surpassing
methods that use motion even at test time. As our approach is applicable to variety
of network architectures that segment the scenes, we also apply it to existing image
reconstruction-based models showing drastic improvement. Project page and code:
https://www.robots.ox.ac.uk/~vgg/research/ppmp.
1 Introduction
Humans have an innate ability to segment individual objects in a picture, but learning this capability
with an algorithm usually relies on manual supervision. In this paper, we consider the problem of
learning to segment objects from visual data only without externally provided labels. Algorithms
for this task usually assume that objects are seen in different configurations and in front of different
backgrounds. They then exploit cues such as the visual consistency and the co-occurrence of
characteristic object parts to learn to discover and segment individual object instances.
Most such methods use still images as input and train with a reconstruction objective [
19
,
23
,
37
].
They work well on simple synthetic scenes, but they struggle in scenes with more complex visual
appearance [
27
]. This has motivated the development of algorithms that use videos as input and
can thus observe the motion of the objects as evidence of their presence. A common way of using
motion for unsupervised learning is to seek for a compact representation to reconstruct the video
itself [
24
,
26
,
33
,
55
]. Effectively, such methods seek for a compressed representation of appearance,
but do not sidestep entirely the difficult task of modeling it. This has motivated authors to look instead
at reconstructing the video’s optical flow [
28
,
59
]. In fact, the optical flow measures the motion of the
objects directly and is much simpler to model than the objects’ appearance.
Authors contributed equally.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12148v1 [cs.CV] 21 Oct 2022
In this paper, we propose a method that lies in-between these two classes, i.e. single image-based and
video-based. Our model learns to segment objects in still images, and is thus based on appearance, but
learns to do so using video as a learning signal, in an unsupervised manner. The learning process can
be summarized as follows. Given an image, each pixel is assigned to a slot that represents a certain
object. The quality of the assignments is then measured by the coherence of the (unobserved) optical
flow within the extracted regions. Because predicting optical flow from a still image is intrinsically
ambiguous, the method models distributions of probable flow patterns within each region. The idea is
that (rigid) objects generate characteristic flow patterns that can be used to distinguish between them.
Note that, because the segmentation network is based on a single image, it will learn to partition all
objects contained in the scene, not just the ones that actually move in the video, i.e. solving an object
instance segmentation problem, rather than motion segmentation.
We derive closed-form distributions for the flow field generated by rigid objects moving in the scene.
We also derive efficient expressions for the calculation of the flow probability under such models. The
problem of decomposing the image into a number of regions is cast as a standard image segmentation
task and an off-the-shelf neural network can be used for it.
As our method uses videos to train an image-based model, we introduce two new datasets which are
straightforward video extensions of the existing image datasets CLEVR [
25
] and CLEVRTEX [
27
].
These datasets are built by animating the objects with initial velocities and using a physics simulation
to generate realistic object movement. Our datasets are constructed with the realistic assumption that
not all objects are moving at all times. This means that motion alone cannot be used as the sole cue
for objectness and reflects scenarios such as a workbench where a person only interacts with a small
number of objects at a time.
Empirically, we validate our model against several ablations and baselines. We compare our approach
to existing unsupervised multi-object segmentation methods achieving state-of-the-art performance.
We demonstrate particularly strong performance in visually complex scenes even with unseen objects
and textures at test time. Our experiments in comparison to image-based models and, in particular,
adding our motion-aware formulation to existing models shows substantial improvements, confirming
that motion is an important cue to learn objectness. Furthermore, we show that our learned segmenter,
which operates on still images, produces better segmentation results than current video-based methods
that use motion information at test time. Finally, we also apply our method to real-world self-driving
scenarios where we show superior performance to prior work.
2 Related Work
Multi-object decomposition.
Learning unsupervised object segmentation for static scenes is a
well-researched problem in computer vision [
5
,
9
,
11
,
12
,
13
,
14
,
17
,
18
,
23
,
34
,
37
,
47
,
48
,
58
].
These methods aim to decompose the scene into constituent parts, e.g. the different foreground
objects and the background. Glimpse-based methods [
9
,
14
,
23
,
34
] find input patches (glimpses) that
contain the objects in the scene. These methods learn object descriptors that encode their properties
(e.g. position, number, size of the objects) using variational inference, composing glimpses into the
final picture. More related to ours are approaches that learn per-pixel object masks [
5
,
11
,
12
,
13
,
37
].
MoNet [
5
] and IODINE [
19
] employ multiple encoding-decoding steps to sequentially explain the
scene as a collection of regions. Slot Attention [
37
] uses a multi-step soft clustering-like scheme to
find the regions simultaneously. In all cases, learning is posed as an image reconstruction problem.
In order to align learnable slots with semantic objects, models have to make efficient use of a limited
representation available for each region, such as learning to only explain visual appearance. This
principle, however, is difficult to extend to visually complex data [
27
] and relies on custom specialized
architectures. Instead, our method allows for any standard segmentation architecture to be used,
which we train to predict regions that are most likely described by rigid motion patterns.
Video-based multi object decomposition.
Another line of work extends the unsupervised object
decomposition problem to videos [
21
,
24
,
26
,
29
,
30
,
31
,
33
,
46
,
54
,
56
,
57
,
61
]. Many of these
methods work mainly with simpler datasets [
30
,
46
,
61
] and require sequential frames for training.
For example, SCALOR [
24
] is a glimpse-based method that discovers and propagates objects across
frames to learn intermediate object latents. SIMONe [
26
] processes the whole video at once, learning
both temporal scene representation and time-invariant object representations simultaneously. Slot
Attention for Video (SAVi) [
28
] poses the multi-object problem as optical-flow prediction using
2
sequential frames as input. The internal slot-attention mechanism drives the network to learn regions
that move in a simple and consistent manner. Different to our work, it does not assume a specific
motion model but relies on directly regressing the flow. It is computationally more expensive and
struggles when only one or few frames are available.
Unsupervised video object segmentation.
Unsupervised video object segmentation (VOS) is a
popular problem in computer vision [
10
,
15
,
32
,
39
,
44
,
51
,
53
,
59
,
60
], that focuses on extracting the
most salient object in the scene. Many of the approaches treat the problem as a motion segmentation
task as the background typically shows a dominant motion independent of the salient object. Motion
Grouping [
59
] employs the Slot Attention architecture to reconstruct optical flow from itself, avoiding
appearance information entirely. Another related line of work [
8
,
41
,
42
,
52
] employs approximate
motion models. These approaches rely on a point estimate of the motion model parameters. In
contrast, we adopt a more principled probabilistic approach, placing a prior on the motion parameters
and integrating them out. To deal with flow outliers that do not conform to a rigid motion model,
Mahendran et al.
[41]
use a histogram matching-based loss and GWM [
8
] over-segments the scene
relying on spectral clustering to produce a binary segmentation during inference. Instead, we model
the noise in our formulation directly. Finally, Meunier et al.
[42]
rely on flow as input, limiting the
method to videos only.
3 Method
Let a frame
I R3×H×W
of a video and its optical flow
fR2×H×W
be defined on the
H×W
lattice. The optical flow is a local summary of the motion from one frame to the next. We use it to
supervise a network
Φ
that, given the (single) image
I
as input, predicts soft assignments of each
pixel to up to
K
different image regions, outputting an
H×W
collection of probability vectors
Φ(I)ˆ
H×W
K[0,1]K×H×W
, where
ˆ
K
is the
K1
-dimensional simplex. The quality of
the regions is measured based on how likely they contain flow patterns typical of the motion of
independent objects.
In more detail, we represent the predicted image regions by a hard
K
-way pixel assignment (mask)
mH×W
K⊂ {0,1}K×H×W
, where
K
is the space of
K
-dimensional one-hot vectors. Each
mask is a sample from the categorical distribution output by the network, i.e.
mpΦ(m| I) =
Categorical[Φ(I)]
. Note that there is one categorical distribution for each pixel and that these are
mutually independent.
We then assume that the flow depends only on the regions, in the sense that
pΦ(f,m| I) = p(f|
m)pΦ(m| I),
where
p(f|m)
is a model of the distribution of the flow field given the regions. The
likelihood of the modeled Φis bounded by:
log pΦ(f| I) = log E
mpΦ(m|I)
[p(f|m)] E
mpΦ(m|I)
[log p(f|m)] .
Furthermore, inspired by ELBO, we regularize the model’s prediction
pΦ(m| I)
by taking its KL
divergence from a uniform prior p0(m), obtaining the learning objective
Lβ=E
mpΦ(m|I)
[log p(f|m)] + βDKL (pΦ(m| I)kp0(m)) .(1)
Next, we introduce the closed-form motion model
p(f|m)
in Eq. (1) and then explain how the
Gumbel-Softmax trick can be used to train the network.
Approximate motion models for optical flow.
We now turn to describing the models of motion
used in our work, which play a role in assessing the likelihood of optical flow
p(f|m)
. Optical
flow measures the coordinate change of pixels between neighboring frames, which arises due to the
motion of the camera and objects. We consider rigid-body motion of some object k.
Let
xt
k,yt
kRnk
be the spatial locations of the pixels belonging to region/object
k
at time
t
, where
nk
is the number of pixels in the region. For convenience, we stack the coordinates in a single vector
t
k= (xt
k,yt
k)R2nk.
The pixels comprising this object undergo coordinate change from
t
k
to
t+1
k
, giving rise to the optical flow for this object as
fk= Ωt+1
kt
k.
We assume this underlying
3D rigid-body motion can be approximated using a linear 2D parametric model
Πθ
with parameters
θ, so that:
t+1
k= Πθ(Ωt
k) + , fk= Πθ(Ωt
k)t
k+, (2)
3
where
captures the residual error of the approximation. Several forms of models are available (see
[
1
,
4
] for an overview). Here, we consider two such models: the translation of an object within the
camera plane, and an affine motion, given respectively by linear functions:
Πtr
θ(Ωt
k) = Ωt
k+1nk0
01nk
| {z }
Ptr
k
θ1
θ2,Πaff
θ(Ωt
k) = xtyt1nk0 0 0
0 0 0 xtyt1nk
| {z }
Paff
k
θ1
.
.
.
θ6
,(3)
where we use 1nkis a vector of nkones and matrix Pkcontains the coefficients of the model.
The affine model supports object rotation, scaling and shearing in addition to translation. It is often a
sufficient approximation to real-world optical flow, provided the objects are rigid, convex, and mainly
rotating in-plane.
We can then use the motion equations
(2)
to construct the distribution
p(f|m)
by assuming a prior
on the motion parameters and by marginalizing over it. Specifically, denote by
mk
the
k
-th slice
of the tensor
m
encoding the regions (i.e. the mask of the
k
-th region). We assume that regions are
statistically independent and decompose the log-likelihood p(f|m)as:
log p(f|m) = X
k
log p(fk|mk) = X
kZlog p(fk, θk|mk)k.(4)
Assuming that each object has i.i.d. parameters
θk
with a Gaussian prior
N(θ;µ, Σ),
and assuming
is a zero-mean noise with variance
σ2
, Eq. (2) gives marginal optical flow likelihood for segment
k
:
p(fk|mk) = Nfk; Πµ(Ωk)k, PkΣP>
k+σ2I(5)
where
I
is the identity matrix. A practical issue with Eq. (5) is that, if segment
k
contains
nk=
Pi(mk)i
pixels, then the covariance matrix
PkΣP>
k+σ2I
has dimension
2nk×2nk
. Inverting
such a matrix in the evaluation of the Gaussian log-density is very slow except for very small regions.
Furthermore, it is not obvious how to relax Eq. (4) to support gradient-based learning, e.g. through
the Gumbel-Softmax approximation. We solve these problems in the next section.
Expressions for the likelihood.
We now derive expressions for Eq. (5) which are efficient and
that lead to a natural relaxation for use in the Gumbel-Softmax sampling. Given the definitions
Fk=fkΠµ(Ωk)+Ωkand Λ=Σ1, we can rewrite Eq. (5) as:
p(fk|mk) = (2πσ2)nkdet Sk
det Λ 1
2
ed2
2σ2, d2=F>
kFk1
σ2F>
kPkS1
kP>
kFk,(6)
where
Sk=1
/σ2P>
kPk+ Λ.
The significant advantage of this form is that it involves the computation
of the inverse and determinant of matrix
Sk
, whose size is only
2×2
(for the translation model) or
6×6(for the affine one), instead of the much larger 2nk×2nk.
We can more explicitly introduce the dependency on the region assignments
m
by defining selector
matrices
Rk∈ {0,1}2nk×2n
(with
n=Pknk=HW
) that extract the
x
and
y
coordinates of the
pixels that belong to the corresponding region, i.e.
k=Rk.
We can then also write
Fk=RkF
and
Pk=RkP
. Furthermore, the product of the selectors
Lk=R>
kRk∈ {0,1}2n×2n
can be
written directly as a function of the assignment
m
as
Lk(m) = diag(mk,mk).
Plugging these back
in Eq. (6), we obtain expressions involving Lkonly:
nk=1
2|Lk|1, Sk=1
σ2P>LkP+ Λ, d2=F>LkF1
σ2(F>LkP)S1
k(P>LkF).(7)
Translation-only model.
Further simplifications are possible for specific models. For instance, for
the translation-only model, assuming that
Λ = diag(1 2,12)
then
Sk= diag(nk+ 1 2, nk+
12)and, after some calculations, we obtain the expression:
log p(f|m) = nlog 2πσ2+X
k
log nk+σ2
τ2
σ2
τ2
+1
2σ2F> IX
k
1
nk+σ2
τ2hmkm>
k
mkm>
ki!F.
4
Affine model.
For the affine model, the expression for
log p(f|m)
does not simplify as much.
Still, by exploiting the structure of matrix
Paff
k
, we can reduce the calculations to the computation
of inverse and determinant of small
3×3
matrices, which can be implemented efficiently in closed
form. Please see the Appendix for the derivation. Unless otherwise stated, the mean vector is set to
µ=(100010)>centering the prior on the no-motion point.
Gumbel-softmax.
In order to train the network using gradient descent, we need a differentiable
version of loss
(1)
. To do so, we use the re-parametrizable Gumbel-softmax relaxation [
22
,
40
].
The Gumbel-softmax relaxation replaces categorical samples
mH×W
K
from the distribution
pΦ(m| I) = Categorical[Φ(I)]
with continuous samples
ˆ
mˆ
H×W
K
from the distribution
GumbelSoftmax[Φ(I)]
. We take
N= 3
samples from this distribution to evaluate the expected
negative log-likelihood, further reducing variance. Then we simply replace
ˆ
m
for
m
in Eq. (1),
leading to differentiable quantities.
Post-processing.
Eq. (5) naturally encourages the model to form larger regions to explain parts of
the scene that move in a consistent (under the assumed prior) manner. However, we find that this
can also lead to the model grouping together objects that only coincidentally move together (e.g. all
objects mostly falling due to gravity in one of the datasets). Furthermore, optical flow is ambiguous
around object edges and occlusions. To address both the object grouping and occlusion boundary
issue, we use a simple post-processing step. We isolate connected components in the model output,
selecting the
K
largest masks, discarding any that are smaller than 0.1% of the image area, and
combining the left-over and discarded ones with the largest mask overall.
Warp loss.
Occasionally, the optical flow used to supervise our model can be noisy as it is estimated
by other methods. This noise is also unlikely to be isotropic as some surfaces are easier to estimate
that others. Rather than supporting heterogeneous noise and approximation error (Eq. (2)), we instead
prioritize parts of the scene covered by higher-quality flow. To this end, we introduce an additional
loss term that simply enforces consistency between adjacent frames
I1,I2
. In particular, it warps the
predicted mask distributions
Φ(I1),Φ(I2)
using the optical flow, weighted by the error of warping
the frames themselves, as follows:
Lwarp(I1,I2, f1, b2) = w(I2, f1(I1)) ·d(Φ(I2), f1(Φ(I1))) (8)
+w(I1, b2(I2)) ·d(Φ(I1), b2(Φ(I2))),
w(Ia,Ib)=1norm(|Ia− Ib|),
d(p, q) = DKL(pkq)/2 + DKL(qkp)/2,
where
f1(·)
indicates warping by forward optical flow
f1
(or backward
b2
). The symmetrized KL
divergence,
d(·)
, measured agreement between predicted and warped mask distributions, weighted by
the absolute error of the warped frames normalized in
[0,1]
. While the use of this term is not central
to our method, we include it to show how tolerance to noisy optical flow can be improved. We do not
use the warp loss (Eq. (8)) in our experiments, unless otherwise indicated by (WL). In that case, the
final loss is simply sum of the two terms: Lβ+Lwarp.
4 Experiments
Our method lies in between image-based and video-based segmentation approaches, because it uses
videos for supervision, but trains an image segmentation network that operates on still images only.
We thus evaluate our approach under a number of settings. Firstly, we evaluate how well motion
can be used to supervise an object instance segmenter that operates on still images. Secondly, we
compare such a segmenter to state-of-the-art object segmentation methods that use motion also at test
time (and are thus advantaged compared to our model). We conduct further analysis to validate our
modelling assumptions and the model’s reliance on the quality of the optical flow used for supervision.
Finally, we apply our method to a real-world setting.
4.1 Experimental setup
Datasets.
We evaluate our method on video and still image datasets. For video-based data, we use
the Multi-Object Video (MOVi) datasets, released as part of Kubric [
20
]. Specifically, we employ
MOVi-{A,C,D,E} versions. MOVi-A is similar to CLEVR [
25
] in terms of visual complexity and
contains videos of 3–10 falling objects on a simple, gray background. MOVi-C is significantly more
5
摘要:

UnsupervisedMulti-objectSegmentationbyPredictingProbableMotionPatternsLaurynasKarazija,SubhabrataChoudhury,IroLaina,ChristianRupprecht,AndreaVedaldiVisualGeometryGroupUniversityofOxfordOxford,UK{laurynas,subha,iro,chrisr,vedaldi}@robots.ox.ac.ukAbstractWeproposeanewapproachtolearntosegmentmultipl...

展开>> 收起<<
Unsupervised Multi-object Segmentation by Predicting Probable Motion Patterns Laurynas Karazija Subhabrata Choudhury.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:2.78MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注