In this paper, we propose a method that lies in-between these two classes, i.e. single image-based and
video-based. Our model learns to segment objects in still images, and is thus based on appearance, but
learns to do so using video as a learning signal, in an unsupervised manner. The learning process can
be summarized as follows. Given an image, each pixel is assigned to a slot that represents a certain
object. The quality of the assignments is then measured by the coherence of the (unobserved) optical
flow within the extracted regions. Because predicting optical flow from a still image is intrinsically
ambiguous, the method models distributions of probable flow patterns within each region. The idea is
that (rigid) objects generate characteristic flow patterns that can be used to distinguish between them.
Note that, because the segmentation network is based on a single image, it will learn to partition all
objects contained in the scene, not just the ones that actually move in the video, i.e. solving an object
instance segmentation problem, rather than motion segmentation.
We derive closed-form distributions for the flow field generated by rigid objects moving in the scene.
We also derive efficient expressions for the calculation of the flow probability under such models. The
problem of decomposing the image into a number of regions is cast as a standard image segmentation
task and an off-the-shelf neural network can be used for it.
As our method uses videos to train an image-based model, we introduce two new datasets which are
straightforward video extensions of the existing image datasets CLEVR [
25
] and CLEVRTEX [
27
].
These datasets are built by animating the objects with initial velocities and using a physics simulation
to generate realistic object movement. Our datasets are constructed with the realistic assumption that
not all objects are moving at all times. This means that motion alone cannot be used as the sole cue
for objectness and reflects scenarios such as a workbench where a person only interacts with a small
number of objects at a time.
Empirically, we validate our model against several ablations and baselines. We compare our approach
to existing unsupervised multi-object segmentation methods achieving state-of-the-art performance.
We demonstrate particularly strong performance in visually complex scenes even with unseen objects
and textures at test time. Our experiments in comparison to image-based models and, in particular,
adding our motion-aware formulation to existing models shows substantial improvements, confirming
that motion is an important cue to learn objectness. Furthermore, we show that our learned segmenter,
which operates on still images, produces better segmentation results than current video-based methods
that use motion information at test time. Finally, we also apply our method to real-world self-driving
scenarios where we show superior performance to prior work.
2 Related Work
Multi-object decomposition.
Learning unsupervised object segmentation for static scenes is a
well-researched problem in computer vision [
5
,
9
,
11
,
12
,
13
,
14
,
17
,
18
,
23
,
34
,
37
,
47
,
48
,
58
].
These methods aim to decompose the scene into constituent parts, e.g. the different foreground
objects and the background. Glimpse-based methods [
9
,
14
,
23
,
34
] find input patches (glimpses) that
contain the objects in the scene. These methods learn object descriptors that encode their properties
(e.g. position, number, size of the objects) using variational inference, composing glimpses into the
final picture. More related to ours are approaches that learn per-pixel object masks [
5
,
11
,
12
,
13
,
37
].
MoNet [
5
] and IODINE [
19
] employ multiple encoding-decoding steps to sequentially explain the
scene as a collection of regions. Slot Attention [
37
] uses a multi-step soft clustering-like scheme to
find the regions simultaneously. In all cases, learning is posed as an image reconstruction problem.
In order to align learnable slots with semantic objects, models have to make efficient use of a limited
representation available for each region, such as learning to only explain visual appearance. This
principle, however, is difficult to extend to visually complex data [
27
] and relies on custom specialized
architectures. Instead, our method allows for any standard segmentation architecture to be used,
which we train to predict regions that are most likely described by rigid motion patterns.
Video-based multi object decomposition.
Another line of work extends the unsupervised object
decomposition problem to videos [
21
,
24
,
26
,
29
,
30
,
31
,
33
,
46
,
54
,
56
,
57
,
61
]. Many of these
methods work mainly with simpler datasets [
30
,
46
,
61
] and require sequential frames for training.
For example, SCALOR [
24
] is a glimpse-based method that discovers and propagates objects across
frames to learn intermediate object latents. SIMONe [
26
] processes the whole video at once, learning
both temporal scene representation and time-invariant object representations simultaneously. Slot
Attention for Video (SAVi) [
28
] poses the multi-object problem as optical-flow prediction using
2