maximize the likelihood of the whole video sequences during training so as to learn a more consistent
object representation and the amodal representation. However, the major tasks for those models
are object discovery and presentation and they are tested on simpler datasets; self-supervised object
discovery in real-world complex scenes like the driving scene in [
9
] remains too challenging for these
methods. Without object discovered, no proper amodal prediction can be expected.
Dense correspondence and motion
Our goal is to achieve amodal using untethered process, which
requires object motion signals. There have been studies [
8
,
2
] on correspondence and motion before
deep learning time. FlowNet [
5
] and its follow-up work FlowNet2 [
14
] train deep networks in
a supervised way using simulation videos. Truong et al.[
34
] proposes GLU-Net, a global-local
universal network for dense correspondences. However, motion on the occlusion area cannot be
estimated with those methods. Occlusion and correspondence estimation depend on each other and it
is a typical chicken-and-egg problem [15]. We need to model additional priors.
Video inpainting
A related but different task is video inpainting. Existing video inpainting methods
fill the spatio-temporal holes by encourage the spatial and temporal coherence and smoothness
[
40
,
10
,
12
], rather than particularly inferring the occluded objects. The object-level knowledge was
not explicitly leveraged to inform the model learning. Recently, Ke et al.[
17
] learns object completion
by contributing the large-scale dataset Youtube-VOI, where occlusion masks are generated using
high-fidelity simulation to provide training signal. Nevertheless, there is still the reality gap between
synthetic occlusions and the amodal masks in the real-world . Accordingly, our model is designed to
learn the amodal supervision signal in the easily accessible raw videos if spatiotemporal information
is properly mined.
Domain generalization and test-time adaptation
Transfer learning [
4
] and domain adaptation [
25
]
are general approaches for improving the performance of predictive models when training and test
data come from different distributions. Sun et al.[
31
] proposes Test-time Training. It is different from
finetuning where some labeled data are available in test domain, and different from domain adaptation
where there is access to both train and test samples. They design a self-supervised loss to train
together with the supervised loss and in test time, apply the self-supervised loss on the test sample.
Wang et al.[
37
] proposes a fully test time adaptation by test entropy minimization. A related topic is
Deep Image Prior [
36
] and Deep Video Prior [
19
], they directly optimize on test samples without
training on training set. Our model is self-supervised thus naturally fit into test-time adaptation
framework. We will see how it works for our method on challenging adaptation scenarios.
3 Method
Notations
. Given the input video
{It}T
t=1
of
T
frames with
K
objects, the task is to generate the
amodal “binary” segmentation mask sequences
M=Mk
t
for each object in every frame. On
the raw frames, we obtain the image patch
Ik
t
and visible modal segmentation mask
V=Vk
t
.
Further, we also obtain the optical flow
∆Vt=∆Vk
x,t,∆Vk
y,t
such that
It+1[x+ ∆Vx,t[x, y], y +
∆Vy,t[x, y]] ≈It[x, y]
. These information can be retrieved from human annotation or extracted with
off-the-shelf models, and we use them as given input to our model.
3.1 Overview of Our SaVos Learning Problem
The key insight of the amodal segmentation task is to maximally exploit and explore visual prior
patterns to explain away the occluded object parts [
35
]. Such prior patterns include, but are not be
limited to (1)
type prior
: the statistics of images, or the shape of certain types of objects; and (2)
spatiotemporal prior
: the current-occluded part of an object might be visible in the other frames,
as illustrated in Figure 1. Under a self-supervised setting, we exploit temporal correlation among
frames relying exclusively on the data itself.
Specifically, SaVos generates training supervision signals by investigating the relations between
the amodal masks and visible masks on neighboring frames. The key assumption is “some part is
occluded at some time, but not all parts all the time”, and that deformation of the past visible parts can
be approximately learned. I.e. gleaning visible parts over enough frames produces enough evidences
to complete an object. The inductive bias we can and must leverage is spatiotemporal continuity.
We note that parts remain occluded all the time can not be recovered unless there are other priors
3