Self-supervised Amodal Video Object Segmentation Jian Yao1Yuxin Hong2Chiyu Wang3Tianjun Xiao4yTong He4 Francesco Locatello4David Wipf4Yanwei Fu2yZheng Zhang4

2025-05-03 0 0 877.2KB 13 页 10玖币
侵权投诉
Self-supervised Amodal Video Object Segmentation
Jian Yao1Yuxin Hong2* Chiyu Wang3* Tianjun Xiao4Tong He4
Francesco Locatello4David Wipf4Yanwei Fu2Zheng Zhang4
1School of Management, Fudan University
2School of Data Science, Fudan University
3University of California, Berkeley
4Amazon Web Services
{jianyao20, yxhong20, yanweifu}@fudan.edu.cn,wcy_james@berkeley.edu
{tianjux, htong, locatelf, daviwipf, zhaz}@amazon.com
Abstract
Amodal perception requires inferring the full shape of an object that is partially
occluded. This task is particularly challenging on two levels: (1) it requires more
information than what is contained in the instant retina or imaging sensor, (2) it
is difficult to obtain enough well-annotated amodal labels for supervision. To this
end, this paper develops a new framework of Self-supervised amodal Video object
segmentation (SaVos). Our method efficiently leverages the visual information of
video temporal sequences to infer the amodal mask of objects. The key intuition
is that the occluded part of an object can be explained away if that part is visible
in other frames, possibly deformed as long as the deformation can be reasonably
learned. Accordingly, we derive a novel self-supervised learning paradigm that
efficiently utilizes the visible object parts as the supervision to guide the training
on videos. In addition to learning type prior to complete masks for known types,
SaVos also learns the spatiotemporal prior, which is also useful for the amodal
task and could generalize to unseen types. The proposed framework achieves
the state-of-the-art performance on the synthetic amodal segmentation benchmark
FISHBOWL and the real world benchmark KINS-Video-Car. Further, it lends
itself well to being transferred to novel distributions using test-time adaptation,
outperforming existing models even after the transfer to a new distribution.
1 Introduction
Cognitive scientists have found human vision system contains several hierarchies. Visual percep-
tion [
27
] first carves a scene at its physical joints, decomposing it into initial object representation by
grouping and simple completion. At this point, the representation is tethered into the retina sensor
[
13
]. Then, correspondence or motion on the temporal dimension is built to form the object repre-
sentation that is untethered from the retinal reference frame through operations like spatiotemporal
aggregation, tracking, inference and prediction [
6
]. The more stable untethered representation is
ready to be raised from the perception system to the cognitive system for higher level action and
symbolic cognition [
23
]. Machine learning, especially with artificial neural networks, has progressed
tremendously on tethered vision tasks like detection and modal segmentation. The natural next step
is to go the higher rung of the ladder by tackling untethered vision.
This paper studies the task of amodal segmentation which aims at inferring the whole shape of the
object on both visible and occluded parts. It has critical applications on robot manipulation and
Work completed during internship at AWS Shanghai AI Labs.
Correspondence authors are Tianjun Xiao and Yanwei Fu.
Preprint. Under review.
arXiv:2210.12733v1 [cs.CV] 23 Oct 2022
autonomous driving [
24
]. Conceptually, this task is on the bridge between tethered and untethered rep-
resentations. Amodal segmentation requires prior knowledge. One option that has been explored in lit-
erature is using the tethered representation and prior knowledge about object type to get amodal mask.
Alternatively, we can get amodal masks using the untethered representation by building dense object
motion across frames to explain away occlusion, which is referred as spatiotemporal prior. We prefer
to explore more on the second one since the dependence on type prior makes the first method hard
to generalize, considering the frequency distribution of visual categories in daily life is long-tailed.
Following this direction, we propose a Self-supervised amodal Video object segmentation (SaVos)
pipeline which simultaneously models amodal mask and the dense object motion on the amodal mask.
Unlike traditional optical flow or correspondence networks, our approach does not require explicit
visual correspondence across pixels, which would be impossible due to occlusions. Instead, modeling
motion using temporal information allows us to complete dense amodal motion predictions.
The architecture is built for spatiotemporal modeling, which has better generalization performance
than using type priors. Despite that, we show that SaVos automatically figures its way to learn type
prior as well, as learning types can help the encoder-decoder-style architecture make prediction.
This makes generalization to distribution shifts remain challenging, for example, to unseen types
of objects. To address this issue, we need to suppress the type prior and amplify spatiotemporal
prior to make predictions. This is achieved by combining SaVos with test-time adaptation. Critically,
we found that our model is “adaptation-friendly” as it can naturally be improved with test-time
adaptation techniques without any change on the self-supervised loss, achieving a significant boost in
generalization performance.
We make several contributions in this paper:
(1) We propose a Self-supervised amodal Video object segmentation (SaVos) training pipeline built
upon the intuition that the occluded part of an object can be explained away if that part is visible in
other frames (Figure 1), possibly deformed as long as the deformation can be reasonably learned.
The pipeline turns visible masks in other frames to amodal self-supervision signals.
(2) The proposed approach simultaneously models the amodal mask and the dense amodal object
motion. The dense amodal object motion builds the bridge between different frames to achieve the
transition from visible masks to amodal supervision. To address the challenge of predicting motion
on the occluded area, we propose a novel architecture design that takes the advantage of the inductive
bias from the spatiotemporal modeling and the common-fate principle of Gestalt Psychology [
39
].
The proposed method shows the state-of-the-art amodal segmentation performance in self-supervised
setting on several simulation and real-world benchmarks.
(3) The proposed SaVos model shows strong generalization performance on drastic distribution shifts
between training and test data after combining with one-shot test-time adaptation. We empirically
demonstrate that, by applying test-time adaptation without any change on the loss, SaVos trained on
synthetic fish dataset can even outperform a competitor that is well learned on the target real-world
driving car dataset. Interestingly, applying test-time adaptation on an image-level baseline model
doesn’t bring the same improvement as observed on SaVos. This provides an unique perspective on
comparing different models by checking how effective can test-time adaptation work on them.
2 Related works
Untethered vision and amodal segmentation
. Human vision forms a hierarchy by grouping retina
signals into initial object concept; and the representation will untether from the immediate retina
sensor input grouping the spatiotemporally disjoint pieces. Such untethered representation has been
studied in various topics [
23
,
21
,
28
,
26
,
34
,
22
]. Particularly, Amodal segmentation [
46
] is a task
inferring shape of the object on both visible and occluded part. There are various image amodal
datasets such as COCOA [
46
] and KINS [
24
], and video amodal dataset – SAIL-VOS [
11
] created by
the GTA game engine. Unfortunately, SAIL-VOS has frequent camera view switches, not the ideal
testbed to apply video tracking or motion. Several efforts are made towards amodal segmentation on
these datasets [
46
,
24
,
7
,
45
,
41
,
44
,
33
,
17
,
43
,
30
,
20
]. Generally speaking, most of the methods are
on image level and they model type priors with shape statistics, as such it is challenging to extend
their models to open-world applications where object category distributions are long-tail. Amodal
segmentation is also related to structured generative model [
47
,
16
,
29
,
18
]. These models attempt to
2
maximize the likelihood of the whole video sequences during training so as to learn a more consistent
object representation and the amodal representation. However, the major tasks for those models
are object discovery and presentation and they are tested on simpler datasets; self-supervised object
discovery in real-world complex scenes like the driving scene in [
9
] remains too challenging for these
methods. Without object discovered, no proper amodal prediction can be expected.
Dense correspondence and motion
Our goal is to achieve amodal using untethered process, which
requires object motion signals. There have been studies [
8
,
2
] on correspondence and motion before
deep learning time. FlowNet [
5
] and its follow-up work FlowNet2 [
14
] train deep networks in
a supervised way using simulation videos. Truong et al.[
34
] proposes GLU-Net, a global-local
universal network for dense correspondences. However, motion on the occlusion area cannot be
estimated with those methods. Occlusion and correspondence estimation depend on each other and it
is a typical chicken-and-egg problem [15]. We need to model additional priors.
Video inpainting
A related but different task is video inpainting. Existing video inpainting methods
fill the spatio-temporal holes by encourage the spatial and temporal coherence and smoothness
[
40
,
10
,
12
], rather than particularly inferring the occluded objects. The object-level knowledge was
not explicitly leveraged to inform the model learning. Recently, Ke et al.[
17
] learns object completion
by contributing the large-scale dataset Youtube-VOI, where occlusion masks are generated using
high-fidelity simulation to provide training signal. Nevertheless, there is still the reality gap between
synthetic occlusions and the amodal masks in the real-world . Accordingly, our model is designed to
learn the amodal supervision signal in the easily accessible raw videos if spatiotemporal information
is properly mined.
Domain generalization and test-time adaptation
Transfer learning [
4
] and domain adaptation [
25
]
are general approaches for improving the performance of predictive models when training and test
data come from different distributions. Sun et al.[
31
] proposes Test-time Training. It is different from
finetuning where some labeled data are available in test domain, and different from domain adaptation
where there is access to both train and test samples. They design a self-supervised loss to train
together with the supervised loss and in test time, apply the self-supervised loss on the test sample.
Wang et al.[
37
] proposes a fully test time adaptation by test entropy minimization. A related topic is
Deep Image Prior [
36
] and Deep Video Prior [
19
], they directly optimize on test samples without
training on training set. Our model is self-supervised thus naturally fit into test-time adaptation
framework. We will see how it works for our method on challenging adaptation scenarios.
3 Method
Notations
. Given the input video
{It}T
t=1
of
T
frames with
K
objects, the task is to generate the
amodal “binary” segmentation mask sequences
M=Mk
t
for each object in every frame. On
the raw frames, we obtain the image patch
Ik
t
and visible modal segmentation mask
V=Vk
t
.
Further, we also obtain the optical flow
Vt=Vk
x,t,Vk
y,t
such that
It+1[x+ ∆Vx,t[x, y], y +
Vy,t[x, y]] It[x, y]
. These information can be retrieved from human annotation or extracted with
off-the-shelf models, and we use them as given input to our model.
3.1 Overview of Our SaVos Learning Problem
The key insight of the amodal segmentation task is to maximally exploit and explore visual prior
patterns to explain away the occluded object parts [
35
]. Such prior patterns include, but are not be
limited to (1)
type prior
: the statistics of images, or the shape of certain types of objects; and (2)
spatiotemporal prior
: the current-occluded part of an object might be visible in the other frames,
as illustrated in Figure 1. Under a self-supervised setting, we exploit temporal correlation among
frames relying exclusively on the data itself.
Specifically, SaVos generates training supervision signals by investigating the relations between
the amodal masks and visible masks on neighboring frames. The key assumption is “some part is
occluded at some time, but not all parts all the time”, and that deformation of the past visible parts can
be approximately learned. I.e. gleaning visible parts over enough frames produces enough evidences
to complete an object. The inductive bias we can and must leverage is spatiotemporal continuity.
We note that parts remain occluded all the time can not be recovered unless there are other priors
3
摘要:

Self-supervisedAmodalVideoObjectSegmentationJianYao1YuxinHong2*ChiyuWang3*TianjunXiao4yTongHe4FrancescoLocatello4DavidWipf4YanweiFu2yZhengZhang41SchoolofManagement,FudanUniversity2SchoolofDataScience,FudanUniversity3UniversityofCalifornia,Berkeley4AmazonWebServices{jianyao20,yxhong20,yanweifu}@fuda...

展开>> 收起<<
Self-supervised Amodal Video Object Segmentation Jian Yao1Yuxin Hong2Chiyu Wang3Tianjun Xiao4yTong He4 Francesco Locatello4David Wipf4Yanwei Fu2yZheng Zhang4.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:877.2KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注