Self-supervised Amodal Video Object Segmentation Jian Yao1Yuxin Hong2Chiyu Wang3Tianjun Xiao4yTong He4 Francesco Locatello4David Wipf4Yanwei Fu2yZheng Zhang4

2025-05-03 0 0 877.2KB 13 页 10玖币

侵权投诉

Self-supervised Amodal Video Object Segmentation

Jian Yao1∗Yuxin Hong2* Chiyu Wang3* Tianjun Xiao4†Tong He4

Francesco Locatello4David Wipf4Yanwei Fu2†Zheng Zhang4

1School of Management, Fudan University

2School of Data Science, Fudan University

3University of California, Berkeley

4Amazon Web Services

{jianyao20, yxhong20, yanweifu}@fudan.edu.cn,wcy_james@berkeley.edu

{tianjux, htong, locatelf, daviwipf, zhaz}@amazon.com

Abstract

Amodal perception requires inferring the full shape of an object that is partially

occluded. This task is particularly challenging on two levels: (1) it requires more

information than what is contained in the instant retina or imaging sensor, (2) it

is difﬁcult to obtain enough well-annotated amodal labels for supervision. To this

end, this paper develops a new framework of Self-supervised amodal Video object

segmentation (SaVos). Our method efﬁciently leverages the visual information of

video temporal sequences to infer the amodal mask of objects. The key intuition

is that the occluded part of an object can be explained away if that part is visible

in other frames, possibly deformed as long as the deformation can be reasonably

learned. Accordingly, we derive a novel self-supervised learning paradigm that

efﬁciently utilizes the visible object parts as the supervision to guide the training

on videos. In addition to learning type prior to complete masks for known types,

SaVos also learns the spatiotemporal prior, which is also useful for the amodal

task and could generalize to unseen types. The proposed framework achieves

the state-of-the-art performance on the synthetic amodal segmentation benchmark

FISHBOWL and the real world benchmark KINS-Video-Car. Further, it lends

itself well to being transferred to novel distributions using test-time adaptation,

outperforming existing models even after the transfer to a new distribution.

1 Introduction

Cognitive scientists have found human vision system contains several hierarchies. Visual percep-

tion [

] ﬁrst carves a scene at its physical joints, decomposing it into initial object representation by

grouping and simple completion. At this point, the representation is tethered into the retina sensor

[

]. Then, correspondence or motion on the temporal dimension is built to form the object repre-

sentation that is untethered from the retinal reference frame through operations like spatiotemporal

aggregation, tracking, inference and prediction [

]. The more stable untethered representation is

ready to be raised from the perception system to the cognitive system for higher level action and

symbolic cognition [

]. Machine learning, especially with artiﬁcial neural networks, has progressed

tremendously on tethered vision tasks like detection and modal segmentation. The natural next step

is to go the higher rung of the ladder by tackling untethered vision.

This paper studies the task of amodal segmentation which aims at inferring the whole shape of the

object on both visible and occluded parts. It has critical applications on robot manipulation and

∗Work completed during internship at AWS Shanghai AI Labs.

†Correspondence authors are Tianjun Xiao and Yanwei Fu.

Preprint. Under review.

arXiv:2210.12733v1 [cs.CV] 23 Oct 2022

autonomous driving [

]. Conceptually, this task is on the bridge between tethered and untethered rep-

resentations. Amodal segmentation requires prior knowledge. One option that has been explored in lit-

erature is using the tethered representation and prior knowledge about object type to get amodal mask.

Alternatively, we can get amodal masks using the untethered representation by building dense object

motion across frames to explain away occlusion, which is referred as spatiotemporal prior. We prefer

to explore more on the second one since the dependence on type prior makes the ﬁrst method hard

to generalize, considering the frequency distribution of visual categories in daily life is long-tailed.

Following this direction, we propose a Self-supervised amodal Video object segmentation (SaVos)

pipeline which simultaneously models amodal mask and the dense object motion on the amodal mask.

Unlike traditional optical ﬂow or correspondence networks, our approach does not require explicit

visual correspondence across pixels, which would be impossible due to occlusions. Instead, modeling

motion using temporal information allows us to complete dense amodal motion predictions.

The architecture is built for spatiotemporal modeling, which has better generalization performance

than using type priors. Despite that, we show that SaVos automatically ﬁgures its way to learn type

prior as well, as learning types can help the encoder-decoder-style architecture make prediction.

This makes generalization to distribution shifts remain challenging, for example, to unseen types

of objects. To address this issue, we need to suppress the type prior and amplify spatiotemporal

prior to make predictions. This is achieved by combining SaVos with test-time adaptation. Critically,

we found that our model is “adaptation-friendly” as it can naturally be improved with test-time

adaptation techniques without any change on the self-supervised loss, achieving a signiﬁcant boost in

generalization performance.

We make several contributions in this paper:

(1) We propose a Self-supervised amodal Video object segmentation (SaVos) training pipeline built

upon the intuition that the occluded part of an object can be explained away if that part is visible in

other frames (Figure 1), possibly deformed as long as the deformation can be reasonably learned.

The pipeline turns visible masks in other frames to amodal self-supervision signals.

(2) The proposed approach simultaneously models the amodal mask and the dense amodal object

motion. The dense amodal object motion builds the bridge between different frames to achieve the

transition from visible masks to amodal supervision. To address the challenge of predicting motion

on the occluded area, we propose a novel architecture design that takes the advantage of the inductive

bias from the spatiotemporal modeling and the common-fate principle of Gestalt Psychology [

The proposed method shows the state-of-the-art amodal segmentation performance in self-supervised

setting on several simulation and real-world benchmarks.

(3) The proposed SaVos model shows strong generalization performance on drastic distribution shifts

between training and test data after combining with one-shot test-time adaptation. We empirically

demonstrate that, by applying test-time adaptation without any change on the loss, SaVos trained on

synthetic ﬁsh dataset can even outperform a competitor that is well learned on the target real-world

driving car dataset. Interestingly, applying test-time adaptation on an image-level baseline model

doesn’t bring the same improvement as observed on SaVos. This provides an unique perspective on

comparing different models by checking how effective can test-time adaptation work on them.

2 Related works

Untethered vision and amodal segmentation

. Human vision forms a hierarchy by grouping retina

signals into initial object concept; and the representation will untether from the immediate retina

sensor input grouping the spatiotemporally disjoint pieces. Such untethered representation has been

studied in various topics [

]. Particularly, Amodal segmentation [

] is a task

inferring shape of the object on both visible and occluded part. There are various image amodal

datasets such as COCOA [

] and KINS [

], and video amodal dataset – SAIL-VOS [

] created by

the GTA game engine. Unfortunately, SAIL-VOS has frequent camera view switches, not the ideal

testbed to apply video tracking or motion. Several efforts are made towards amodal segmentation on

these datasets [

]. Generally speaking, most of the methods are

on image level and they model type priors with shape statistics, as such it is challenging to extend

their models to open-world applications where object category distributions are long-tail. Amodal

segmentation is also related to structured generative model [

]. These models attempt to

maximize the likelihood of the whole video sequences during training so as to learn a more consistent

object representation and the amodal representation. However, the major tasks for those models

are object discovery and presentation and they are tested on simpler datasets; self-supervised object

discovery in real-world complex scenes like the driving scene in [

] remains too challenging for these

methods. Without object discovered, no proper amodal prediction can be expected.

Dense correspondence and motion

Our goal is to achieve amodal using untethered process, which

requires object motion signals. There have been studies [

] on correspondence and motion before

deep learning time. FlowNet [

] and its follow-up work FlowNet2 [

] train deep networks in

a supervised way using simulation videos. Truong et al.[

] proposes GLU-Net, a global-local

universal network for dense correspondences. However, motion on the occlusion area cannot be

estimated with those methods. Occlusion and correspondence estimation depend on each other and it

is a typical chicken-and-egg problem [15]. We need to model additional priors.

Video inpainting

A related but different task is video inpainting. Existing video inpainting methods

ﬁll the spatio-temporal holes by encourage the spatial and temporal coherence and smoothness

[

], rather than particularly inferring the occluded objects. The object-level knowledge was

not explicitly leveraged to inform the model learning. Recently, Ke et al.[

] learns object completion

by contributing the large-scale dataset Youtube-VOI, where occlusion masks are generated using

high-ﬁdelity simulation to provide training signal. Nevertheless, there is still the reality gap between

synthetic occlusions and the amodal masks in the real-world . Accordingly, our model is designed to

learn the amodal supervision signal in the easily accessible raw videos if spatiotemporal information

is properly mined.

Domain generalization and test-time adaptation

Transfer learning [

] and domain adaptation [

]

are general approaches for improving the performance of predictive models when training and test

data come from different distributions. Sun et al.[

] proposes Test-time Training. It is different from

ﬁnetuning where some labeled data are available in test domain, and different from domain adaptation

where there is access to both train and test samples. They design a self-supervised loss to train

together with the supervised loss and in test time, apply the self-supervised loss on the test sample.

Wang et al.[

] proposes a fully test time adaptation by test entropy minimization. A related topic is

Deep Image Prior [

] and Deep Video Prior [

], they directly optimize on test samples without

training on training set. Our model is self-supervised thus naturally ﬁt into test-time adaptation

framework. We will see how it works for our method on challenging adaptation scenarios.

3 Method

Notations

. Given the input video

{It}T

t=1

frames with

objects, the task is to generate the

amodal “binary” segmentation mask sequences

M=Mk

t

for each object in every frame. On

the raw frames, we obtain the image patch

and visible modal segmentation mask

V=Vk

t

Further, we also obtain the optical ﬂow

∆Vt=∆Vk

x,t,∆Vk

y,t

such that

It+1[x+ ∆Vx,t[x, y], y +

∆Vy,t[x, y]] ≈It[x, y]

. These information can be retrieved from human annotation or extracted with

off-the-shelf models, and we use them as given input to our model.

3.1 Overview of Our SaVos Learning Problem

The key insight of the amodal segmentation task is to maximally exploit and explore visual prior

patterns to explain away the occluded object parts [

]. Such prior patterns include, but are not be

limited to (1)

type prior

: the statistics of images, or the shape of certain types of objects; and (2)

spatiotemporal prior

: the current-occluded part of an object might be visible in the other frames,

as illustrated in Figure 1. Under a self-supervised setting, we exploit temporal correlation among

frames relying exclusively on the data itself.

Speciﬁcally, SaVos generates training supervision signals by investigating the relations between

the amodal masks and visible masks on neighboring frames. The key assumption is “some part is

occluded at some time, but not all parts all the time”, and that deformation of the past visible parts can

be approximately learned. I.e. gleaning visible parts over enough frames produces enough evidences

to complete an object. The inductive bias we can and must leverage is spatiotemporal continuity.

We note that parts remain occluded all the time can not be recovered unless there are other priors

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-supervisedAmodalVideoObjectSegmentationJianYao1YuxinHong2*ChiyuWang3*TianjunXiao4yTongHe4FrancescoLocatello4DavidWipf4YanweiFu2yZhengZhang41SchoolofManagement,FudanUniversity2SchoolofDataScience,FudanUniversity3UniversityofCalifornia,Berkeley4AmazonWebServices{jianyao20,yxhong20,yanweifu}@fuda...

展开>> 收起<<

Self-supervised Amodal Video Object Segmentation Jian Yao1Yuxin Hong2Chiyu Wang3Tianjun Xiao4yTong He4 Francesco Locatello4David Wipf4Yanwei Fu2yZheng Zhang4.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-supervised Amodal Video Object Segmentation Jian Yao1Yuxin Hong2Chiyu Wang3Tianjun Xiao4yTong He4 Francesco Locatello4David Wipf4Yanwei Fu2yZheng Zhang4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: