CLIP-F LOW C ONTRASTIVE LEARNING BY SEMI - SUPERVISED ITERATIVE PSEUDO LABELING FOR OP- TICAL FLOW ESTIMATION

2025-04-29 0 0 6.71MB 16 页 10玖币

侵权投诉

CLIP-FLOW: CONTRASTIVE LEARNING BY SEMI-

SUPERVISED ITERATIVE PSEUDO LABELING FOR OP-

TICAL FLOW ESTIMATION

Zhiqi Zhang, Nitin Bansal, Changjiang Cai, Pan Ji∗

, Qingan Yan, Xiangyu Xu & Yi Xu

OPPO US Research Center, Innopeak Technology, USA

{zhiqi.zhang,pan.ji,nitin.bansal,changjiang.cai,qingan.yan,

xiangyu.xu,yi.xu}@innopeaktech.com

ABSTRACT

Synthetic datasets are often used to pretrain end-to-end optical ﬂow networks, due

to the lack of a large amount of labeled, real scene data. But major drops in accu-

racy occur when moving from synthetic to real scenes. How do we better transfer

the knowledge learned from synthetic to real domains? To this end, we propose

CLIP-Flow, a semi-supervised iterative pseudo labeling framework to transfer the

pretraining knowledge to the target real domain. We leverage large-scale, unla-

beled real data to facilitate transfer learning with the supervision of iteratively

updated pseudo ground truth labels, bridging the domain gap between the syn-

thetic and the real. In addition, we propose a contrastive ﬂow loss on reference

features and the warped features by pseudo ground truth ﬂows, to further boost

the accurate matching and dampen the mismatching due to motion, occlusion, or

noisy pseudo labels. We adopt RAFT as backbone and obtain an F1-all error of

4.11%, i.e. a 19% error reduction from RAFT (5.10%) and ranking 2nd place at

submission on KITTI 2015 benchmark. Our framework can also be extended to

other models, e.g. CRAFT, reducing the F1-all error from 4.79% to 4.66% on

KITTI 2015 benchmark.

1 INTRODUCTION

Optical ﬂow is critical in many high level vision problems, such as action recognition (Simonyan &

Zisserman, 2014; Sevilla-Lara et al., 2018; Sun et al., 2018b), video segmentation (Yang et al., 2021;

Yang & Ramanan, 2021) and editing (Bonneel et al., 2015), autonomous driving (Janai et al., 2020)

and so on. Traditional methods (Horn & Schunck, 1981; Menze et al., 2015; Ranftl et al., 2014;

Zach et al., 2007) mainly focus on formulating ﬂow estimation as solving optimization problems us-

ing hand-crafted features. The optimization is searched over the space of dense displacement ﬁelds

between a pair of input images, which is often time-consuming. Recently, data driven deep learning

methods Dosovitskiy et al. (2015); Ilg et al. (2017); Teed & Deng (2020) have been proved suc-

cessful in estimating the optical ﬂow thanks to the availability of all kinds of high quality synthetic

datasets (Butler et al., 2012b; Dosovitskiy et al., 2015; Mayer et al., 2016; Krispin et al., 2016).

However, most of the recent works (Dosovitskiy et al., 2015; Ilg et al., 2017; Teed & Deng, 2020;

Jeong et al., 2022) mainly train on the synthetic datasets given that there is no sufﬁcient real labeled

optical ﬂow datasets to be used to train a deep learning model. State-of-the-art (SOTA) models

always get more accurate results on the synthetic dataset like Sintel (Butler et al., 2012a) than the real

scene dataset like KITTI 2015 (Menze & Geiger, 2015). This is mainly because that the model tends

to overﬁt the small training data, which echos in Tab. 1, i.e., there is a big gap between the training

F1-all error and test F1-all error when train and test on the KITTI dataset in all of the previous SOTA

methods. Therefore, we argue that this gap in performance is because of dearth of real training data,

and a big distribution gap between the synthetic data and real scene data. Although the model can

perfectly explain all kinds of synthetic data, however, when dealing with real data, it performs rather

unsatisfactorily. Our proposed work focuses on bridging the glaring performance gap between the

∗Worked on this, while associated with OPPO US Research

arXiv:2210.14383v3 [cs.CV] 3 Dec 2022

synthetic data and the real scene data. As in previous data driven approaches, smarter and longer

training strategies prove to be beneﬁcial and helps in obtaining better optical ﬂow results. Through

our work we also try to ﬁnd answers of the following two questions: (i) How to take advantage of

the current SOTA optical ﬂow models to further consolidate gain on real datasets? and (ii) How can

we use semi-supervised learning along with contrastive feature representation learning strategies to

effectively utilize the huge amount of unlabeled real data at our disposal?

Unsupervised visual representation learning (He et al., 2020; Chen et al., 2020b) has proved suc-

cessful in boosting most of major vision related tasks like image classiﬁcation, object detection and

semantic segmentation to name a few. Work such as (He et al., 2020; Chen et al., 2020b) also

emphasizes the importance of the contrastive loss, when dealing with huge dense dataset. Given

that optical ﬂow tasks generally lack real ground truth labels, we ask if leveraging the unsupervised

visual representation learning boosts optical ﬂow performance? In order to answer this question,

we examine the impact of contrastive learning and pseudo labeling during training under a semi-

supervised setting. We particularly conduct exhaustive experiments using KITTI-Raw Geiger et al.

(2013) and KITTI 2015 (Menze & Geiger, 2015) datasets to evaluate its performance gain, and

show encouraging results. We believe that gain seen in model’s performance is reﬂective of the fact

that employing representation learning techniques such as contrasting learning helps in achieving

a much more reﬁned 4D cost correlation volume. To constrain the ﬂow per pixel, we employ a

simple positional encoding of 2D cartesian coordinates on the input frames as suggested in (Liu

et al., 2018), which further consolidates the gain achieved by contrastive learning. We follow this up

with an iterative-ﬂow-reﬁnement training using pseudo labeling which further consolidates on the

previous gains to give us SOTA results. At this point we would also like to highlight that we follow

a speciﬁc and well calibrated training strategy to fully exploit the gains of our method. Without loss

of generality, we use RAFT (Teed & Deng, 2020) as the backbone network for our experiments.

To fairly compare our method with existing SOTA methods, we tested our proposed method on the

KITTI 2015 test dataset, and we achieve the best F1-all error score among all the published methods

by a signiﬁcant margin.

To summarize our main contributions: 1) We provide a detailed training strategy, which uses SSL

methods on top of the well known RAFT model to improve SOTA performance for optical ﬂow

estimation. 2) We present the ways to employ contrastive learning and pseudo labeling effectively

and intelligently, such that both jointly help in improving upon existing benchmarks. 3) We discuss

the positive impact of a simple 2D positional encoding, which beneﬁts ﬂow training both for Sintel

and KITTI 2015 datasets.

2 RELATED WORK

Optical ﬂow estimation. Maximizing visual similarity between neighboring frames by formulat-

ing the problem as an energy minimization (Black & Anandan, 1993; Bruhn et al., 2005; Sun et al.,

2014) has been the primary approach for optical ﬂow estimation. Previous works such as (Dosovit-

skiy et al., 2015; Ilg et al., 2017; Ranjan & Black, 2017; Sun et al., 2018a; 2019; Hui et al., 2018;

2020; Zou et al., 2018) have successfully established efﬁcacy of deep neural networks in estimating

optical ﬂow both under supervised and self-supervised settings. Iterative improvement in model ar-

chitecture and better regularization terms has been primarily responsible for achieving better results.

But, most of these works fail to better handle occlusion, small fast-moving objects, capture global

motion and rectify and recover from early mistakes.

To overcome these limitations, Teed & Deng (2020) proposed RAFT, which adopts a learning-to-

optimize strategy using a recurrent GRU-based decoder to iteratively update a ﬂow ﬁeld f which

is initialized at zero. Inspired by the success of RAFT, there has been a number of variants such

as CRAFT (Sui et al., 2022), GMA (Jiang et al., 2021), Sparse volume RAFT (Jiang et al., 2021)

and FlowFormer (Huang et al., 2022), all of which beneﬁts from all-pair correlation volume way

of estimating optical ﬂow. The current state-of-the-art work RAFT-OCTC (Jeong et al., 2022) also

uses RAFT based architecture, and it imposes consistency based on various proxy tasks to improve

ﬂow estimation. Considering RAFT’s effectiveness, generalizability and relatively smaller model

size, we adopt RAFT (Teed & Deng, 2020) as our base architecture and employ semi-supervised

iterative pseudo labeling together with the contrastive ﬂow loss, to achieve a state-of-the-arts result

on KITTI 2015 (Menze & Geiger, 2015) benchmark.

Figure 1: Overview of our Contrastive Flow Model(CLIP-Flow). Input frames are ﬁrstly positionally

encoded, according to individual 2D pixel locations. Subsequently contrastive loss are applied on

encoded features to enforce pixel wise consistency using GT Flow.

Semi-Supervised and Representation Learning. Semi-supervised learning (SSL) and represen-

tation learning have shown success for a range of computer vision tasks, both during the pretext task

training and during speciﬁc downstream tasks. Most of these methods leverage contrastive learning

(Chen et al., 2020a;b; He et al., 2020), clustering (Caron et al., 2020) and pseudo-labeling (Caron

et al., 2021; Chen & He, 2021; Grill et al., 2020; Hoyer et al., 2021) as an enforcing mechanism.

Recent works such as (Caron et al., 2020; 2021; Chen et al., 2020a;b; 2021; Grill et al., 2020; He

et al., 2020; Xie et al., 2021b; Yun et al., 2022) have empirically shown beneﬁts of SSL for down-

stream tasks such as image classiﬁcation, object detection, instance and semantic segmentation.

These works leverage contrastive loss in different shapes and forms to facilitate better representa-

tion learning. For example, (He et al., 2020; Chen et al., 2020b) advocate for ﬁnding positive and

negative keys with respect to a given encoded query to enforce a contrastive loss. On similar lines,

for dense prediction tasks, studies such as (O Pinheiro et al., 2020; Xiao et al., 2021; Xie et al.,

2021a) enforce matching overlapping regions between two augmented images, where as (Yun et al.,

2022) looks to form a positive/negative pair between adjacent patches. As part of our approach we

use contrastive ﬂow loss between features of the neighboring frames, where we draw a one-to-one

positive pair relations between the reference features and the warped features using (pseudo) ground

truth ﬂow or ﬂow estimates, as shown in Fig. 1.

Pseudo labeling is another important approach in SSL training paradigm. Some studies leverage

pseudo labeling to generate training labels as part of consistency training (Yun et al., 2019; Olsson

et al., 2021; Hoyer et al., 2021), while other works propose to use pseudo labeling to improve training

pretext task training as in (Caron et al., 2021; Chen & He, 2021; Grill et al., 2020). Rather, we use

pseudo labeling for an iterative reﬁnement mechanism, through which we effectively distill the

correct ﬂow estimate for the KITTI-Raw dataset (Geiger et al., 2013). To the best of our knowledge,

ours is the ﬁrst work which leverages both contrastive learning and pseudo labeling for estimating

optical ﬂow in an SSL fashion.

3 APPROACH

In this section, we describe our method CLIP-Flow, a semi-supervised framework for optical ﬂow

estimation by iterative pseudo labeling and contrastive ﬂow loss. Based on two SOTA optical ﬂow

networks RAFT (Teed & Deng, 2020) and CRAFT (Sui et al., 2022) as backbone (c.f. Sec.3.1), we

obtain non-trivial improvement by leveraging our iterative pseudo labeling (PL) (c.f. Sec. 3.2) and

the proposed contrastive ﬂow loss (c.f. Sec. 3.3). It should be noted that our CLIP-Flow can be

easily extended to other optical ﬂow networks, e.g. FlowNet (Dosovitskiy et al., 2015; Ilg et al.,

2017), SpyNet (Ranjan & Black, 2017) and PWC-Net (Sun et al., 2018a), with little modiﬁcation.

3.1 PRELIMINARIES

Given two consecutive RGB images I1,I2∈RH×W×3, the optical ﬂow f∈RH×W×2is deﬁned as

a dense 2D motion ﬁeld f= (fu, fv), which maps each pixel (u, v)in I1to its counterpart (u0, v0)

in I2, with u0=u+fuand v0=v+fv.

RAFT. Among end-to-end optical ﬂow methods (Dosovitskiy et al., 2015; Ilg et al., 2017; Ranjan

& Black, 2017; Sun et al., 2018a; Sui et al., 2022; Jeong et al., 2022; Teed & Deng, 2020), RAFT

(Teed & Deng, 2020) features a learning-to-optimize strategy using a recurrent GRU-based decoder

to iteratively update a ﬂow ﬁeld fwhich is initialized at zero. Speciﬁcally, it extracts features using

a convolutional encoder gθfrom the input images I1and I2, and outputs features at 1/8 resolution,

i.e.,gθ(I1)∈RH0×W0×Cand gθ(I2)∈RH0×W0×C, where H0=H/8and W0=W/8for spatial

dimension and C=256 for feature dimension. Also, a context network hθis applied to the ﬁrst input

image I1. Then all-pair visual similarity is computed by constructing a 4D correlation volume V∈

RH0×W0×H0×W0between features gθ(I1)and gθ(I2). It can be computed via matrix multiplication

as V=gθ(I1)·gT

θ(I2),i.e.,R(H0·W0)×C,RC×(H0·W0)7→ R(H0·W0)×(H0·W0), which is further

reshaped to V∈RH0×W0×H0×W0. Then RAFT builds a 4-layer coorelation pyramid {Vs}4

s=1 by

pooling the last two dimensions of Vwith kernel sizes 2s−1, respectively. The GRU-based decoder

estimates a sequence of ﬂow estimates {f1,...,fT}(T=12 or 24) from a zero initialized f0=0.

RAFT attains high accuracy, strong generalization as well as high efﬁciency. We take RAFT as

the backbone and achieve boosted performance, i.e. F1-all errors of 4.11 (ours) vs 5.10 (raft) on

KITTI-2015 (Menze & Geiger, 2015) (c.f. Tab. 1).

CRAFT. To overcome the challenges of large displacements with motion blur and the limited

ﬁeld of view due to locality of convolutional features in RAFT, CRAFT (Sui et al., 2022) proposes

to leverage transformer layers to learn global features by considering long-range dependence, and

hence revitalize the 4D correlation volume Vcomputation as in RAFT. We also use CRAFT as the

backbone and attain improvement, i.e. F1-all errors of 4.66 (ours) vs 4.79 (craft) on KITTI-2015

(Menze & Geiger, 2015) (c.f. Tab. 1).

3.2 ITERATIVE PSEUDO LABELING

Deep learning based optical ﬂow methods are usually pretrained on synthetic data 1and ﬁnetuned on

small real data. This begs an important question: How to effectively transfer the knowledge learned

from synthetic domain to real world scenarios and bridge the big gap between them? Our semi-

supervised framework is proposed to improve the performance on real datasets DR, by iteratively

transferring the knowledge learned from synthetic data DSand/or a few of available real datasets

Dtr

R(with sparse or dense ground truth optical ﬂow labels). Without loss of generality, we assume

that the real data DRconsists of i) a small amount of training data Dtr

R(e.g. KITTI 2015 (Menze &

Geiger, 2015) training set with 200 image pairs) due to the expensive and tedious labeling by human,

ii) a number of testing data Dte

R(e.g. KITTI 2015 test set with 200 pairs), and iii) a large amount

of unlabeled data Du

R(e.g. KITTI raw dataset (Geiger et al., 2013) having 84,642 images pairs)

which is quite similar to the test domain. Therefore, we propose to use the unlabeled, real KITTI

Raw data by generating pseudo ground truth labels using a master (or teacher) model to transfer the

knowledge from pretraining on synthetic data or small real data to real data KITTI 2015 test set.

As shown in Fig. 2, our semi-supervised iterative pseudo labeling training strategy includes 3

steps: 1) Training on a large amount of unlabeled data (Du

R) supervised by a master (or teacher)

model, which is chosen at the beginning as a model pretrained on large-scale synthetic and small

real datasets, 2) Conducting k-fold cross validation on the labeled real dataset (Dtr

R) to ﬁnd best

hyper-parameters, e.g. training steps for ﬁnetuing Sf t, and 3) ﬁnetuning our model on the labeled

dataset (Dtr

R) using the best hyper-parameters selected above, and updating the ﬁnetuned model as

a new version of the master (or teacher) model to repeat those steps for next iteration, until the pre-

deﬁned iteration steps Nis reached or the gain of evaluation accuracy on test set (Dte

R) is marginal.

The detailed algorithm is illustrated in Alg. 1.

Semi-supervised learning on unlabeled real dataset. Our proposed iterative pseudo labeling

method aims at dealing with real imagery that usually lacks ground truth labels and is difﬁcult to be

1We assume the synthetic data is at large scale and have ground truth optical ﬂow maps.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CLIP-FLOW:CONTRASTIVELEARNINGBYSEMI-SUPERVISEDITERATIVEPSEUDOLABELINGFOROP-TICALFLOWESTIMATIONZhiqiZhang,NitinBansal,ChangjiangCai,PanJi,QinganYan,XiangyuXu&YiXuOPPOUSResearchCenter,InnopeakTechnology,USAfzhiqi.zhang,pan.ji,nitin.bansal,changjiang.cai,qingan.yan,xiangyu.xu,yi.xug@innopeaktech.comAB...

展开>> 收起<<

CLIP-F LOW C ONTRASTIVE LEARNING BY SEMI - SUPERVISED ITERATIVE PSEUDO LABELING FOR OP- TICAL FLOW ESTIMATION.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CLIP-F LOW C ONTRASTIVE LEARNING BY SEMI - SUPERVISED ITERATIVE PSEUDO LABELING FOR OP- TICAL FLOW ESTIMATION

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: