CLIP-F LOW C ONTRASTIVE LEARNING BY SEMI - SUPERVISED ITERATIVE PSEUDO LABELING FOR OP- TICAL FLOW ESTIMATION

2025-04-29 0 0 6.71MB 16 页 10玖币
侵权投诉
CLIP-FLOW: CONTRASTIVE LEARNING BY SEMI-
SUPERVISED ITERATIVE PSEUDO LABELING FOR OP-
TICAL FLOW ESTIMATION
Zhiqi Zhang, Nitin Bansal, Changjiang Cai, Pan Ji
, Qingan Yan, Xiangyu Xu & Yi Xu
OPPO US Research Center, Innopeak Technology, USA
{zhiqi.zhang,pan.ji,nitin.bansal,changjiang.cai,qingan.yan,
xiangyu.xu,yi.xu}@innopeaktech.com
ABSTRACT
Synthetic datasets are often used to pretrain end-to-end optical flow networks, due
to the lack of a large amount of labeled, real scene data. But major drops in accu-
racy occur when moving from synthetic to real scenes. How do we better transfer
the knowledge learned from synthetic to real domains? To this end, we propose
CLIP-Flow, a semi-supervised iterative pseudo labeling framework to transfer the
pretraining knowledge to the target real domain. We leverage large-scale, unla-
beled real data to facilitate transfer learning with the supervision of iteratively
updated pseudo ground truth labels, bridging the domain gap between the syn-
thetic and the real. In addition, we propose a contrastive flow loss on reference
features and the warped features by pseudo ground truth flows, to further boost
the accurate matching and dampen the mismatching due to motion, occlusion, or
noisy pseudo labels. We adopt RAFT as backbone and obtain an F1-all error of
4.11%, i.e. a 19% error reduction from RAFT (5.10%) and ranking 2nd place at
submission on KITTI 2015 benchmark. Our framework can also be extended to
other models, e.g. CRAFT, reducing the F1-all error from 4.79% to 4.66% on
KITTI 2015 benchmark.
1 INTRODUCTION
Optical flow is critical in many high level vision problems, such as action recognition (Simonyan &
Zisserman, 2014; Sevilla-Lara et al., 2018; Sun et al., 2018b), video segmentation (Yang et al., 2021;
Yang & Ramanan, 2021) and editing (Bonneel et al., 2015), autonomous driving (Janai et al., 2020)
and so on. Traditional methods (Horn & Schunck, 1981; Menze et al., 2015; Ranftl et al., 2014;
Zach et al., 2007) mainly focus on formulating flow estimation as solving optimization problems us-
ing hand-crafted features. The optimization is searched over the space of dense displacement fields
between a pair of input images, which is often time-consuming. Recently, data driven deep learning
methods Dosovitskiy et al. (2015); Ilg et al. (2017); Teed & Deng (2020) have been proved suc-
cessful in estimating the optical flow thanks to the availability of all kinds of high quality synthetic
datasets (Butler et al., 2012b; Dosovitskiy et al., 2015; Mayer et al., 2016; Krispin et al., 2016).
However, most of the recent works (Dosovitskiy et al., 2015; Ilg et al., 2017; Teed & Deng, 2020;
Jeong et al., 2022) mainly train on the synthetic datasets given that there is no sufficient real labeled
optical flow datasets to be used to train a deep learning model. State-of-the-art (SOTA) models
always get more accurate results on the synthetic dataset like Sintel (Butler et al., 2012a) than the real
scene dataset like KITTI 2015 (Menze & Geiger, 2015). This is mainly because that the model tends
to overfit the small training data, which echos in Tab. 1, i.e., there is a big gap between the training
F1-all error and test F1-all error when train and test on the KITTI dataset in all of the previous SOTA
methods. Therefore, we argue that this gap in performance is because of dearth of real training data,
and a big distribution gap between the synthetic data and real scene data. Although the model can
perfectly explain all kinds of synthetic data, however, when dealing with real data, it performs rather
unsatisfactorily. Our proposed work focuses on bridging the glaring performance gap between the
Worked on this, while associated with OPPO US Research
1
arXiv:2210.14383v3 [cs.CV] 3 Dec 2022
synthetic data and the real scene data. As in previous data driven approaches, smarter and longer
training strategies prove to be beneficial and helps in obtaining better optical flow results. Through
our work we also try to find answers of the following two questions: (i) How to take advantage of
the current SOTA optical flow models to further consolidate gain on real datasets? and (ii) How can
we use semi-supervised learning along with contrastive feature representation learning strategies to
effectively utilize the huge amount of unlabeled real data at our disposal?
Unsupervised visual representation learning (He et al., 2020; Chen et al., 2020b) has proved suc-
cessful in boosting most of major vision related tasks like image classification, object detection and
semantic segmentation to name a few. Work such as (He et al., 2020; Chen et al., 2020b) also
emphasizes the importance of the contrastive loss, when dealing with huge dense dataset. Given
that optical flow tasks generally lack real ground truth labels, we ask if leveraging the unsupervised
visual representation learning boosts optical flow performance? In order to answer this question,
we examine the impact of contrastive learning and pseudo labeling during training under a semi-
supervised setting. We particularly conduct exhaustive experiments using KITTI-Raw Geiger et al.
(2013) and KITTI 2015 (Menze & Geiger, 2015) datasets to evaluate its performance gain, and
show encouraging results. We believe that gain seen in model’s performance is reflective of the fact
that employing representation learning techniques such as contrasting learning helps in achieving
a much more refined 4D cost correlation volume. To constrain the flow per pixel, we employ a
simple positional encoding of 2D cartesian coordinates on the input frames as suggested in (Liu
et al., 2018), which further consolidates the gain achieved by contrastive learning. We follow this up
with an iterative-flow-refinement training using pseudo labeling which further consolidates on the
previous gains to give us SOTA results. At this point we would also like to highlight that we follow
a specific and well calibrated training strategy to fully exploit the gains of our method. Without loss
of generality, we use RAFT (Teed & Deng, 2020) as the backbone network for our experiments.
To fairly compare our method with existing SOTA methods, we tested our proposed method on the
KITTI 2015 test dataset, and we achieve the best F1-all error score among all the published methods
by a significant margin.
To summarize our main contributions: 1) We provide a detailed training strategy, which uses SSL
methods on top of the well known RAFT model to improve SOTA performance for optical flow
estimation. 2) We present the ways to employ contrastive learning and pseudo labeling effectively
and intelligently, such that both jointly help in improving upon existing benchmarks. 3) We discuss
the positive impact of a simple 2D positional encoding, which benefits flow training both for Sintel
and KITTI 2015 datasets.
2 RELATED WORK
Optical flow estimation. Maximizing visual similarity between neighboring frames by formulat-
ing the problem as an energy minimization (Black & Anandan, 1993; Bruhn et al., 2005; Sun et al.,
2014) has been the primary approach for optical flow estimation. Previous works such as (Dosovit-
skiy et al., 2015; Ilg et al., 2017; Ranjan & Black, 2017; Sun et al., 2018a; 2019; Hui et al., 2018;
2020; Zou et al., 2018) have successfully established efficacy of deep neural networks in estimating
optical flow both under supervised and self-supervised settings. Iterative improvement in model ar-
chitecture and better regularization terms has been primarily responsible for achieving better results.
But, most of these works fail to better handle occlusion, small fast-moving objects, capture global
motion and rectify and recover from early mistakes.
To overcome these limitations, Teed & Deng (2020) proposed RAFT, which adopts a learning-to-
optimize strategy using a recurrent GRU-based decoder to iteratively update a flow field f which
is initialized at zero. Inspired by the success of RAFT, there has been a number of variants such
as CRAFT (Sui et al., 2022), GMA (Jiang et al., 2021), Sparse volume RAFT (Jiang et al., 2021)
and FlowFormer (Huang et al., 2022), all of which benefits from all-pair correlation volume way
of estimating optical flow. The current state-of-the-art work RAFT-OCTC (Jeong et al., 2022) also
uses RAFT based architecture, and it imposes consistency based on various proxy tasks to improve
flow estimation. Considering RAFT’s effectiveness, generalizability and relatively smaller model
size, we adopt RAFT (Teed & Deng, 2020) as our base architecture and employ semi-supervised
iterative pseudo labeling together with the contrastive flow loss, to achieve a state-of-the-arts result
on KITTI 2015 (Menze & Geiger, 2015) benchmark.
2
Figure 1: Overview of our Contrastive Flow Model(CLIP-Flow). Input frames are firstly positionally
encoded, according to individual 2D pixel locations. Subsequently contrastive loss are applied on
encoded features to enforce pixel wise consistency using GT Flow.
Semi-Supervised and Representation Learning. Semi-supervised learning (SSL) and represen-
tation learning have shown success for a range of computer vision tasks, both during the pretext task
training and during specific downstream tasks. Most of these methods leverage contrastive learning
(Chen et al., 2020a;b; He et al., 2020), clustering (Caron et al., 2020) and pseudo-labeling (Caron
et al., 2021; Chen & He, 2021; Grill et al., 2020; Hoyer et al., 2021) as an enforcing mechanism.
Recent works such as (Caron et al., 2020; 2021; Chen et al., 2020a;b; 2021; Grill et al., 2020; He
et al., 2020; Xie et al., 2021b; Yun et al., 2022) have empirically shown benefits of SSL for down-
stream tasks such as image classification, object detection, instance and semantic segmentation.
These works leverage contrastive loss in different shapes and forms to facilitate better representa-
tion learning. For example, (He et al., 2020; Chen et al., 2020b) advocate for finding positive and
negative keys with respect to a given encoded query to enforce a contrastive loss. On similar lines,
for dense prediction tasks, studies such as (O Pinheiro et al., 2020; Xiao et al., 2021; Xie et al.,
2021a) enforce matching overlapping regions between two augmented images, where as (Yun et al.,
2022) looks to form a positive/negative pair between adjacent patches. As part of our approach we
use contrastive flow loss between features of the neighboring frames, where we draw a one-to-one
positive pair relations between the reference features and the warped features using (pseudo) ground
truth flow or flow estimates, as shown in Fig. 1.
Pseudo labeling is another important approach in SSL training paradigm. Some studies leverage
pseudo labeling to generate training labels as part of consistency training (Yun et al., 2019; Olsson
et al., 2021; Hoyer et al., 2021), while other works propose to use pseudo labeling to improve training
pretext task training as in (Caron et al., 2021; Chen & He, 2021; Grill et al., 2020). Rather, we use
pseudo labeling for an iterative refinement mechanism, through which we effectively distill the
correct flow estimate for the KITTI-Raw dataset (Geiger et al., 2013). To the best of our knowledge,
ours is the first work which leverages both contrastive learning and pseudo labeling for estimating
optical flow in an SSL fashion.
3 APPROACH
In this section, we describe our method CLIP-Flow, a semi-supervised framework for optical flow
estimation by iterative pseudo labeling and contrastive flow loss. Based on two SOTA optical flow
networks RAFT (Teed & Deng, 2020) and CRAFT (Sui et al., 2022) as backbone (c.f. Sec.3.1), we
obtain non-trivial improvement by leveraging our iterative pseudo labeling (PL) (c.f. Sec. 3.2) and
the proposed contrastive flow loss (c.f. Sec. 3.3). It should be noted that our CLIP-Flow can be
easily extended to other optical flow networks, e.g. FlowNet (Dosovitskiy et al., 2015; Ilg et al.,
2017), SpyNet (Ranjan & Black, 2017) and PWC-Net (Sun et al., 2018a), with little modification.
3
3.1 PRELIMINARIES
Given two consecutive RGB images I1,I2RH×W×3, the optical flow fRH×W×2is defined as
a dense 2D motion field f= (fu, fv), which maps each pixel (u, v)in I1to its counterpart (u0, v0)
in I2, with u0=u+fuand v0=v+fv.
RAFT. Among end-to-end optical flow methods (Dosovitskiy et al., 2015; Ilg et al., 2017; Ranjan
& Black, 2017; Sun et al., 2018a; Sui et al., 2022; Jeong et al., 2022; Teed & Deng, 2020), RAFT
(Teed & Deng, 2020) features a learning-to-optimize strategy using a recurrent GRU-based decoder
to iteratively update a flow field fwhich is initialized at zero. Specifically, it extracts features using
a convolutional encoder gθfrom the input images I1and I2, and outputs features at 1/8 resolution,
i.e.,gθ(I1)RH0×W0×Cand gθ(I2)RH0×W0×C, where H0=H/8and W0=W/8for spatial
dimension and C=256 for feature dimension. Also, a context network hθis applied to the first input
image I1. Then all-pair visual similarity is computed by constructing a 4D correlation volume V
RH0×W0×H0×W0between features gθ(I1)and gθ(I2). It can be computed via matrix multiplication
as V=gθ(I1)·gT
θ(I2),i.e.,R(H0·W0)×C,RC×(H0·W0)7→ R(H0·W0)×(H0·W0), which is further
reshaped to VRH0×W0×H0×W0. Then RAFT builds a 4-layer coorelation pyramid {Vs}4
s=1 by
pooling the last two dimensions of Vwith kernel sizes 2s1, respectively. The GRU-based decoder
estimates a sequence of flow estimates {f1,...,fT}(T=12 or 24) from a zero initialized f0=0.
RAFT attains high accuracy, strong generalization as well as high efficiency. We take RAFT as
the backbone and achieve boosted performance, i.e. F1-all errors of 4.11 (ours) vs 5.10 (raft) on
KITTI-2015 (Menze & Geiger, 2015) (c.f. Tab. 1).
CRAFT. To overcome the challenges of large displacements with motion blur and the limited
field of view due to locality of convolutional features in RAFT, CRAFT (Sui et al., 2022) proposes
to leverage transformer layers to learn global features by considering long-range dependence, and
hence revitalize the 4D correlation volume Vcomputation as in RAFT. We also use CRAFT as the
backbone and attain improvement, i.e. F1-all errors of 4.66 (ours) vs 4.79 (craft) on KITTI-2015
(Menze & Geiger, 2015) (c.f. Tab. 1).
3.2 ITERATIVE PSEUDO LABELING
Deep learning based optical flow methods are usually pretrained on synthetic data 1and finetuned on
small real data. This begs an important question: How to effectively transfer the knowledge learned
from synthetic domain to real world scenarios and bridge the big gap between them? Our semi-
supervised framework is proposed to improve the performance on real datasets DR, by iteratively
transferring the knowledge learned from synthetic data DSand/or a few of available real datasets
Dtr
R(with sparse or dense ground truth optical flow labels). Without loss of generality, we assume
that the real data DRconsists of i) a small amount of training data Dtr
R(e.g. KITTI 2015 (Menze &
Geiger, 2015) training set with 200 image pairs) due to the expensive and tedious labeling by human,
ii) a number of testing data Dte
R(e.g. KITTI 2015 test set with 200 pairs), and iii) a large amount
of unlabeled data Du
R(e.g. KITTI raw dataset (Geiger et al., 2013) having 84,642 images pairs)
which is quite similar to the test domain. Therefore, we propose to use the unlabeled, real KITTI
Raw data by generating pseudo ground truth labels using a master (or teacher) model to transfer the
knowledge from pretraining on synthetic data or small real data to real data KITTI 2015 test set.
As shown in Fig. 2, our semi-supervised iterative pseudo labeling training strategy includes 3
steps: 1) Training on a large amount of unlabeled data (Du
R) supervised by a master (or teacher)
model, which is chosen at the beginning as a model pretrained on large-scale synthetic and small
real datasets, 2) Conducting k-fold cross validation on the labeled real dataset (Dtr
R) to find best
hyper-parameters, e.g. training steps for finetuing Sf t, and 3) finetuning our model on the labeled
dataset (Dtr
R) using the best hyper-parameters selected above, and updating the finetuned model as
a new version of the master (or teacher) model to repeat those steps for next iteration, until the pre-
defined iteration steps Nis reached or the gain of evaluation accuracy on test set (Dte
R) is marginal.
The detailed algorithm is illustrated in Alg. 1.
Semi-supervised learning on unlabeled real dataset. Our proposed iterative pseudo labeling
method aims at dealing with real imagery that usually lacks ground truth labels and is difficult to be
1We assume the synthetic data is at large scale and have ground truth optical flow maps.
4
摘要:

CLIP-FLOW:CONTRASTIVELEARNINGBYSEMI-SUPERVISEDITERATIVEPSEUDOLABELINGFOROP-TICALFLOWESTIMATIONZhiqiZhang,NitinBansal,ChangjiangCai,PanJi,QinganYan,XiangyuXu&YiXuOPPOUSResearchCenter,InnopeakTechnology,USAfzhiqi.zhang,pan.ji,nitin.bansal,changjiang.cai,qingan.yan,xiangyu.xu,yi.xug@innopeaktech.comAB...

展开>> 收起<<
CLIP-F LOW C ONTRASTIVE LEARNING BY SEMI - SUPERVISED ITERATIVE PSEUDO LABELING FOR OP- TICAL FLOW ESTIMATION.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:6.71MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注