
3.1 PRELIMINARIES
Given two consecutive RGB images I1,I2∈RH×W×3, the optical flow f∈RH×W×2is defined as
a dense 2D motion field f= (fu, fv), which maps each pixel (u, v)in I1to its counterpart (u0, v0)
in I2, with u0=u+fuand v0=v+fv.
RAFT. Among end-to-end optical flow methods (Dosovitskiy et al., 2015; Ilg et al., 2017; Ranjan
& Black, 2017; Sun et al., 2018a; Sui et al., 2022; Jeong et al., 2022; Teed & Deng, 2020), RAFT
(Teed & Deng, 2020) features a learning-to-optimize strategy using a recurrent GRU-based decoder
to iteratively update a flow field fwhich is initialized at zero. Specifically, it extracts features using
a convolutional encoder gθfrom the input images I1and I2, and outputs features at 1/8 resolution,
i.e.,gθ(I1)∈RH0×W0×Cand gθ(I2)∈RH0×W0×C, where H0=H/8and W0=W/8for spatial
dimension and C=256 for feature dimension. Also, a context network hθis applied to the first input
image I1. Then all-pair visual similarity is computed by constructing a 4D correlation volume V∈
RH0×W0×H0×W0between features gθ(I1)and gθ(I2). It can be computed via matrix multiplication
as V=gθ(I1)·gT
θ(I2),i.e.,R(H0·W0)×C,RC×(H0·W0)7→ R(H0·W0)×(H0·W0), which is further
reshaped to V∈RH0×W0×H0×W0. Then RAFT builds a 4-layer coorelation pyramid {Vs}4
s=1 by
pooling the last two dimensions of Vwith kernel sizes 2s−1, respectively. The GRU-based decoder
estimates a sequence of flow estimates {f1,...,fT}(T=12 or 24) from a zero initialized f0=0.
RAFT attains high accuracy, strong generalization as well as high efficiency. We take RAFT as
the backbone and achieve boosted performance, i.e. F1-all errors of 4.11 (ours) vs 5.10 (raft) on
KITTI-2015 (Menze & Geiger, 2015) (c.f. Tab. 1).
CRAFT. To overcome the challenges of large displacements with motion blur and the limited
field of view due to locality of convolutional features in RAFT, CRAFT (Sui et al., 2022) proposes
to leverage transformer layers to learn global features by considering long-range dependence, and
hence revitalize the 4D correlation volume Vcomputation as in RAFT. We also use CRAFT as the
backbone and attain improvement, i.e. F1-all errors of 4.66 (ours) vs 4.79 (craft) on KITTI-2015
(Menze & Geiger, 2015) (c.f. Tab. 1).
3.2 ITERATIVE PSEUDO LABELING
Deep learning based optical flow methods are usually pretrained on synthetic data 1and finetuned on
small real data. This begs an important question: How to effectively transfer the knowledge learned
from synthetic domain to real world scenarios and bridge the big gap between them? Our semi-
supervised framework is proposed to improve the performance on real datasets DR, by iteratively
transferring the knowledge learned from synthetic data DSand/or a few of available real datasets
Dtr
R(with sparse or dense ground truth optical flow labels). Without loss of generality, we assume
that the real data DRconsists of i) a small amount of training data Dtr
R(e.g. KITTI 2015 (Menze &
Geiger, 2015) training set with 200 image pairs) due to the expensive and tedious labeling by human,
ii) a number of testing data Dte
R(e.g. KITTI 2015 test set with 200 pairs), and iii) a large amount
of unlabeled data Du
R(e.g. KITTI raw dataset (Geiger et al., 2013) having 84,642 images pairs)
which is quite similar to the test domain. Therefore, we propose to use the unlabeled, real KITTI
Raw data by generating pseudo ground truth labels using a master (or teacher) model to transfer the
knowledge from pretraining on synthetic data or small real data to real data KITTI 2015 test set.
As shown in Fig. 2, our semi-supervised iterative pseudo labeling training strategy includes 3
steps: 1) Training on a large amount of unlabeled data (Du
R) supervised by a master (or teacher)
model, which is chosen at the beginning as a model pretrained on large-scale synthetic and small
real datasets, 2) Conducting k-fold cross validation on the labeled real dataset (Dtr
R) to find best
hyper-parameters, e.g. training steps for finetuing Sf t, and 3) finetuning our model on the labeled
dataset (Dtr
R) using the best hyper-parameters selected above, and updating the finetuned model as
a new version of the master (or teacher) model to repeat those steps for next iteration, until the pre-
defined iteration steps Nis reached or the gain of evaluation accuracy on test set (Dte
R) is marginal.
The detailed algorithm is illustrated in Alg. 1.
Semi-supervised learning on unlabeled real dataset. Our proposed iterative pseudo labeling
method aims at dealing with real imagery that usually lacks ground truth labels and is difficult to be
1We assume the synthetic data is at large scale and have ground truth optical flow maps.
4