three tasks in an iterated fashion: determining the smooth
global background motion between consecutive frames by
detecting moving objects, inpainting the flow in those re-
gions, and smoothing per-pixel optical flow vectors across
time. There also exist methods which make use of epipolar
geometry [4] or various geometrically motivated subspace
constraints [13]. The latter technique requires fairly long
feature tracks which may not be available in many real-
world videos. Finally, many techniques which use 3D in-
formation have also been proposed, for example methods
that use structure from motion [12], a depth camera [26] or
a light field camera [24].
Related work (Deep Learning Approaches): Deep neu-
ral network (DNN) based approaches for video stabiliza-
tion have become very popular in recent years. The work
in [35] represents the warp fields using the weights of an
unsupervised DNN, which minimizes the sum of two terms:
a regularizer that encourages the warp fields to be piece-
wise linear, and a fidelity term which minimizes the dis-
tance between corresponding pixels in consecutive frames
in the stabilized video. This approach, though elegant, must
evolve the network weights afresh for every video, and has
very high computational cost. The work in [36] trains a
DNN offline on a large video dataset with synthetic unsta-
ble motion. In an unsupervised fashion, the weights of the
DNN are evolved so as to generate warp fields that (1) have
dominant low-frequency content in the Fourier domain, and
(2) yield minimal distance between corresponding pixels in
consecutive frames of the stabilized video. The work uses
frame-to-frame optical flow as initial input and requires a
number of pre-processing steps to: (1) identify regions with
moving objects from the optical flow fields using a vari-
ety of segmentation masks for typical foreground objects
obtained from [39], (2) identify regions of inaccurate op-
tical flow, and (3) inpaint all such regions using the PCA-
based approach from [32]. The work in [2] trains two DNNs
for performing video stabilization via frame interpolation
to smooth the motion between consecutive frames. The
(i−1)th and (i+1)th frames are linearly warped mid-way to-
ward each other using the bidirectional optical flow between
them. The resulting warped frames are passed through a U-
Net [22] to generate the ith intermediate ‘stabilized’ frame.
This interpolation process is carried out iteratively which
may accumulate blur. To prevent this, the intermediate sta-
bilized frames are also passed through a Resnet [8]. The
motion smoothing in this approach is always linear with-
out any adaptation of the smoothing parameters to the mo-
tion at different time instants or at different depths. Similar
in spirit to [2], work in [18,34] perform full-frame video
stabilization by bringing in border-based frame inpainting.
However, the approach in [18] is computationally expen-
sive. The approach in [30] trains a Siamese network to gen-
erate a warp grid for video stabilization using stable and
unstable video pairs from the Deepstab dataset [33]. Their
approach is based purely on color without using any motion
parameters and does not perform very well. The approach
in [33] uses spatial transformer networks along with ad-
versarial networks for video stabilization, but suffers from
problems due to inadequate training data. The work in [37],
which is called PWStableNet, uses a supervised training ap-
proach based on a cascade of encoder-decoder units to opti-
mize a combination of criteria such as fidelity w.r.t. the un-
derlying stable video, and various motion and feature-based
characteristics. This approach has limitations in terms of
training data scarcity and generalizability.
Overview of Proposed Approach: A major contribution
of our work is a novel method of estimating the global
motion between frames of an unstable video, proposed in
Sec. 2. Our method involves training a network GLOB-
ALFLOWNET in a teacher-student fashion in such a way
that it imposes a smooth and compact representation for the
global motion and is designed to not be influenced by the
motion in regions containing moving objects. Our method
yields global motion representations that are more general
than 2D affine or homography transformation and does not
require any salient feature point tracking. Given a pre-
trained GLOBALFLOWNET, we achieve video stabilization
using a two-stage process comprising a novel global affine
parameter smoothing step (Sec. 3.1) and a novel residual
level smoothing step (Sec. 3.2) involving low frequency dis-
crete cosine transform (DCT) coefficients of the residual
flow. Both these steps smooth the parameters in the tem-
poral direction. The first step acts as a very useful initial
condition, whereas the second step is necessary to signifi-
cantly improve stabilization performance. This is because
it works with a global motion model that despite being very
compact, is much more general than just 2D affine or ho-
mography. Our overall approach for video stabilization is
simple, computationally efficient and interpretable. In ex-
tensive experiments (see Sec. 4), it outperforms state of the
art techniques in terms of stability measures. We also pro-
pose a new video stabilization measure which uses the low
frequency representation from Sec. 2.2 to quantify the tem-
poral smoothness of the global motion between successive
pairs of frames. Our measure uses a more general motion
model than existing measures which largely use affine trans-
formations.
2. Global Motion Estimation
A key step in video stabilization is the estimation of
global motion between consecutive video frames (or tempo-
rally nearby video frames), followed by temporal smooth-
ing of the motion parameters. The difference between the
original global motion and the global motion in a stabilized
video constitutes the warp field, which when applied to the
unstable frames, stabilizes the video. An ideal global mo-