GlobalFlowNet Video Stabilization using Deep Distilled Global Motion Estimates Jerin Geo James Devansh Jain Ajit Rajwade

2025-05-06 0 0 1.02MB 10 页 10玖币
侵权投诉
GlobalFlowNet: Video Stabilization using Deep Distilled Global Motion
Estimates
Jerin Geo James Devansh Jain Ajit Rajwade
Indian Institute of Technology Bombay
{jeringeo,devanshdvj,ajitvr}@cse.iitb.ac.in
Abstract
Videos shot by laymen using hand-held cameras contain
undesirable shaky motion. Estimating the global motion
between successive frames, in a manner not influenced by
moving objects, is central to many video stabilization tech-
niques, but poses significant challenges. A large body of
work uses 2D affine transformations or homography for the
global motion. However, in this work, we introduce a more
general representation scheme, which adapts any existing
optical flow network to ignore the moving objects and ob-
tain a spatially smooth approximation of the global motion
between video frames. We achieve this by a knowledge dis-
tillation approach, where we first introduce a low pass filter
module into the optical flow network to constrain the pre-
dicted optical flow to be spatially smooth. This becomes our
student network, named as GLOBALFLOWNET. Then, us-
ing the original optical flow network as the teacher network,
we train the student network using a robust loss function.
Given a trained GLOBALFLOWNET, we stabilize videos
using a two stage process. In the first stage, we correct
the instability in affine parameters using a quadratic pro-
gramming approach constrained by a user-specified crop-
ping limit to control loss of field of view. In the second
stage, we stabilize the video further by smoothing global
motion parameters, expressed using a small number of dis-
crete cosine transform coefficients. In extensive experiments
on a variety of different videos, our technique outperforms
state of the art techniques in terms of subjective quality and
different quantitative measures of video stability. Addition-
ally, we present a new measure for evaluation of video stabi-
lization based on the flow generated by GLOBALFLOWNET
and argue that it is based on a more general motion model
in contrast to the affine motion model on which most exist-
ing measures are based. The source code is publicly avail-
able at https://github.com/GlobalFlowNet/GlobalFlowNet
1. Introduction
Videos acquired by amateur photographers or lay users
from hand-held cameras or mobile phones are subject to
a large magnitude of undesirable and discontinuous mo-
tion. The process of eliminating or reducing this unde-
sirable motion is called video stabilization. In some se-
tups, the camera can be mounted on stable stands or dol-
lies, but this is infeasible in many commonplace scenarios.
Some cameras are equipped with hardware such as gyro-
scopes for stabilization, but the state of the art in video
stabilization still adopts software-based approaches due to
the gyroscope’s cost, weight and error-pone motion es-
timation [20,23]. Apart from casual hand-held photog-
raphy, the need for video stabilization also arises in en-
doscopy [10], underwater imaging [21] and aerial photogra-
phy from drones/helicopters [11]. Many video stabilization
techniques consist of three broad steps: (1) estimation of
the motion between consecutive or temporally neighboring
frames assuming a suitable motion model, (2) temporal mo-
tion smoothing assuming an appropriate motion model for
the underlying stable video, and (3) re-targeting or warping
of the frames of the unstable video so as to generate a stabi-
lized video. There exists a large body of literature on video
stabilization, with differences in the way these three steps
are executed. Several of these techniques are summarized
below.
Related work (Classical Approaches): Many traditional
techniques assume that the motion between consecutive
frames can be approximated using 2D affine transforma-
tions or homographies [6,19], and seek to smooth a se-
quence of such parameters to render a stabilized video. For
computing the parameterized motion, many of these tech-
niques make use of robust point tracking methods [6,15,19].
However, 2D motion models cannot accurately account for
the motion between consecutive video frames for scenes
with significant depth variation or significant camera mo-
tion. Some methods such as [15] approximate the motion
between consecutive frames by means of patch-wise 2D
models or homographies. The method in [17] performs
arXiv:2210.13769v3 [cs.CV] 4 Nov 2022
three tasks in an iterated fashion: determining the smooth
global background motion between consecutive frames by
detecting moving objects, inpainting the flow in those re-
gions, and smoothing per-pixel optical flow vectors across
time. There also exist methods which make use of epipolar
geometry [4] or various geometrically motivated subspace
constraints [13]. The latter technique requires fairly long
feature tracks which may not be available in many real-
world videos. Finally, many techniques which use 3D in-
formation have also been proposed, for example methods
that use structure from motion [12], a depth camera [26] or
a light field camera [24].
Related work (Deep Learning Approaches): Deep neu-
ral network (DNN) based approaches for video stabiliza-
tion have become very popular in recent years. The work
in [35] represents the warp fields using the weights of an
unsupervised DNN, which minimizes the sum of two terms:
a regularizer that encourages the warp fields to be piece-
wise linear, and a fidelity term which minimizes the dis-
tance between corresponding pixels in consecutive frames
in the stabilized video. This approach, though elegant, must
evolve the network weights afresh for every video, and has
very high computational cost. The work in [36] trains a
DNN offline on a large video dataset with synthetic unsta-
ble motion. In an unsupervised fashion, the weights of the
DNN are evolved so as to generate warp fields that (1) have
dominant low-frequency content in the Fourier domain, and
(2) yield minimal distance between corresponding pixels in
consecutive frames of the stabilized video. The work uses
frame-to-frame optical flow as initial input and requires a
number of pre-processing steps to: (1) identify regions with
moving objects from the optical flow fields using a vari-
ety of segmentation masks for typical foreground objects
obtained from [39], (2) identify regions of inaccurate op-
tical flow, and (3) inpaint all such regions using the PCA-
based approach from [32]. The work in [2] trains two DNNs
for performing video stabilization via frame interpolation
to smooth the motion between consecutive frames. The
(i1)th and (i+1)th frames are linearly warped mid-way to-
ward each other using the bidirectional optical flow between
them. The resulting warped frames are passed through a U-
Net [22] to generate the ith intermediate ‘stabilized’ frame.
This interpolation process is carried out iteratively which
may accumulate blur. To prevent this, the intermediate sta-
bilized frames are also passed through a Resnet [8]. The
motion smoothing in this approach is always linear with-
out any adaptation of the smoothing parameters to the mo-
tion at different time instants or at different depths. Similar
in spirit to [2], work in [18,34] perform full-frame video
stabilization by bringing in border-based frame inpainting.
However, the approach in [18] is computationally expen-
sive. The approach in [30] trains a Siamese network to gen-
erate a warp grid for video stabilization using stable and
unstable video pairs from the Deepstab dataset [33]. Their
approach is based purely on color without using any motion
parameters and does not perform very well. The approach
in [33] uses spatial transformer networks along with ad-
versarial networks for video stabilization, but suffers from
problems due to inadequate training data. The work in [37],
which is called PWStableNet, uses a supervised training ap-
proach based on a cascade of encoder-decoder units to opti-
mize a combination of criteria such as fidelity w.r.t. the un-
derlying stable video, and various motion and feature-based
characteristics. This approach has limitations in terms of
training data scarcity and generalizability.
Overview of Proposed Approach: A major contribution
of our work is a novel method of estimating the global
motion between frames of an unstable video, proposed in
Sec. 2. Our method involves training a network GLOB-
ALFLOWNET in a teacher-student fashion in such a way
that it imposes a smooth and compact representation for the
global motion and is designed to not be influenced by the
motion in regions containing moving objects. Our method
yields global motion representations that are more general
than 2D affine or homography transformation and does not
require any salient feature point tracking. Given a pre-
trained GLOBALFLOWNET, we achieve video stabilization
using a two-stage process comprising a novel global affine
parameter smoothing step (Sec. 3.1) and a novel residual
level smoothing step (Sec. 3.2) involving low frequency dis-
crete cosine transform (DCT) coefficients of the residual
flow. Both these steps smooth the parameters in the tem-
poral direction. The first step acts as a very useful initial
condition, whereas the second step is necessary to signifi-
cantly improve stabilization performance. This is because
it works with a global motion model that despite being very
compact, is much more general than just 2D affine or ho-
mography. Our overall approach for video stabilization is
simple, computationally efficient and interpretable. In ex-
tensive experiments (see Sec. 4), it outperforms state of the
art techniques in terms of stability measures. We also pro-
pose a new video stabilization measure which uses the low
frequency representation from Sec. 2.2 to quantify the tem-
poral smoothness of the global motion between successive
pairs of frames. Our measure uses a more general motion
model than existing measures which largely use affine trans-
formations.
2. Global Motion Estimation
A key step in video stabilization is the estimation of
global motion between consecutive video frames (or tempo-
rally nearby video frames), followed by temporal smooth-
ing of the motion parameters. The difference between the
original global motion and the global motion in a stabilized
video constitutes the warp field, which when applied to the
unstable frames, stabilizes the video. An ideal global mo-
摘要:

GlobalFlowNet:VideoStabilizationusingDeepDistilledGlobalMotionEstimatesJerinGeoJamesDevanshJainAjitRajwadeIndianInstituteofTechnologyBombay{jeringeo,devanshdvj,ajitvr}@cse.iitb.ac.inAbstractVideosshotbylaymenusinghand-heldcamerascontainundesirableshakymotion.Estimatingtheglobalmotionbetweensuccessiv...

展开>> 收起<<
GlobalFlowNet Video Stabilization using Deep Distilled Global Motion Estimates Jerin Geo James Devansh Jain Ajit Rajwade.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.02MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注