GlobalFlowNet Video Stabilization using Deep Distilled Global Motion Estimates Jerin Geo James Devansh Jain Ajit Rajwade

2025-05-06 0 0 1.02MB 10 页 10玖币

侵权投诉

GlobalFlowNet: Video Stabilization using Deep Distilled Global Motion

Estimates

Jerin Geo James Devansh Jain Ajit Rajwade

Indian Institute of Technology Bombay

{jeringeo,devanshdvj,ajitvr}@cse.iitb.ac.in

Abstract

Videos shot by laymen using hand-held cameras contain

undesirable shaky motion. Estimating the global motion

between successive frames, in a manner not inﬂuenced by

moving objects, is central to many video stabilization tech-

niques, but poses signiﬁcant challenges. A large body of

work uses 2D afﬁne transformations or homography for the

global motion. However, in this work, we introduce a more

general representation scheme, which adapts any existing

optical ﬂow network to ignore the moving objects and ob-

tain a spatially smooth approximation of the global motion

between video frames. We achieve this by a knowledge dis-

tillation approach, where we ﬁrst introduce a low pass ﬁlter

module into the optical ﬂow network to constrain the pre-

dicted optical ﬂow to be spatially smooth. This becomes our

student network, named as GLOBALFLOWNET. Then, us-

ing the original optical ﬂow network as the teacher network,

we train the student network using a robust loss function.

Given a trained GLOBALFLOWNET, we stabilize videos

using a two stage process. In the ﬁrst stage, we correct

the instability in afﬁne parameters using a quadratic pro-

gramming approach constrained by a user-speciﬁed crop-

ping limit to control loss of ﬁeld of view. In the second

stage, we stabilize the video further by smoothing global

motion parameters, expressed using a small number of dis-

crete cosine transform coefﬁcients. In extensive experiments

on a variety of different videos, our technique outperforms

state of the art techniques in terms of subjective quality and

different quantitative measures of video stability. Addition-

ally, we present a new measure for evaluation of video stabi-

lization based on the ﬂow generated by GLOBALFLOWNET

and argue that it is based on a more general motion model

in contrast to the afﬁne motion model on which most exist-

ing measures are based. The source code is publicly avail-

able at https://github.com/GlobalFlowNet/GlobalFlowNet

1. Introduction

Videos acquired by amateur photographers or lay users

from hand-held cameras or mobile phones are subject to

a large magnitude of undesirable and discontinuous mo-

tion. The process of eliminating or reducing this unde-

sirable motion is called video stabilization. In some se-

tups, the camera can be mounted on stable stands or dol-

lies, but this is infeasible in many commonplace scenarios.

Some cameras are equipped with hardware such as gyro-

scopes for stabilization, but the state of the art in video

stabilization still adopts software-based approaches due to

the gyroscope’s cost, weight and error-pone motion es-

timation [20,23]. Apart from casual hand-held photog-

raphy, the need for video stabilization also arises in en-

doscopy [10], underwater imaging [21] and aerial photogra-

phy from drones/helicopters [11]. Many video stabilization

techniques consist of three broad steps: (1) estimation of

the motion between consecutive or temporally neighboring

frames assuming a suitable motion model, (2) temporal mo-

tion smoothing assuming an appropriate motion model for

the underlying stable video, and (3) re-targeting or warping

of the frames of the unstable video so as to generate a stabi-

lized video. There exists a large body of literature on video

stabilization, with differences in the way these three steps

are executed. Several of these techniques are summarized

below.

Related work (Classical Approaches): Many traditional

techniques assume that the motion between consecutive

frames can be approximated using 2D afﬁne transforma-

tions or homographies [6,19], and seek to smooth a se-

quence of such parameters to render a stabilized video. For

computing the parameterized motion, many of these tech-

niques make use of robust point tracking methods [6,15,19].

However, 2D motion models cannot accurately account for

the motion between consecutive video frames for scenes

with signiﬁcant depth variation or signiﬁcant camera mo-

tion. Some methods such as [15] approximate the motion

between consecutive frames by means of patch-wise 2D

models or homographies. The method in [17] performs

arXiv:2210.13769v3 [cs.CV] 4 Nov 2022

three tasks in an iterated fashion: determining the smooth

global background motion between consecutive frames by

detecting moving objects, inpainting the ﬂow in those re-

gions, and smoothing per-pixel optical ﬂow vectors across

time. There also exist methods which make use of epipolar

geometry [4] or various geometrically motivated subspace

constraints [13]. The latter technique requires fairly long

feature tracks which may not be available in many real-

world videos. Finally, many techniques which use 3D in-

formation have also been proposed, for example methods

that use structure from motion [12], a depth camera [26] or

a light ﬁeld camera [24].

Related work (Deep Learning Approaches): Deep neu-

ral network (DNN) based approaches for video stabiliza-

tion have become very popular in recent years. The work

in [35] represents the warp ﬁelds using the weights of an

unsupervised DNN, which minimizes the sum of two terms:

a regularizer that encourages the warp ﬁelds to be piece-

wise linear, and a ﬁdelity term which minimizes the dis-

tance between corresponding pixels in consecutive frames

in the stabilized video. This approach, though elegant, must

evolve the network weights afresh for every video, and has

very high computational cost. The work in [36] trains a

DNN ofﬂine on a large video dataset with synthetic unsta-

ble motion. In an unsupervised fashion, the weights of the

DNN are evolved so as to generate warp ﬁelds that (1) have

dominant low-frequency content in the Fourier domain, and

(2) yield minimal distance between corresponding pixels in

consecutive frames of the stabilized video. The work uses

frame-to-frame optical ﬂow as initial input and requires a

number of pre-processing steps to: (1) identify regions with

moving objects from the optical ﬂow ﬁelds using a vari-

ety of segmentation masks for typical foreground objects

obtained from [39], (2) identify regions of inaccurate op-

tical ﬂow, and (3) inpaint all such regions using the PCA-

based approach from [32]. The work in [2] trains two DNNs

for performing video stabilization via frame interpolation

to smooth the motion between consecutive frames. The

(i−1)th and (i+1)th frames are linearly warped mid-way to-

ward each other using the bidirectional optical ﬂow between

them. The resulting warped frames are passed through a U-

Net [22] to generate the ith intermediate ‘stabilized’ frame.

This interpolation process is carried out iteratively which

may accumulate blur. To prevent this, the intermediate sta-

bilized frames are also passed through a Resnet [8]. The

motion smoothing in this approach is always linear with-

out any adaptation of the smoothing parameters to the mo-

tion at different time instants or at different depths. Similar

in spirit to [2], work in [18,34] perform full-frame video

stabilization by bringing in border-based frame inpainting.

However, the approach in [18] is computationally expen-

sive. The approach in [30] trains a Siamese network to gen-

erate a warp grid for video stabilization using stable and

unstable video pairs from the Deepstab dataset [33]. Their

approach is based purely on color without using any motion

parameters and does not perform very well. The approach

in [33] uses spatial transformer networks along with ad-

versarial networks for video stabilization, but suffers from

problems due to inadequate training data. The work in [37],

which is called PWStableNet, uses a supervised training ap-

proach based on a cascade of encoder-decoder units to opti-

mize a combination of criteria such as ﬁdelity w.r.t. the un-

derlying stable video, and various motion and feature-based

characteristics. This approach has limitations in terms of

training data scarcity and generalizability.

Overview of Proposed Approach: A major contribution

of our work is a novel method of estimating the global

motion between frames of an unstable video, proposed in

Sec. 2. Our method involves training a network GLOB-

ALFLOWNET in a teacher-student fashion in such a way

that it imposes a smooth and compact representation for the

global motion and is designed to not be inﬂuenced by the

motion in regions containing moving objects. Our method

yields global motion representations that are more general

than 2D afﬁne or homography transformation and does not

require any salient feature point tracking. Given a pre-

trained GLOBALFLOWNET, we achieve video stabilization

using a two-stage process comprising a novel global afﬁne

parameter smoothing step (Sec. 3.1) and a novel residual

level smoothing step (Sec. 3.2) involving low frequency dis-

crete cosine transform (DCT) coefﬁcients of the residual

ﬂow. Both these steps smooth the parameters in the tem-

poral direction. The ﬁrst step acts as a very useful initial

condition, whereas the second step is necessary to signiﬁ-

cantly improve stabilization performance. This is because

it works with a global motion model that despite being very

compact, is much more general than just 2D afﬁne or ho-

mography. Our overall approach for video stabilization is

simple, computationally efﬁcient and interpretable. In ex-

tensive experiments (see Sec. 4), it outperforms state of the

art techniques in terms of stability measures. We also pro-

pose a new video stabilization measure which uses the low

frequency representation from Sec. 2.2 to quantify the tem-

poral smoothness of the global motion between successive

pairs of frames. Our measure uses a more general motion

model than existing measures which largely use afﬁne trans-

formations.

2. Global Motion Estimation

A key step in video stabilization is the estimation of

global motion between consecutive video frames (or tempo-

rally nearby video frames), followed by temporal smooth-

ing of the motion parameters. The difference between the

original global motion and the global motion in a stabilized

video constitutes the warp ﬁeld, which when applied to the

unstable frames, stabilizes the video. An ideal global mo-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GlobalFlowNet:VideoStabilizationusingDeepDistilledGlobalMotionEstimatesJerinGeoJamesDevanshJainAjitRajwadeIndianInstituteofTechnologyBombay{jeringeo,devanshdvj,ajitvr}@cse.iitb.ac.inAbstractVideosshotbylaymenusinghand-heldcamerascontainundesirableshakymotion.Estimatingtheglobalmotionbetweensuccessiv...

展开>> 收起<<

GlobalFlowNet Video Stabilization using Deep Distilled Global Motion Estimates Jerin Geo James Devansh Jain Ajit Rajwade.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GlobalFlowNet Video Stabilization using Deep Distilled Global Motion Estimates Jerin Geo James Devansh Jain Ajit Rajwade

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: