1 BVI-VFI A Video Quality Database for Video Frame Interpolation

2025-04-28 1 0 8.47MB 16 页 10玖币
侵权投诉
1
BVI-VFI: A Video Quality Database for Video
Frame Interpolation
Duolikun Danier, Student Member, IEEE, Fan Zhang, Member, IEEE, and David R. Bull, Fellow, IEEE
Abstract—Video frame interpolation (VFI) is a fundamental
research topic in video processing, which is currently attracting
increased attention across the research community. While the
development of more advanced VFI algorithms has been ex-
tensively researched, there remains little understanding of how
humans perceive the quality of interpolated content and how
well existing objective quality assessment methods perform when
measuring the perceived quality. In order to narrow this research
gap, we have developed a new video quality database named
BVI-VFI, which contains 540 distorted sequences generated by
applying five commonly used VFI algorithms to 36 diverse
source videos with various spatial resolutions and frame rates.
We collected more than 10,800 quality ratings for these videos
through a large scale subjective study involving 189 human
subjects. Based on the collected subjective scores, we further
analysed the influence of VFI algorithms and frame rates on
the perceptual quality of interpolated videos. Moreover, we
benchmarked the performance of 33 classic and state-of-the-
art objective image/video quality metrics on the new database,
and demonstrated the urgent requirement for more accurate
bespoke quality assessment methods for VFI. To facilitate further
research in this area, we have made BVI-VFI publicly available
at https://github.com/danier97/BVI-VFI-database.
Index Terms—Video quality database, subjective quality assess-
ment, video frame interpolation, perceptual quality, BVI-VFI.
I. INTRODUCTION
Video frame interpolation (VFI) is an important video
processing technique which is used to synthesise intermediate
frames between every two consecutive frames in a video
sequence. Conventionally known as motion interpolation, VFI
has been employed to increase the frame rates of content
captured at low frame rates, and to compensate for motion
blur in LCD displays [1]. Although the “soap opera effect”
(the unnaturalness in perception caused by the extra non-
real frames being displayed) can arise as a byproduct of
such a process, various previous studies [2–8] have confirmed
that high-frame-rate (HFR) formats provide improved view-
ing experience in terms of perceived motion smoothness,
perceived realism, and immersiveness. Since VFI enables
HFR formats to be generated from low-frame-rate videos,
it offers significant promise for improving perceptual video
quality. VFI also offers utility, and has been the subject of
This work involved collecting data from human participants. The relevant
experiments have been approved by the Faculty of Engineering Research
Ethics Committee of the University of Bristol (Ref 10739; PI: David R. Bull;
Title: Subjective Quality Study on Video Frame Interpolation).
The authors are with the Bristol Vision Institute, University of
Bristol, Bristol BS8 1TH, U.K. (e-mail:duolikun.danier@bristol.ac.uk;
fan.zhang@bristol.ac.uk; dave.bull@bristol.ac.uk).
The authors acknowledge the funding from China Scholarship Council, the
University of Bristol and the UKRI MyWorld Strength in Places Programme.
increased popularity, across many applications beyond HFR
format generation in recent years; these include slow motion
generation [9], video compression [10], medical imaging [11]
and animation production [12].
Other than frame repetition and averaging, early VFI meth-
ods employed in televisions and other display devices were
mainly based on motion estimation and compensation, where
motion vectors between frames were used to interpolate the
intermediate pixels [1]. Recently, driven by the development
of various deep learning techniques and more powerful com-
putational hardware, there has been a surge in the reporting of
new video frame interpolation methods, which are generally
classified into two groups: flow- and kernel-based. While flow-
based methods rely on optical flow to warp reference frames,
kernel-based approaches estimate local interpolation kernels to
synthesise output pixels. Although these VFI approaches have
delivered significant improvements in terms of interpolation
performance [13–18], challenging scenarios still exist that
cause interpolation failure; these often relate to content con-
taining large motions, dynamic textures, and occlusions [14].
While there is ongoing activity to develop new VFI methods
that tackle these challenges, the perceptual quality assessment
of frame interpolated content remains underinvestigated. Cur-
rently, the most widely adopted approach for assessing VFI
performance is to calculate the distortion between the interpo-
lated frames and their original ground-truth counterparts using
image quality assessment (IQA) models including PSNR,
SSIM [19], and LPIPS [20]. More recently, new perceptually
oriented image and video quality metrics have been developed
for other applications such as video compression, with no-
table examples including VSI [21], DISTS [22], VMAF [23],
FAST [24] and C3DVQA [25]. However, none of these models
has been fully evaluated on frame interpolated videos against
subjective ground truth. Due to this concern, in order to
accurately assess VFI performance, some researchers [14, 26]
have resorted to benchmarking based on subjective opinion
scores through psychophysical experiments; these however
are very time consuming and resource-heavy. In this context,
there is an urgent need to develop a video quality database
containing diverse VFI content alongside reliable subjective
score metadata, which can be employed to investigate the
competence of existing quality metrics for the VFI task.
Although there have been little reports of research in this
area [27, 28], the associated databases either contain human
opinion scores only on single interpolated images, or only
focus on slow-motion videos at a fixed frame rate. Also, the
video sequences in [28] suffer from compression artefacts in
addition to VFI-related distortions, making it difficult to de-
arXiv:2210.00823v3 [eess.IV] 21 Oct 2023
2
couple these during assessment. We have previously addressed
these issues in [29], where a small video quality database for
VFI was developed based on a limited subjective study and a
benchmark experiment only involving a few objective quality
metrics. To overcome these limits and make further progress
in understanding the perceptual quality of frame interpolated
videos, in this paper we extend our previous work [29] and
present a new video quality database, BVI-VFI, which contains
540 interpolated videos generated by various VFI algorithms,
covering different frame rates, spatial resolutions and diverse
content types. The database also includes subjective quality
scores for all videos collected through a large-scale subjective
experiment. Based on the subjective data, we performed a
much more comprehensive evaluation of existing objective
quality metrics, involving 33 conventional and learning-based
image and video quality models. This work differs from our
previous work [29] in the following aspects.
Compared to the original 36 reference and 180 distorted
sequences, the new BVI-VFI database contains 108 ref-
erence and 540 distorted sequences.
While the study in [29] concerns only HD videos, in this
work we cover three resolutions: 960×540,1920×1080,
and 3840 ×2160.
The study on the subjective data in [29] was limited to the
effect of frame rate. In this work we additionally analyse
the impact of various video features on the perceived
quality of frame interpolated videos.
While in [29] only eight full-reference quality metrics
were benchmarked, in this work we evaluate 33 im-
age/video quality metrics, covering both full- and no-
reference categories. Additionally, we perform cross-
validation experiments to better reflect the performance
of learning-based metrics.
The primary contributions are summarised below.
We developed the first bespoke video quality database
for frame interpolation, BVI-VFI, that covers multiple
frame rates (30-120fps) and spatial resolutions (540p-
2160p). It contains 540 distorted sequences generated by
five different video frame interpolation methods from 36
source videos, which uniformly cover a wide range of
video features.
We conducted a large-scale laboratory-based psychophys-
ical experiment to collect subjective quality ground truths
for all the videos in BVI-VFI.
We performed quantitative comparison of 33 classic and
state-of-the-art image/video quality assessment methods
on the BVI-VFI database. Cross-validation experiments
were also performed for learning-based metrics.
The proposed new database serves as an important plat-
form for developing and validating new quality metrics
for video frame interpolation. It can also be used as a
test dataset for benchmarking VFI algorithms due to its
content diversity.
The rest of the paper is organised as follows. We first
briefly review the relevant literature in Section II, and then
describe the process of source video collection and test se-
quence generation in Section III. The subjective experiment,
the data processing procedures and analysis of the collected
subjective opinions are presented in Section IV. Section V
summarises the comparative study results for 33 quality as-
sessment methods. Finally, Section VI draws conclusions and
outlines potential future work.
II. RELATED WORK
In this section, we first describe previous works in video
frame interpolation, and then summarise the related research
work on objective video quality metrics and subjective quality
assessment in the context of VFI.
A. Video Frame Interpolation
Early attempts [30, 31] to perform video frame interpolation
typically used estimated optical flow maps to warp input
frames. This paradigm, referred to as flow-based VFI, was
further developed in the learning-based VFI literature [32].
These methods adopted various techniques to enhance interpo-
lation quality, including the use of contextual information [33],
designing bespoke flow estimation module [9, 15, 16, 32,
34–37], employing a coarse-to-fine refinement strategy [38–
41], developing new warping operations [17, 42, 43] and
adopting higher-order motion modelling with additional input
frames [13, 18, 44]. Some researchers argue that the imposi-
tion of a one-to-one mapping between the target and source
pixels can limit the ability of flow-based methods to handle
complex motions. This has led to the development of kernel-
based methods [45–54] that predict adaptive local interpo-
lation kernels to synthesis the output pixels. This creates a
many-to-one mapping between the source and target pixels,
supporting additional degrees of freedom. Moreover, other
researchers reported the limitations of predicting fixed-shaped
kernels [45, 46], and introduced deformable kernels [55] to
achieve improved interpolation performance. Finally, observ-
ing that fixed kernel sizes can limit the captured motion
magnitude, some VFI methods [14, 56, 57] combine flow-
based and kernel-based approaches in a single framework to
benefit from both model types.
Besides the flow-/kernel-based classes, other VFI paradigms
exist, for example based on pixel hallucination [26, 58], phase
information [59, 60], event cameras [61–63], unsupervised
learning [64, 65], and meta-learning [66]. More recently, the
joint problem of deblurring and interpolation has also been
addressed in [67, 68].
B. Objective Quality Assessment for VFI
In the current VFI literature, the commonly adopted ap-
proach for benchmarking interpolation performance is to mea-
sure the distortion between an interpolated video and its
ground-truth version. The most popular methods are PSNR,
SSIM [19] and LPIPS [20], all of which are applied at the level
of a single image or frame. There are also many image/video
quality metrics developed for other applications, including
approaches based on classic signal processing methods, e.g.,
MS-SSIM [69], VIF [70], VSI [21], FAST [24], SpEED [71],
VIQE [72] and ST-RRED [73]. More recently, machine learn-
ing techniques have been employed in the development of
3
TABLE I
THE UNIFORMITY AND RANGE CHARACTERISTICS OF THE 36 SOURCE
SEQUENCES IN THE BVI-VFI DATABASE.
Feature SI TI CF DTP MV
Uniformity (0-1) 0.93 0.85 0.95 0.87 0.87
Range (0-1) 0.88 0.97 0.74 0.99 0.99
perceptual metrics including VMAF [23], C3DVQA [25] and
CONTRIQUE [74]. Alongside these generic quality metrics,
assessment methods that were designed to specifically model
the effect of frame rate/spatial resolution down-sampling or
frame interpolation have been reported, including FRQM [75],
ST-GREED [76], VSTR [77], FAVER [78] and FloLPIPS.
However, none of these methods have been rigorously bench-
marked due to the lack of databases with diverse content and
ground-truth metadata.
C. Subjective Quality Assessment for VFI
Various subjective quality databases exist that support stud-
ies into how the human vision system (HVS) perceives video
quality. These include those developed in the context of video
compression, e.g., the early VQEG FR-TV Phase I [79],
LIVE VQA [80], LIVE Mobile [81], CSIQ-VQA [82], BVI-
HD [83], LIVE-SJTU [84], TVG [85], CSCVQ [86] and
LIVE Livestream [87]. Video quality databases also exist that
support investigations into distortions due to video parameter
variations (e.g, frame rate, spatial resolution and bit depth,
with or without video compression artefacts), including MCL-
V [88], MCML-4K-UHD [89], BVI-SR [7], BVI-HFR [7],
BVI-BD [90], FRD-VQA [91], LIVE-YT-HFR [92], AVT-
VQDB-UHD-1 [93] and ETRI-LIVE STSVQ [94].
There are, however, very few examples of databases that
contain content distorted by different VFI methods; the most
relevant contributions being [27] and [28]. In [27], subjective
quality scores were collected by showing viewers individual
interpolated frames instead of video sequences. This method-
ology is limited since it does not consider temporal arte-
facts which could significantly influence the perceived video
quality [95]. The later database, KosMo-1k [28], addresses
this issue by collecting subjective opinions when viewing
interpolated videos. However, the video sequences were played
in slow-motion (one specific use case of VFI) and, in addition,
all the distorted sequences were contaminated by video com-
pression artefacts, making it difficult for subjects to isolate VFI
artefacts. Due to their disadvantages, none of these databases
can be recommended for evaluating the performance of VFI
quality metrics. Hence there is an urgent requirement for a
bespoke video quality database.
III. THE BVI-VFI DATABASE
This section describes the approach used to select the source
sequences in the BVI-VFI database, and how the test videos
were generated.
A. Reference Sequences
The BVI-VFI database contains 108 reference sequences in
total, which were generated from 36 different source videos
captured at 120fps and resampled to 60 and 30 fps. The
36 source videos have three different spatial resolutions: 12
at 3840×2160 (UHD-1), 12 at 1920×1080 (HD), and 12 at
960×540. For each resolution group, we: (i) first created a
selection pool consisting of 120fps, YUV 4:2:0 8 bit video
candidates collected from various sources; (ii) then calculated
five video features: Motion Vector (MV), Dynamic Texture
Parameter (DTP), Spatial Information (SI), Temporal infor-
mation (TI) and Colourfulness (CF) for each candidate; (iii)
used the selection algorithm described in [83] to determine
the final source sequences in the BVI-VFI database. This
selection procedure ensures that extreme cases of each feature
are included, thus covering the most challenging VFI scenarios
such as large motions and dynamic textures (see Fig. 1
for examples). Among all these five features, MV and DTP
describe the overall motion magnitude and textural complexity
in the video whereas the other three features are employed to
characterise the range and diversity of spatio-temporal activity
and color in the video database [7, 92, 94]. A description of
MV, SI, TI and CF can be found in [96, 97], and DTP is
described in [83].
All 22 HD source candidates (in YUV 4:2:0 8 bit 120fps
format) came from the BVI-HFR dataset [7]. They were first
truncated to five seconds following the research study in [97].
The choice of five-second duration is further justified because
the source videos do not contain scene cuts and the featured
motion characteristics in most videos do not vary significantly,
thus allowing subjects to perceive temporal artefacts more
easily. The effectiveness of the video duration has also been
validated by high subject consistency in Section IV-C. Twelve1
sequences were selected from this pool using the algorithm
in [83] to ensure wide feature coverage range and uniform
feature distribution [96] across the whole database2. We then
followed the same procedure for the UHD-1 resolution group,
for which 27 source candidates were collected from a variety
of sources, including five from the LIVE-YT-HFR dataset [92],
six from the UVG dataset [98], and 16 captured using a
RED Epic-X video camera at the University of Bristol. All
sequences were also trimmed to five seconds (600 frames) and
converted to YUV 4:2:0 8 bit format to ensure consistency
with the above-mentioned HD sources. In addition, because
of the limited spatial resolution (up to 1920×1080) of the
high frame rate display employed in the subjective experi-
ment, we further cropped HD representations of these UHD-1
videos. Specifically, we generated 9 cropped candidates for
each UHD-1 video (x∈ {0,960,1920}, y ∈ {0,540,1080}
where (x, y)is the top-left coordinate of the crop), and
selected the crop that has the highest content similarity to
the original UHD-1 video in terms of two feature descriptors:
MV and DTP. By doing so, we obtained 27 cropped videos,
which retain the characteristics of UHD-1 content despite their
smaller spatial extent. From these 27 sources, we selected 12
UHD-1 source videos using the same procedure as for the
1The number of source sequences for each resolution group was determined
based on the trade-off between available resources for computation and
subjective testing, and the optimisation of feature uniformity and range values.
2The preliminary results based on HD reference and distorted sequences
have been published in [29].
4
(1) (2) (3) (4) (5) (6) (7) (8) (9)
(10) (11) (12) (13) (14) (15) (16) (17) (18)
(19) (20) (21) (22) (23) (24) (25) (26) (27)
(28) (29) (30) (31) (32) (33) (34) (35) (36)
Fig. 1. Sample frames from the 36 source sequences of the BVI-VFI database. (1)-(12): sequences at 960×540. (13)-(24): sequences at 1920×1080. (25)-(36):
sequences at 3840×2160.
HD resolution group. For the resolution group of 960×540,
due to the lack of publicly available 120fps videos at this
resolution, we generated a selection pool by spatially down-
sampling (using a Lanczos3 filter) the unused 120fps videos
from the previously collected HD and UHD-1 candidates.
Then we applied again the same sequence selection process
to obtain the 12 source sequences at 960×540.
Finally, to obtain reference sequences sampled at var-
ious frame rates, the twelve 1920×1080, twelve cropped
3840×2160 and the twelve 960×540 source videos at 120fps
were temporally sub-sampled to 60fps and 30fps by frame
dropping, resulting in 108 references. An alternative sub-
sampling method, frame averaging, can create visible ghosting
artefacts in cases with small shutter angles and introduce
additional motion blur. Both of these artefacts can seriously
deteriorate the frame interpolation performance [67, 68]. In
contrast, although frame dropping reduces the original shutter
angle which may introduce motion judder, the resulting frames
provide a superior basis for VFI applications. Sample frames
of the all the final source sequences are shown in Fig. 1. Table I
reports the range and uniformity characteristics of the source
sequences in BVI-VFI. It can be observed that the selected
source sequences offer a wide and uniform coverage for all
the spatial and temporal features measured.
B. Distorted Sequence Generation
To generate different distorted versions of the 108 reference
sequences, we first halved their frame rates by dropping every
second frame, and then reconstructed the dropped frames
using five VFI algorithms: frame repeating, frame averaging
(where the middle frame is generated by averaging every
two frames), DVF [34], QVI [13] and ST-MFNet [14]. The
first two methods were included because they have very low
computational complexity and produce unique artefact types,
motion judder and motion blur respectively. The other three
algorithms are all based on deep learning. and were chosen
for the following reasons.
DVF is one of the earliest and most well-known deep
learning-based VFI method that is based on flow esti-
mation. It inspired a series of new VFI methods that
improved upon DVF, and is representative of the class
of VFI methods that rely on a linear motion assumption
to perform frame warping.
QVI is the first flow-based VFI method that explicitly
models higher-order motion (second order). Many follow-
ing works [18, 37, 44] have drawn from the components
of QVI to develop new methods. Therefore, QVI is a
representative flow-based VFI method that assume non-
linear motions.
ST-MFNet is a state-of-the-art VFI method that is repre-
sentative of kernel-based VFI approaches, where interpo-
lation kernels are predicted and used for frame synthesis.
Equipped with kernel-based warping and deep CNNs
for frame processing, ST-MFNet can generate similar
artefacts to many other kernel-based [45, 47, 50] and end-
to-end methods [26, 58].
For the deep learning-based models, we employ the model
parameters pre-trained as in [14] due to their proven perfor-
mance on various challenging content. As a result, a total of
540 (108×5) distorted videos were obtained. Fig. 2 shows
example frames interpolated by all five VFI methods, where it
can be seen that diverse artefacts types have been generated.
It should be noted that we did not employ video compression
during the test sequence generation process, hence our content
is free from compression artefacts, ensuring a focus solely on
the perceptual quality of VFI-generated content.
C. Summary
To summarise, in total the BVI-VFI database contains
108 reference videos and 540 distorted (interpolated) videos
generated by applying five VFI algorithms to each reference
video. The 108 reference sequences correspond to 36 source
sequences at three frame rates and three spatial resolutions.
The number of sequences employed for subjective testing was
摘要:

1BVI-VFI:AVideoQualityDatabaseforVideoFrameInterpolationDuolikunDanier,StudentMember,IEEE,FanZhang,Member,IEEE,andDavidR.Bull,Fellow,IEEEAbstract—Videoframeinterpolation(VFI)isafundamentalresearchtopicinvideoprocessing,whichiscurrentlyattractingincreasedattentionacrosstheresearchcommunity.Whilethede...

展开>> 收起<<
1 BVI-VFI A Video Quality Database for Video Frame Interpolation.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:8.47MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注