1 BVI-VFI A Video Quality Database for Video Frame Interpolation

2025-04-28 1 0 8.47MB 16 页 10玖币

侵权投诉

BVI-VFI: A Video Quality Database for Video

Frame Interpolation

Duolikun Danier, Student Member, IEEE, Fan Zhang, Member, IEEE, and David R. Bull, Fellow, IEEE

Abstract—Video frame interpolation (VFI) is a fundamental

research topic in video processing, which is currently attracting

increased attention across the research community. While the

development of more advanced VFI algorithms has been ex-

tensively researched, there remains little understanding of how

humans perceive the quality of interpolated content and how

well existing objective quality assessment methods perform when

measuring the perceived quality. In order to narrow this research

gap, we have developed a new video quality database named

BVI-VFI, which contains 540 distorted sequences generated by

applying ﬁve commonly used VFI algorithms to 36 diverse

source videos with various spatial resolutions and frame rates.

We collected more than 10,800 quality ratings for these videos

through a large scale subjective study involving 189 human

subjects. Based on the collected subjective scores, we further

analysed the inﬂuence of VFI algorithms and frame rates on

the perceptual quality of interpolated videos. Moreover, we

benchmarked the performance of 33 classic and state-of-the-

art objective image/video quality metrics on the new database,

and demonstrated the urgent requirement for more accurate

bespoke quality assessment methods for VFI. To facilitate further

research in this area, we have made BVI-VFI publicly available

at https://github.com/danier97/BVI-VFI-database.

Index Terms—Video quality database, subjective quality assess-

ment, video frame interpolation, perceptual quality, BVI-VFI.

I. INTRODUCTION

Video frame interpolation (VFI) is an important video

processing technique which is used to synthesise intermediate

frames between every two consecutive frames in a video

sequence. Conventionally known as motion interpolation, VFI

has been employed to increase the frame rates of content

captured at low frame rates, and to compensate for motion

blur in LCD displays [1]. Although the “soap opera effect”

(the unnaturalness in perception caused by the extra non-

real frames being displayed) can arise as a byproduct of

such a process, various previous studies [2–8] have conﬁrmed

that high-frame-rate (HFR) formats provide improved view-

ing experience in terms of perceived motion smoothness,

perceived realism, and immersiveness. Since VFI enables

HFR formats to be generated from low-frame-rate videos,

it offers signiﬁcant promise for improving perceptual video

quality. VFI also offers utility, and has been the subject of

This work involved collecting data from human participants. The relevant

experiments have been approved by the Faculty of Engineering Research

Ethics Committee of the University of Bristol (Ref 10739; PI: David R. Bull;

Title: Subjective Quality Study on Video Frame Interpolation).

The authors are with the Bristol Vision Institute, University of

Bristol, Bristol BS8 1TH, U.K. (e-mail:duolikun.danier@bristol.ac.uk;

fan.zhang@bristol.ac.uk; dave.bull@bristol.ac.uk).

The authors acknowledge the funding from China Scholarship Council, the

University of Bristol and the UKRI MyWorld Strength in Places Programme.

increased popularity, across many applications beyond HFR

format generation in recent years; these include slow motion

generation [9], video compression [10], medical imaging [11]

and animation production [12].

Other than frame repetition and averaging, early VFI meth-

ods employed in televisions and other display devices were

mainly based on motion estimation and compensation, where

motion vectors between frames were used to interpolate the

intermediate pixels [1]. Recently, driven by the development

of various deep learning techniques and more powerful com-

putational hardware, there has been a surge in the reporting of

new video frame interpolation methods, which are generally

classiﬁed into two groups: ﬂow- and kernel-based. While ﬂow-

based methods rely on optical ﬂow to warp reference frames,

kernel-based approaches estimate local interpolation kernels to

synthesise output pixels. Although these VFI approaches have

delivered signiﬁcant improvements in terms of interpolation

performance [13–18], challenging scenarios still exist that

cause interpolation failure; these often relate to content con-

taining large motions, dynamic textures, and occlusions [14].

While there is ongoing activity to develop new VFI methods

that tackle these challenges, the perceptual quality assessment

of frame interpolated content remains underinvestigated. Cur-

rently, the most widely adopted approach for assessing VFI

performance is to calculate the distortion between the interpo-

lated frames and their original ground-truth counterparts using

image quality assessment (IQA) models including PSNR,

SSIM [19], and LPIPS [20]. More recently, new perceptually

oriented image and video quality metrics have been developed

for other applications such as video compression, with no-

table examples including VSI [21], DISTS [22], VMAF [23],

FAST [24] and C3DVQA [25]. However, none of these models

has been fully evaluated on frame interpolated videos against

subjective ground truth. Due to this concern, in order to

accurately assess VFI performance, some researchers [14, 26]

have resorted to benchmarking based on subjective opinion

scores through psychophysical experiments; these however

are very time consuming and resource-heavy. In this context,

there is an urgent need to develop a video quality database

containing diverse VFI content alongside reliable subjective

score metadata, which can be employed to investigate the

competence of existing quality metrics for the VFI task.

Although there have been little reports of research in this

area [27, 28], the associated databases either contain human

opinion scores only on single interpolated images, or only

focus on slow-motion videos at a ﬁxed frame rate. Also, the

video sequences in [28] suffer from compression artefacts in

addition to VFI-related distortions, making it difﬁcult to de-

arXiv:2210.00823v3 [eess.IV] 21 Oct 2023

couple these during assessment. We have previously addressed

these issues in [29], where a small video quality database for

VFI was developed based on a limited subjective study and a

benchmark experiment only involving a few objective quality

metrics. To overcome these limits and make further progress

in understanding the perceptual quality of frame interpolated

videos, in this paper we extend our previous work [29] and

present a new video quality database, BVI-VFI, which contains

540 interpolated videos generated by various VFI algorithms,

covering different frame rates, spatial resolutions and diverse

content types. The database also includes subjective quality

scores for all videos collected through a large-scale subjective

experiment. Based on the subjective data, we performed a

much more comprehensive evaluation of existing objective

quality metrics, involving 33 conventional and learning-based

image and video quality models. This work differs from our

previous work [29] in the following aspects.

•Compared to the original 36 reference and 180 distorted

sequences, the new BVI-VFI database contains 108 ref-

erence and 540 distorted sequences.

•While the study in [29] concerns only HD videos, in this

work we cover three resolutions: 960×540,1920×1080,

and 3840 ×2160.

•The study on the subjective data in [29] was limited to the

effect of frame rate. In this work we additionally analyse

the impact of various video features on the perceived

quality of frame interpolated videos.

•While in [29] only eight full-reference quality metrics

were benchmarked, in this work we evaluate 33 im-

age/video quality metrics, covering both full- and no-

reference categories. Additionally, we perform cross-

validation experiments to better reﬂect the performance

of learning-based metrics.

The primary contributions are summarised below.

•We developed the ﬁrst bespoke video quality database

for frame interpolation, BVI-VFI, that covers multiple

frame rates (30-120fps) and spatial resolutions (540p-

2160p). It contains 540 distorted sequences generated by

ﬁve different video frame interpolation methods from 36

source videos, which uniformly cover a wide range of

video features.

•We conducted a large-scale laboratory-based psychophys-

ical experiment to collect subjective quality ground truths

for all the videos in BVI-VFI.

•We performed quantitative comparison of 33 classic and

state-of-the-art image/video quality assessment methods

on the BVI-VFI database. Cross-validation experiments

were also performed for learning-based metrics.

•The proposed new database serves as an important plat-

form for developing and validating new quality metrics

for video frame interpolation. It can also be used as a

test dataset for benchmarking VFI algorithms due to its

content diversity.

The rest of the paper is organised as follows. We ﬁrst

brieﬂy review the relevant literature in Section II, and then

describe the process of source video collection and test se-

quence generation in Section III. The subjective experiment,

the data processing procedures and analysis of the collected

subjective opinions are presented in Section IV. Section V

summarises the comparative study results for 33 quality as-

sessment methods. Finally, Section VI draws conclusions and

outlines potential future work.

II. RELATED WORK

In this section, we ﬁrst describe previous works in video

frame interpolation, and then summarise the related research

work on objective video quality metrics and subjective quality

assessment in the context of VFI.

A. Video Frame Interpolation

Early attempts [30, 31] to perform video frame interpolation

typically used estimated optical ﬂow maps to warp input

frames. This paradigm, referred to as ﬂow-based VFI, was

further developed in the learning-based VFI literature [32].

These methods adopted various techniques to enhance interpo-

lation quality, including the use of contextual information [33],

designing bespoke ﬂow estimation module [9, 15, 16, 32,

34–37], employing a coarse-to-ﬁne reﬁnement strategy [38–

41], developing new warping operations [17, 42, 43] and

adopting higher-order motion modelling with additional input

frames [13, 18, 44]. Some researchers argue that the imposi-

tion of a one-to-one mapping between the target and source

pixels can limit the ability of ﬂow-based methods to handle

complex motions. This has led to the development of kernel-

based methods [45–54] that predict adaptive local interpo-

lation kernels to synthesis the output pixels. This creates a

many-to-one mapping between the source and target pixels,

supporting additional degrees of freedom. Moreover, other

researchers reported the limitations of predicting ﬁxed-shaped

kernels [45, 46], and introduced deformable kernels [55] to

achieve improved interpolation performance. Finally, observ-

ing that ﬁxed kernel sizes can limit the captured motion

magnitude, some VFI methods [14, 56, 57] combine ﬂow-

based and kernel-based approaches in a single framework to

beneﬁt from both model types.

Besides the ﬂow-/kernel-based classes, other VFI paradigms

exist, for example based on pixel hallucination [26, 58], phase

information [59, 60], event cameras [61–63], unsupervised

learning [64, 65], and meta-learning [66]. More recently, the

joint problem of deblurring and interpolation has also been

addressed in [67, 68].

B. Objective Quality Assessment for VFI

In the current VFI literature, the commonly adopted ap-

proach for benchmarking interpolation performance is to mea-

sure the distortion between an interpolated video and its

ground-truth version. The most popular methods are PSNR,

SSIM [19] and LPIPS [20], all of which are applied at the level

of a single image or frame. There are also many image/video

quality metrics developed for other applications, including

approaches based on classic signal processing methods, e.g.,

MS-SSIM [69], VIF [70], VSI [21], FAST [24], SpEED [71],

VIQE [72] and ST-RRED [73]. More recently, machine learn-

ing techniques have been employed in the development of

TABLE I

THE UNIFORMITY AND RANGE CHARACTERISTICS OF THE 36 SOURCE

SEQUENCES IN THE BVI-VFI DATABASE.

Feature SI TI CF DTP MV

Uniformity (0-1) 0.93 0.85 0.95 0.87 0.87

Range (0-1) 0.88 0.97 0.74 0.99 0.99

perceptual metrics including VMAF [23], C3DVQA [25] and

CONTRIQUE [74]. Alongside these generic quality metrics,

assessment methods that were designed to speciﬁcally model

the effect of frame rate/spatial resolution down-sampling or

frame interpolation have been reported, including FRQM [75],

ST-GREED [76], VSTR [77], FAVER [78] and FloLPIPS.

However, none of these methods have been rigorously bench-

marked due to the lack of databases with diverse content and

ground-truth metadata.

C. Subjective Quality Assessment for VFI

Various subjective quality databases exist that support stud-

ies into how the human vision system (HVS) perceives video

quality. These include those developed in the context of video

compression, e.g., the early VQEG FR-TV Phase I [79],

LIVE VQA [80], LIVE Mobile [81], CSIQ-VQA [82], BVI-

HD [83], LIVE-SJTU [84], TVG [85], CSCVQ [86] and

LIVE Livestream [87]. Video quality databases also exist that

support investigations into distortions due to video parameter

variations (e.g, frame rate, spatial resolution and bit depth,

with or without video compression artefacts), including MCL-

V [88], MCML-4K-UHD [89], BVI-SR [7], BVI-HFR [7],

BVI-BD [90], FRD-VQA [91], LIVE-YT-HFR [92], AVT-

VQDB-UHD-1 [93] and ETRI-LIVE STSVQ [94].

There are, however, very few examples of databases that

contain content distorted by different VFI methods; the most

relevant contributions being [27] and [28]. In [27], subjective

quality scores were collected by showing viewers individual

interpolated frames instead of video sequences. This method-

ology is limited since it does not consider temporal arte-

facts which could signiﬁcantly inﬂuence the perceived video

quality [95]. The later database, KosMo-1k [28], addresses

this issue by collecting subjective opinions when viewing

interpolated videos. However, the video sequences were played

in slow-motion (one speciﬁc use case of VFI) and, in addition,

all the distorted sequences were contaminated by video com-

pression artefacts, making it difﬁcult for subjects to isolate VFI

artefacts. Due to their disadvantages, none of these databases

can be recommended for evaluating the performance of VFI

quality metrics. Hence there is an urgent requirement for a

bespoke video quality database.

III. THE BVI-VFI DATABASE

This section describes the approach used to select the source

sequences in the BVI-VFI database, and how the test videos

were generated.

A. Reference Sequences

The BVI-VFI database contains 108 reference sequences in

total, which were generated from 36 different source videos

captured at 120fps and resampled to 60 and 30 fps. The

36 source videos have three different spatial resolutions: 12

at 3840×2160 (UHD-1), 12 at 1920×1080 (HD), and 12 at

960×540. For each resolution group, we: (i) ﬁrst created a

selection pool consisting of 120fps, YUV 4:2:0 8 bit video

candidates collected from various sources; (ii) then calculated

ﬁve video features: Motion Vector (MV), Dynamic Texture

Parameter (DTP), Spatial Information (SI), Temporal infor-

mation (TI) and Colourfulness (CF) for each candidate; (iii)

used the selection algorithm described in [83] to determine

the ﬁnal source sequences in the BVI-VFI database. This

selection procedure ensures that extreme cases of each feature

are included, thus covering the most challenging VFI scenarios

such as large motions and dynamic textures (see Fig. 1

for examples). Among all these ﬁve features, MV and DTP

describe the overall motion magnitude and textural complexity

in the video whereas the other three features are employed to

characterise the range and diversity of spatio-temporal activity

and color in the video database [7, 92, 94]. A description of

MV, SI, TI and CF can be found in [96, 97], and DTP is

described in [83].

All 22 HD source candidates (in YUV 4:2:0 8 bit 120fps

format) came from the BVI-HFR dataset [7]. They were ﬁrst

truncated to ﬁve seconds following the research study in [97].

The choice of ﬁve-second duration is further justiﬁed because

the source videos do not contain scene cuts and the featured

motion characteristics in most videos do not vary signiﬁcantly,

thus allowing subjects to perceive temporal artefacts more

easily. The effectiveness of the video duration has also been

validated by high subject consistency in Section IV-C. Twelve1

sequences were selected from this pool using the algorithm

in [83] to ensure wide feature coverage range and uniform

feature distribution [96] across the whole database2. We then

followed the same procedure for the UHD-1 resolution group,

for which 27 source candidates were collected from a variety

of sources, including ﬁve from the LIVE-YT-HFR dataset [92],

six from the UVG dataset [98], and 16 captured using a

RED Epic-X video camera at the University of Bristol. All

sequences were also trimmed to ﬁve seconds (600 frames) and

converted to YUV 4:2:0 8 bit format to ensure consistency

with the above-mentioned HD sources. In addition, because

of the limited spatial resolution (up to 1920×1080) of the

high frame rate display employed in the subjective experi-

ment, we further cropped HD representations of these UHD-1

videos. Speciﬁcally, we generated 9 cropped candidates for

each UHD-1 video (x∈ {0,960,1920}, y ∈ {0,540,1080}

where (x, y)is the top-left coordinate of the crop), and

selected the crop that has the highest content similarity to

the original UHD-1 video in terms of two feature descriptors:

MV and DTP. By doing so, we obtained 27 cropped videos,

which retain the characteristics of UHD-1 content despite their

smaller spatial extent. From these 27 sources, we selected 12

UHD-1 source videos using the same procedure as for the

1The number of source sequences for each resolution group was determined

based on the trade-off between available resources for computation and

subjective testing, and the optimisation of feature uniformity and range values.

2The preliminary results based on HD reference and distorted sequences

have been published in [29].

(1) (2) (3) (4) (5) (6) (7) (8) (9)

(10) (11) (12) (13) (14) (15) (16) (17) (18)

(19) (20) (21) (22) (23) (24) (25) (26) (27)

(28) (29) (30) (31) (32) (33) (34) (35) (36)

Fig. 1. Sample frames from the 36 source sequences of the BVI-VFI database. (1)-(12): sequences at 960×540. (13)-(24): sequences at 1920×1080. (25)-(36):

sequences at 3840×2160.

HD resolution group. For the resolution group of 960×540,

due to the lack of publicly available 120fps videos at this

resolution, we generated a selection pool by spatially down-

sampling (using a Lanczos3 ﬁlter) the unused 120fps videos

from the previously collected HD and UHD-1 candidates.

Then we applied again the same sequence selection process

to obtain the 12 source sequences at 960×540.

Finally, to obtain reference sequences sampled at var-

ious frame rates, the twelve 1920×1080, twelve cropped

3840×2160 and the twelve 960×540 source videos at 120fps

were temporally sub-sampled to 60fps and 30fps by frame

dropping, resulting in 108 references. An alternative sub-

sampling method, frame averaging, can create visible ghosting

artefacts in cases with small shutter angles and introduce

additional motion blur. Both of these artefacts can seriously

deteriorate the frame interpolation performance [67, 68]. In

contrast, although frame dropping reduces the original shutter

angle which may introduce motion judder, the resulting frames

provide a superior basis for VFI applications. Sample frames

of the all the ﬁnal source sequences are shown in Fig. 1. Table I

reports the range and uniformity characteristics of the source

sequences in BVI-VFI. It can be observed that the selected

source sequences offer a wide and uniform coverage for all

the spatial and temporal features measured.

B. Distorted Sequence Generation

To generate different distorted versions of the 108 reference

sequences, we ﬁrst halved their frame rates by dropping every

second frame, and then reconstructed the dropped frames

using ﬁve VFI algorithms: frame repeating, frame averaging

(where the middle frame is generated by averaging every

two frames), DVF [34], QVI [13] and ST-MFNet [14]. The

ﬁrst two methods were included because they have very low

computational complexity and produce unique artefact types,

motion judder and motion blur respectively. The other three

algorithms are all based on deep learning. and were chosen

for the following reasons.

•DVF is one of the earliest and most well-known deep

learning-based VFI method that is based on ﬂow esti-

mation. It inspired a series of new VFI methods that

improved upon DVF, and is representative of the class

of VFI methods that rely on a linear motion assumption

to perform frame warping.

•QVI is the ﬁrst ﬂow-based VFI method that explicitly

models higher-order motion (second order). Many follow-

ing works [18, 37, 44] have drawn from the components

of QVI to develop new methods. Therefore, QVI is a

representative ﬂow-based VFI method that assume non-

linear motions.

•ST-MFNet is a state-of-the-art VFI method that is repre-

sentative of kernel-based VFI approaches, where interpo-

lation kernels are predicted and used for frame synthesis.

Equipped with kernel-based warping and deep CNNs

for frame processing, ST-MFNet can generate similar

artefacts to many other kernel-based [45, 47, 50] and end-

to-end methods [26, 58].

For the deep learning-based models, we employ the model

parameters pre-trained as in [14] due to their proven perfor-

mance on various challenging content. As a result, a total of

540 (108×5) distorted videos were obtained. Fig. 2 shows

example frames interpolated by all ﬁve VFI methods, where it

can be seen that diverse artefacts types have been generated.

It should be noted that we did not employ video compression

during the test sequence generation process, hence our content

is free from compression artefacts, ensuring a focus solely on

the perceptual quality of VFI-generated content.

C. Summary

To summarise, in total the BVI-VFI database contains

108 reference videos and 540 distorted (interpolated) videos

generated by applying ﬁve VFI algorithms to each reference

video. The 108 reference sequences correspond to 36 source

sequences at three frame rates and three spatial resolutions.

The number of sequences employed for subjective testing was

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1BVI-VFI:AVideoQualityDatabaseforVideoFrameInterpolationDuolikunDanier,StudentMember,IEEE,FanZhang,Member,IEEE,andDavidR.Bull,Fellow,IEEEAbstract—Videoframeinterpolation(VFI)isafundamentalresearchtopicinvideoprocessing,whichiscurrentlyattractingincreasedattentionacrosstheresearchcommunity.Whilethede...

展开>> 收起<<

1 BVI-VFI A Video Quality Database for Video Frame Interpolation.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 BVI-VFI A Video Quality Database for Video Frame Interpolation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: