
3
TABLE I
THE UNIFORMITY AND RANGE CHARACTERISTICS OF THE 36 SOURCE
SEQUENCES IN THE BVI-VFI DATABASE.
Feature SI TI CF DTP MV
Uniformity (0-1) 0.93 0.85 0.95 0.87 0.87
Range (0-1) 0.88 0.97 0.74 0.99 0.99
perceptual metrics including VMAF [23], C3DVQA [25] and
CONTRIQUE [74]. Alongside these generic quality metrics,
assessment methods that were designed to specifically model
the effect of frame rate/spatial resolution down-sampling or
frame interpolation have been reported, including FRQM [75],
ST-GREED [76], VSTR [77], FAVER [78] and FloLPIPS.
However, none of these methods have been rigorously bench-
marked due to the lack of databases with diverse content and
ground-truth metadata.
C. Subjective Quality Assessment for VFI
Various subjective quality databases exist that support stud-
ies into how the human vision system (HVS) perceives video
quality. These include those developed in the context of video
compression, e.g., the early VQEG FR-TV Phase I [79],
LIVE VQA [80], LIVE Mobile [81], CSIQ-VQA [82], BVI-
HD [83], LIVE-SJTU [84], TVG [85], CSCVQ [86] and
LIVE Livestream [87]. Video quality databases also exist that
support investigations into distortions due to video parameter
variations (e.g, frame rate, spatial resolution and bit depth,
with or without video compression artefacts), including MCL-
V [88], MCML-4K-UHD [89], BVI-SR [7], BVI-HFR [7],
BVI-BD [90], FRD-VQA [91], LIVE-YT-HFR [92], AVT-
VQDB-UHD-1 [93] and ETRI-LIVE STSVQ [94].
There are, however, very few examples of databases that
contain content distorted by different VFI methods; the most
relevant contributions being [27] and [28]. In [27], subjective
quality scores were collected by showing viewers individual
interpolated frames instead of video sequences. This method-
ology is limited since it does not consider temporal arte-
facts which could significantly influence the perceived video
quality [95]. The later database, KosMo-1k [28], addresses
this issue by collecting subjective opinions when viewing
interpolated videos. However, the video sequences were played
in slow-motion (one specific use case of VFI) and, in addition,
all the distorted sequences were contaminated by video com-
pression artefacts, making it difficult for subjects to isolate VFI
artefacts. Due to their disadvantages, none of these databases
can be recommended for evaluating the performance of VFI
quality metrics. Hence there is an urgent requirement for a
bespoke video quality database.
III. THE BVI-VFI DATABASE
This section describes the approach used to select the source
sequences in the BVI-VFI database, and how the test videos
were generated.
A. Reference Sequences
The BVI-VFI database contains 108 reference sequences in
total, which were generated from 36 different source videos
captured at 120fps and resampled to 60 and 30 fps. The
36 source videos have three different spatial resolutions: 12
at 3840×2160 (UHD-1), 12 at 1920×1080 (HD), and 12 at
960×540. For each resolution group, we: (i) first created a
selection pool consisting of 120fps, YUV 4:2:0 8 bit video
candidates collected from various sources; (ii) then calculated
five video features: Motion Vector (MV), Dynamic Texture
Parameter (DTP), Spatial Information (SI), Temporal infor-
mation (TI) and Colourfulness (CF) for each candidate; (iii)
used the selection algorithm described in [83] to determine
the final source sequences in the BVI-VFI database. This
selection procedure ensures that extreme cases of each feature
are included, thus covering the most challenging VFI scenarios
such as large motions and dynamic textures (see Fig. 1
for examples). Among all these five features, MV and DTP
describe the overall motion magnitude and textural complexity
in the video whereas the other three features are employed to
characterise the range and diversity of spatio-temporal activity
and color in the video database [7, 92, 94]. A description of
MV, SI, TI and CF can be found in [96, 97], and DTP is
described in [83].
All 22 HD source candidates (in YUV 4:2:0 8 bit 120fps
format) came from the BVI-HFR dataset [7]. They were first
truncated to five seconds following the research study in [97].
The choice of five-second duration is further justified because
the source videos do not contain scene cuts and the featured
motion characteristics in most videos do not vary significantly,
thus allowing subjects to perceive temporal artefacts more
easily. The effectiveness of the video duration has also been
validated by high subject consistency in Section IV-C. Twelve1
sequences were selected from this pool using the algorithm
in [83] to ensure wide feature coverage range and uniform
feature distribution [96] across the whole database2. We then
followed the same procedure for the UHD-1 resolution group,
for which 27 source candidates were collected from a variety
of sources, including five from the LIVE-YT-HFR dataset [92],
six from the UVG dataset [98], and 16 captured using a
RED Epic-X video camera at the University of Bristol. All
sequences were also trimmed to five seconds (600 frames) and
converted to YUV 4:2:0 8 bit format to ensure consistency
with the above-mentioned HD sources. In addition, because
of the limited spatial resolution (up to 1920×1080) of the
high frame rate display employed in the subjective experi-
ment, we further cropped HD representations of these UHD-1
videos. Specifically, we generated 9 cropped candidates for
each UHD-1 video (x∈ {0,960,1920}, y ∈ {0,540,1080}
where (x, y)is the top-left coordinate of the crop), and
selected the crop that has the highest content similarity to
the original UHD-1 video in terms of two feature descriptors:
MV and DTP. By doing so, we obtained 27 cropped videos,
which retain the characteristics of UHD-1 content despite their
smaller spatial extent. From these 27 sources, we selected 12
UHD-1 source videos using the same procedure as for the
1The number of source sequences for each resolution group was determined
based on the trade-off between available resources for computation and
subjective testing, and the optimisation of feature uniformity and range values.
2The preliminary results based on HD reference and distorted sequences
have been published in [29].