1
HVS Revisited: A Comprehensive Video Quality
Assessment Framework
Ao-Xiang Zhang, Student Member, IEEE, Yuan-Gen Wang, Senior Member, IEEE
Weixuan Tang, Member, IEEE, Leida Li, Member, IEEE, Sam Kwong, Fellow, IEEE
Abstract—Video quality is a primary concern for video service
providers. In recent years, the techniques of video quality assess-
ment (VQA) based on deep convolutional neural network (CNN)
have been developed rapidly. Although existing works attempt to
introduce the knowledge of the human visual system (HVS) into
VQA, there still exhibit limitations that prevent full exploitation
of HVS, including an incomplete model by few characteristics and
insufficient connections among these characteristics. To overcome
these limitations, this paper revisits HVS with five representative
characteristics, and further reorganizes their connections. Based
on the revisited HVS, a no-reference VQA framework called
HVS-5M (NRVQA framework with five modules simulating HVS
with five characteristics) is proposed. It works in a domain-fusion
design paradigm with advanced network structures. On the side
of the spatial domain, the visual saliency module applies SAMNet
to obtain a saliency map. And then, the content-dependency
and the edge masking modules respectively utilize ConvNeXt
to extract the spatial features, which have been attentively
weighted by the saliency map for the purpose of highlighting
those regions that human beings may be interested in. On the
other side of the temporal domain, to supplement the static
spatial features, the motion perception module utilizes SlowFast
to obtain the dynamic temporal features. Besides, the temporal
hysteresis module applies TempHyst to simulate the memory
mechanism of human beings, and comprehensively evaluates the
quality score according to the fusion features from the spatial and
temporal domains. Extensive experiments show that our HVS-5M
outperforms the state-of-the-art VQA methods. Ablation studies
are further conducted to verify the effectiveness of each module
towards the proposed framework.
Index Terms—No-reference video quality assessment, human
visual system, visual saliency, content-dependency, edge masking,
motion perception, temporal hysteresis.
I. INTRODUCTION
RECENT years have witnessed an explosive growth of
“we-media”. It is estimated that there are about 4 billion
video views per day on Facebook [1]. However, storing and
delivering these vast amounts of video data greatly stresses
video service providers [2]. It is necessary to apply video
coding to reduce the storage capacity, and balance the tradeoff
between the coding efficiency and the video quality. Therefore,
A.-X. Zhang and Y.-G. Wang are with the School of Computer Science
and Cyber Engineering, Guangzhou University, Guangzhou 510006, China
(e-mail: zax@e.gzhu.edu.cn; wangyg@gzhu.edu.cn).
W. Tang is with the Institute of Artificial Intelligence and
Blockchain, Guangzhou University, Guangzhou 510006, China (email:
tweix@gzhu.edu.cn).
L. Li is with the School of Artificial Intelligence, Xidian University, Xian
710071, China (e-mail: ldli@xidian.edu.cn).
S. Kwong is with the Department of Computer Science, City University of
Hong Kong, Hong Kong (e-mail: cssamk@cityu.edu.hk).
video quality assessment (VQA) has become a hot research
topic [3], [4], [5], [6]. Subjective VQA is a manual rating by
human beings which is time-consuming and labor-consuming
[7]. By contrast, objective VQA is an automatic predicting by
machines, and thus is more widely used in real application
scenarios. Since the scoring indicator of VQA, i.e., mean
opinion score (MOS), is related to the visual effect of human
beings, it is of great benefit to introduce human visual system
(HVS) into VQA [8], [9].
Early HVS-based VQA methods utilized hand-crafted fea-
tures to handle synthetic distortions. In order to accurately
simulate the texture masking of HVS, Ma et al. [32] developed
a mutual masking strategy to extract the spatial information of
video. Galkandage et al. [10] incorporated binocular suppres-
sion and recurrent excitation. Saad et al. [12] proposed a non-
distortion specific evaluation model relied on the video scenes
in discrete cosine transform domain, and analyzed the types of
motion that occurred in the video to predict the video quality.
Korhonen proposed TLVQM [70] to reduce the complexity
of feature extraction with a two-level method, which obtained
the low and high complexity features from an entire video
sequence and several representative video frames, respectively.
Wang and Li [13] proposed a statistical model of human visual
speed perception, which can estimate the motion information
and the perceptual uncertainty in the video.
Recently, with the rapid development of deep learning
techniques, CNN-based (Convolutional Neural Network) VQA
methods have significantly improved the evaluation of in-
the-wild videos. The in-the-wild videos are referred to as
the authentically distorted ones, which are often hard to be
annotated due to the lack of pristine ones. To solve the problem
of insufficient training data, You and Korhonen [31] utilized
the extended short-term memory network to predict the video
quality based on the features extracted from small video cube
clips by 3D-CNNs. Li et al. [35] applied the gated recurrent
unit (GRU) [36] to obtain the video quality score according to
the frame-level content features extracted from ResNet [16].
Zhang and Wang [76] utilized texture features to complement
content features, further improving [35]. However, all these
methods had not fully exploited the temporal information
within videos, which led to poor performance on LIVE-
VQC containing rich motion-related contents. Chen et al. [34]
proposed to fuse motion information from different temporal
frequencies in an efficient manner, and further applied a
hierarchical distortion description to quantify the temporal
motion effect. Li et al. [74] proposed a model-based transfer
learning approach and applied motion perception to VQA.
arXiv:2210.04158v1 [eess.IV] 9 Oct 2022