1 HVS Revisited A Comprehensive Video Quality Assessment Framework

2025-04-28 0 0 1.13MB 13 页 10玖币
侵权投诉
1
HVS Revisited: A Comprehensive Video Quality
Assessment Framework
Ao-Xiang Zhang, Student Member, IEEE, Yuan-Gen Wang, Senior Member, IEEE
Weixuan Tang, Member, IEEE, Leida Li, Member, IEEE, Sam Kwong, Fellow, IEEE
Abstract—Video quality is a primary concern for video service
providers. In recent years, the techniques of video quality assess-
ment (VQA) based on deep convolutional neural network (CNN)
have been developed rapidly. Although existing works attempt to
introduce the knowledge of the human visual system (HVS) into
VQA, there still exhibit limitations that prevent full exploitation
of HVS, including an incomplete model by few characteristics and
insufficient connections among these characteristics. To overcome
these limitations, this paper revisits HVS with five representative
characteristics, and further reorganizes their connections. Based
on the revisited HVS, a no-reference VQA framework called
HVS-5M (NRVQA framework with five modules simulating HVS
with five characteristics) is proposed. It works in a domain-fusion
design paradigm with advanced network structures. On the side
of the spatial domain, the visual saliency module applies SAMNet
to obtain a saliency map. And then, the content-dependency
and the edge masking modules respectively utilize ConvNeXt
to extract the spatial features, which have been attentively
weighted by the saliency map for the purpose of highlighting
those regions that human beings may be interested in. On the
other side of the temporal domain, to supplement the static
spatial features, the motion perception module utilizes SlowFast
to obtain the dynamic temporal features. Besides, the temporal
hysteresis module applies TempHyst to simulate the memory
mechanism of human beings, and comprehensively evaluates the
quality score according to the fusion features from the spatial and
temporal domains. Extensive experiments show that our HVS-5M
outperforms the state-of-the-art VQA methods. Ablation studies
are further conducted to verify the effectiveness of each module
towards the proposed framework.
Index Terms—No-reference video quality assessment, human
visual system, visual saliency, content-dependency, edge masking,
motion perception, temporal hysteresis.
I. INTRODUCTION
RECENT years have witnessed an explosive growth of
“we-media”. It is estimated that there are about 4 billion
video views per day on Facebook [1]. However, storing and
delivering these vast amounts of video data greatly stresses
video service providers [2]. It is necessary to apply video
coding to reduce the storage capacity, and balance the tradeoff
between the coding efficiency and the video quality. Therefore,
A.-X. Zhang and Y.-G. Wang are with the School of Computer Science
and Cyber Engineering, Guangzhou University, Guangzhou 510006, China
(e-mail: zax@e.gzhu.edu.cn; wangyg@gzhu.edu.cn).
W. Tang is with the Institute of Artificial Intelligence and
Blockchain, Guangzhou University, Guangzhou 510006, China (email:
tweix@gzhu.edu.cn).
L. Li is with the School of Artificial Intelligence, Xidian University, Xian
710071, China (e-mail: ldli@xidian.edu.cn).
S. Kwong is with the Department of Computer Science, City University of
Hong Kong, Hong Kong (e-mail: cssamk@cityu.edu.hk).
video quality assessment (VQA) has become a hot research
topic [3], [4], [5], [6]. Subjective VQA is a manual rating by
human beings which is time-consuming and labor-consuming
[7]. By contrast, objective VQA is an automatic predicting by
machines, and thus is more widely used in real application
scenarios. Since the scoring indicator of VQA, i.e., mean
opinion score (MOS), is related to the visual effect of human
beings, it is of great benefit to introduce human visual system
(HVS) into VQA [8], [9].
Early HVS-based VQA methods utilized hand-crafted fea-
tures to handle synthetic distortions. In order to accurately
simulate the texture masking of HVS, Ma et al. [32] developed
a mutual masking strategy to extract the spatial information of
video. Galkandage et al. [10] incorporated binocular suppres-
sion and recurrent excitation. Saad et al. [12] proposed a non-
distortion specific evaluation model relied on the video scenes
in discrete cosine transform domain, and analyzed the types of
motion that occurred in the video to predict the video quality.
Korhonen proposed TLVQM [70] to reduce the complexity
of feature extraction with a two-level method, which obtained
the low and high complexity features from an entire video
sequence and several representative video frames, respectively.
Wang and Li [13] proposed a statistical model of human visual
speed perception, which can estimate the motion information
and the perceptual uncertainty in the video.
Recently, with the rapid development of deep learning
techniques, CNN-based (Convolutional Neural Network) VQA
methods have significantly improved the evaluation of in-
the-wild videos. The in-the-wild videos are referred to as
the authentically distorted ones, which are often hard to be
annotated due to the lack of pristine ones. To solve the problem
of insufficient training data, You and Korhonen [31] utilized
the extended short-term memory network to predict the video
quality based on the features extracted from small video cube
clips by 3D-CNNs. Li et al. [35] applied the gated recurrent
unit (GRU) [36] to obtain the video quality score according to
the frame-level content features extracted from ResNet [16].
Zhang and Wang [76] utilized texture features to complement
content features, further improving [35]. However, all these
methods had not fully exploited the temporal information
within videos, which led to poor performance on LIVE-
VQC containing rich motion-related contents. Chen et al. [34]
proposed to fuse motion information from different temporal
frequencies in an efficient manner, and further applied a
hierarchical distortion description to quantify the temporal
motion effect. Li et al. [74] proposed a model-based transfer
learning approach and applied motion perception to VQA.
arXiv:2210.04158v1 [eess.IV] 9 Oct 2022
2
Magnocellular pathway
Parvocellular pathway
Spatial characteristics
Temporal characteristics
Content-
dependency
Motion
perception
Temporal
hysteresis
Visual
saliency
Edge masking
Fig. 1: Illustration of our revisited version of HVS.
Besides, some methods attempted to introduce visual saliency
into VQA. For instance, Guan et al. [14] established a quality-
aware visual attention module to obtain the frame-level quality
scores, which were integrated into the video quality score
through an end-to-end structure of visual and memory atten-
tion. Vagar [15] proposed a parallel CNN structure, which
firstly extracted the temporally pooled and saliency weighted
features of the video, and then independently mapped them to
the quality scores for further fusion.
Despite their good performance, these methods have some
potential limitations. Firstly, only few characteristics of HVS
have been exploited, and thus the simulated sensory function is
incomplete. Besides, the connections among different charac-
teristics have not been well organized, failing to facilitate their
applications to VQA. Secondly, some methods have not fully
considered temporal information. The combination of spatial
and temporal information is to be explored. Thirdly, the effect
of edge masking has not been introduced in VQA. In fact,
shape edges can not only be utilized to conceal distortions in
edge masking, but also be served as effective spatial features.
Fourthly, although early attempts have been made, how to
effectively apply visual saliency to VQA is still challenging.
To address the above problems, this paper proposes a gen-
eral no-reference video quality assessment framework called
HVS-5M (NRVQA framework with five modules simulating
HVS with five characteristics). The foundation of HVS-5M
is our revisited version of HVS, wherein the connections of
the five representative characteristics of HVS are reorganized.
On this basis, HVS-5M is designed in the domain-fusion
paradigm. Specifically, on the side of the spatial domain,
the edge masking and the content-dependency modules are
utilized to extract the frame-level spatial features, which are
then weighted by the saliency map from the visual saliency
module according to the attention mechanism. On the other
side of the temporal domain, to supplement the spatial features,
the motion perception module is utilized to extract the video-
level temporal features. Furthermore, the temporal hysteresis
module simulates the memory mechanism of human beings,
and outputs the quality score according to the fusion features
integrated from the spatial static ones and temporal dynamic
ones. By this means, the quality of a given video can be
comprehensively represented from different aspects.
The contributions of this work are summarized as follows.
The mechanism of HVS is revisited, wherein five repre-
sentative characteristics are selected to model the function
of sensory organ in a relatively simple and comprehensive
manner, and their connections are reorganized to facilitate
its application to VQA.
A video quality assessment framework simulating HVS
with five modules, called HVS-5M, is proposed, wherein
these modules cooperatively work in the domain-fusion
paradigm. In particular, to the best of our knowledge, this
is the first to introduce edge masking and a new scheme
to apply visual saliency to VQA.
Experimental results show that our proposed HVS-5M
achieves state-of-the-art (SOTA) performance on various
mainstream video datasets, including four in-the-wild
ones. Ablation studies are further conducted to verify the
effectiveness of its different modules.
The rest of paper is organized as follows. HVS is first
revisited from a neurophysiology perspective in Section II.
Then, the proposed HVS-5M is described in Section III.
Extensive experimental results are presented in Section IV.
Finally, conclusions are drawn in Section V.
II. HUMAN VISUAL SYSTEM REVISITED
Human visual system (HVS) is responsible for detecting
and interpreting the perceived spectral information to build a
representation of the surrounding environment, which consists
of sensory organ and parts of the central nervous system.
Specifically, the function of sensory organ has many com-
plicated characteristics. In order to demonstrate the function of
sensory organ in a relative simple but comprehensive manner,
we revisit HVS, and formulate it with five representative
characteristics. Visual saliency is a bottom-up, stimulus-driven
signal, which indicates that a specific location is sufficiently
different from the surrounding environment and deserves hu-
man attention [26]. Content-dependency is referred to as a
phenomenon that human preference is highly dependent on
the observed content [49]. Edge masking indicates that the
effect of masking is more likely to occur at positions with
richer edge information [28]. Motion perception is the process
of inferring the speed and direction of various elements in
a dynamic scene. Temporal hysteresis [51] indicates that the
memory of elements with bad impression can last for longer
than those with good impression.
On this basis, in order to apply HVS to VQA in a more
efficient manner, we try to establish connections for the five
characteristics in HVS, as shown in Fig. 1. According to the
existing study [21], the parvocellular pathway (P-path) and
magnocellular pathway (M-path) are two major pathways of
the central nervous system. Specifically, P-path can distin-
guish subtle spatial details [22], while M-path is capable of
detecting temporal motion information [23]. Therefore, we
correspondingly divide those five characteristics into spatial
or temporal ones. On the side of the spatial characteristics,
human beings could be attracted by salient regions (reflected in
visual saliency) at first sight. And then, under the guidance of
the salient regions, content (reflected in content-dependency)
3
GPmean
R Channel
G channel
B channel
Edge masking module
Content-dependency module
Visual saliency module
Motion perception module Temporal hysteresis module
TempHyst
Spatial
damain
Temporal
domain
Canny
Concate
Edge
feature
maps
Resize
ConvNeXt
SAMNet ConvNeXt
Content
feature
maps
GPmean
GPstd
Attention
mechanism
Edge
features
Edge
features
GPmean
GPstd
Attention
mechanism
Content
features
Content
features
SlowFast
Motion
feature
maps
GPmean
GPstd
Motion
features
Motion
features Concate Temporal
features
Concate Spatial
features
Fusion
features
Concate
Frame
Video Quality
score
Frame
Frame
Canny
Canny
Fig. 2: Illustration of the proposed HVS-5M, which includes five modules.
and edge (reflected in edge masking) are used to represent
semantics and details, respectively. On the other side of
the temporal characteristics, motion information (reflected in
motion perception) is used to represent dynamic temporal
changes. Finally, the above semantics, details, and temporal
changes are fused into a sequence of elements along time
flow, wherein the elements with bad impression are highlighted
(reflected in temporal hysteresis). By this means, these five
characteristics can collaboratively work in HVS. In this paper,
we focus on reorganizing the connections of the five represen-
tative characteristics within our revisited version of HVS, and
take it as a starting point for the research of VQA.
III. PROPOSED METHOD
In order to simulate the HVS mechanism, the proposed
HVS-5M follows the domain-fusion paradigm and consists of
five modules from two domains, as shown in Fig. 2.
The first branch operates on frame level in spatial domain.
The visual saliency module extracts the saliency map for each
frame. Then, the content-dependency module and the edge
masking module are responsible for extracting the content
and the edge feature maps, respectively. These two feature
maps are further adjusted by the saliency map in the attention
manner for the purpose of highlighting the key regions, and
then integrated into statistics as spatial features. The second
branch operates on video level in temporal domain, wherein
the motion perception module aims at capturing the motion
feature maps of a video sequence, and then combines them
into temporal features. Finally, the temporal hysteresis module
simulates the memory mechanism of human beings, and
comprehensively evaluates the quality score according to the
fusion features from the spatial and temporal domains. The
detailed descriptions of each module are given in the following
subsections, and the notations of features are given in Table I.
A. Visual Saliency Module
In HVS, it is widely acknowledged that human attention is
attracted by visual salient region within an image [26], [27]. To
simulate HVS, the proposed HVS-5M also adopts the saliency
map as an attention mask to highlight those regions that human
beings may be interested in. In the following of this paper,
the video is supposed to have Nframes and the n-th frame is
denoted as In. Considering both accuracy and computational
complexity, a lightweight network SAMNet [38] pre-trained
on ImageNet-22k [50] is adopted to extract the saliency map
Anfor each frame as
An=SAMNet (In).(1)
In SAMNet, a stereoscopically attentive multi-scale module
is designed for effective and efficient multi-scale learning,
which enables each channel at each spatial position to ad-
just the weights of each branch. Based on this module, a
lightweight encoder-decoder network is utilized for salient ob-
ject detection. Note that ImageNet-22k is a large-scale image
摘要:

1HVSRevisited:AComprehensiveVideoQualityAssessmentFrameworkAo-XiangZhang,StudentMember,IEEE,Yuan-GenWang,SeniorMember,IEEEWeixuanTang,Member,IEEE,LeidaLi,Member,IEEE,SamKwong,Fellow,IEEEAbstract—Videoqualityisaprimaryconcernforvideoserviceproviders.Inrecentyears,thetechniquesofvideoqualityassess-men...

展开>> 收起<<
1 HVS Revisited A Comprehensive Video Quality Assessment Framework.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.13MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注