1 HVS Revisited A Comprehensive Video Quality Assessment Framework

2025-04-28 0 0 1.13MB 13 页 10玖币

侵权投诉

HVS Revisited: A Comprehensive Video Quality

Assessment Framework

Ao-Xiang Zhang, Student Member, IEEE, Yuan-Gen Wang, Senior Member, IEEE

Weixuan Tang, Member, IEEE, Leida Li, Member, IEEE, Sam Kwong, Fellow, IEEE

Abstract—Video quality is a primary concern for video service

providers. In recent years, the techniques of video quality assess-

ment (VQA) based on deep convolutional neural network (CNN)

have been developed rapidly. Although existing works attempt to

introduce the knowledge of the human visual system (HVS) into

VQA, there still exhibit limitations that prevent full exploitation

of HVS, including an incomplete model by few characteristics and

insufﬁcient connections among these characteristics. To overcome

these limitations, this paper revisits HVS with ﬁve representative

characteristics, and further reorganizes their connections. Based

on the revisited HVS, a no-reference VQA framework called

HVS-5M (NRVQA framework with ﬁve modules simulating HVS

with ﬁve characteristics) is proposed. It works in a domain-fusion

design paradigm with advanced network structures. On the side

of the spatial domain, the visual saliency module applies SAMNet

to obtain a saliency map. And then, the content-dependency

and the edge masking modules respectively utilize ConvNeXt

to extract the spatial features, which have been attentively

weighted by the saliency map for the purpose of highlighting

those regions that human beings may be interested in. On the

other side of the temporal domain, to supplement the static

spatial features, the motion perception module utilizes SlowFast

to obtain the dynamic temporal features. Besides, the temporal

hysteresis module applies TempHyst to simulate the memory

mechanism of human beings, and comprehensively evaluates the

quality score according to the fusion features from the spatial and

temporal domains. Extensive experiments show that our HVS-5M

outperforms the state-of-the-art VQA methods. Ablation studies

are further conducted to verify the effectiveness of each module

towards the proposed framework.

Index Terms—No-reference video quality assessment, human

visual system, visual saliency, content-dependency, edge masking,

motion perception, temporal hysteresis.

I. INTRODUCTION

RECENT years have witnessed an explosive growth of

“we-media”. It is estimated that there are about 4 billion

video views per day on Facebook [1]. However, storing and

delivering these vast amounts of video data greatly stresses

video service providers [2]. It is necessary to apply video

coding to reduce the storage capacity, and balance the tradeoff

between the coding efﬁciency and the video quality. Therefore,

A.-X. Zhang and Y.-G. Wang are with the School of Computer Science

and Cyber Engineering, Guangzhou University, Guangzhou 510006, China

(e-mail: zax@e.gzhu.edu.cn; wangyg@gzhu.edu.cn).

W. Tang is with the Institute of Artiﬁcial Intelligence and

Blockchain, Guangzhou University, Guangzhou 510006, China (email:

tweix@gzhu.edu.cn).

L. Li is with the School of Artiﬁcial Intelligence, Xidian University, Xian

710071, China (e-mail: ldli@xidian.edu.cn).

S. Kwong is with the Department of Computer Science, City University of

Hong Kong, Hong Kong (e-mail: cssamk@cityu.edu.hk).

video quality assessment (VQA) has become a hot research

topic [3], [4], [5], [6]. Subjective VQA is a manual rating by

human beings which is time-consuming and labor-consuming

[7]. By contrast, objective VQA is an automatic predicting by

machines, and thus is more widely used in real application

scenarios. Since the scoring indicator of VQA, i.e., mean

opinion score (MOS), is related to the visual effect of human

beings, it is of great beneﬁt to introduce human visual system

(HVS) into VQA [8], [9].

Early HVS-based VQA methods utilized hand-crafted fea-

tures to handle synthetic distortions. In order to accurately

simulate the texture masking of HVS, Ma et al. [32] developed

a mutual masking strategy to extract the spatial information of

video. Galkandage et al. [10] incorporated binocular suppres-

sion and recurrent excitation. Saad et al. [12] proposed a non-

distortion speciﬁc evaluation model relied on the video scenes

in discrete cosine transform domain, and analyzed the types of

motion that occurred in the video to predict the video quality.

Korhonen proposed TLVQM [70] to reduce the complexity

of feature extraction with a two-level method, which obtained

the low and high complexity features from an entire video

sequence and several representative video frames, respectively.

Wang and Li [13] proposed a statistical model of human visual

speed perception, which can estimate the motion information

and the perceptual uncertainty in the video.

Recently, with the rapid development of deep learning

techniques, CNN-based (Convolutional Neural Network) VQA

methods have signiﬁcantly improved the evaluation of in-

the-wild videos. The in-the-wild videos are referred to as

the authentically distorted ones, which are often hard to be

annotated due to the lack of pristine ones. To solve the problem

of insufﬁcient training data, You and Korhonen [31] utilized

the extended short-term memory network to predict the video

quality based on the features extracted from small video cube

clips by 3D-CNNs. Li et al. [35] applied the gated recurrent

unit (GRU) [36] to obtain the video quality score according to

the frame-level content features extracted from ResNet [16].

Zhang and Wang [76] utilized texture features to complement

content features, further improving [35]. However, all these

methods had not fully exploited the temporal information

within videos, which led to poor performance on LIVE-

VQC containing rich motion-related contents. Chen et al. [34]

proposed to fuse motion information from different temporal

frequencies in an efﬁcient manner, and further applied a

hierarchical distortion description to quantify the temporal

motion effect. Li et al. [74] proposed a model-based transfer

learning approach and applied motion perception to VQA.

arXiv:2210.04158v1 [eess.IV] 9 Oct 2022

Magnocellular pathway

Parvocellular pathway

Spatial characteristics

Temporal characteristics

Content-

dependency

Motion

perception

Temporal

hysteresis

Visual

saliency

Edge masking

Fig. 1: Illustration of our revisited version of HVS.

Besides, some methods attempted to introduce visual saliency

into VQA. For instance, Guan et al. [14] established a quality-

aware visual attention module to obtain the frame-level quality

scores, which were integrated into the video quality score

through an end-to-end structure of visual and memory atten-

tion. Vagar [15] proposed a parallel CNN structure, which

ﬁrstly extracted the temporally pooled and saliency weighted

features of the video, and then independently mapped them to

the quality scores for further fusion.

Despite their good performance, these methods have some

potential limitations. Firstly, only few characteristics of HVS

have been exploited, and thus the simulated sensory function is

incomplete. Besides, the connections among different charac-

teristics have not been well organized, failing to facilitate their

applications to VQA. Secondly, some methods have not fully

considered temporal information. The combination of spatial

and temporal information is to be explored. Thirdly, the effect

of edge masking has not been introduced in VQA. In fact,

shape edges can not only be utilized to conceal distortions in

edge masking, but also be served as effective spatial features.

Fourthly, although early attempts have been made, how to

effectively apply visual saliency to VQA is still challenging.

To address the above problems, this paper proposes a gen-

eral no-reference video quality assessment framework called

HVS-5M (NRVQA framework with ﬁve modules simulating

HVS with ﬁve characteristics). The foundation of HVS-5M

is our revisited version of HVS, wherein the connections of

the ﬁve representative characteristics of HVS are reorganized.

On this basis, HVS-5M is designed in the domain-fusion

paradigm. Speciﬁcally, on the side of the spatial domain,

the edge masking and the content-dependency modules are

utilized to extract the frame-level spatial features, which are

then weighted by the saliency map from the visual saliency

module according to the attention mechanism. On the other

side of the temporal domain, to supplement the spatial features,

the motion perception module is utilized to extract the video-

level temporal features. Furthermore, the temporal hysteresis

module simulates the memory mechanism of human beings,

and outputs the quality score according to the fusion features

integrated from the spatial static ones and temporal dynamic

ones. By this means, the quality of a given video can be

comprehensively represented from different aspects.

The contributions of this work are summarized as follows.

•The mechanism of HVS is revisited, wherein ﬁve repre-

sentative characteristics are selected to model the function

of sensory organ in a relatively simple and comprehensive

manner, and their connections are reorganized to facilitate

its application to VQA.

•A video quality assessment framework simulating HVS

with ﬁve modules, called HVS-5M, is proposed, wherein

these modules cooperatively work in the domain-fusion

paradigm. In particular, to the best of our knowledge, this

is the ﬁrst to introduce edge masking and a new scheme

to apply visual saliency to VQA.

•Experimental results show that our proposed HVS-5M

achieves state-of-the-art (SOTA) performance on various

mainstream video datasets, including four in-the-wild

ones. Ablation studies are further conducted to verify the

effectiveness of its different modules.

The rest of paper is organized as follows. HVS is ﬁrst

revisited from a neurophysiology perspective in Section II.

Then, the proposed HVS-5M is described in Section III.

Extensive experimental results are presented in Section IV.

Finally, conclusions are drawn in Section V.

II. HUMAN VISUAL SYSTEM REVISITED

Human visual system (HVS) is responsible for detecting

and interpreting the perceived spectral information to build a

representation of the surrounding environment, which consists

of sensory organ and parts of the central nervous system.

Speciﬁcally, the function of sensory organ has many com-

plicated characteristics. In order to demonstrate the function of

sensory organ in a relative simple but comprehensive manner,

we revisit HVS, and formulate it with ﬁve representative

characteristics. Visual saliency is a bottom-up, stimulus-driven

signal, which indicates that a speciﬁc location is sufﬁciently

different from the surrounding environment and deserves hu-

man attention [26]. Content-dependency is referred to as a

phenomenon that human preference is highly dependent on

the observed content [49]. Edge masking indicates that the

effect of masking is more likely to occur at positions with

richer edge information [28]. Motion perception is the process

of inferring the speed and direction of various elements in

a dynamic scene. Temporal hysteresis [51] indicates that the

memory of elements with bad impression can last for longer

than those with good impression.

On this basis, in order to apply HVS to VQA in a more

efﬁcient manner, we try to establish connections for the ﬁve

characteristics in HVS, as shown in Fig. 1. According to the

existing study [21], the parvocellular pathway (P-path) and

magnocellular pathway (M-path) are two major pathways of

the central nervous system. Speciﬁcally, P-path can distin-

guish subtle spatial details [22], while M-path is capable of

detecting temporal motion information [23]. Therefore, we

correspondingly divide those ﬁve characteristics into spatial

or temporal ones. On the side of the spatial characteristics,

human beings could be attracted by salient regions (reﬂected in

visual saliency) at ﬁrst sight. And then, under the guidance of

the salient regions, content (reﬂected in content-dependency)

GPmean

R Channel

G channel

B channel

Edge masking module

Content-dependency module

Visual saliency module

Motion perception module Temporal hysteresis module

TempHyst

Spatial

damain

Temporal

domain

Canny

Concate

Edge

feature

maps

Resize

ConvNeXt

SAMNet ConvNeXt

Content

feature

maps

GPmean

GPstd

Attention

mechanism

Edge

features

Edge

features

GPmean

GPstd

Attention

mechanism

Content

features

Content

features

SlowFast

Motion

feature

maps

GPmean

GPstd

Motion

features

Motion

features Concate Temporal

features

Concate Spatial

features

Fusion

features

Concate

Frame

Video Quality

score

Frame

Canny

Fig. 2: Illustration of the proposed HVS-5M, which includes ﬁve modules.

and edge (reﬂected in edge masking) are used to represent

semantics and details, respectively. On the other side of

the temporal characteristics, motion information (reﬂected in

motion perception) is used to represent dynamic temporal

changes. Finally, the above semantics, details, and temporal

changes are fused into a sequence of elements along time

ﬂow, wherein the elements with bad impression are highlighted

(reﬂected in temporal hysteresis). By this means, these ﬁve

characteristics can collaboratively work in HVS. In this paper,

we focus on reorganizing the connections of the ﬁve represen-

tative characteristics within our revisited version of HVS, and

take it as a starting point for the research of VQA.

III. PROPOSED METHOD

In order to simulate the HVS mechanism, the proposed

HVS-5M follows the domain-fusion paradigm and consists of

ﬁve modules from two domains, as shown in Fig. 2.

The ﬁrst branch operates on frame level in spatial domain.

The visual saliency module extracts the saliency map for each

frame. Then, the content-dependency module and the edge

masking module are responsible for extracting the content

and the edge feature maps, respectively. These two feature

maps are further adjusted by the saliency map in the attention

manner for the purpose of highlighting the key regions, and

then integrated into statistics as spatial features. The second

branch operates on video level in temporal domain, wherein

the motion perception module aims at capturing the motion

feature maps of a video sequence, and then combines them

into temporal features. Finally, the temporal hysteresis module

simulates the memory mechanism of human beings, and

comprehensively evaluates the quality score according to the

fusion features from the spatial and temporal domains. The

detailed descriptions of each module are given in the following

subsections, and the notations of features are given in Table I.

A. Visual Saliency Module

In HVS, it is widely acknowledged that human attention is

attracted by visual salient region within an image [26], [27]. To

simulate HVS, the proposed HVS-5M also adopts the saliency

map as an attention mask to highlight those regions that human

beings may be interested in. In the following of this paper,

the video is supposed to have Nframes and the n-th frame is

denoted as In. Considering both accuracy and computational

complexity, a lightweight network SAMNet [38] pre-trained

on ImageNet-22k [50] is adopted to extract the saliency map

Anfor each frame as

An=SAMNet (In).(1)

In SAMNet, a stereoscopically attentive multi-scale module

is designed for effective and efﬁcient multi-scale learning,

which enables each channel at each spatial position to ad-

just the weights of each branch. Based on this module, a

lightweight encoder-decoder network is utilized for salient ob-

ject detection. Note that ImageNet-22k is a large-scale image

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1HVSRevisited:AComprehensiveVideoQualityAssessmentFrameworkAo-XiangZhang,StudentMember,IEEE,Yuan-GenWang,SeniorMember,IEEEWeixuanTang,Member,IEEE,LeidaLi,Member,IEEE,SamKwong,Fellow,IEEEAbstractVideoqualityisaprimaryconcernforvideoserviceproviders.Inrecentyears,thetechniquesofvideoqualityassess-men...

展开>> 收起<<

1 HVS Revisited A Comprehensive Video Quality Assessment Framework.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 HVS Revisited A Comprehensive Video Quality Assessment Framework

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: