DCVQE A Hierarchical Transformer for Video Quality Assessment Zutong Liand Lei Yangy

2025-05-06 0 0 4.51MB 22 页 10玖币

侵权投诉

DCVQE: A Hierarchical Transformer for Video

Quality Assessment

Zutong Li∗and Lei Yang†

Weibo R&D Limited, USA

{zutongli0805, trilithy}@gmail.com

Abstract. The explosion of user-generated videos stimulates a great de-

mand for no-reference video quality assessment (NR-VQA). Inspired by

our observation on the actions of human annotation, we put forward a

Divide and Conquer Video Quality Estimator (DCVQE) for NR-VQA.

Starting from extracting the frame-level quality embeddings (QE), our

proposal splits the whole sequence into a number of clips and applies

Transformers to learn the clip-level QE and update the frame-level QE

simultaneously; another Transformer is introduced to combine the clip-

level QE to generate the video-level QE. We call this hierarchical com-

bination of Transformers as a Divide and Conquer Transformer (DCTr)

layer. An accurate video quality feature extraction can be achieved by

repeating the process of this DCTr layer several times. Taking the order

relationship among the annotated data into account, we also propose a

novel correlation loss term for model training. Experiments on various

datasets conﬁrm the eﬀectiveness and robustness of our DCVQE model.

1 Introduction

Recent years have witnessed a signiﬁcant increase in user-generated content

(UGC) on social media platforms like Youtube, Tiktok, and Weibo. Watching

the UGC videos on computers or smartphones has even become part of our daily

life. This trend stimulates a great demand for automatic video quality assessment

(VQA), especially in popular video sharing/recommendation services.

UGC-VQA, also known as blind or No-Reference video quality assessment

(NR-VQA), aims to evaluate in-the-wild videos without the corresponding pris-

tine reference videos. Usually, UGC videos may suﬀer from complex distortions

due to the diversity of capturing devices, uncertain shooting skills, compres-

sion, and poor editing process. Although many excellent algorithms have been

proposed to evaluate video quality, it remains a challenging task to assess the

quality of UGC videos accurately and consistently.

Besides the frame-level image information, temporal information is regarded

as a critically important factor for video analysis tasks. Although many image

quality assessment (IQA) models [35,67,33,61,22,14,62,60,64] can be applied to

∗Work done when Z. Li was at Weibo. Z. Li is currently with Microsoft.

†Corresponding author.

arXiv:2210.04377v1 [cs.CV] 10 Oct 2022

2 Zutong Li and Lei Yang

Fig. 1. Three consecutive frames from Video B304 in LIVE-VQA dataset. An overall

high mean opinion score (MOS) of 91.73 was annotated.

VQA base on a simple temporal pooling process [52], these models may not

work very robustly because of the absence of proper time sequence aggregation.

For instance, Fig. 1 shows three consecutive frames extracted from Video B304

in LIVE-VQA [48] dataset. As seen, the motion blur distortion appears on the

actress’s hand area. These frame images are most likely to be recognized as

middle, even low quality when applying a sophisticated IQA method to them

individually. However, the quality of this video was labeled as high by human

annotators, because a very smooth movement of the actress can be observed

when playing the video stream. RNNs [1,2,23] and 3D-CNNs [20,6,13,12,25] are

potential models to integrate spatial and temporal information for NR-VQA.

Though these algorithms perform well on many datasets, the diﬃculty of par-

allelizing RNN models and the non-negligible computational cost of 3D-CNNs

make them infeasible to be applied to many internet applications that require

quick responses, just like the online video sharing/recommendation services.

On the other hand, research on NR-VQA is often limited by the lack of suf-

ﬁcient training data, due to the tedious work to label the mean opinion score

(MOS) for each video. Among the publicly available datasets, MCL-JCV [56],

VideoSet [57], UGC-VIDEO [27], CVD-2014 [38], LIVE-Qualcomm [16] are gen-

erated in lab environments, while KoNViD-1k [18], LIVE-VQC [48], YouTube-

UGC [58] and LSVQ [63] are collected in-the-wild. As mentioned above, one

could consider VQA as a temporal extension of IQA task, thus can apply some

frame-level distortion and ranking processes [28] to augment the small video

datasets for algorithm development. However, in-the-wild videos are usually hard

to synthesize, since they may suﬀer from compound distortions which cannot be

exactly parameterized as a combination of certain distortion cases. Recently,

a large-scale LSVQ dataset [63], which contains 39,075 annotated videos with

authentic distortions is released for public research. We know that not many

studies have been conducted so far based on this new dataset.

Additionally, most previous works take L1 or L2 loss [64,63,70,23,54] as the

optimization criterion for model training. Since these criteria to some extent

ignore the order relationship of quality scores of the training samples, the trained

model may be not stable to quantify the perceptual diﬀerences between the

DCVQE: A Hierarchical Transformer for Video Quality Assessment 3

videos with similar quality scores. For example, in our research we ﬁnd that

many existing NR-VQA models work well to identify both high and low quality

videos, but struggle to distinguish the videos with middle quality scores. How

to eﬀectively quantify the diﬀerence between samples with similar perceptual

scores therefore becomes the key to the success of a NR-VQA model.

To address the above problems, in this paper we put forward a new Di-

vide and Conquer Video Quality Estimator (DCVQE) model for NR-VQA. We

summarize our contributions as follows: (1) Inspired by our observation on the

actions of human annotation, we propose a Divide and Conquer Transformer

(DCTr) architecture to extract video quality features for NR-VQA. Our algo-

rithm starts from extracting the frame-level quality representations. Regarded

as a divide process, we split the input sequence into a number of clips and apply

Transformers to learn the clip-level quality embeddings (QE) and update the

frame-level QE simultaneously. Subsequently, a conquer process is conducted by

using another Transformer to combine the clip-level QE to generate a video-level

QE. After stacking several DCTr layers and topping with a linear regressor, our

DCVQE model can be constructed to predict the quality value of the input

video. (2) By taking the order relationship of the training samples into account,

we propose a novel correlation loss to bring an additional order constraint of

video quality to guide the training. Experiments indicate that the introduction

of this correlation loss can consistently help to improve the performance of our

DCVQE model. (3) We conduct plenty of experiments on diﬀerent datasets and

conﬁrm that our DCVQE outperforms most other algorithms.

2 Related works

Traditional NR-VQA solutions: Many prior NR-VQA works are “distortion

speciﬁc” because they are designed to identify diﬀerent distortion types like blur

[31], blockiness [39], or noise [49] in compressed videos or image frames. More

recent and popular used models are deployed on natural video statistics (NVS)

features, which are created by extending the highly regular parametric band-

pass models of natural scene statistics (NSS) from IQA to VQA tasks. Among

them, successful applications have explored in both frequency domain (BIQI [36],

DIIVINE [37], BLINDS [44], BLINDS-II [45]) and spatial domain (NIQE [35],

BRISQUE [33]). V-BLIINDS [46] combined spatio-temporal NSS with motion

coherency models to estimate perceptual video quality. Inter-subband correla-

tions, modeled by spatial domain statistical features in frame diﬀerences, were

used to quantify the degree of distortion in VIIDEO [34]. 3D-DCT was applied

on local space-time regions, to establish quality aware features in [26]. Based on

hand-crafted features selection and combination, recent algorithms VIDEVAL

[53] and TLVQM [21] demonstrated outstanding performance on many UGC

datasets. The designation of these hand-crafted features is deliberate, though

they are hard to be deployed in an end-to-end fashion for NR-VQA tasks.

Deep learning based NR-VQA solutions: Deep neural networks have

shown their superior abilities in many computer vision tasks. With the availabil-

4 Zutong Li and Lei Yang

ity of perceptual image quality datasets [15,19,41,64], many successful applica-

tions have been reported in the past decade [28,64,29,50,70]. Combining with a

convolutional neural aggregation network, DeepVQA [59] utilized the advantages

of CNN to learn spatio-temporal visual sensitivity maps for VQA. Based on a

weakly supervised learning and resampling strategy, Zhang et al. [69] proposed

a general purpose NR-VQA framework which inherited the knowledge learned

from full-reference VQA and can eﬀectively alleviate the curse of inadequate

training data. VSFA [23] used a pretrained CNN to abstract frame features,

and introduced the gated recurrent unit (GRU) to learn the temporal depen-

dencies. Following VSFA, MDTVSFA [24] proposed a mixed datasets training

method to further improve VQA performance. Although the above methods per-

formed well on synthetic distortion datasets, they may be unstable to analyze

UGC videos with complex and diverse distortions. PVQ [63] reported a leading

performance on the large-scale dataset LSVQ. For a careful study on the local

and global spatio-temporal quality, spatial patch, temporal patch and spatio-

temporal patch were introduced in [63]. RAPIQUE [54], by leveraging a set of

NSS features concatenated with learned CNN features, shown the top perfor-

mance on several public datasets, including KoNViD-1k, YouTube-UGC, and

their combination.

Transformer techniques in computer vision: Self-attention mechanism-

based Transformer architecture shows its exceptional performance in natural

language processing [55,9]. Recently, many researchers introduced Transformer

to solve computer vision problems. ViT [10] directly run attention among image

patches with positional embeddings for image classiﬁcation. Detection Trans-

former (DETR) [5] reached a comparable performance with Faster-RCNN [43]

by designing a new object detection systems based on Transformers and bipartite

matching loss for direct set prediction. Through conducting contrastive learning

on 400 million image-text pairs, CLIP [42] shown impressive performance to solve

diﬀerent zero-shot transfer learning problems. For IQA tasks, Transformer also

shows its powerful strength. Inspired by ViT, TRIQ [66] connected Transformer

with MLP head to predict perceptual image quality, where suﬃcient lengths of

positional embeddings were set to analyze the images with diﬀerent resolutions.

IQT [7] achieved outstanding performance by applying Transformer encoder and

decoder on the features of reference images and distorted images. Through intro-

ducing 1D CNN and Transformer to integrate short-term and long-term temporal

information, a recent work LSCT [65] demonstrated excellent performance on

VQA. Our proposal is also derived from Transformer, inspired by our observation

on the actions of human annotation. Experiments on various datasets conﬁrm

the eﬀectiveness and robustness of our method.

3 Divide and Conquer Video Quality Estimator

(DCVQE)

Human judgements of video quality are usually content-dependent and aﬀected

by their temporal memory [3,11,32,47,51,68,57]. In our investigation, we notice

DCVQE: A Hierarchical Transformer for Video Quality Assessment 5

that many human annotators like to give their opinions on the quality of a video

after the following two actions: ﬁrst, watch the video quickly (usually in the

fast forward mode) to get an overall impression of its quality, then they may

scroll mouse forward and backward to review some speciﬁc parts of the video

for their ﬁnal decisions. Inspired by this observation, we propose a hierarchical

architecture, dubbed Divide and Conquer Video Quality Estimator (DCVQE)

for NR-VQA. Our model is worked by extracting three levels of video quality

representations from frames, video clips to whole video sequence progressively

and repeatedly, somewhat similar to the reverse processes of human annotation.

An additional correlation loss term is also presented to bring an additional order

constraint of video quality to guide the training. We ﬁnd that our method can

eﬀectively improve the performance of NR-VQA. We will describe our work in

detail in the following paragraphs.

Fig. 2. The architectures of the proposed Divide and Conquer Transformer (DCTr)

layer (left) and Divide and Conquer Video Quality Estimator (DCVQE) (right).

3.1 Overall Architecture

The left side of Fig. 2 represents the architecture of a key video quality analysis

layer, Divide and Conquer Transformer (DCTr) in our proposal. In order to

simulate the second action of human annotation mentioned above, we split the

input sequence into a number of clips, and introduce a Transformer module

T ransformerD to learn quality representations for each clip. As shown in Fig.

2, for the kth DCTr layer, we split the whole sequence into Iclips, and each

clip covers Jframe-level quality representations generated by the previous layer.

For the ith (1≤i≤I) clip Ck

i, a module T ransformerDiis applied to combine

the two levels of quality embeddings (QE) generated by the previous layer, that

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DCVQE:AHierarchicalTransformerforVideoQualityAssessmentZutongLiandLeiYangyWeiboR&DLimited,USAfzutongli0805,trilithyg@gmail.comAbstract.Theexplosionofuser-generatedvideosstimulatesagreatde-mandforno-referencevideoqualityassessment(NR-VQA).Inspiredbyourobservationontheactionsofhumanannotation,weputfo...

展开>> 收起<<

DCVQE A Hierarchical Transformer for Video Quality Assessment Zutong Liand Lei Yangy.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DCVQE A Hierarchical Transformer for Video Quality Assessment Zutong Liand Lei Yangy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: