DCVQE A Hierarchical Transformer for Video Quality Assessment Zutong Liand Lei Yangy

2025-05-06 0 0 4.51MB 22 页 10玖币
侵权投诉
DCVQE: A Hierarchical Transformer for Video
Quality Assessment
Zutong Liand Lei Yang
Weibo R&D Limited, USA
{zutongli0805, trilithy}@gmail.com
Abstract. The explosion of user-generated videos stimulates a great de-
mand for no-reference video quality assessment (NR-VQA). Inspired by
our observation on the actions of human annotation, we put forward a
Divide and Conquer Video Quality Estimator (DCVQE) for NR-VQA.
Starting from extracting the frame-level quality embeddings (QE), our
proposal splits the whole sequence into a number of clips and applies
Transformers to learn the clip-level QE and update the frame-level QE
simultaneously; another Transformer is introduced to combine the clip-
level QE to generate the video-level QE. We call this hierarchical com-
bination of Transformers as a Divide and Conquer Transformer (DCTr)
layer. An accurate video quality feature extraction can be achieved by
repeating the process of this DCTr layer several times. Taking the order
relationship among the annotated data into account, we also propose a
novel correlation loss term for model training. Experiments on various
datasets confirm the effectiveness and robustness of our DCVQE model.
1 Introduction
Recent years have witnessed a significant increase in user-generated content
(UGC) on social media platforms like Youtube, Tiktok, and Weibo. Watching
the UGC videos on computers or smartphones has even become part of our daily
life. This trend stimulates a great demand for automatic video quality assessment
(VQA), especially in popular video sharing/recommendation services.
UGC-VQA, also known as blind or No-Reference video quality assessment
(NR-VQA), aims to evaluate in-the-wild videos without the corresponding pris-
tine reference videos. Usually, UGC videos may suffer from complex distortions
due to the diversity of capturing devices, uncertain shooting skills, compres-
sion, and poor editing process. Although many excellent algorithms have been
proposed to evaluate video quality, it remains a challenging task to assess the
quality of UGC videos accurately and consistently.
Besides the frame-level image information, temporal information is regarded
as a critically important factor for video analysis tasks. Although many image
quality assessment (IQA) models [35,67,33,61,22,14,62,60,64] can be applied to
Work done when Z. Li was at Weibo. Z. Li is currently with Microsoft.
Corresponding author.
arXiv:2210.04377v1 [cs.CV] 10 Oct 2022
2 Zutong Li and Lei Yang
Fig. 1. Three consecutive frames from Video B304 in LIVE-VQA dataset. An overall
high mean opinion score (MOS) of 91.73 was annotated.
VQA base on a simple temporal pooling process [52], these models may not
work very robustly because of the absence of proper time sequence aggregation.
For instance, Fig. 1 shows three consecutive frames extracted from Video B304
in LIVE-VQA [48] dataset. As seen, the motion blur distortion appears on the
actress’s hand area. These frame images are most likely to be recognized as
middle, even low quality when applying a sophisticated IQA method to them
individually. However, the quality of this video was labeled as high by human
annotators, because a very smooth movement of the actress can be observed
when playing the video stream. RNNs [1,2,23] and 3D-CNNs [20,6,13,12,25] are
potential models to integrate spatial and temporal information for NR-VQA.
Though these algorithms perform well on many datasets, the difficulty of par-
allelizing RNN models and the non-negligible computational cost of 3D-CNNs
make them infeasible to be applied to many internet applications that require
quick responses, just like the online video sharing/recommendation services.
On the other hand, research on NR-VQA is often limited by the lack of suf-
ficient training data, due to the tedious work to label the mean opinion score
(MOS) for each video. Among the publicly available datasets, MCL-JCV [56],
VideoSet [57], UGC-VIDEO [27], CVD-2014 [38], LIVE-Qualcomm [16] are gen-
erated in lab environments, while KoNViD-1k [18], LIVE-VQC [48], YouTube-
UGC [58] and LSVQ [63] are collected in-the-wild. As mentioned above, one
could consider VQA as a temporal extension of IQA task, thus can apply some
frame-level distortion and ranking processes [28] to augment the small video
datasets for algorithm development. However, in-the-wild videos are usually hard
to synthesize, since they may suffer from compound distortions which cannot be
exactly parameterized as a combination of certain distortion cases. Recently,
a large-scale LSVQ dataset [63], which contains 39,075 annotated videos with
authentic distortions is released for public research. We know that not many
studies have been conducted so far based on this new dataset.
Additionally, most previous works take L1 or L2 loss [64,63,70,23,54] as the
optimization criterion for model training. Since these criteria to some extent
ignore the order relationship of quality scores of the training samples, the trained
model may be not stable to quantify the perceptual differences between the
DCVQE: A Hierarchical Transformer for Video Quality Assessment 3
videos with similar quality scores. For example, in our research we find that
many existing NR-VQA models work well to identify both high and low quality
videos, but struggle to distinguish the videos with middle quality scores. How
to effectively quantify the difference between samples with similar perceptual
scores therefore becomes the key to the success of a NR-VQA model.
To address the above problems, in this paper we put forward a new Di-
vide and Conquer Video Quality Estimator (DCVQE) model for NR-VQA. We
summarize our contributions as follows: (1) Inspired by our observation on the
actions of human annotation, we propose a Divide and Conquer Transformer
(DCTr) architecture to extract video quality features for NR-VQA. Our algo-
rithm starts from extracting the frame-level quality representations. Regarded
as a divide process, we split the input sequence into a number of clips and apply
Transformers to learn the clip-level quality embeddings (QE) and update the
frame-level QE simultaneously. Subsequently, a conquer process is conducted by
using another Transformer to combine the clip-level QE to generate a video-level
QE. After stacking several DCTr layers and topping with a linear regressor, our
DCVQE model can be constructed to predict the quality value of the input
video. (2) By taking the order relationship of the training samples into account,
we propose a novel correlation loss to bring an additional order constraint of
video quality to guide the training. Experiments indicate that the introduction
of this correlation loss can consistently help to improve the performance of our
DCVQE model. (3) We conduct plenty of experiments on different datasets and
confirm that our DCVQE outperforms most other algorithms.
2 Related works
Traditional NR-VQA solutions: Many prior NR-VQA works are “distortion
specific” because they are designed to identify different distortion types like blur
[31], blockiness [39], or noise [49] in compressed videos or image frames. More
recent and popular used models are deployed on natural video statistics (NVS)
features, which are created by extending the highly regular parametric band-
pass models of natural scene statistics (NSS) from IQA to VQA tasks. Among
them, successful applications have explored in both frequency domain (BIQI [36],
DIIVINE [37], BLINDS [44], BLINDS-II [45]) and spatial domain (NIQE [35],
BRISQUE [33]). V-BLIINDS [46] combined spatio-temporal NSS with motion
coherency models to estimate perceptual video quality. Inter-subband correla-
tions, modeled by spatial domain statistical features in frame differences, were
used to quantify the degree of distortion in VIIDEO [34]. 3D-DCT was applied
on local space-time regions, to establish quality aware features in [26]. Based on
hand-crafted features selection and combination, recent algorithms VIDEVAL
[53] and TLVQM [21] demonstrated outstanding performance on many UGC
datasets. The designation of these hand-crafted features is deliberate, though
they are hard to be deployed in an end-to-end fashion for NR-VQA tasks.
Deep learning based NR-VQA solutions: Deep neural networks have
shown their superior abilities in many computer vision tasks. With the availabil-
4 Zutong Li and Lei Yang
ity of perceptual image quality datasets [15,19,41,64], many successful applica-
tions have been reported in the past decade [28,64,29,50,70]. Combining with a
convolutional neural aggregation network, DeepVQA [59] utilized the advantages
of CNN to learn spatio-temporal visual sensitivity maps for VQA. Based on a
weakly supervised learning and resampling strategy, Zhang et al. [69] proposed
a general purpose NR-VQA framework which inherited the knowledge learned
from full-reference VQA and can effectively alleviate the curse of inadequate
training data. VSFA [23] used a pretrained CNN to abstract frame features,
and introduced the gated recurrent unit (GRU) to learn the temporal depen-
dencies. Following VSFA, MDTVSFA [24] proposed a mixed datasets training
method to further improve VQA performance. Although the above methods per-
formed well on synthetic distortion datasets, they may be unstable to analyze
UGC videos with complex and diverse distortions. PVQ [63] reported a leading
performance on the large-scale dataset LSVQ. For a careful study on the local
and global spatio-temporal quality, spatial patch, temporal patch and spatio-
temporal patch were introduced in [63]. RAPIQUE [54], by leveraging a set of
NSS features concatenated with learned CNN features, shown the top perfor-
mance on several public datasets, including KoNViD-1k, YouTube-UGC, and
their combination.
Transformer techniques in computer vision: Self-attention mechanism-
based Transformer architecture shows its exceptional performance in natural
language processing [55,9]. Recently, many researchers introduced Transformer
to solve computer vision problems. ViT [10] directly run attention among image
patches with positional embeddings for image classification. Detection Trans-
former (DETR) [5] reached a comparable performance with Faster-RCNN [43]
by designing a new object detection systems based on Transformers and bipartite
matching loss for direct set prediction. Through conducting contrastive learning
on 400 million image-text pairs, CLIP [42] shown impressive performance to solve
different zero-shot transfer learning problems. For IQA tasks, Transformer also
shows its powerful strength. Inspired by ViT, TRIQ [66] connected Transformer
with MLP head to predict perceptual image quality, where sufficient lengths of
positional embeddings were set to analyze the images with different resolutions.
IQT [7] achieved outstanding performance by applying Transformer encoder and
decoder on the features of reference images and distorted images. Through intro-
ducing 1D CNN and Transformer to integrate short-term and long-term temporal
information, a recent work LSCT [65] demonstrated excellent performance on
VQA. Our proposal is also derived from Transformer, inspired by our observation
on the actions of human annotation. Experiments on various datasets confirm
the effectiveness and robustness of our method.
3 Divide and Conquer Video Quality Estimator
(DCVQE)
Human judgements of video quality are usually content-dependent and affected
by their temporal memory [3,11,32,47,51,68,57]. In our investigation, we notice
DCVQE: A Hierarchical Transformer for Video Quality Assessment 5
that many human annotators like to give their opinions on the quality of a video
after the following two actions: first, watch the video quickly (usually in the
fast forward mode) to get an overall impression of its quality, then they may
scroll mouse forward and backward to review some specific parts of the video
for their final decisions. Inspired by this observation, we propose a hierarchical
architecture, dubbed Divide and Conquer Video Quality Estimator (DCVQE)
for NR-VQA. Our model is worked by extracting three levels of video quality
representations from frames, video clips to whole video sequence progressively
and repeatedly, somewhat similar to the reverse processes of human annotation.
An additional correlation loss term is also presented to bring an additional order
constraint of video quality to guide the training. We find that our method can
effectively improve the performance of NR-VQA. We will describe our work in
detail in the following paragraphs.
Fig. 2. The architectures of the proposed Divide and Conquer Transformer (DCTr)
layer (left) and Divide and Conquer Video Quality Estimator (DCVQE) (right).
3.1 Overall Architecture
The left side of Fig. 2 represents the architecture of a key video quality analysis
layer, Divide and Conquer Transformer (DCTr) in our proposal. In order to
simulate the second action of human annotation mentioned above, we split the
input sequence into a number of clips, and introduce a Transformer module
T ransformerD to learn quality representations for each clip. As shown in Fig.
2, for the kth DCTr layer, we split the whole sequence into Iclips, and each
clip covers Jframe-level quality representations generated by the previous layer.
For the ith (1iI) clip Ck
i, a module T ransformerDiis applied to combine
the two levels of quality embeddings (QE) generated by the previous layer, that
摘要:

DCVQE:AHierarchicalTransformerforVideoQualityAssessmentZutongLiandLeiYangyWeiboR&DLimited,USAfzutongli0805,trilithyg@gmail.comAbstract.Theexplosionofuser-generatedvideosstimulatesagreatde-mandforno-referencevideoqualityassessment(NR-VQA).Inspiredbyourobservationontheactionsofhumanannotation,weputfo...

展开>> 收起<<
DCVQE A Hierarchical Transformer for Video Quality Assessment Zutong Liand Lei Yangy.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:4.51MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注