AGARWAL ET AL. COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 1 Compressing Video Calls using Synthetic Talking Heads

2025-04-27 0 0 946.67KB 12 页 10玖币

侵权投诉

AGARWAL ET AL.: COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 1

Compressing Video Calls using Synthetic

Talking Heads

Madhav Agarwal1

madhav.agarwal@research.iiit.ac.in

Anchit Gupta1

anchit.gupta@research.iiit.ac.in

Rudrabha Mukhopadhyay1

radrabha.m@research.iiit.ac.in

Vinay P. Namboodiri2

vpn22@bath.ac.uk

C V Jawahar1

jawahar@iiit.ac.in

1CVIT, IIIT-Hyderabad, India

2University of Bath, England

Abstract

We leverage the modern advancements in talking head generation to propose an

end-to-end system for talking head video compression. Our algorithm transmits pivot

frames intermittently while the rest of the talking head video is generated by animating

them. We use a state-of-the-art face reenactment network to detect key points in the

non-pivot frames and transmit them to the receiver. A dense ﬂow is then calculated

to warp a pivot frame to reconstruct the non-pivot ones. Transmitting key points in-

stead of full frames leads to signiﬁcant compression. We propose a novel algorithm to

adaptively select the best-suited pivot frames at regular intervals to provide a smooth

experience. We also propose a frame-interpolater at the receiver’s end to improve the

compression levels further. Finally, a face enhancement network improves reconstruction

quality, signiﬁcantly improving several aspects like the sharpness of the generations.

We evaluate our method both qualitatively and quantitatively on benchmark datasets

and compare it with multiple compression techniques. We release a demo video and

additional information at

https://cvit.iiit.ac.in/research/projects/

cvit-projects/talking-video-compression.

1 Introduction

As we progress through the 21st century, the world continues to grow digitally and becomes

more connected than ever! Video calls are a big part of this push and are a staple form

of communication. The pandemic in 2020 led to a massive reduction in social interaction

and fast-tracked its adoption. Universities and schools were forced to use video calls as the

primary means of teaching, while for many, video calling remained the only way to connect

with friends and family. While the number of video calls will continue to rise in the future,

increasing bandwidth is a daunting task. Incidentally, over half the world’s countries do not

It may be distributed unchanged freely in print or electronic forms.

arXiv:2210.03692v1 [cs.CV] 7 Oct 2022

2AGARWAL ET AL.: COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS

even have 4G services

! Therefore, introducing video compression schemes to reduce the

bandwidth requirement is a need of the hour.

Traditional Video Compression Techniques

Compressing video information has fasci-

nated researchers for nearly a century. The ﬁrst works dealt with analog video compression

and were released in 1929 [

]. A signiﬁcant breakthrough in modern video compression

was achieved by [

] using a DCT-based compression technique leading to the ﬁrst practical

applications. This was followed by the widely adopted H.264 [

] and H.265 [

] video

codecs, which remain the most popular in industrial applications. The most recent codec

to be released is H.266 [

]. However, we do not compare our work with H.266 due to the

lack of availability of open-source implementations. Deep learning-based video compression

techniques like [

] have also been prevalent in the recent past. These techniques

use autoencoder-like structures to encode video frames in a bottlenecked latent space and

generate it back on the receiver’s end. While such approaches have proven their effectiveness

in multiple situations, they are generic and do not consider the high-level semantics of the

video for compression.

Talking Head Video Compression

Video calls, on the other hand, encompass a speciﬁc

class of videos. They primarily contain videos of speakers and are popularly known as talking

head videos. The inherent semantic information present in a talking head video involving the

face structure, head movements, expressions on the face, etc., has long interested researchers

in developing compression schemes targeted towards such specialized videos. Techniques

like [

] transmit

facial landmarks for each frame, which synthesize the talking head

at the receiver’s end. In 2021, Wang et al. [

] proposed using face reenactment for video

compression. They used

learned 3D key points instead of pre-deﬁned face landmarks to

represent a face in their work leading to signiﬁcant compression. Each learned key point

contains information regarding the structure of the face, rotation, translation, etc., and helps

to warp a reference frame.

Our Contributions

We explore this concept further in this work and propose several novel

improvements. We ﬁrst send a high-resolution frame (pivot frame) at the start of the video

calls. For the rest of the frames, we use a modiﬁed version of [

] to detect key points in

each of them and transmit them to the receiver. The key points are then used to calculate a

dense ﬂow that warps the pivot frame to recreate the original video. While [

] used

bytes to represent a single key point, we further propose to reduce this requirement to only

bytes. Next, we use a novel talking head frame-interpolater network to generate frames at the

receiver’s side. This allows us to send key points from fewer frames while rendering the rest

of the frames using the interpolater network. We use a patch-wise super-resolution network to

upsample the ﬁnal outputs to arbitrary resolutions, signiﬁcantly improving the generations’

quality. In a lengthy video call sending a single pivot frame at the start of the video may lead

to inferior results on signiﬁcant changes in the background and head pose. Therefore, we also

propose an algorithm to adaptively select and send pivot frames negating the effects of such

changes. Overall, our approach allows for unprecedently low Bits-per-Pixel (BPP) value (bits

used to represent a pixel in a video) while maintaining usable quality. We refer the reader to

check our project web-page for numerous example results from our approach.

1https://en.wikipedia.org/wiki/List_of_countries_by_4G_LTE_penetration

AGARWAL ET AL.: COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 3

Key point Extractor

Decoder

Frame-Interpolation

Patch-wise

Super-resolution

Sender

Pivot Frame

Key points

Input Video: Middle Frame not Transmitted at all

FOMM-Variant Receiver

Transmitting only key

points

Reconstructed

Frames

Interpolated Middle

Frame

Generating High

Resolution Output

Pivot Frame

Uses Lightweight Key points

Interpolates Frames at the receiver’s end

Generates high-resolution outputs using SR

Figure 1: We depict the entire pipeline used for compressing talking head videos. In our

pipeline, we detect and send key points of alternate frames over the network and regenerate

the talking heads at the receiver’s end. We then use frame interpolation to generate the rest of

the frames and use super-resolution to generate high-resolution outputs.

2 Background: Synthetic Talking Head Generation

Our work revolves around synthetic talking head generation. Therefore, we survey the

different types of talking head generation works prevalent in the community. Talking head

generation was ﬁrst popularized in works like [

] which attempted to generate

only the lip movements from a given speech. These works were effective for solutions that

required preserving the original head movements in a talking head video while changing

only the lip synchronization to a new speech. A separate class of works [

]

tried to generate the talking head video directly from speech without additional information.

While these works can also potentially ﬁnd their usage in video call compression, the head

movements in the generated video do not match those of the original one, limiting its usage!

Face Reenactment

In face reenactment, a source image is animated using the motion from

a driving video. The initial models for this class of works were speaker-speciﬁc [

These models are speciﬁcally trained on a single identity and cannot generalize to different

individuals. On the other hand, speaker agnostic models [

] are more robust. They

require a single image of any identity and a driving video (need not have the same identity)

to generate a talking head of the source identity following the driving motion. We ﬁnd face

reenactment works to be well suited for talking head video compression. We propose to

use the inherent characteristic of the problem and send a single high-quality frame that can

be animated by the rest of the video at the receiver’s end to generate the ﬁnal output. The

reenactment is driven by landmarks, feature warping, or latent embeddings. First-Order-

Motion-Model (FOMM) proposed by Siarohin et al. [

] uses self-learned key points to

represent the dense motion ﬂow of driving video. Each key point consists of the coordinates

and Jacobians representing the local motion ﬁeld between the source image and the driving

video. A global motion ﬁeld is then interpolated from the local motion ﬁeld, and the source

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AGARWALETAL.:COMPRESSINGVIDEOCALLSUSINGSYNTHETICTALKINGHEADS1CompressingVideoCallsusingSyntheticTalkingHeadsMadhavAgarwal1madhav.agarwal@research.iiit.ac.inAnchitGupta1anchit.gupta@research.iiit.ac.inRudrabhaMukhopadhyay1radrabha.m@research.iiit.ac.inVinayP.Namboodiri2vpn22@bath.ac.ukCVJawahar1jawah...

展开>> 收起<<

AGARWAL ET AL. COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 1 Compressing Video Calls using Synthetic Talking Heads.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AGARWAL ET AL. COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 1 Compressing Video Calls using Synthetic Talking Heads

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: