AGARWAL ET AL. COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 1 Compressing Video Calls using Synthetic Talking Heads

2025-04-27 0 0 946.67KB 12 页 10玖币
侵权投诉
AGARWAL ET AL.: COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 1
Compressing Video Calls using Synthetic
Talking Heads
Madhav Agarwal1
madhav.agarwal@research.iiit.ac.in
Anchit Gupta1
anchit.gupta@research.iiit.ac.in
Rudrabha Mukhopadhyay1
radrabha.m@research.iiit.ac.in
Vinay P. Namboodiri2
vpn22@bath.ac.uk
C V Jawahar1
jawahar@iiit.ac.in
1CVIT, IIIT-Hyderabad, India
2University of Bath, England
Abstract
We leverage the modern advancements in talking head generation to propose an
end-to-end system for talking head video compression. Our algorithm transmits pivot
frames intermittently while the rest of the talking head video is generated by animating
them. We use a state-of-the-art face reenactment network to detect key points in the
non-pivot frames and transmit them to the receiver. A dense flow is then calculated
to warp a pivot frame to reconstruct the non-pivot ones. Transmitting key points in-
stead of full frames leads to significant compression. We propose a novel algorithm to
adaptively select the best-suited pivot frames at regular intervals to provide a smooth
experience. We also propose a frame-interpolater at the receiver’s end to improve the
compression levels further. Finally, a face enhancement network improves reconstruction
quality, significantly improving several aspects like the sharpness of the generations.
We evaluate our method both qualitatively and quantitatively on benchmark datasets
and compare it with multiple compression techniques. We release a demo video and
additional information at
https://cvit.iiit.ac.in/research/projects/
cvit-projects/talking-video-compression.
1 Introduction
As we progress through the 21st century, the world continues to grow digitally and becomes
more connected than ever! Video calls are a big part of this push and are a staple form
of communication. The pandemic in 2020 led to a massive reduction in social interaction
and fast-tracked its adoption. Universities and schools were forced to use video calls as the
primary means of teaching, while for many, video calling remained the only way to connect
with friends and family. While the number of video calls will continue to rise in the future,
increasing bandwidth is a daunting task. Incidentally, over half the world’s countries do not
© 2022. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
arXiv:2210.03692v1 [cs.CV] 7 Oct 2022
2AGARWAL ET AL.: COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS
even have 4G services
1
! Therefore, introducing video compression schemes to reduce the
bandwidth requirement is a need of the hour.
Traditional Video Compression Techniques
Compressing video information has fasci-
nated researchers for nearly a century. The first works dealt with analog video compression
and were released in 1929 [
8
]. A significant breakthrough in modern video compression
was achieved by [
13
] using a DCT-based compression technique leading to the first practical
applications. This was followed by the widely adopted H.264 [
26
] and H.265 [
1
] video
codecs, which remain the most popular in industrial applications. The most recent codec
to be released is H.266 [
4
]. However, we do not compare our work with H.266 due to the
lack of availability of open-source implementations. Deep learning-based video compression
techniques like [
11
,
15
,
18
,
19
] have also been prevalent in the recent past. These techniques
use autoencoder-like structures to encode video frames in a bottlenecked latent space and
generate it back on the receiver’s end. While such approaches have proven their effectiveness
in multiple situations, they are generic and do not consider the high-level semantics of the
video for compression.
Talking Head Video Compression
Video calls, on the other hand, encompass a specific
class of videos. They primarily contain videos of speakers and are popularly known as talking
head videos. The inherent semantic information present in a talking head video involving the
face structure, head movements, expressions on the face, etc., has long interested researchers
in developing compression schemes targeted towards such specialized videos. Techniques
like [
16
] transmit
68
facial landmarks for each frame, which synthesize the talking head
at the receiver’s end. In 2021, Wang et al. [
25
] proposed using face reenactment for video
compression. They used
10
learned 3D key points instead of pre-defined face landmarks to
represent a face in their work leading to significant compression. Each learned key point
contains information regarding the structure of the face, rotation, translation, etc., and helps
to warp a reference frame.
Our Contributions
We explore this concept further in this work and propose several novel
improvements. We first send a high-resolution frame (pivot frame) at the start of the video
calls. For the rest of the frames, we use a modified version of [
20
] to detect key points in
each of them and transmit them to the receiver. The key points are then used to calculate a
dense flow that warps the pivot frame to recreate the original video. While [
20
,
25
] used
24
bytes to represent a single key point, we further propose to reduce this requirement to only
8
bytes. Next, we use a novel talking head frame-interpolater network to generate frames at the
receiver’s side. This allows us to send key points from fewer frames while rendering the rest
of the frames using the interpolater network. We use a patch-wise super-resolution network to
upsample the final outputs to arbitrary resolutions, significantly improving the generations’
quality. In a lengthy video call sending a single pivot frame at the start of the video may lead
to inferior results on significant changes in the background and head pose. Therefore, we also
propose an algorithm to adaptively select and send pivot frames negating the effects of such
changes. Overall, our approach allows for unprecedently low Bits-per-Pixel (BPP) value (bits
used to represent a pixel in a video) while maintaining usable quality. We refer the reader to
check our project web-page for numerous example results from our approach.
1https://en.wikipedia.org/wiki/List_of_countries_by_4G_LTE_penetration
AGARWAL ET AL.: COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 3
Key point Extractor
Decoder
Frame-Interpolation
Patch-wise
Super-resolution
Sender
Pivot Frame
Key points
Input Video: Middle Frame not Transmitted at all
FOMM-Variant Receiver
Transmitting only key
points
Reconstructed
Frames
Interpolated Middle
Frame
Generating High
Resolution Output
Pivot Frame
Uses Lightweight Key points
Interpolates Frames at the receiver’s end
Generates high-resolution outputs using SR
Figure 1: We depict the entire pipeline used for compressing talking head videos. In our
pipeline, we detect and send key points of alternate frames over the network and regenerate
the talking heads at the receiver’s end. We then use frame interpolation to generate the rest of
the frames and use super-resolution to generate high-resolution outputs.
2 Background: Synthetic Talking Head Generation
Our work revolves around synthetic talking head generation. Therefore, we survey the
different types of talking head generation works prevalent in the community. Talking head
generation was first popularized in works like [
5
,
7
,
10
,
17
,
22
] which attempted to generate
only the lip movements from a given speech. These works were effective for solutions that
required preserving the original head movements in a talking head video while changing
only the lip synchronization to a new speech. A separate class of works [
24
,
30
,
31
,
32
]
tried to generate the talking head video directly from speech without additional information.
While these works can also potentially find their usage in video call compression, the head
movements in the generated video do not match those of the original one, limiting its usage!
Face Reenactment
In face reenactment, a source image is animated using the motion from
a driving video. The initial models for this class of works were speaker-specific [
3
,
27
].
These models are specifically trained on a single identity and cannot generalize to different
individuals. On the other hand, speaker agnostic models [
2
,
20
,
25
,
31
] are more robust. They
require a single image of any identity and a driving video (need not have the same identity)
to generate a talking head of the source identity following the driving motion. We find face
reenactment works to be well suited for talking head video compression. We propose to
use the inherent characteristic of the problem and send a single high-quality frame that can
be animated by the rest of the video at the receiver’s end to generate the final output. The
reenactment is driven by landmarks, feature warping, or latent embeddings. First-Order-
Motion-Model (FOMM) proposed by Siarohin et al. [
20
] uses self-learned key points to
represent the dense motion flow of driving video. Each key point consists of the coordinates
and Jacobians representing the local motion field between the source image and the driving
video. A global motion field is then interpolated from the local motion field, and the source
摘要:

AGARWALETAL.:COMPRESSINGVIDEOCALLSUSINGSYNTHETICTALKINGHEADS1CompressingVideoCallsusingSyntheticTalkingHeadsMadhavAgarwal1madhav.agarwal@research.iiit.ac.inAnchitGupta1anchit.gupta@research.iiit.ac.inRudrabhaMukhopadhyay1radrabha.m@research.iiit.ac.inVinayP.Namboodiri2vpn22@bath.ac.ukCVJawahar1jawah...

展开>> 收起<<
AGARWAL ET AL. COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS 1 Compressing Video Calls using Synthetic Talking Heads.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:946.67KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注