
2AGARWAL ET AL.: COMPRESSING VIDEO CALLS USING SYNTHETIC TALKING HEADS
even have 4G services
1
! Therefore, introducing video compression schemes to reduce the
bandwidth requirement is a need of the hour.
Traditional Video Compression Techniques
Compressing video information has fasci-
nated researchers for nearly a century. The first works dealt with analog video compression
and were released in 1929 [
8
]. A significant breakthrough in modern video compression
was achieved by [
13
] using a DCT-based compression technique leading to the first practical
applications. This was followed by the widely adopted H.264 [
26
] and H.265 [
1
] video
codecs, which remain the most popular in industrial applications. The most recent codec
to be released is H.266 [
4
]. However, we do not compare our work with H.266 due to the
lack of availability of open-source implementations. Deep learning-based video compression
techniques like [
11
,
15
,
18
,
19
] have also been prevalent in the recent past. These techniques
use autoencoder-like structures to encode video frames in a bottlenecked latent space and
generate it back on the receiver’s end. While such approaches have proven their effectiveness
in multiple situations, they are generic and do not consider the high-level semantics of the
video for compression.
Talking Head Video Compression
Video calls, on the other hand, encompass a specific
class of videos. They primarily contain videos of speakers and are popularly known as talking
head videos. The inherent semantic information present in a talking head video involving the
face structure, head movements, expressions on the face, etc., has long interested researchers
in developing compression schemes targeted towards such specialized videos. Techniques
like [
16
] transmit
68
facial landmarks for each frame, which synthesize the talking head
at the receiver’s end. In 2021, Wang et al. [
25
] proposed using face reenactment for video
compression. They used
10
learned 3D key points instead of pre-defined face landmarks to
represent a face in their work leading to significant compression. Each learned key point
contains information regarding the structure of the face, rotation, translation, etc., and helps
to warp a reference frame.
Our Contributions
We explore this concept further in this work and propose several novel
improvements. We first send a high-resolution frame (pivot frame) at the start of the video
calls. For the rest of the frames, we use a modified version of [
20
] to detect key points in
each of them and transmit them to the receiver. The key points are then used to calculate a
dense flow that warps the pivot frame to recreate the original video. While [
20
,
25
] used
24
bytes to represent a single key point, we further propose to reduce this requirement to only
8
bytes. Next, we use a novel talking head frame-interpolater network to generate frames at the
receiver’s side. This allows us to send key points from fewer frames while rendering the rest
of the frames using the interpolater network. We use a patch-wise super-resolution network to
upsample the final outputs to arbitrary resolutions, significantly improving the generations’
quality. In a lengthy video call sending a single pivot frame at the start of the video may lead
to inferior results on significant changes in the background and head pose. Therefore, we also
propose an algorithm to adaptively select and send pivot frames negating the effects of such
changes. Overall, our approach allows for unprecedently low Bits-per-Pixel (BPP) value (bits
used to represent a pixel in a video) while maintaining usable quality. We refer the reader to
check our project web-page for numerous example results from our approach.
1https://en.wikipedia.org/wiki/List_of_countries_by_4G_LTE_penetration