systems or embedded platforms, our goal is to design
a lightweight architecture that solves this task while
still guaranteeing remarkable performance. For hav-
ing a compact and simple model, our network uses
ResNet18 architecture as a backbone. The contribu-
tions of our work can be summarized as follows:
• We address a major mistake found in HopeNet
(Ruiz et al., 2018) in which annotated face boxes
are mislabeled. We prove that correcting those
mislabeled boxes can significantly improve the
accuracy of the head pose estimation task.
• An end-to-end deep architecture designed to solve
head pose estimation problem is proposed. A
lightweight model is trained to this task via the
knowledge distillation process.
• Experiments are conducted to evaluate the per-
formance of our method on two challenging
head pose datasets (BIWI and AFLW-2000).
Our method achieves state-of-the-art performance
when evaluating on the head pose dataset.
The rest of the paper is organized as follows: Sec-
tion 2 puts forward some related works on head pose
estimation problem. In section 3, we present our pro-
posed method. Section 4 discusses the datasets, ex-
periments, results, and ablation study. Finally, the
conclusion and future work are discussed in Section
5.
2 RELATED WORKS
Convolutional neural networks (CNNs) are widely
used in computer vision tasks and gradually replace
the traditional image processing methods. CNN is de-
signed to automatically learn the spatial features of
the image by using convolution kernels. With many
convolutional layers, deep networks can extract high-
level semantic features. He et al. (He et al., 2016) pro-
pose the Residual Network to train the much deeper
convolutional neural network. ResNet uses a skip
connection between the current layer and the previous
layer which can learn the identity mapping and solve
the vanishing gradient problem. Because of its pow-
erful and simple architecture, ResNet and its variants
(Xie et al., 2017; Zhang et al., 2020; Gao et al., 2019)
are widely used in many computer vision applications
and deliver high performance.
Human head pose estimation has been re-
searched over the past 25 years with many different
approaches. Appearance Template (Niyogi and Free-
man, 1996; Beymer, 1994; Sherrah et al., 2001; Ng
and Gong, 2002; Sherrah et al., 1999) is the method
that compares the input image with a set of labeled
templates and assigns it to the most similar template.
Detector arrays (Huang et al., 1998; Zhang et al.,
2006; Jones and Viola, 2003) estimate head pose by
training multiple face detectors for the different dis-
crete poses.
Many approaches are based on facial landmarks
from the input image to estimate the head pose. With
the progress of landmarks detection, landmark-based
methods demonstrate superior performance. Demen-
thon et al. (DeMenthon and Davis, 1995) pro-
posed Pose from Orthography and Scaling with Iter-
ations which determines the head pose by 3D com-
puter vision techniques for the given 2D face land-
marks. FAN (Bulat and Tzimiropoulos, 2017) us-
ing deep neural network to estimate 3D face models.
EVA-GCN (Xin et al., 2021) constructs a landmark-
connection graph and leverages the Graph Convolu-
tion Network (Yan et al., 2018) to learn the nonlinear
relationships between head poses and distribution of
facial keypoints.
Multi-task methods combine the head pose es-
timation problem with other related facial analysis
problems, such as face detection, keypoints detection.
Some works show that learning with related tasks
yields better results than learning individual tasks in-
dependently (Chen et al., 2014; Kumar et al., 2017;
Zhu and Ramanan, 2012; Ranjan et al., 2017b). KE-
PLER (Kumar et al., 2017) predicts face detection
and pose estimation jointly by using Heatmap-CNN
to capture structured global and local features. Hy-
perface (Ranjan et al., 2017a) presents a convolu-
tional neural network for simultaneous face detection,
landmarks localization, pose estimation, and gender
recognition.
Gu et al. (Gu et al., 2017) proposed a dynamic
facial analysis that uses a recurrent neural network.
They improve head pose estimation and facial land-
marks localization by leveraging the time dimension
from videos instead of a single frame.
For accurate head pose estimation, some methods
utilize 3D information of depth images. Meyer et al.
(Meyer et al., 2015) perform head pose estimation by
registering 3D morphable models to depth images, us-
ing the particle swarm optimization and the iterative
closest point algorithm. Fanelli et al. (Fanelli et al.,
2011) using Random Regression Forests to regress the
head pose estimation of depth images.
Recent works directly predict the Euler angles
from a single RGB image by using a deep neural net-
work and achieve prominent performance. HopeNet
(Ruiz et al., 2018) proposed a multi-loss framework
that combines binned pose classification and regres-
sion loss for each Euler angle. By using a very stable
softmax layer and cross-entropy for binned classifica-