An Effective Deep Network for Head Pose Estimation without Keypoints Chien Thai Viet Tran Minh Bui Huong Ninh and Hai Tran Computer Vision Department Optoelectronics Center Viettel Aerospace Institute Vietnam

2025-04-30 0 0 1.05MB 9 页 10玖币
侵权投诉
An Effective Deep Network for Head Pose Estimation without Keypoints
Chien Thai, Viet Tran, Minh Bui, Huong Ninh and Hai Tran
Computer Vision Department, Optoelectronics Center, Viettel Aerospace Institute, Vietnam
{chientv13, vietth5, minhbq6, huongnt382, haitt27}@viettel.com.vn
Keywords: head pose estimation, knowledge distillation, convolutional neural network
Abstract: Human head pose estimation is an essential problem in facial analysis in recent years that has a lot of computer
vision applications such as gaze estimation, virtual reality, driver assistance. Because of the importance of the
head pose estimation problem, it is necessary to design a compact model to resolve this task in order to reduce
the computational cost when deploying on facial analysis-based applications such as large camera surveillance
systems, AI cameras while maintaining accuracy. In this work, we propose a lightweight model that effectively
addresses the head pose estimation problem. Our approach has two main steps. 1) We first train many teacher
models on the synthesis dataset - 300W-LPA to get the head pose pseudo labels. 2) We design an architecture
with the ResNet18 backbone and train our proposed model with the ensemble of these pseudo labels via the
knowledge distillation process. To evaluate the effectiveness of our model, we use AFLW-2000 and BIWI -
two real-world head pose datasets. Experimental results show that our proposed model significantly improves
the accuracy in comparison with the state-of-the-art head pose estimation methods. Furthermore, our model
has the real-time speed of 300 FPS when inferring on Tesla V100.
1 INTRODUCTION
Head pose estimation (HPE) is an important prob-
lem in facial analysis that has been extensively re-
searched in recent years. Its application can be widely
observed in lots of intelligent computer vision sys-
tems including virtual reality (Kumar et al., 2017),
driver assistance (Schwarz et al., 2017; Murphy-
Chutorian et al., 2007), gaze estimation (Murphy-
Chutorian and Trivedi, 2008), human-computer inter-
action (Seemann et al., 2004; Wang et al., 2019) and
smart city surveillance.
The objective of head pose estimation is to ac-
curately identify the orientation of heads of individ-
uals found in images. Existing methods to solve
this problem can be divided into two primary cate-
gories: landmark-based approaches (Cao et al., 2014;
Lathuili`
ere et al., 2017; Fanelli et al., 2011; Xiong
and De la Torre, 2015; Sun et al., 2013; Xin et al.,
2021; Bulat and Tzimiropoulos, 2017; DeMenthon
and Davis, 1995) and landmark-free approach (Ruiz
et al., 2018; Yang et al., 2019; Zhou and Gregson,
2020; Chang et al., 2017). Landmark-based methods
use facial keypoints extracted by landmark detectors
to regress the head pose angle. Recently, these ap-
proaches have achieved remarkable results since the
usage of deep neural networks has greatly enhanced
the quality of landmark detectors. However, the prob-
lem remains challenging due to the fact that not only a
minor error of landmark detectors may adversely af-
fect the head pose estimation but learning the rela-
tion between the geometric distribution of facial land-
marks and head poses is not a trivial task. Further-
more, using landmark detection as a preprocessing
step imposes a computational burden for the whole
process of estimating head angle which hinders its us-
age for real-time applications. Landmark-free meth-
ods, on the other hand, directly predict the head poses
from images without detecting facial keypoints which
results in their fast execution time.
In addition to the above approaches, some works
utilize depth information from depth cameras (Meyer
et al., 2015; Fanelli et al., 2011; Mukherjee and
Robertson, 2015; Martin et al., 2014). Although this
approach provides a prominent result, it still has some
limitations. The depth cameras are sensitive to illu-
mination change and light conditions so that they of-
ten yield substandard results in an uncontrolled envi-
ronment. Moreover, they are very expensive and use
more storage and transfer time, so they are often im-
practical for real-time applications.
Because of the importance of the head pose es-
timation problem and in order to minimize the pro-
cessing time of the model when deploying on large
arXiv:2210.13705v1 [cs.CV] 25 Oct 2022
systems or embedded platforms, our goal is to design
a lightweight architecture that solves this task while
still guaranteeing remarkable performance. For hav-
ing a compact and simple model, our network uses
ResNet18 architecture as a backbone. The contribu-
tions of our work can be summarized as follows:
We address a major mistake found in HopeNet
(Ruiz et al., 2018) in which annotated face boxes
are mislabeled. We prove that correcting those
mislabeled boxes can significantly improve the
accuracy of the head pose estimation task.
An end-to-end deep architecture designed to solve
head pose estimation problem is proposed. A
lightweight model is trained to this task via the
knowledge distillation process.
Experiments are conducted to evaluate the per-
formance of our method on two challenging
head pose datasets (BIWI and AFLW-2000).
Our method achieves state-of-the-art performance
when evaluating on the head pose dataset.
The rest of the paper is organized as follows: Sec-
tion 2 puts forward some related works on head pose
estimation problem. In section 3, we present our pro-
posed method. Section 4 discusses the datasets, ex-
periments, results, and ablation study. Finally, the
conclusion and future work are discussed in Section
5.
2 RELATED WORKS
Convolutional neural networks (CNNs) are widely
used in computer vision tasks and gradually replace
the traditional image processing methods. CNN is de-
signed to automatically learn the spatial features of
the image by using convolution kernels. With many
convolutional layers, deep networks can extract high-
level semantic features. He et al. (He et al., 2016) pro-
pose the Residual Network to train the much deeper
convolutional neural network. ResNet uses a skip
connection between the current layer and the previous
layer which can learn the identity mapping and solve
the vanishing gradient problem. Because of its pow-
erful and simple architecture, ResNet and its variants
(Xie et al., 2017; Zhang et al., 2020; Gao et al., 2019)
are widely used in many computer vision applications
and deliver high performance.
Human head pose estimation has been re-
searched over the past 25 years with many different
approaches. Appearance Template (Niyogi and Free-
man, 1996; Beymer, 1994; Sherrah et al., 2001; Ng
and Gong, 2002; Sherrah et al., 1999) is the method
that compares the input image with a set of labeled
templates and assigns it to the most similar template.
Detector arrays (Huang et al., 1998; Zhang et al.,
2006; Jones and Viola, 2003) estimate head pose by
training multiple face detectors for the different dis-
crete poses.
Many approaches are based on facial landmarks
from the input image to estimate the head pose. With
the progress of landmarks detection, landmark-based
methods demonstrate superior performance. Demen-
thon et al. (DeMenthon and Davis, 1995) pro-
posed Pose from Orthography and Scaling with Iter-
ations which determines the head pose by 3D com-
puter vision techniques for the given 2D face land-
marks. FAN (Bulat and Tzimiropoulos, 2017) us-
ing deep neural network to estimate 3D face models.
EVA-GCN (Xin et al., 2021) constructs a landmark-
connection graph and leverages the Graph Convolu-
tion Network (Yan et al., 2018) to learn the nonlinear
relationships between head poses and distribution of
facial keypoints.
Multi-task methods combine the head pose es-
timation problem with other related facial analysis
problems, such as face detection, keypoints detection.
Some works show that learning with related tasks
yields better results than learning individual tasks in-
dependently (Chen et al., 2014; Kumar et al., 2017;
Zhu and Ramanan, 2012; Ranjan et al., 2017b). KE-
PLER (Kumar et al., 2017) predicts face detection
and pose estimation jointly by using Heatmap-CNN
to capture structured global and local features. Hy-
perface (Ranjan et al., 2017a) presents a convolu-
tional neural network for simultaneous face detection,
landmarks localization, pose estimation, and gender
recognition.
Gu et al. (Gu et al., 2017) proposed a dynamic
facial analysis that uses a recurrent neural network.
They improve head pose estimation and facial land-
marks localization by leveraging the time dimension
from videos instead of a single frame.
For accurate head pose estimation, some methods
utilize 3D information of depth images. Meyer et al.
(Meyer et al., 2015) perform head pose estimation by
registering 3D morphable models to depth images, us-
ing the particle swarm optimization and the iterative
closest point algorithm. Fanelli et al. (Fanelli et al.,
2011) using Random Regression Forests to regress the
head pose estimation of depth images.
Recent works directly predict the Euler angles
from a single RGB image by using a deep neural net-
work and achieve prominent performance. HopeNet
(Ruiz et al., 2018) proposed a multi-loss framework
that combines binned pose classification and regres-
sion loss for each Euler angle. By using a very stable
softmax layer and cross-entropy for binned classifica-
摘要:

AnEffectiveDeepNetworkforHeadPoseEstimationwithoutKeypointsChienThai,VietTran,MinhBui,HuongNinhandHaiTranComputerVisionDepartment,OptoelectronicsCenter,ViettelAerospaceInstitute,Vietnamfchientv13,vietth5,minhbq6,huongnt382,haitt27g@viettel.com.vnKeywords:headposeestimation,knowledgedistillation,conv...

展开>> 收起<<
An Effective Deep Network for Head Pose Estimation without Keypoints Chien Thai Viet Tran Minh Bui Huong Ninh and Hai Tran Computer Vision Department Optoelectronics Center Viettel Aerospace Institute Vietnam.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.05MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注