An Effective Deep Network for Head Pose Estimation without Keypoints Chien Thai Viet Tran Minh Bui Huong Ninh and Hai Tran Computer Vision Department Optoelectronics Center Viettel Aerospace Institute Vietnam

2025-04-30 0 0 1.05MB 9 页 10玖币

An Effective Deep Network for Head Pose Estimation without Keypoints

Chien Thai, Viet Tran, Minh Bui, Huong Ninh and Hai Tran

Computer Vision Department, Optoelectronics Center, Viettel Aerospace Institute, Vietnam

{chientv13, vietth5, minhbq6, huongnt382, haitt27}@viettel.com.vn

Keywords: head pose estimation, knowledge distillation, convolutional neural network

Abstract: Human head pose estimation is an essential problem in facial analysis in recent years that has a lot of computer

vision applications such as gaze estimation, virtual reality, driver assistance. Because of the importance of the

head pose estimation problem, it is necessary to design a compact model to resolve this task in order to reduce

the computational cost when deploying on facial analysis-based applications such as large camera surveillance

systems, AI cameras while maintaining accuracy. In this work, we propose a lightweight model that effectively

addresses the head pose estimation problem. Our approach has two main steps. 1) We ﬁrst train many teacher

models on the synthesis dataset - 300W-LPA to get the head pose pseudo labels. 2) We design an architecture

with the ResNet18 backbone and train our proposed model with the ensemble of these pseudo labels via the

knowledge distillation process. To evaluate the effectiveness of our model, we use AFLW-2000 and BIWI -

two real-world head pose datasets. Experimental results show that our proposed model signiﬁcantly improves

the accuracy in comparison with the state-of-the-art head pose estimation methods. Furthermore, our model

has the real-time speed of ∼300 FPS when inferring on Tesla V100.

1 INTRODUCTION

Head pose estimation (HPE) is an important prob-

lem in facial analysis that has been extensively re-

searched in recent years. Its application can be widely

observed in lots of intelligent computer vision sys-

tems including virtual reality (Kumar et al., 2017),

driver assistance (Schwarz et al., 2017; Murphy-

Chutorian et al., 2007), gaze estimation (Murphy-

Chutorian and Trivedi, 2008), human-computer inter-

action (Seemann et al., 2004; Wang et al., 2019) and

smart city surveillance.

The objective of head pose estimation is to ac-

curately identify the orientation of heads of individ-

uals found in images. Existing methods to solve

this problem can be divided into two primary cate-

gories: landmark-based approaches (Cao et al., 2014;

Lathuili`

ere et al., 2017; Fanelli et al., 2011; Xiong

and De la Torre, 2015; Sun et al., 2013; Xin et al.,

2021; Bulat and Tzimiropoulos, 2017; DeMenthon

and Davis, 1995) and landmark-free approach (Ruiz

et al., 2018; Yang et al., 2019; Zhou and Gregson,

2020; Chang et al., 2017). Landmark-based methods

use facial keypoints extracted by landmark detectors

to regress the head pose angle. Recently, these ap-

proaches have achieved remarkable results since the

usage of deep neural networks has greatly enhanced

the quality of landmark detectors. However, the prob-

lem remains challenging due to the fact that not only a

minor error of landmark detectors may adversely af-

fect the head pose estimation but learning the rela-

tion between the geometric distribution of facial land-

marks and head poses is not a trivial task. Further-

more, using landmark detection as a preprocessing

step imposes a computational burden for the whole

process of estimating head angle which hinders its us-

age for real-time applications. Landmark-free meth-

ods, on the other hand, directly predict the head poses

from images without detecting facial keypoints which

results in their fast execution time.

In addition to the above approaches, some works

utilize depth information from depth cameras (Meyer

et al., 2015; Fanelli et al., 2011; Mukherjee and

Robertson, 2015; Martin et al., 2014). Although this

approach provides a prominent result, it still has some

limitations. The depth cameras are sensitive to illu-

mination change and light conditions so that they of-

ten yield substandard results in an uncontrolled envi-

ronment. Moreover, they are very expensive and use

more storage and transfer time, so they are often im-

practical for real-time applications.

Because of the importance of the head pose es-

timation problem and in order to minimize the pro-

cessing time of the model when deploying on large

arXiv:2210.13705v1 [cs.CV] 25 Oct 2022

systems or embedded platforms, our goal is to design

a lightweight architecture that solves this task while

still guaranteeing remarkable performance. For hav-

ing a compact and simple model, our network uses

ResNet18 architecture as a backbone. The contribu-

tions of our work can be summarized as follows:

• We address a major mistake found in HopeNet

(Ruiz et al., 2018) in which annotated face boxes

are mislabeled. We prove that correcting those

mislabeled boxes can signiﬁcantly improve the

accuracy of the head pose estimation task.

• An end-to-end deep architecture designed to solve

head pose estimation problem is proposed. A

lightweight model is trained to this task via the

knowledge distillation process.

• Experiments are conducted to evaluate the per-

formance of our method on two challenging

head pose datasets (BIWI and AFLW-2000).

Our method achieves state-of-the-art performance

when evaluating on the head pose dataset.

The rest of the paper is organized as follows: Sec-

tion 2 puts forward some related works on head pose

estimation problem. In section 3, we present our pro-

posed method. Section 4 discusses the datasets, ex-

periments, results, and ablation study. Finally, the

conclusion and future work are discussed in Section

2 RELATED WORKS

Convolutional neural networks (CNNs) are widely

used in computer vision tasks and gradually replace

the traditional image processing methods. CNN is de-

signed to automatically learn the spatial features of

the image by using convolution kernels. With many

convolutional layers, deep networks can extract high-

level semantic features. He et al. (He et al., 2016) pro-

pose the Residual Network to train the much deeper

convolutional neural network. ResNet uses a skip

connection between the current layer and the previous

layer which can learn the identity mapping and solve

the vanishing gradient problem. Because of its pow-

erful and simple architecture, ResNet and its variants

(Xie et al., 2017; Zhang et al., 2020; Gao et al., 2019)

are widely used in many computer vision applications

and deliver high performance.

Human head pose estimation has been re-

searched over the past 25 years with many different

approaches. Appearance Template (Niyogi and Free-

man, 1996; Beymer, 1994; Sherrah et al., 2001; Ng

and Gong, 2002; Sherrah et al., 1999) is the method

that compares the input image with a set of labeled

templates and assigns it to the most similar template.

Detector arrays (Huang et al., 1998; Zhang et al.,

2006; Jones and Viola, 2003) estimate head pose by

training multiple face detectors for the different dis-

crete poses.

Many approaches are based on facial landmarks

from the input image to estimate the head pose. With

the progress of landmarks detection, landmark-based

methods demonstrate superior performance. Demen-

thon et al. (DeMenthon and Davis, 1995) pro-

posed Pose from Orthography and Scaling with Iter-

ations which determines the head pose by 3D com-

puter vision techniques for the given 2D face land-

marks. FAN (Bulat and Tzimiropoulos, 2017) us-

ing deep neural network to estimate 3D face models.

EVA-GCN (Xin et al., 2021) constructs a landmark-

connection graph and leverages the Graph Convolu-

tion Network (Yan et al., 2018) to learn the nonlinear

relationships between head poses and distribution of

facial keypoints.

Multi-task methods combine the head pose es-

timation problem with other related facial analysis

problems, such as face detection, keypoints detection.

Some works show that learning with related tasks

yields better results than learning individual tasks in-

dependently (Chen et al., 2014; Kumar et al., 2017;

Zhu and Ramanan, 2012; Ranjan et al., 2017b). KE-

PLER (Kumar et al., 2017) predicts face detection

and pose estimation jointly by using Heatmap-CNN

to capture structured global and local features. Hy-

perface (Ranjan et al., 2017a) presents a convolu-

tional neural network for simultaneous face detection,

landmarks localization, pose estimation, and gender

recognition.

Gu et al. (Gu et al., 2017) proposed a dynamic

facial analysis that uses a recurrent neural network.

They improve head pose estimation and facial land-

marks localization by leveraging the time dimension

from videos instead of a single frame.

For accurate head pose estimation, some methods

utilize 3D information of depth images. Meyer et al.

(Meyer et al., 2015) perform head pose estimation by

registering 3D morphable models to depth images, us-

ing the particle swarm optimization and the iterative

closest point algorithm. Fanelli et al. (Fanelli et al.,

2011) using Random Regression Forests to regress the

head pose estimation of depth images.

Recent works directly predict the Euler angles

from a single RGB image by using a deep neural net-

work and achieve prominent performance. HopeNet

(Ruiz et al., 2018) proposed a multi-loss framework

that combines binned pose classiﬁcation and regres-

sion loss for each Euler angle. By using a very stable

softmax layer and cross-entropy for binned classiﬁca-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnEffectiveDeepNetworkforHeadPoseEstimationwithoutKeypointsChienThai,VietTran,MinhBui,HuongNinhandHaiTranComputerVisionDepartment,OptoelectronicsCenter,ViettelAerospaceInstitute,Vietnamfchientv13,vietth5,minhbq6,huongnt382,haitt27g@viettel.com.vnKeywords:headposeestimation,knowledgedistillation,conv...

展开>> 收起<<

An Effective Deep Network for Head Pose Estimation without Keypoints Chien Thai Viet Tran Minh Bui Huong Ninh and Hai Tran Computer Vision Department Optoelectronics Center Viettel Aerospace Institute Vietnam.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An Effective Deep Network for Head Pose Estimation without Keypoints Chien Thai Viet Tran Minh Bui Huong Ninh and Hai Tran Computer Vision Department Optoelectronics Center Viettel Aerospace Institute Vietnam

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: