Jitter Does Matter Adapting Gaze Estimation to New Domains Ruicong Liu1Yiwei Bao1Mingjie Xu1Haofei Wang2Yunfei Liu1Feng Lu12 1State Key Laboratory of VR Technology and Systems School of CSE Beihang University

2025-05-05 0 0 1.15MB 9 页 10玖币

侵权投诉

Jitter Does Matter: Adapting Gaze Estimation to New Domains

Ruicong Liu1Yiwei Bao1Mingjie Xu1Haofei Wang2Yunfei Liu1Feng Lu1,2,*

1State Key Laboratory of VR Technology and Systems, School of CSE, Beihang University

2Peng Cheng Laboratory, Shenzhen, China

{liuruicong, baoyiwei, xumingjie, lyunfei, lufeng}@buaa.edu.cn wanghf@pcl.ac.cn

Abstract

Deep neural networks have demonstrated superior perfor-

mance on appearance-based gaze estimation tasks. However,

due to variations in person, illuminations, and background,

performance degrades dramatically when applying the model

to a new domain. In this paper, we discover an interesting

gaze jitter phenomenon in cross-domain gaze estimation, i.e.,

the gaze predictions of two similar images can be severely

deviated in target domain. This is closely related to cross-

domain gaze estimation tasks, but surprisingly, it has not been

noticed yet previously. Therefore, we innovatively propose to

utilize the gaze jitter to analyze and optimize the gaze domain

adaptation task. We ﬁnd that the high-frequency component

(HFC) is an important factor that leads to jitter. Based on

this discovery, we add high-frequency components to input

images using the adversarial attack and employ contrastive

learning to encourage the model to obtain similar representa-

tions between original and perturbed data, which reduces the

impacts of HFC. We evaluate the proposed method on four

cross-domain gaze estimation tasks, and experimental results

demonstrate that it signiﬁcantly reduces the gaze jitter and

improves the gaze estimation performance in target domains.

1 Introduction

Gaze indicates the direction along which a person is looking.

It has been adopted in various applications, such as semi-

autonomous driving(Demiris 2007; Majaranta and Bulling

2014; Park, Jain, and Sheikh 2013) and human-robot inter-

action(Admoni and Scassellati 2017; Terzio˘

glu, Mutlu, and

S¸ahin 2020; Wang et al. 2015). With an increasing demand

for predicting user intent implicitly, appearance-based gaze

estimation has attracted more attention recently. To train

the gaze estimator using deep learning neural networks, a

number of large-scale datasets have been proposed (Zhang

et al. 2020, 2017; Funes Mora, Monay, and Odobez 2014;

Kellnhofer et al. 2019).

However, due to variations in subjects, backgrounds, and

illuminations, the performance of deep learning-based gaze

estimation algorithms deteriorate signiﬁcantly when apply-

ing the model trained in one dataset to new datasets. Re-

cently, several techniques have been applied to address this

cross-domain problem, such as adversarial learning(Tzeng

et al. 2017; Cui et al. 2020), few-shot learning(Park et al.

*Corresponding Author.

2019; Yu, Liu, and Odobez 2019) and self-training(Cai, Lu,

and Sato 2020). Among them, unsupervised domain adap-

tation (UDA) method(Wang et al. 2019; Kellnhofer et al.

2019; Liu et al. 2021c) is one of the promising approaches

that attracts much attention. While requiring no labels makes

it more applicable to real-world scenarios, it also makes the

task more challenging.

Existing approaches usually optimize the gaze accuracy

during adaptation directly. Instead, we design an approach

that starts with the analysis of a phenomena we observed that

occurs in crossing domain tests. Where we can look for the

factors that cause the problems, and the factors can then be

used as guidance for us to ﬁnd a more explainable solution

for domain adaptation.

In this paper, we observe the gaze jitter phenomena: two

very similar images could be predicted with gazes severely

deviated (shown in Fig. 1), particularly when crossing do-

mains. As shown in Fig. 1, on the test set in the source do-

main, the model gives similar predictions when the input im-

ages are similar. In contrary, in the target domain, even if the

input images are very similar, the model may still give pre-

dictions that are severely deviated. In this paper, we name

this phenomenon as gaze jitter, and in addition, we consider

gaze jitter as a manifestation of gaze error across domains,

and use this phenomenon as a starting point to ﬁnd a solution

for domain adaptation.

Based on the above observation, we start to analyze why

the gaze jitter phenomenon occurs and discover an important

factor, i.e., the high-frequency component (HFC), which in-

troduces gaze jitter problem and lowers the gaze estimation

accuracy. Inspired by this, we propose our gaze adaptation

framework. At ﬁrst, our framework adds additive HFC to

the input data, then it employs contrastive learning to keep

the consistency between the original data and the perturbed

data, thus making the model learn features with less impact

of high-frequency component. Our method leads to signiﬁ-

cant jitter reduction and performance improvement on var-

ious cross-domain gaze estimation tasks. The primary con-

tributions of this paper are summarized as follows:

• For the ﬁrst time, we discover the gaze jitter problem on

cross-domain gaze estimation tasks. We ﬁnd that high-

frequency component is an important factor introducing

jitters.

• We propose a framework for cross-domain gaze estima-

arXiv:2210.02082v1 [cs.CV] 5 Oct 2022

Source Domain Target Domain

Gaze Predictions: Similar Gaze Predictions: Deviated (Jitter)

Gaze Labels:Similar Gaze Labels:Similar

0.42

0.85

0.49

6.79

TT+1

Gaze Estimation Model Trained on Source Domain

Test Input

Test Output

Test Input

Test Output

TT+1

Figure 1: We observe gaze jitter during cross-domain gaze

estimation. Even though similar input images are expected

to output close gaze directions, the predicted output can be

severely deviated in target domain (bottom-right). We ﬁnd

such a gaze jitter a good indicator to help analyze and opti-

mize cross domain gaze estimation.

tion that suppresses the inﬂuence of high-frequency com-

ponent, resulting in less jitter and better cross-domain

gaze estimation accuracy.

• Experimental results demonstrate that our method exhibits

exceptional performances on four gaze domain adaptation

tasks using only a small number of target images.

2 Related Work

2.1 Appearance-based gaze estimation

Appearance-based gaze estimation aims to predict the hu-

man gaze from appearance. Zhang et al. proposed the ﬁrst

CNN-based gaze estimation method (Zhang et al. 2017),

which uses eye images. With the release of many large-scale

gaze datasets(Zhang et al. 2020; Funes Mora, Monay, and

Odobez 2014; Kellnhofer et al. 2019; Zhang et al. 2017),

appearance-based gaze estimation has attracted more and

more attention. Many methods have been proposed to es-

timate accurate gazes on public datasets.(Cheng et al. 2020;

Guo et al. 2020; Shrivastava et al. 2017; Wang et al. 2019).

However, most studies focus on the gaze estimation

within a single dataset(Lu et al. 2014; Park et al. 2019;

Yu, Liu, and Odobez 2019). Due to the diversity of differ-

ent datasets, almost all gaze estimation methods suffer from

poor cross-domain capability(Cheng et al. 2020; Wang et al.

2019). Recent works (Liu et al. 2021c; Zhang et al. 2020)

investigated the cross-domain capability of a gaze estimator,

which improves the applicability to real-world scenes.

2.2 Unsupervised domain adaption

Unsupervised domain adaption(UDA) is a transfer learning

task that requires no target labels. Previous UDA approaches

can be divided into three categories: discrepancy, recon-

struction, and adversarial methods. Discrepancy methods

aim to minimize the domain gap using some distance met-

rics, such as Maximum Mean Discrepancy (MMD) (Ghifary,

Kleijn, and Zhang 2014) and Local Maximum Mean Dis-

crepancy (LMMD) (Zhu et al. 2020). Reconstruction meth-

ods (Glorot, Bordes, and Bengio 2011; Bousmalis et al.

2016) use a reconstruction strategy that allows a model to

learn features from both domains(Wang and Deng 2018).

Adversarial methods are inspired by the generative adver-

sarial network (GAN)(Goodfellow et al. 2014). In (Ganin

and Lempitsky 2015; Tzeng et al. 2017; Cui et al. 2020; Yu

et al. 2019), they make a domain discriminator and a gen-

erator play a min-max game, thereby explicitly reducing the

distance between the source and target domains.

However, most existing UDA methods have been de-

signed for classiﬁcation or semantic segmentation tasks.

Gaze estimation is a typical regression task, its continuous

label space makes it even more challenging.

2.3 Adversarial attack

The goal of the adversarial attack is to generate adversarial

noise. Although this noise is a type of high-frequency com-

ponent that usually cannot affect human cognition, recent

studies(Goodfellow, Shlens, and Szegedy 2014; Szegedy

et al. 2013) have shown that deep neural networks are

highly vulnerable to it. Although some adversarial attack

methods have been proposed (Moosavi-Dezfooli, Fawzi, and

Frossard 2016; Su, Vargas, and Sakurai 2019) in the past few

years, they mainly follow two ideas proposed by (Goodfel-

low, Shlens, and Szegedy 2014) and (Madry et al. 2017a).

Recently, the adversarial attack has been applied to UDA

tasks in various ﬁelds(Ma et al. 2021; Yang et al. 2021;

Madry et al. 2017b; Liu et al. 2021a), which reminds us of

the potential of applying it to the ﬁeld of gaze estimation.

2.4 Contrastive learning

On UDA tasks, contrastive learning is usually used to help

the model learn better representations. It encourages aug-

mentations of the same input to have more similar repre-

sentations compared to augmentations of different inputs.

Common contrastive learning frameworks achieve this aim

by constructing two kinds of pairs: positive pairs contain-

ing similar instances and negative pairs containing different

instances. Then it maximizes the consistency over the posi-

tive pairs and pushes apart samples from the negative pairs.

Recent contrastive learning studies, e.g., Memory Bank(Wu

et al. 2018), MoCo(He et al. 2020), SimCLR(Chen et al.

2020), and PCL(Li et al. 2020) have reached considerable

improvement on some downstream tasks.

Contrastive learning has been used for UDA in some

tasks, such as action recognition (Kang et al. 2020) and se-

mantic segmentation (Liu et al. 2021b). The signiﬁcant ef-

fect shows the capability of contrastive learning to learn use-

ful representations.

3 Motivation

In this section, we observe that gaze jitter is a signiﬁcant

phenomenon in cross-domain gaze estimation. Then, we dis-

cover one important factor introducing jitter: high-frequency

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JitterDoesMatter:AdaptingGazeEstimationtoNewDomainsRuicongLiu1YiweiBao1MingjieXu1HaofeiWang2YunfeiLiu1FengLu1,2,*1StateKeyLaboratoryofVRTechnologyandSystems,SchoolofCSE,BeihangUniversity2PengChengLaboratory,Shenzhen,Chinafliuruicong,baoyiwei,xumingjie,lyunfei,lufengg@buaa.edu.cnwanghf@pcl.ac.cnAbstr...

展开>> 收起<<

Jitter Does Matter Adapting Gaze Estimation to New Domains Ruicong Liu1Yiwei Bao1Mingjie Xu1Haofei Wang2Yunfei Liu1Feng Lu12 1State Key Laboratory of VR Technology and Systems School of CSE Beihang University.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Jitter Does Matter Adapting Gaze Estimation to New Domains Ruicong Liu1Yiwei Bao1Mingjie Xu1Haofei Wang2Yunfei Liu1Feng Lu12 1State Key Laboratory of VR Technology and Systems School of CSE Beihang University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: