Jitter Does Matter Adapting Gaze Estimation to New Domains Ruicong Liu1Yiwei Bao1Mingjie Xu1Haofei Wang2Yunfei Liu1Feng Lu12 1State Key Laboratory of VR Technology and Systems School of CSE Beihang University

2025-05-05 0 0 1.15MB 9 页 10玖币
侵权投诉
Jitter Does Matter: Adapting Gaze Estimation to New Domains
Ruicong Liu1Yiwei Bao1Mingjie Xu1Haofei Wang2Yunfei Liu1Feng Lu1,2,*
1State Key Laboratory of VR Technology and Systems, School of CSE, Beihang University
2Peng Cheng Laboratory, Shenzhen, China
{liuruicong, baoyiwei, xumingjie, lyunfei, lufeng}@buaa.edu.cn wanghf@pcl.ac.cn
Abstract
Deep neural networks have demonstrated superior perfor-
mance on appearance-based gaze estimation tasks. However,
due to variations in person, illuminations, and background,
performance degrades dramatically when applying the model
to a new domain. In this paper, we discover an interesting
gaze jitter phenomenon in cross-domain gaze estimation, i.e.,
the gaze predictions of two similar images can be severely
deviated in target domain. This is closely related to cross-
domain gaze estimation tasks, but surprisingly, it has not been
noticed yet previously. Therefore, we innovatively propose to
utilize the gaze jitter to analyze and optimize the gaze domain
adaptation task. We find that the high-frequency component
(HFC) is an important factor that leads to jitter. Based on
this discovery, we add high-frequency components to input
images using the adversarial attack and employ contrastive
learning to encourage the model to obtain similar representa-
tions between original and perturbed data, which reduces the
impacts of HFC. We evaluate the proposed method on four
cross-domain gaze estimation tasks, and experimental results
demonstrate that it significantly reduces the gaze jitter and
improves the gaze estimation performance in target domains.
1 Introduction
Gaze indicates the direction along which a person is looking.
It has been adopted in various applications, such as semi-
autonomous driving(Demiris 2007; Majaranta and Bulling
2014; Park, Jain, and Sheikh 2013) and human-robot inter-
action(Admoni and Scassellati 2017; Terzio˘
glu, Mutlu, and
S¸ahin 2020; Wang et al. 2015). With an increasing demand
for predicting user intent implicitly, appearance-based gaze
estimation has attracted more attention recently. To train
the gaze estimator using deep learning neural networks, a
number of large-scale datasets have been proposed (Zhang
et al. 2020, 2017; Funes Mora, Monay, and Odobez 2014;
Kellnhofer et al. 2019).
However, due to variations in subjects, backgrounds, and
illuminations, the performance of deep learning-based gaze
estimation algorithms deteriorate significantly when apply-
ing the model trained in one dataset to new datasets. Re-
cently, several techniques have been applied to address this
cross-domain problem, such as adversarial learning(Tzeng
et al. 2017; Cui et al. 2020), few-shot learning(Park et al.
*Corresponding Author.
2019; Yu, Liu, and Odobez 2019) and self-training(Cai, Lu,
and Sato 2020). Among them, unsupervised domain adap-
tation (UDA) method(Wang et al. 2019; Kellnhofer et al.
2019; Liu et al. 2021c) is one of the promising approaches
that attracts much attention. While requiring no labels makes
it more applicable to real-world scenarios, it also makes the
task more challenging.
Existing approaches usually optimize the gaze accuracy
during adaptation directly. Instead, we design an approach
that starts with the analysis of a phenomena we observed that
occurs in crossing domain tests. Where we can look for the
factors that cause the problems, and the factors can then be
used as guidance for us to find a more explainable solution
for domain adaptation.
In this paper, we observe the gaze jitter phenomena: two
very similar images could be predicted with gazes severely
deviated (shown in Fig. 1), particularly when crossing do-
mains. As shown in Fig. 1, on the test set in the source do-
main, the model gives similar predictions when the input im-
ages are similar. In contrary, in the target domain, even if the
input images are very similar, the model may still give pre-
dictions that are severely deviated. In this paper, we name
this phenomenon as gaze jitter, and in addition, we consider
gaze jitter as a manifestation of gaze error across domains,
and use this phenomenon as a starting point to find a solution
for domain adaptation.
Based on the above observation, we start to analyze why
the gaze jitter phenomenon occurs and discover an important
factor, i.e., the high-frequency component (HFC), which in-
troduces gaze jitter problem and lowers the gaze estimation
accuracy. Inspired by this, we propose our gaze adaptation
framework. At first, our framework adds additive HFC to
the input data, then it employs contrastive learning to keep
the consistency between the original data and the perturbed
data, thus making the model learn features with less impact
of high-frequency component. Our method leads to signifi-
cant jitter reduction and performance improvement on var-
ious cross-domain gaze estimation tasks. The primary con-
tributions of this paper are summarized as follows:
For the first time, we discover the gaze jitter problem on
cross-domain gaze estimation tasks. We find that high-
frequency component is an important factor introducing
jitters.
We propose a framework for cross-domain gaze estima-
arXiv:2210.02082v1 [cs.CV] 5 Oct 2022
Source Domain Target Domain
Gaze Predictions: Similar Gaze Predictions: Deviated (Jitter)
Gaze Labels:Similar Gaze Labels:Similar
0.42
°
0.85
°
0.49
°
6.79
°
TT+1
Gaze Estimation Model Trained on Source Domain
Test Input
Test Output
Test Input
Test Output
TT+1
TT+1
TT+1
Figure 1: We observe gaze jitter during cross-domain gaze
estimation. Even though similar input images are expected
to output close gaze directions, the predicted output can be
severely deviated in target domain (bottom-right). We find
such a gaze jitter a good indicator to help analyze and opti-
mize cross domain gaze estimation.
tion that suppresses the influence of high-frequency com-
ponent, resulting in less jitter and better cross-domain
gaze estimation accuracy.
Experimental results demonstrate that our method exhibits
exceptional performances on four gaze domain adaptation
tasks using only a small number of target images.
2 Related Work
2.1 Appearance-based gaze estimation
Appearance-based gaze estimation aims to predict the hu-
man gaze from appearance. Zhang et al. proposed the first
CNN-based gaze estimation method (Zhang et al. 2017),
which uses eye images. With the release of many large-scale
gaze datasets(Zhang et al. 2020; Funes Mora, Monay, and
Odobez 2014; Kellnhofer et al. 2019; Zhang et al. 2017),
appearance-based gaze estimation has attracted more and
more attention. Many methods have been proposed to es-
timate accurate gazes on public datasets.(Cheng et al. 2020;
Guo et al. 2020; Shrivastava et al. 2017; Wang et al. 2019).
However, most studies focus on the gaze estimation
within a single dataset(Lu et al. 2014; Park et al. 2019;
Yu, Liu, and Odobez 2019). Due to the diversity of differ-
ent datasets, almost all gaze estimation methods suffer from
poor cross-domain capability(Cheng et al. 2020; Wang et al.
2019). Recent works (Liu et al. 2021c; Zhang et al. 2020)
investigated the cross-domain capability of a gaze estimator,
which improves the applicability to real-world scenes.
2.2 Unsupervised domain adaption
Unsupervised domain adaption(UDA) is a transfer learning
task that requires no target labels. Previous UDA approaches
can be divided into three categories: discrepancy, recon-
struction, and adversarial methods. Discrepancy methods
aim to minimize the domain gap using some distance met-
rics, such as Maximum Mean Discrepancy (MMD) (Ghifary,
Kleijn, and Zhang 2014) and Local Maximum Mean Dis-
crepancy (LMMD) (Zhu et al. 2020). Reconstruction meth-
ods (Glorot, Bordes, and Bengio 2011; Bousmalis et al.
2016) use a reconstruction strategy that allows a model to
learn features from both domains(Wang and Deng 2018).
Adversarial methods are inspired by the generative adver-
sarial network (GAN)(Goodfellow et al. 2014). In (Ganin
and Lempitsky 2015; Tzeng et al. 2017; Cui et al. 2020; Yu
et al. 2019), they make a domain discriminator and a gen-
erator play a min-max game, thereby explicitly reducing the
distance between the source and target domains.
However, most existing UDA methods have been de-
signed for classification or semantic segmentation tasks.
Gaze estimation is a typical regression task, its continuous
label space makes it even more challenging.
2.3 Adversarial attack
The goal of the adversarial attack is to generate adversarial
noise. Although this noise is a type of high-frequency com-
ponent that usually cannot affect human cognition, recent
studies(Goodfellow, Shlens, and Szegedy 2014; Szegedy
et al. 2013) have shown that deep neural networks are
highly vulnerable to it. Although some adversarial attack
methods have been proposed (Moosavi-Dezfooli, Fawzi, and
Frossard 2016; Su, Vargas, and Sakurai 2019) in the past few
years, they mainly follow two ideas proposed by (Goodfel-
low, Shlens, and Szegedy 2014) and (Madry et al. 2017a).
Recently, the adversarial attack has been applied to UDA
tasks in various fields(Ma et al. 2021; Yang et al. 2021;
Madry et al. 2017b; Liu et al. 2021a), which reminds us of
the potential of applying it to the field of gaze estimation.
2.4 Contrastive learning
On UDA tasks, contrastive learning is usually used to help
the model learn better representations. It encourages aug-
mentations of the same input to have more similar repre-
sentations compared to augmentations of different inputs.
Common contrastive learning frameworks achieve this aim
by constructing two kinds of pairs: positive pairs contain-
ing similar instances and negative pairs containing different
instances. Then it maximizes the consistency over the posi-
tive pairs and pushes apart samples from the negative pairs.
Recent contrastive learning studies, e.g., Memory Bank(Wu
et al. 2018), MoCo(He et al. 2020), SimCLR(Chen et al.
2020), and PCL(Li et al. 2020) have reached considerable
improvement on some downstream tasks.
Contrastive learning has been used for UDA in some
tasks, such as action recognition (Kang et al. 2020) and se-
mantic segmentation (Liu et al. 2021b). The significant ef-
fect shows the capability of contrastive learning to learn use-
ful representations.
3 Motivation
In this section, we observe that gaze jitter is a significant
phenomenon in cross-domain gaze estimation. Then, we dis-
cover one important factor introducing jitter: high-frequency
摘要:

JitterDoesMatter:AdaptingGazeEstimationtoNewDomainsRuicongLiu1YiweiBao1MingjieXu1HaofeiWang2YunfeiLiu1FengLu1,2,*1StateKeyLaboratoryofVRTechnologyandSystems,SchoolofCSE,BeihangUniversity2PengChengLaboratory,Shenzhen,Chinafliuruicong,baoyiwei,xumingjie,lyunfei,lufengg@buaa.edu.cnwanghf@pcl.ac.cnAbstr...

展开>> 收起<<
Jitter Does Matter Adapting Gaze Estimation to New Domains Ruicong Liu1Yiwei Bao1Mingjie Xu1Haofei Wang2Yunfei Liu1Feng Lu12 1State Key Laboratory of VR Technology and Systems School of CSE Beihang University.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:9 页 大小:1.15MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注