
Jitter Does Matter: Adapting Gaze Estimation to New Domains
Ruicong Liu1Yiwei Bao1Mingjie Xu1Haofei Wang2Yunfei Liu1Feng Lu1,2,*
1State Key Laboratory of VR Technology and Systems, School of CSE, Beihang University
2Peng Cheng Laboratory, Shenzhen, China
{liuruicong, baoyiwei, xumingjie, lyunfei, lufeng}@buaa.edu.cn wanghf@pcl.ac.cn
Abstract
Deep neural networks have demonstrated superior perfor-
mance on appearance-based gaze estimation tasks. However,
due to variations in person, illuminations, and background,
performance degrades dramatically when applying the model
to a new domain. In this paper, we discover an interesting
gaze jitter phenomenon in cross-domain gaze estimation, i.e.,
the gaze predictions of two similar images can be severely
deviated in target domain. This is closely related to cross-
domain gaze estimation tasks, but surprisingly, it has not been
noticed yet previously. Therefore, we innovatively propose to
utilize the gaze jitter to analyze and optimize the gaze domain
adaptation task. We find that the high-frequency component
(HFC) is an important factor that leads to jitter. Based on
this discovery, we add high-frequency components to input
images using the adversarial attack and employ contrastive
learning to encourage the model to obtain similar representa-
tions between original and perturbed data, which reduces the
impacts of HFC. We evaluate the proposed method on four
cross-domain gaze estimation tasks, and experimental results
demonstrate that it significantly reduces the gaze jitter and
improves the gaze estimation performance in target domains.
1 Introduction
Gaze indicates the direction along which a person is looking.
It has been adopted in various applications, such as semi-
autonomous driving(Demiris 2007; Majaranta and Bulling
2014; Park, Jain, and Sheikh 2013) and human-robot inter-
action(Admoni and Scassellati 2017; Terzio˘
glu, Mutlu, and
S¸ahin 2020; Wang et al. 2015). With an increasing demand
for predicting user intent implicitly, appearance-based gaze
estimation has attracted more attention recently. To train
the gaze estimator using deep learning neural networks, a
number of large-scale datasets have been proposed (Zhang
et al. 2020, 2017; Funes Mora, Monay, and Odobez 2014;
Kellnhofer et al. 2019).
However, due to variations in subjects, backgrounds, and
illuminations, the performance of deep learning-based gaze
estimation algorithms deteriorate significantly when apply-
ing the model trained in one dataset to new datasets. Re-
cently, several techniques have been applied to address this
cross-domain problem, such as adversarial learning(Tzeng
et al. 2017; Cui et al. 2020), few-shot learning(Park et al.
*Corresponding Author.
2019; Yu, Liu, and Odobez 2019) and self-training(Cai, Lu,
and Sato 2020). Among them, unsupervised domain adap-
tation (UDA) method(Wang et al. 2019; Kellnhofer et al.
2019; Liu et al. 2021c) is one of the promising approaches
that attracts much attention. While requiring no labels makes
it more applicable to real-world scenarios, it also makes the
task more challenging.
Existing approaches usually optimize the gaze accuracy
during adaptation directly. Instead, we design an approach
that starts with the analysis of a phenomena we observed that
occurs in crossing domain tests. Where we can look for the
factors that cause the problems, and the factors can then be
used as guidance for us to find a more explainable solution
for domain adaptation.
In this paper, we observe the gaze jitter phenomena: two
very similar images could be predicted with gazes severely
deviated (shown in Fig. 1), particularly when crossing do-
mains. As shown in Fig. 1, on the test set in the source do-
main, the model gives similar predictions when the input im-
ages are similar. In contrary, in the target domain, even if the
input images are very similar, the model may still give pre-
dictions that are severely deviated. In this paper, we name
this phenomenon as gaze jitter, and in addition, we consider
gaze jitter as a manifestation of gaze error across domains,
and use this phenomenon as a starting point to find a solution
for domain adaptation.
Based on the above observation, we start to analyze why
the gaze jitter phenomenon occurs and discover an important
factor, i.e., the high-frequency component (HFC), which in-
troduces gaze jitter problem and lowers the gaze estimation
accuracy. Inspired by this, we propose our gaze adaptation
framework. At first, our framework adds additive HFC to
the input data, then it employs contrastive learning to keep
the consistency between the original data and the perturbed
data, thus making the model learn features with less impact
of high-frequency component. Our method leads to signifi-
cant jitter reduction and performance improvement on var-
ious cross-domain gaze estimation tasks. The primary con-
tributions of this paper are summarized as follows:
• For the first time, we discover the gaze jitter problem on
cross-domain gaze estimation tasks. We find that high-
frequency component is an important factor introducing
jitters.
• We propose a framework for cross-domain gaze estima-
arXiv:2210.02082v1 [cs.CV] 5 Oct 2022