Contrastive Representation Learning for Gaze Estimation Swati Jindal

2025-05-06 0 0 5.67MB 17 页 10玖币
侵权投诉
Contrastive Representation Learning for Gaze
Estimation
Swati Jindal
University of California, Santa Cruz
swjindal@ucsc.edu
Roberto Manduchi
University of California, Santa Cruz
manduchi@soe.ucsc.edu
Abstract
Self-supervised learning (SSL) has become prevalent for learning representations
in computer vision. Notably, SSL exploits contrastive learning to encourage visual
representations to be invariant under various image transformations. The task of
gaze estimation, on the other hand, demands not just invariance to various appear-
ances but also equivariance to the geometric transformations. In this work, we
propose a simple contrastive representation learning framework for gaze estima-
tion, named Gaze Contrastive Learning (GazeCLR).GazeCLR exploits multi-view
data to promote equivariance and relies on selected data augmentation techniques
that do not alter gaze directions for invariance learning. Our experiments demon-
strate the effectiveness of GazeCLR for several settings of the gaze estimation
task. Particularly, our results show that GazeCLR improves the performance of
cross-domain gaze estimation and yield as high as
17.2%
relative improvement.
Moreover, GazeCLR framework is competitive with state-of-the-art representation
learning methods for few-shot evaluation. The code and pre-trained models are
available at https://github.com/jswati31/gazeclr.
1 Introduction
Gaze represents the focus of human attention and serves as an essential cue for non-verbal commu-
nication. While specialized gaze trackers can accurately measure a user’s gaze direction, there is
substantial interest in gaze estimation using regular cameras. Although, learning gaze estimation
models from images is challenging and needs to transcend multiple “nuisance” characteristics such
as facial features or head orientation to estimate gaze accurately.
In recent years, deep learning [
1
,
2
,
3
] has shown promising results for gaze estimation. In part, this
success stems from the availability of large-scale annotated datasets. As a result, valuable datasets
must contain a wide range of gaze directions, appearances, and head poses, which is a laborious and
time-consuming procedure. Also, gaze annotations are difficult to obtain, which makes the creation of
large, representative datasets challenging [
4
]. Therefore, methods that facilitate training with limited
gaze annotations are highly desirable.
Self-supervised learning (SSL) has gained tremendous success over the past few years and emerged
as a powerful tool for reducing over-reliance on human annotations [
5
,
6
,
7
]. Following a generally
accepted paradigm, we consider a pre-training stage that requires no labels, followed by a fine-tuning
stage using a relatively small number of labeled samples. SSL is an effective approach for pre-training,
where semantically meaningful representations are learned that can be seamlessly adapted during
fine-tuning stage [
8
,
9
,
10
]. Specifically, a good pre-training would ensure that the embeddings for
images associated with the same gaze direction are neighbors in the feature space, regardless of
other non-relevant factors such as appearance. Arguably, this could accelerate the job of fine-tuning,
possibly reducing the number of required labeled samples.
Preprint. Under review.
arXiv:2210.13404v1 [cs.CV] 24 Oct 2022
Invariant
Projection
Head
Encoder
Equivariant
Projection
Head
Encoder
Encoder
Encoder
Pre-trained
Regressor
I. Gaze Contrastive Learning Framework
II. Learning For Gaze Estimation
Supervision
Equivariance
Invariance
(a) Two-stages of our framework
View
Time
(b) Formation of positive pairs
Figure 1:
Overall idea.
(a) The proposed two-stage learning framework for gaze estimation. Stage-I
shows Gaze Contrastive Learning (GazeCLR) framework trained using only unlabeled data and learns
both invariance and equivariance properties. In Stage-II, the pre-trained encoder is employed for
gaze estimation task with small labeled data. (b) Two images (shown in
red
and
green
) captured at
same time with different camera views are used to create both invariant and equivariant positive pairs.
In this work, for SSL pre-training, we focus on contrastive representation learning (CRL), which
aims to map “positive" pair samples to embeddings that are close to each other, while mapping
“negative" pairs apart from each other [
11
]. A popular approach is to generate pairs by applying two
different transformations (or augmentations) to an input image forming a positive pair, and different
images forming negative pairs. This method encourages invariance in representations w.r.t. similar
types of transformations, where these transformations are assumed to model “nuisance" effects.
However, obtaining the necessary and sufficient set of “positive" and “negative" pairs remains a
non-trivial and unanswered challenge for a given task. This work attempts to answer this question
for gaze estimation. Recent CRL-based methods encourages the representations to be invariant
to any image transformation, many of which are not suitable for gaze estimation. For example,
geometry-based image transformations (such as rotation) will change the gaze direction. In contrast,
it is beneficial to have invariance to appearance, e.g., a person’s identity, background, etc.
In this paper, we propose Gaze Contrastive Learning (or GazeCLR) framework – a simple CRL-based
unsupervised pre-training approach for gaze estimation, i.e., a pre-training method requiring no gaze
label data. In detail, our approach relies on invariance to image transforms (e.g., color jitter) that do
not alter gaze direction, and equivariance to camera viewpoint, which requires additional information
of multi-view geometry, i.e., images of the same person should be obtained at the same time by two
or more cameras from different locations.
For learning equivariance, we leverage the fact that in a common reference system, two or more
synchronous images of the same person from different camera viewpoints are associated with the
same gaze direction. The knowledge of the relative pose of each camera to the common reference
system provides the relation of gaze directions defined in the respective camera space. In other words,
gaze direction has an equivariant relationship to camera viewpoints. We claim that the requirement of
using multiple cameras may be less onerous than obtaining gaze annotations for each image.
We use an existing multi-view gaze dataset (EVE [
12
]) which provides video sequences captured from
four calibrated and synchronized cameras and contains gaze annotations, which are obtained using a
gaze tracking device [
13
]. We neglect labels during pre-training and use them only for fine-tuning
and evaluation. Observe that, the relative camera pose information available with the EVE dataset is
used only during pre-training stage. Figure 1 presents an overview of the proposed idea.
To evaluate the GazeCLR, we perform self-supervised pre-training using the EVE dataset and
transfer the learned representations for the gaze estimation task in various evaluation settings. We
demonstrate the effectiveness of representations by showing that the proposed method achieves
superior performance on both within-dataset and cross-dataset (MPIIGaze [
2
] and Columbia [
14
])
evaluations, by using only a small number of labeled samples for fine-tuning. Our major contributions
are summarized as follows:
2
1.
We propose a simple contrastive learning method for gaze estimation that relies on the
observation that gaze direction is invariant under selected appearance transformations and
equivariant to any two camera viewpoints.
2.
We also argue to learn equivariant representations by taking advantage of the multi-view
data that can be seamlessly collected using multiple cameras.
3.
Our empirical evaluations show that GazeCLR yields improvements for various settings of
gaze estimation and is competitive with existing supervised [
15
] and unsupervised state-of-
the-art gaze representation learning methods [16, 17].
2 Proposed Method
2.1 Stage-I: Gaze Contrastive Learning (GazeCLR) Framework
GazeCLR is a framework to train an encoder that learns embeddings to induce desired set of invariance
and equivariance for the gaze estimation task. As stated earlier, the key intuition of GazeCLR is
that we want to enforce invariance using selected appearance transformations (e.g., color jitter) and
equivariance using synchronous images of the same person captured from multiple camera viewpoints.
Similar to previous SSL approaches [
6
,
18
], we rely on the normalized temperature-scaled cross-
entropy loss (NT-Xent)[
6
], to encourage invariance or equivariance by maximizing the agreement
between positive pairs and disagreement between the negative pairs. In particular, we devise two
variants of NT-Xent loss, namely, LIfor invariance, and LEfor equivariance.
The GazeCLR framework has three sub-modules: a CNN-based encoder and two projection heads
based on MLP layers, as illustrated in Figure 2. The output of the encoder branches out into different
projection heads depending on the type of input positive pair. To abide by the invariance for gaze
direction, we consider augmentations based on only appearance transformations denoted as A.
Let
{Ivi,t}|V|
i=1
be the synchronous frames for timestamp
t
coming from different camera views (i.e.,
{vi}|V|
i=1), then we create following positive pairs:
1.
Single-view positive pairs: We apply two randomly sampled augmentations from
A
to create
a single-view positive pair. Specifically, for any image
Ivi,t
, at a given timestamp
t
and
view
vi
, we sample two augmentations
a
and
a0
from
A
and then
(a(Ivi,t), a0(Ivi,t))
forms
a positive pair to learn invariance. The left branch of Figure 2 shows one such positive pair
for view v1.
2.
Multi-view positive pairs: We consider all unique pairs of camera viewpoints from the same
timestamp
t
and apply random augmentations from
A
, i.e.,
{(ai(Ivi,t), aj(Ivj,t)) |i, j
{1,...,|V|} | i6=j}
. The corresponding outputs from the encoder are passed through
projection head p2and multiplied by an appropriate rotation matrix to learn equivariance.
Next, to construct negative pairs, we do not sample them explicitly but use all other samples in
the mini-batch as negative examples, similar to Chen et al. [
6
]. The exact formulation of both loss
functions
LI
and
LE
is described below. For brevity, we omit
t
from
Ivi,t
and augmentation
a
in the
following subsections.
Single-View Learning.
Recall, the goal of single-view learning is to induce invariance amongst
representations under various appearance transformations. Let
viV
be any view and
b[1, . . . , B]
be the batch index. Given a batch size of
B
, we apply two augmentations to each sample in the batch
yielding
2B
augmented images, and for each sample, we have one positive pair and
(2B1)
negative
pairs stemming from remaining samples in the batch. Our encoder
E
extracts representations for
all
2B
augmented images, which are further mapped by projection head
p1(·)
yielding embeddings
(
{zb
vi
,
z0b
vi}B
b=1
). With above notations, for any view
vi
, the proposed invariance loss function
LI
associated with a positive pair (zb
vi, z0b
vi) can be given as follows:
LI(zb
vi, z0b
vi) = log sim(zb
vi, z0b
vi)
PB
l=1 l6=bsim(zb
vi, zl
vi) + PB
l=1 sim(zb
vi, z0l
vi)(1)
3
Encoder Encoder Encoder Encoder Encoder Encoder
Projector Projector Projector Projector Projector Projector
Single-View Learning Multi-View Learning
= augmentation
= rotation
Figure 2:
Method schematic.
For synchronous view frames
{Ivi}4
i=1
, the above figure illustrates
invariant and equivariant positive pairs anchored only for view
v1
. The left branch shows single-view
learning (
LI
) and right branch illustrates multi-view learning using four views (
LE
). All images
(after augmentation,
a∈ A
) are passed through a shared CNN encoder network, followed by
MLP projectors (either
p1
or
p2
) depending on the type of input positive pair. The embeddings
for multi-view learning are further multiplied by an appropriate rotation matrix. More details in
Section 2.1.
where,
zb
vi=p1(E(Ib
vi))
,
z0b
vi=p1(E(I0b
vi))
,
sim(r, s) = exp 1
τ
rTs
||r|| · ||s||
,
[l6=b]
is an indicator
function and
τ
is the temperature coefficient parameter. It is worth noting that to minimize the loss
in Eq. 1, it must hold that
zb
vi
and
z0b
vi
needs to be closer, which aligns with our goal of learning
invariance to appearance transformations. One challenge, however, is the risk of collapse (e.g., the
network could simply learn each person’s identity). To avoid this, we create mini-batches such that
all samples in a batch are taken from the single participant.
Multi-View Learning.
We encourage equivariance in the gaze representations to different camera
viewpoints through multi-view learning. To do so, we transform embeddings to a common reference
system, which is chosen as the screen reference system used during the EVE data collection. Let
{RS
Cvi}be the rotation matrix relating the camera viewpoint viwith the screen reference system.
For each sample
Ib
vi
in a batch of size B, the positive pair is given as
(Ib
vi, Ib
vj)
for two distinct
camera viewpoints
(vi, vj)i6=j
. All images for viewpoints
vi
and
vj
are first augmented then passed
through encoder
E
and the projector head
p2(·)
which gives embeddings
ˆzb
vi,ˆzb
vjR3×d0
. These
embeddings are further multiplied by corresponding rotation matrices
RS
Cvi
, to project embeddings in
the common (screen) reference system. We denote embeddings after rotation as
{¯zb
vi
,
¯zb
vj}B
b=1
such
that
¯zb
vi=RS
Cviˆzb
vi
. Therefore, for a batch of size B, our equivariant loss
LE
associated with the
positive pair (¯zb
vi,¯zb
vj)is as follows:
LE(¯zb
vi,¯zb
vj) = log sim(¯zb
vi,¯zb
vj)
PB
l=1 [l6=b]sim(¯zb
vi,¯zl
vi) + PB
l=1 sim(¯zb
vi,¯zl
vj)(2)
Overall loss function.
Given
|V|
camera viewpoints, we apply both
LI
and
LE
loss functions to
each view. Thus, our overall objective function for a batch size of B becomes:
4
摘要:

ContrastiveRepresentationLearningforGazeEstimationSwatiJindalUniversityofCalifornia,SantaCruzswjindal@ucsc.eduRobertoManduchiUniversityofCalifornia,SantaCruzmanduchi@soe.ucsc.eduAbstractSelf-supervisedlearning(SSL)hasbecomeprevalentforlearningrepresentationsincomputervision.Notably,SSLexploitscontra...

展开>> 收起<<
Contrastive Representation Learning for Gaze Estimation Swati Jindal.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:5.67MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注