Contrastive Representation Learning for Gaze Estimation Swati Jindal

2025-05-06 0 0 5.67MB 17 页 10玖币

侵权投诉

Contrastive Representation Learning for Gaze

Estimation

Swati Jindal

University of California, Santa Cruz

swjindal@ucsc.edu

Roberto Manduchi

University of California, Santa Cruz

manduchi@soe.ucsc.edu

Abstract

Self-supervised learning (SSL) has become prevalent for learning representations

in computer vision. Notably, SSL exploits contrastive learning to encourage visual

representations to be invariant under various image transformations. The task of

gaze estimation, on the other hand, demands not just invariance to various appear-

ances but also equivariance to the geometric transformations. In this work, we

propose a simple contrastive representation learning framework for gaze estima-

tion, named Gaze Contrastive Learning (GazeCLR).GazeCLR exploits multi-view

data to promote equivariance and relies on selected data augmentation techniques

that do not alter gaze directions for invariance learning. Our experiments demon-

strate the effectiveness of GazeCLR for several settings of the gaze estimation

task. Particularly, our results show that GazeCLR improves the performance of

cross-domain gaze estimation and yield as high as

17.2%

relative improvement.

Moreover, GazeCLR framework is competitive with state-of-the-art representation

learning methods for few-shot evaluation. The code and pre-trained models are

available at https://github.com/jswati31/gazeclr.

1 Introduction

Gaze represents the focus of human attention and serves as an essential cue for non-verbal commu-

nication. While specialized gaze trackers can accurately measure a user’s gaze direction, there is

substantial interest in gaze estimation using regular cameras. Although, learning gaze estimation

models from images is challenging and needs to transcend multiple “nuisance” characteristics such

as facial features or head orientation to estimate gaze accurately.

In recent years, deep learning [

] has shown promising results for gaze estimation. In part, this

success stems from the availability of large-scale annotated datasets. As a result, valuable datasets

must contain a wide range of gaze directions, appearances, and head poses, which is a laborious and

time-consuming procedure. Also, gaze annotations are difﬁcult to obtain, which makes the creation of

large, representative datasets challenging [

]. Therefore, methods that facilitate training with limited

gaze annotations are highly desirable.

Self-supervised learning (SSL) has gained tremendous success over the past few years and emerged

as a powerful tool for reducing over-reliance on human annotations [

]. Following a generally

accepted paradigm, we consider a pre-training stage that requires no labels, followed by a ﬁne-tuning

stage using a relatively small number of labeled samples. SSL is an effective approach for pre-training,

where semantically meaningful representations are learned that can be seamlessly adapted during

ﬁne-tuning stage [

]. Speciﬁcally, a good pre-training would ensure that the embeddings for

images associated with the same gaze direction are neighbors in the feature space, regardless of

other non-relevant factors such as appearance. Arguably, this could accelerate the job of ﬁne-tuning,

possibly reducing the number of required labeled samples.

Preprint. Under review.

arXiv:2210.13404v1 [cs.CV] 24 Oct 2022

Invariant

Projection

Head

Encoder

Equivariant

Projection

Head

Encoder

Pre-trained

Regressor

I. Gaze Contrastive Learning Framework

II. Learning For Gaze Estimation

Supervision

Equivariance

Invariance

(a) Two-stages of our framework

View

Time

(b) Formation of positive pairs

Figure 1:

Overall idea.

(a) The proposed two-stage learning framework for gaze estimation. Stage-I

shows Gaze Contrastive Learning (GazeCLR) framework trained using only unlabeled data and learns

both invariance and equivariance properties. In Stage-II, the pre-trained encoder is employed for

gaze estimation task with small labeled data. (b) Two images (shown in

red

and

green

) captured at

same time with different camera views are used to create both invariant and equivariant positive pairs.

In this work, for SSL pre-training, we focus on contrastive representation learning (CRL), which

aims to map “positive" pair samples to embeddings that are close to each other, while mapping

“negative" pairs apart from each other [

]. A popular approach is to generate pairs by applying two

different transformations (or augmentations) to an input image forming a positive pair, and different

images forming negative pairs. This method encourages invariance in representations w.r.t. similar

types of transformations, where these transformations are assumed to model “nuisance" effects.

However, obtaining the necessary and sufﬁcient set of “positive" and “negative" pairs remains a

non-trivial and unanswered challenge for a given task. This work attempts to answer this question

for gaze estimation. Recent CRL-based methods encourages the representations to be invariant

to any image transformation, many of which are not suitable for gaze estimation. For example,

geometry-based image transformations (such as rotation) will change the gaze direction. In contrast,

it is beneﬁcial to have invariance to appearance, e.g., a person’s identity, background, etc.

In this paper, we propose Gaze Contrastive Learning (or GazeCLR) framework – a simple CRL-based

unsupervised pre-training approach for gaze estimation, i.e., a pre-training method requiring no gaze

label data. In detail, our approach relies on invariance to image transforms (e.g., color jitter) that do

not alter gaze direction, and equivariance to camera viewpoint, which requires additional information

of multi-view geometry, i.e., images of the same person should be obtained at the same time by two

or more cameras from different locations.

For learning equivariance, we leverage the fact that in a common reference system, two or more

synchronous images of the same person from different camera viewpoints are associated with the

same gaze direction. The knowledge of the relative pose of each camera to the common reference

system provides the relation of gaze directions deﬁned in the respective camera space. In other words,

gaze direction has an equivariant relationship to camera viewpoints. We claim that the requirement of

using multiple cameras may be less onerous than obtaining gaze annotations for each image.

We use an existing multi-view gaze dataset (EVE [

]) which provides video sequences captured from

four calibrated and synchronized cameras and contains gaze annotations, which are obtained using a

gaze tracking device [

]. We neglect labels during pre-training and use them only for ﬁne-tuning

and evaluation. Observe that, the relative camera pose information available with the EVE dataset is

used only during pre-training stage. Figure 1 presents an overview of the proposed idea.

To evaluate the GazeCLR, we perform self-supervised pre-training using the EVE dataset and

transfer the learned representations for the gaze estimation task in various evaluation settings. We

demonstrate the effectiveness of representations by showing that the proposed method achieves

superior performance on both within-dataset and cross-dataset (MPIIGaze [

] and Columbia [

])

evaluations, by using only a small number of labeled samples for ﬁne-tuning. Our major contributions

are summarized as follows:

We propose a simple contrastive learning method for gaze estimation that relies on the

observation that gaze direction is invariant under selected appearance transformations and

equivariant to any two camera viewpoints.

We also argue to learn equivariant representations by taking advantage of the multi-view

data that can be seamlessly collected using multiple cameras.

Our empirical evaluations show that GazeCLR yields improvements for various settings of

gaze estimation and is competitive with existing supervised [

] and unsupervised state-of-

the-art gaze representation learning methods [16, 17].

2 Proposed Method

2.1 Stage-I: Gaze Contrastive Learning (GazeCLR) Framework

GazeCLR is a framework to train an encoder that learns embeddings to induce desired set of invariance

and equivariance for the gaze estimation task. As stated earlier, the key intuition of GazeCLR is

that we want to enforce invariance using selected appearance transformations (e.g., color jitter) and

equivariance using synchronous images of the same person captured from multiple camera viewpoints.

Similar to previous SSL approaches [

], we rely on the normalized temperature-scaled cross-

entropy loss (NT-Xent)[

], to encourage invariance or equivariance by maximizing the agreement

between positive pairs and disagreement between the negative pairs. In particular, we devise two

variants of NT-Xent loss, namely, LIfor invariance, and LEfor equivariance.

The GazeCLR framework has three sub-modules: a CNN-based encoder and two projection heads

based on MLP layers, as illustrated in Figure 2. The output of the encoder branches out into different

projection heads depending on the type of input positive pair. To abide by the invariance for gaze

direction, we consider augmentations based on only appearance transformations denoted as A.

Let

{Ivi,t}|V|

i=1

be the synchronous frames for timestamp

coming from different camera views (i.e.,

{vi}|V|

i=1), then we create following positive pairs:

Single-view positive pairs: We apply two randomly sampled augmentations from

to create

a single-view positive pair. Speciﬁcally, for any image

Ivi,t

, at a given timestamp

and

view

, we sample two augmentations

and

from

and then

(a(Ivi,t), a0(Ivi,t))

forms

a positive pair to learn invariance. The left branch of Figure 2 shows one such positive pair

for view v1.

Multi-view positive pairs: We consider all unique pairs of camera viewpoints from the same

timestamp

and apply random augmentations from

, i.e.,

{(ai(Ivi,t), aj(Ivj,t)) |i, j ∈

{1,...,|V|} | i6=j}

. The corresponding outputs from the encoder are passed through

projection head p2and multiplied by an appropriate rotation matrix to learn equivariance.

Next, to construct negative pairs, we do not sample them explicitly but use all other samples in

the mini-batch as negative examples, similar to Chen et al. [

]. The exact formulation of both loss

functions

and

is described below. For brevity, we omit

from

Ivi,t

and augmentation

in the

following subsections.

Single-View Learning.

Recall, the goal of single-view learning is to induce invariance amongst

representations under various appearance transformations. Let

vi∈V

be any view and

b∈[1, . . . , B]

be the batch index. Given a batch size of

, we apply two augmentations to each sample in the batch

yielding

augmented images, and for each sample, we have one positive pair and

(2B−1)

negative

pairs stemming from remaining samples in the batch. Our encoder

extracts representations for

all

augmented images, which are further mapped by projection head

p1(·)

yielding embeddings

(

{zb

z0b

vi}B

b=1

). With above notations, for any view

, the proposed invariance loss function

associated with a positive pair (zb

vi, z0b

vi) can be given as follows:

LI(zb

vi, z0b

vi) = −log sim(zb

vi, z0b

vi)

l=1 l6=bsim(zb

vi, zl

vi) + PB

l=1 sim(zb

vi, z0l

vi)(1)

Encoder Encoder Encoder Encoder Encoder Encoder

Projector Projector Projector Projector Projector Projector

Single-View Learning Multi-View Learning

= augmentation

= rotation

Figure 2:

Method schematic.

For synchronous view frames

{Ivi}4

i=1

, the above ﬁgure illustrates

invariant and equivariant positive pairs anchored only for view

. The left branch shows single-view

learning (

) and right branch illustrates multi-view learning using four views (

). All images

(after augmentation,

a∈ A

) are passed through a shared CNN encoder network, followed by

MLP projectors (either

) depending on the type of input positive pair. The embeddings

for multi-view learning are further multiplied by an appropriate rotation matrix. More details in

Section 2.1.

where,

vi=p1(E(Ib

vi))

z0b

vi=p1(E(I0b

vi))

sim(r, s) = exp 1

rTs

||r|| · ||s||

[l6=b]

is an indicator

function and

is the temperature coefﬁcient parameter. It is worth noting that to minimize the loss

in Eq. 1, it must hold that

and

z0b

needs to be closer, which aligns with our goal of learning

invariance to appearance transformations. One challenge, however, is the risk of collapse (e.g., the

network could simply learn each person’s identity). To avoid this, we create mini-batches such that

all samples in a batch are taken from the single participant.

Multi-View Learning.

We encourage equivariance in the gaze representations to different camera

viewpoints through multi-view learning. To do so, we transform embeddings to a common reference

system, which is chosen as the screen reference system used during the EVE data collection. Let

{RS

Cvi}be the rotation matrix relating the camera viewpoint viwith the screen reference system.

For each sample

in a batch of size B, the positive pair is given as

(Ib

vi, Ib

vj)

for two distinct

camera viewpoints

(vi, vj)i6=j

. All images for viewpoints

and

are ﬁrst augmented then passed

through encoder

and the projector head

p2(·)

which gives embeddings

ˆzb

vi,ˆzb

vj∈R3×d0

. These

embeddings are further multiplied by corresponding rotation matrices

Cvi

, to project embeddings in

the common (screen) reference system. We denote embeddings after rotation as

{¯zb

¯zb

vj}B

b=1

such

that

¯zb

vi=RS

Cviˆzb

. Therefore, for a batch of size B, our equivariant loss

associated with the

positive pair (¯zb

vi,¯zb

vj)is as follows:

LE(¯zb

vi,¯zb

vj) = −log sim(¯zb

vi,¯zb

vj)

l=1 [l6=b]sim(¯zb

vi,¯zl

vi) + PB

l=1 sim(¯zb

vi,¯zl

vj)(2)

Overall loss function.

Given

|V|

camera viewpoints, we apply both

and

loss functions to

each view. Thus, our overall objective function for a batch size of B becomes:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContrastiveRepresentationLearningforGazeEstimationSwatiJindalUniversityofCalifornia,SantaCruzswjindal@ucsc.eduRobertoManduchiUniversityofCalifornia,SantaCruzmanduchi@soe.ucsc.eduAbstractSelf-supervisedlearning(SSL)hasbecomeprevalentforlearningrepresentationsincomputervision.Notably,SSLexploitscontra...

展开>> 收起<<

Contrastive Representation Learning for Gaze Estimation Swati Jindal.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Contrastive Representation Learning for Gaze Estimation Swati Jindal

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: