Learning Video-independent Eye Contact Segmentation from In-the-Wild Videos Tianyi Wu0000000190775632and Yusuke Sugano000000034206710X

2025-04-29 0 0 1.51MB 18 页 10玖币

侵权投诉

Learning Video-independent Eye Contact

Segmentation from In-the-Wild Videos

Tianyi Wu[0000−0001−9077−5632] and Yusuke Sugano[0000−0003−4206−710X]

Institute of Industrial Science, The University of Tokyo

{twu223, sugano}@iis.u-tokyo.ac.jp

Abstract. Human eye contact is a form of non-verbal communication

and can have a great inﬂuence on social behavior. Since the location

and size of the eye contact targets vary across diﬀerent videos, learning

a generic video-independent eye contact detector is still a challenging

task. In this work, we address the task of one-way eye contact detection

for videos in the wild. Our goal is to build a uniﬁed model that can iden-

tify when a person is looking at his gaze targets in an arbitrary input

video. Considering that this requires time-series relative eye movement

information, we propose to formulate the task as a temporal segmen-

tation. Due to the scarcity of labeled training data, we further propose

a gaze target discovery method to generate pseudo-labels for unlabeled

videos, which allows us to train a generic eye contact segmentation model

in an unsupervised way using in-the-wild videos. To evaluate our pro-

posed approach, we manually annotated a test dataset consisting of 52

videos of human conversations. Experimental results show that our eye

contact segmentation model outperforms the previous video-dependent

eye contact detector and can achieve 71.88% framewise accuracy on our

annotated test set. Our code and evaluation dataset are available at

https://github.com/ut-vision/Video-Independent-ECS.

Keywords: Human gaze ·Eye contact ·Video segmentation

1 Introduction

Human gaze and eye contact have strong social meaning and are considered

key to understanding human dyadic interactions. Studies have shown that eye

contact functions as a signaling mechanism [6,19], indicates interest and atten-

tion [22,2], and is related to certain psychiatric conditions [3,37,35]. The impor-

tance of human gazes in general has also been well recognized in the computer

vision community, leading to a series of related research work on vision-based

gaze estimation techniques [61,47,56,7,38,8,59,62,60,57,39]. Recent advances in

vision-based gaze estimation have the potential to enable robust analyses of gaze

behavior, including one-way eye contact. However, gaze estimation is still chal-

lenging in images with extreme head poses and lighting conditions, and it is not

a trivial task to robustly detect eye contacts in in-the-wild situations.

Several previous studies have attempted to directly address the task of de-

tecting one-way eye contact [58,36,46,54,9]. Given its binary classiﬁcation nature,

arXiv:2210.02033v1 [cs.CV] 5 Oct 2022

2 T. Wu and Y. Sugano

Gaze Target Discovery

Videos Collected Online

Video-

Independent

Segmentation

Model

Fig. 1: Illustration of our proposed task of video-independent eye contact seg-

mentation. Given a video sequence and a target person, the goal is to segment

the video into fragments of the target person having and not having eye contact

with his potential gaze targets.

one-way eye contact detection can be a simpler task than regressing gaze direc-

tions. However, unconstrained eye contact detection remains a challenge. Fun-

damentally speaking, one-way eye contact detection is an ill-posed task if the

gaze targets are not identiﬁed beforehand. Fully supervised approaches [46,54,9]

necessarily result in environment-dependent models that cannot be applied to

eye contact targets with diﬀerent positions and sizes. Although there have been

some work that address this task using unsupervised approaches that automat-

ically detect the position of gaze targets relative to the camera [58,36], they

still require a suﬃcient amount of unlabeled training data from the target envi-

ronment. Learning a model that can detect one-way eye contact from arbitrary

inputs independently of the environment is still a challenging task.

This work aims to address the task of unconstrained video-independent one-

way eye contact detection. We aim to train a uniﬁed model that can be applied to

arbitrary videos in the wild to obtain one-way eye contact moments of the target

person without knowing his gaze targets beforehand. Since the position and size

of the eye contact targets vary from video to video, it is nearly impossible to

approach this task frame by frame. However, we humans can recognize when

eye contact occurs from temporal eye movements, even when the target object

is not visible in the scene. Inspired by this observation, we instead formulate

the problem as a segmentation task utilizing the target person’s temporal face

appearance information from the input video (Fig. 1). The remaining challenge

here is that this approach requires a large amount of eye contact training data.

It is undoubtedly diﬃcult to manually annotate training videos covering a wide

variety of environmental and lighting conditions.

To train the eye contact segmentation model, we propose an unsupervised

gaze target discovery method to generate eye contact pseudo-labels from noisy

appearance-based gaze estimation results. Since online videos often contain cam-

era movements and artiﬁcial edits, it is not a trivial task to locate eye contact

targets relative to the camera. Instead of making a strong assumption about a

stationary camera, we assume only that the relative positions of the eye contact

target and the person are ﬁxed. Our method analyzes human gazes in the body

Unconstrained Video-independent Eye Contact Segmentation 3

coordinate system and treats high-density gaze point regions as positive samples.

By applying our gaze target discovery method to the VoxCeleb2 dataset [12],

we obtain a large-scale pseudo-labeled training dataset. Based on the initial

pseudo-labels, our segmentation model is trained iteratively using the original

facial features as input. We also manually annotated 52 videos with eye contact

segmentation labels for evaluation, and experiments show that our approach

can achieve 71.88% framewise accuracy on our test set and outperforms video-

dependent baselines.

Our contributions are threefold. First, to the best of our knowledge, we are

the ﬁrst to formulate one-way eye contact detection as a segmentation task. This

formulation allows us to naturally leverage the target person’s face and gaze fea-

tures temporally, leading to a video-independent eye contact detector that can

be applied to arbitrary videos. Second, we propose a novel gaze target discov-

ery method robust to videos in the wild. This leads to high-quality eye contact

pseudo-labels that can be further used for both video-dependent eye contact de-

tection and video-independent eye contact segmentation. Finally, we create and

release a manually annotated evaluation dataset for eye contact segmentation

based on the VoxCeleb2 dataset.

2 Related work

2.1 Gaze Estimation and Analysis

Appearance-based Gaze Estimation Appearance-based gaze estimation directly

regresses the input image into the gaze direction and only requires an oﬀ-the-shelf

camera. Although most of the work take the eye region as input [61,47,56,7,38,8],

some demonstrated the advantage of using the full face as input [63,59,62,60,57,39].

If the eye region is hardly visible, possibly due to low resolution, extreme head

poses, and poor lighting conditions, the full-face gaze model can be expected

to infer the direction of the human gaze from the rest of the face. Since most

gaze estimation datasets are collected in controlled laboratory settings [57,16,17],

in-the-wild appearance-based gaze estimation remains a challenge. Some recent

eﬀorts have been made to address this issue by domain adaptation [29,44] or us-

ing synthetic data [39,55,64]. Note that eye contact detection is a diﬀerent task

from gaze estimation and is still diﬃcult even with a perfect gaze estimator due

to the unknown gaze target locations. The goal of this work is to improve the

accuracy of eye contact detection on top of the state-of-the-art appearance-based

gaze estimation method.

Gaze Following and Mutual Gaze Detection First proposed by Recasens et al. [40],

gaze following aims to estimate the object where the person gazes in an im-

age [41,49,51,14,52,10,11]. Another line of work is mutual gaze detection, which

aims to locate moments when two people are looking at each other. Mutual gaze

is an even stronger signal than one-way eye contact in reﬂecting the relation-

ship between two people [31,32,33]. The problem of mutual gaze detection was

ﬁrst proposed by Marin-Jimenez et al. [30]. Our target diﬀers from these tasks

4 T. Wu and Y. Sugano

in two ways. First, we are interested in ﬁnding the moments in which one-way

eye contact occurs to gaze targets, rather than determining the location of gaze

targets on a frame-by-frame basis or detecting mutual gazes. Second, since our

proposed method performs eye contact detection by segmenting the video based

on the person’s facial appearance, it can handle the cases where the gaze targets

are not visible from the scene. Although some gaze following work [10,11] can

tell when gaze targets are outside the image, most of them are designed with the

implicit assumption that the gaze target is included in the image.

2.2 Eye Contact Detection

Several previous works address the task of detecting eye contact speciﬁcally with

the camera [46,54,9]. However, such pre-trained models cannot be applied to

videos with the target person attending to gaze targets of diﬀerent sizes and po-

sitions. Recent progress in appearance-based gaze estimation allows unsupervised

detection of one-way eye contacts in third-person videos using an oﬀ-the-shelf

camera [58,36]. Zhang et al. assume a setting in which the camera is placed next

to the gaze target and propose an unsupervised gaze target discovery method

to locate the gaze target region relative to the camera [58]. They ﬁrst run the

appearance-based gaze estimator on all input sequences of human faces to get

3D gaze directions and then compute gaze points in the camera plane. This is

followed by density-based clustering, which identiﬁes high-density gaze point re-

gions as the locations of gaze targets. Based on this idea, Müller et al. studies

eye contact detection in a group of 3-4people having conversations [36]. Based

on the assumption that all listeners would look at the speaker most of the time,

they use audio clues to more accurately locate gaze targets in the camera plane.

There are two major limitations that make these two approaches inapplicable

to videos in the wild. First, in many online videos, camera movements and jump

cuts are common, making the camera coordinate system inconsistent throughout

the video. Meanwhile, since gaze points are essentially the intersection between

the gaze ray and the plane z= 0 in the camera coordinate system, the gaze

points corresponding to gaze targets far from the camera will naturally be more

scattered than those corresponding to gaze targets close to the camera when

receiving the same amount of eye gazes. Consequently, density-based clustering

would fail to identify potential gaze targets far from the camera on the camera

plane. Second, both works only explored video-dependent eye contact detection,

i.e., training one model for each test video. Instead, we study the feasibility

of training a video-independent eye contact segmentation model that can be

applied to diﬀerent videos in the wild.

2.3 Action Segmentation

Action segmentation is the task of detecting and segmenting actions in a given

video. Various research works have focused on designing the network archi-

tecture for the task. Singh et al. [45] propose to feed spatial-temporal video

representations learned by a two-stream network to a bi-directional LSTM to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningVideo-independentEyeContactSegmentationfromIn-the-WildVideosTianyiWu[0000000190775632]andYusukeSugano[000000034206710X]InstituteofIndustrialScience,TheUniversityofTokyo{twu223,sugano}@iis.u-tokyo.ac.jpAbstract.Humaneyecontactisaformofnon-verbalcommunicationandcanhaveagreatinuenceonsocialbeh...

展开>> 收起<<

Learning Video-independent Eye Contact Segmentation from In-the-Wild Videos Tianyi Wu0000000190775632and Yusuke Sugano000000034206710X.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Video-independent Eye Contact Segmentation from In-the-Wild Videos Tianyi Wu0000000190775632and Yusuke Sugano000000034206710X

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: