Learning Video-independent Eye Contact Segmentation from In-the-Wild Videos Tianyi Wu0000000190775632and Yusuke Sugano000000034206710X

2025-04-29 0 0 1.51MB 18 页 10玖币
侵权投诉
Learning Video-independent Eye Contact
Segmentation from In-the-Wild Videos
Tianyi Wu[0000000190775632] and Yusuke Sugano[000000034206710X]
Institute of Industrial Science, The University of Tokyo
{twu223, sugano}@iis.u-tokyo.ac.jp
Abstract. Human eye contact is a form of non-verbal communication
and can have a great influence on social behavior. Since the location
and size of the eye contact targets vary across different videos, learning
a generic video-independent eye contact detector is still a challenging
task. In this work, we address the task of one-way eye contact detection
for videos in the wild. Our goal is to build a unified model that can iden-
tify when a person is looking at his gaze targets in an arbitrary input
video. Considering that this requires time-series relative eye movement
information, we propose to formulate the task as a temporal segmen-
tation. Due to the scarcity of labeled training data, we further propose
a gaze target discovery method to generate pseudo-labels for unlabeled
videos, which allows us to train a generic eye contact segmentation model
in an unsupervised way using in-the-wild videos. To evaluate our pro-
posed approach, we manually annotated a test dataset consisting of 52
videos of human conversations. Experimental results show that our eye
contact segmentation model outperforms the previous video-dependent
eye contact detector and can achieve 71.88% framewise accuracy on our
annotated test set. Our code and evaluation dataset are available at
https://github.com/ut-vision/Video-Independent-ECS.
Keywords: Human gaze ·Eye contact ·Video segmentation
1 Introduction
Human gaze and eye contact have strong social meaning and are considered
key to understanding human dyadic interactions. Studies have shown that eye
contact functions as a signaling mechanism [6,19], indicates interest and atten-
tion [22,2], and is related to certain psychiatric conditions [3,37,35]. The impor-
tance of human gazes in general has also been well recognized in the computer
vision community, leading to a series of related research work on vision-based
gaze estimation techniques [61,47,56,7,38,8,59,62,60,57,39]. Recent advances in
vision-based gaze estimation have the potential to enable robust analyses of gaze
behavior, including one-way eye contact. However, gaze estimation is still chal-
lenging in images with extreme head poses and lighting conditions, and it is not
a trivial task to robustly detect eye contacts in in-the-wild situations.
Several previous studies have attempted to directly address the task of de-
tecting one-way eye contact [58,36,46,54,9]. Given its binary classification nature,
arXiv:2210.02033v1 [cs.CV] 5 Oct 2022
2 T. Wu and Y. Sugano
Gaze Target Discovery
Videos Collected Online
Video-
Independent
Segmentation
Model
Fig. 1: Illustration of our proposed task of video-independent eye contact seg-
mentation. Given a video sequence and a target person, the goal is to segment
the video into fragments of the target person having and not having eye contact
with his potential gaze targets.
one-way eye contact detection can be a simpler task than regressing gaze direc-
tions. However, unconstrained eye contact detection remains a challenge. Fun-
damentally speaking, one-way eye contact detection is an ill-posed task if the
gaze targets are not identified beforehand. Fully supervised approaches [46,54,9]
necessarily result in environment-dependent models that cannot be applied to
eye contact targets with different positions and sizes. Although there have been
some work that address this task using unsupervised approaches that automat-
ically detect the position of gaze targets relative to the camera [58,36], they
still require a sufficient amount of unlabeled training data from the target envi-
ronment. Learning a model that can detect one-way eye contact from arbitrary
inputs independently of the environment is still a challenging task.
This work aims to address the task of unconstrained video-independent one-
way eye contact detection. We aim to train a unified model that can be applied to
arbitrary videos in the wild to obtain one-way eye contact moments of the target
person without knowing his gaze targets beforehand. Since the position and size
of the eye contact targets vary from video to video, it is nearly impossible to
approach this task frame by frame. However, we humans can recognize when
eye contact occurs from temporal eye movements, even when the target object
is not visible in the scene. Inspired by this observation, we instead formulate
the problem as a segmentation task utilizing the target person’s temporal face
appearance information from the input video (Fig. 1). The remaining challenge
here is that this approach requires a large amount of eye contact training data.
It is undoubtedly difficult to manually annotate training videos covering a wide
variety of environmental and lighting conditions.
To train the eye contact segmentation model, we propose an unsupervised
gaze target discovery method to generate eye contact pseudo-labels from noisy
appearance-based gaze estimation results. Since online videos often contain cam-
era movements and artificial edits, it is not a trivial task to locate eye contact
targets relative to the camera. Instead of making a strong assumption about a
stationary camera, we assume only that the relative positions of the eye contact
target and the person are fixed. Our method analyzes human gazes in the body
Unconstrained Video-independent Eye Contact Segmentation 3
coordinate system and treats high-density gaze point regions as positive samples.
By applying our gaze target discovery method to the VoxCeleb2 dataset [12],
we obtain a large-scale pseudo-labeled training dataset. Based on the initial
pseudo-labels, our segmentation model is trained iteratively using the original
facial features as input. We also manually annotated 52 videos with eye contact
segmentation labels for evaluation, and experiments show that our approach
can achieve 71.88% framewise accuracy on our test set and outperforms video-
dependent baselines.
Our contributions are threefold. First, to the best of our knowledge, we are
the first to formulate one-way eye contact detection as a segmentation task. This
formulation allows us to naturally leverage the target person’s face and gaze fea-
tures temporally, leading to a video-independent eye contact detector that can
be applied to arbitrary videos. Second, we propose a novel gaze target discov-
ery method robust to videos in the wild. This leads to high-quality eye contact
pseudo-labels that can be further used for both video-dependent eye contact de-
tection and video-independent eye contact segmentation. Finally, we create and
release a manually annotated evaluation dataset for eye contact segmentation
based on the VoxCeleb2 dataset.
2 Related work
2.1 Gaze Estimation and Analysis
Appearance-based Gaze Estimation Appearance-based gaze estimation directly
regresses the input image into the gaze direction and only requires an off-the-shelf
camera. Although most of the work take the eye region as input [61,47,56,7,38,8],
some demonstrated the advantage of using the full face as input [63,59,62,60,57,39].
If the eye region is hardly visible, possibly due to low resolution, extreme head
poses, and poor lighting conditions, the full-face gaze model can be expected
to infer the direction of the human gaze from the rest of the face. Since most
gaze estimation datasets are collected in controlled laboratory settings [57,16,17],
in-the-wild appearance-based gaze estimation remains a challenge. Some recent
efforts have been made to address this issue by domain adaptation [29,44] or us-
ing synthetic data [39,55,64]. Note that eye contact detection is a different task
from gaze estimation and is still difficult even with a perfect gaze estimator due
to the unknown gaze target locations. The goal of this work is to improve the
accuracy of eye contact detection on top of the state-of-the-art appearance-based
gaze estimation method.
Gaze Following and Mutual Gaze Detection First proposed by Recasens et al. [40],
gaze following aims to estimate the object where the person gazes in an im-
age [41,49,51,14,52,10,11]. Another line of work is mutual gaze detection, which
aims to locate moments when two people are looking at each other. Mutual gaze
is an even stronger signal than one-way eye contact in reflecting the relation-
ship between two people [31,32,33]. The problem of mutual gaze detection was
first proposed by Marin-Jimenez et al. [30]. Our target differs from these tasks
4 T. Wu and Y. Sugano
in two ways. First, we are interested in finding the moments in which one-way
eye contact occurs to gaze targets, rather than determining the location of gaze
targets on a frame-by-frame basis or detecting mutual gazes. Second, since our
proposed method performs eye contact detection by segmenting the video based
on the person’s facial appearance, it can handle the cases where the gaze targets
are not visible from the scene. Although some gaze following work [10,11] can
tell when gaze targets are outside the image, most of them are designed with the
implicit assumption that the gaze target is included in the image.
2.2 Eye Contact Detection
Several previous works address the task of detecting eye contact specifically with
the camera [46,54,9]. However, such pre-trained models cannot be applied to
videos with the target person attending to gaze targets of different sizes and po-
sitions. Recent progress in appearance-based gaze estimation allows unsupervised
detection of one-way eye contacts in third-person videos using an off-the-shelf
camera [58,36]. Zhang et al. assume a setting in which the camera is placed next
to the gaze target and propose an unsupervised gaze target discovery method
to locate the gaze target region relative to the camera [58]. They first run the
appearance-based gaze estimator on all input sequences of human faces to get
3D gaze directions and then compute gaze points in the camera plane. This is
followed by density-based clustering, which identifies high-density gaze point re-
gions as the locations of gaze targets. Based on this idea, Müller et al. studies
eye contact detection in a group of 3-4people having conversations [36]. Based
on the assumption that all listeners would look at the speaker most of the time,
they use audio clues to more accurately locate gaze targets in the camera plane.
There are two major limitations that make these two approaches inapplicable
to videos in the wild. First, in many online videos, camera movements and jump
cuts are common, making the camera coordinate system inconsistent throughout
the video. Meanwhile, since gaze points are essentially the intersection between
the gaze ray and the plane z= 0 in the camera coordinate system, the gaze
points corresponding to gaze targets far from the camera will naturally be more
scattered than those corresponding to gaze targets close to the camera when
receiving the same amount of eye gazes. Consequently, density-based clustering
would fail to identify potential gaze targets far from the camera on the camera
plane. Second, both works only explored video-dependent eye contact detection,
i.e., training one model for each test video. Instead, we study the feasibility
of training a video-independent eye contact segmentation model that can be
applied to different videos in the wild.
2.3 Action Segmentation
Action segmentation is the task of detecting and segmenting actions in a given
video. Various research works have focused on designing the network archi-
tecture for the task. Singh et al. [45] propose to feed spatial-temporal video
representations learned by a two-stream network to a bi-directional LSTM to
摘要:

LearningVideo-independentEyeContactSegmentationfromIn-the-WildVideosTianyiWu[0000000190775632]andYusukeSugano[000000034206710X]InstituteofIndustrialScience,TheUniversityofTokyo{twu223,sugano}@iis.u-tokyo.ac.jpAbstract.Humaneyecontactisaformofnon-verbalcommunicationandcanhaveagreatinuenceonsocialbeh...

展开>> 收起<<
Learning Video-independent Eye Contact Segmentation from In-the-Wild Videos Tianyi Wu0000000190775632and Yusuke Sugano000000034206710X.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.51MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注