IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf

2025-05-08 0 0 7.29MB 6 页 10玖币
侵权投诉
IMU2CLIP: MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS
FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS
Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Alireza Dirafzoon, Aparajita Saraf
Amy Bearman, Babak Damavandi
Meta Reality Labs
ABSTRACT
We present IMU2CLIP, a novel pre-training approach to
align Inertial Measurement Unit (IMU) motion sensor record-
ings with video and text, by projecting them into the joint rep-
resentation space of Contrastive Language-Image Pre-training
(CLIP). The proposed approach allows IMU2CLIP to trans-
late human motions (as measured by IMU sensors) into their
corresponding textual descriptions and videos – while preserv-
ing the transitivity across these modalities.
We explore several new IMU-based applications that
IMU2CLIP enables, such as motion-based media retrieval
and natural language reasoning tasks with motion data. In
addition, we show that IMU2CLIP can significantly improve
the downstream performance when fine-tuned for each appli-
cation (e.g. activity recognition), demonstrating the universal
usage of IMU2CLIP as a new pre-trained resource. Our code
will be made publicly available.
Index TermsIMU modeling, Multimodal learning
1. INTRODUCTION
With the growing popularity of smart glasses or new-generation
wearable devices, first-person or egocentric videos have re-
cently become much more prevalent than ever before [
1
,
2
,
3
].
These egocentric videos are often accompanied by the parallel
head-mounted IMU sensor readings, which record devices’
linear and rotational movements and accelerations.
Given its low power consumption and low privacy implica-
tions, IMU is regarded as an important modality for powering
various on-device models that require understanding of device
wearer’s movement patterns (e.g. exercise / activity recognition
for health applications). The previous works on IMU modeling
typically focus on the purpose-built datasets with manual an-
notations [
4
,
5
], which are limited in their scale. Consequently,
the utilization of IMU models in real-world scenarios has been
confined to a relatively small number of use cases.
On the contrary, for the modalities that are widely studied
(e.g. text, video), there are vast large-scale resources such as
*: Joint First Authors.
Fig. 1
: Illustration of IMU2CLIP (I2C): (a) The model aligns
the parallel video
IMU
text data in the joint space. Once
trained, IMU2CLIP is used as a retriever for both (b) IMU
and (c) videos, or as a classifier for downstream applications.
BERT [
6
] and GPT [
7
] for text, or CLIP4Clip [
8
] for videos.
These powerful pre-trained resources have driven the develop-
ment of many application-oriented models, showing significant
improvements when fine-tuned for each respective task [
9
]. To
the best of our knowledge, however, the study on the equivalent
resources for encoding IMU signals has been lacking.
Inspired by the recent works that leverage large pre-trained
models for other modalities, we present IMU2CLIP, a new
approach to pre-train an IMU encoder by aligning the parallel
Video
IMU
Text data in an un-supervised manner via
multimodal contrastive training. Specifically, we propose to
use CLIP [
10
], which contains the video encoder and the lan-
guage model pre-trained on the large parallel image-text data,
from which the IMU encoder can learn a semantic representa-
tion of various scenes transferred from other modalities.
To show the efficacy of the proposed approach, we evaluate
our models on several benchmark tasks as well as new appli-
cations that IMU2CLIP enables, such as IMU-based media
retrieval, leveraging the modality-transitivity that IMU2CLIP
exhibits (Fig. 1). Most importantly, we show that the fine-
tuned IMU2CLIP can significantly improve the performance
of several downstream tasks, when compared to the identical
IMU model trained from scratch.
arXiv:2210.14395v1 [cs.CV] 26 Oct 2022
Fig. 2
: Illustration of the proposed multimodal contrastive learning for IMU2CLIP. CLIP [
14
] is used to align IMU
Video
(left), and IMUText (right). : the parameters of the encoder are frozen during training.
Our contributions
are as follows: (1) We propose a novel
large-scale pre-training approach for IMU sensors, and release
the resulting large pre-trained IMU encoders for future re-
search. (2) We provide an in-depth empirical analysis evaluat-
ing the pre-trained models, for both upstream and downstream
fine-tuning tasks. (3) Lastly, we present novel applications that
show the feasibility of a wider usage for IMU sensor signals.
2. RELATED WORK
Contrastive Learning
is as an efficient self-supervised frame-
work applied across multiple domains, which learns simi-
lar/dissimilar representations from data that are organized into
similar/dissimilar pairs. For instance, SimCLR [
11
] is a uni-
modal application of contrastive learning in the data augmen-
tation setting, in which the authors propose to learn a vision
encoder given a set of perturbed images. As an example in mul-
timodal settings, Contrastive Language–Image Pre-training
(CLIP) [
10
,
12
] learns visual representations from natural lan-
guage supervision using image and text pairs, achieving com-
petitive results in e.g. zero-shot image classification, image
retrieval via text, and image/caption generation. Similarly,
WAV2CLIP [
13
] proposes to learn audio representation by
distilling it from CLIP. We extend this line of work on con-
trastive learning to a unique multimodal setting that utilizes
IMU signals, which is specific to a new generation of devices
(such as smart glasses) that are equipped with such sensors.
Pre-training Resources
: There are numerous pre-trained re-
sources for well-studied modalities such as images or text.
Many popular computer vision models (e.g. ResNet [
15
]) are
typically trained on large supervised datasets such as ImageNet
[
16
], etc. For language processing, the most popular language
models (LM) include BERT [
6
,
17
], GPT-2 [
18
], and GPT-
3[
19
], which typically use self-superivsion techniques such
as next-word predictions or masked token predictions, thus
without any explicit task labels. Studies report that these pre-
trained resources achieve competitive zero-shot performance
[
10
], and when fine-tuned, often outperform fully supervised
models on several downstream tasks [9].
To our knowledge, the equivalent resource for encoding
IMU signals is not made publicly available. Inspired by this
line of work, we propose to perform large-scale pre-training
for the unique sensor (IMU) signals dataset, and show that
such pre-training significantly improves the performance for
the downstream applications as well.
Egocentric Datasets
: We are particularly interested in ego-
centric (first-person) datasets, for understanding of users’ ac-
tivities from head-mounted devices. Several data collection
efforts have been made for building egocentric datasets, in-
cluding Ego4D [1], Epic-Kitchens [2], and Aria [3] datasets.
Using these datasets, we propose various sub-tasks that can
effectively evaluate diverse capabilities of IMU2CLIP, and
demonstrate the feasibility of future applications. In addition,
we implement a universal multimodal data loader to allow for
easy cross-modality and cross-domain (dataset) studies.
IMU Modeling
: IMU signals have been widely used in var-
ious motion recognition tasks, such as pose estimation [
20
],
walking speed estimation [
20
], foot placement prediction [
5
].
Various deep learning architectures have been explored for
modeling IMU in downstream tasks, including Transformer-
CNN based IMU models [
4
] for gesture recognition, 1D-CNN
and GRU ensemble IMU models [
21
] for clinical balance as-
sessment, and Bi-LSTM IMU models [
22
] for human activity
recognition. Our work proposes a new IMU model architec-
ture, and conducts ablation studies over other models above.
Different from prior work modeling IMU in a specific task,
however, our work focuses on learning general IMU represen-
tations by aligning IMU with other modalities (e.g. images and
text), which can enable wider downstream applications.
摘要:

IMU2CLIP:MULTIMODALCONTRASTIVELEARNINGFORIMUMOTIONSENSORSFROMEGOCENTRICVIDEOSANDTEXTNARRATIONSSeungwhanMoon,AndreaMadotto,ZhaojiangLin,AlirezaDirafzoon,AparajitaSarafAmyBearman,BabakDamavandiMetaRealityLabsABSTRACTWepresentIMU2CLIP,anovelpre-trainingapproachtoalignInertialMeasurementUnit(IMU)motio...

展开>> 收起<<
IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:7.29MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注