IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf

2025-05-08 2 0 7.29MB 6 页 10玖币

侵权投诉

IMU2CLIP: MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS

FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS

Seungwhan Moon∗, Andrea Madotto∗, Zhaojiang Lin, Alireza Dirafzoon, Aparajita Saraf

Amy Bearman, Babak Damavandi

Meta Reality Labs

ABSTRACT

We present IMU2CLIP, a novel pre-training approach to

align Inertial Measurement Unit (IMU) motion sensor record-

ings with video and text, by projecting them into the joint rep-

resentation space of Contrastive Language-Image Pre-training

(CLIP). The proposed approach allows IMU2CLIP to trans-

late human motions (as measured by IMU sensors) into their

corresponding textual descriptions and videos – while preserv-

ing the transitivity across these modalities.

We explore several new IMU-based applications that

IMU2CLIP enables, such as motion-based media retrieval

and natural language reasoning tasks with motion data. In

addition, we show that IMU2CLIP can signiﬁcantly improve

the downstream performance when ﬁne-tuned for each appli-

cation (e.g. activity recognition), demonstrating the universal

usage of IMU2CLIP as a new pre-trained resource. Our code

will be made publicly available.

Index Terms—IMU modeling, Multimodal learning

1. INTRODUCTION

With the growing popularity of smart glasses or new-generation

wearable devices, ﬁrst-person or egocentric videos have re-

cently become much more prevalent than ever before [

These egocentric videos are often accompanied by the parallel

head-mounted IMU sensor readings, which record devices’

linear and rotational movements and accelerations.

Given its low power consumption and low privacy implica-

tions, IMU is regarded as an important modality for powering

various on-device models that require understanding of device

wearer’s movement patterns (e.g. exercise / activity recognition

for health applications). The previous works on IMU modeling

typically focus on the purpose-built datasets with manual an-

notations [

], which are limited in their scale. Consequently,

the utilization of IMU models in real-world scenarios has been

conﬁned to a relatively small number of use cases.

On the contrary, for the modalities that are widely studied

(e.g. text, video), there are vast large-scale resources such as

*: Joint First Authors.

Fig. 1

: Illustration of IMU2CLIP (I2C): (a) The model aligns

the parallel video

↔

IMU

↔

text data in the joint space. Once

trained, IMU2CLIP is used as a retriever for both (b) IMU

and (c) videos, or as a classiﬁer for downstream applications.

BERT [

] and GPT [

] for text, or CLIP4Clip [

] for videos.

These powerful pre-trained resources have driven the develop-

ment of many application-oriented models, showing signiﬁcant

improvements when ﬁne-tuned for each respective task [

]. To

the best of our knowledge, however, the study on the equivalent

resources for encoding IMU signals has been lacking.

Inspired by the recent works that leverage large pre-trained

models for other modalities, we present IMU2CLIP, a new

approach to pre-train an IMU encoder by aligning the parallel

Video

↔

IMU

↔

Text data in an un-supervised manner via

multimodal contrastive training. Speciﬁcally, we propose to

use CLIP [

], which contains the video encoder and the lan-

guage model pre-trained on the large parallel image-text data,

from which the IMU encoder can learn a semantic representa-

tion of various scenes transferred from other modalities.

To show the efﬁcacy of the proposed approach, we evaluate

our models on several benchmark tasks as well as new appli-

cations that IMU2CLIP enables, such as IMU-based media

retrieval, leveraging the modality-transitivity that IMU2CLIP

exhibits (Fig. 1). Most importantly, we show that the ﬁne-

tuned IMU2CLIP can signiﬁcantly improve the performance

of several downstream tasks, when compared to the identical

IMU model trained from scratch.

arXiv:2210.14395v1 [cs.CV] 26 Oct 2022

Fig. 2

: Illustration of the proposed multimodal contrastive learning for IMU2CLIP. CLIP [

] is used to align IMU

↔

Video

(left), and IMU↔Text (right). : the parameters of the encoder are frozen during training.

Our contributions

are as follows: (1) We propose a novel

large-scale pre-training approach for IMU sensors, and release

the resulting large pre-trained IMU encoders for future re-

search. (2) We provide an in-depth empirical analysis evaluat-

ing the pre-trained models, for both upstream and downstream

ﬁne-tuning tasks. (3) Lastly, we present novel applications that

show the feasibility of a wider usage for IMU sensor signals.

2. RELATED WORK

Contrastive Learning

is as an efﬁcient self-supervised frame-

work applied across multiple domains, which learns simi-

lar/dissimilar representations from data that are organized into

similar/dissimilar pairs. For instance, SimCLR [

] is a uni-

modal application of contrastive learning in the data augmen-

tation setting, in which the authors propose to learn a vision

encoder given a set of perturbed images. As an example in mul-

timodal settings, Contrastive Language–Image Pre-training

(CLIP) [

] learns visual representations from natural lan-

guage supervision using image and text pairs, achieving com-

petitive results in e.g. zero-shot image classiﬁcation, image

retrieval via text, and image/caption generation. Similarly,

WAV2CLIP [

] proposes to learn audio representation by

distilling it from CLIP. We extend this line of work on con-

trastive learning to a unique multimodal setting that utilizes

IMU signals, which is speciﬁc to a new generation of devices

(such as smart glasses) that are equipped with such sensors.

Pre-training Resources

: There are numerous pre-trained re-

sources for well-studied modalities such as images or text.

Many popular computer vision models (e.g. ResNet [

]) are

typically trained on large supervised datasets such as ImageNet

[

], etc. For language processing, the most popular language

models (LM) include BERT [

], GPT-2 [

], and GPT-

], which typically use self-superivsion techniques such

as next-word predictions or masked token predictions, thus

without any explicit task labels. Studies report that these pre-

trained resources achieve competitive zero-shot performance

[

], and when ﬁne-tuned, often outperform fully supervised

models on several downstream tasks [9].

To our knowledge, the equivalent resource for encoding

IMU signals is not made publicly available. Inspired by this

line of work, we propose to perform large-scale pre-training

for the unique sensor (IMU) signals dataset, and show that

such pre-training signiﬁcantly improves the performance for

the downstream applications as well.

Egocentric Datasets

: We are particularly interested in ego-

centric (ﬁrst-person) datasets, for understanding of users’ ac-

tivities from head-mounted devices. Several data collection

efforts have been made for building egocentric datasets, in-

cluding Ego4D [1], Epic-Kitchens [2], and Aria [3] datasets.

Using these datasets, we propose various sub-tasks that can

effectively evaluate diverse capabilities of IMU2CLIP, and

demonstrate the feasibility of future applications. In addition,

we implement a universal multimodal data loader to allow for

easy cross-modality and cross-domain (dataset) studies.

IMU Modeling

: IMU signals have been widely used in var-

ious motion recognition tasks, such as pose estimation [

walking speed estimation [

], foot placement prediction [

Various deep learning architectures have been explored for

modeling IMU in downstream tasks, including Transformer-

CNN based IMU models [

] for gesture recognition, 1D-CNN

and GRU ensemble IMU models [

] for clinical balance as-

sessment, and Bi-LSTM IMU models [

] for human activity

recognition. Our work proposes a new IMU model architec-

ture, and conducts ablation studies over other models above.

Different from prior work modeling IMU in a speciﬁc task,

however, our work focuses on learning general IMU represen-

tations by aligning IMU with other modalities (e.g. images and

text), which can enable wider downstream applications.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IMU2CLIP:MULTIMODALCONTRASTIVELEARNINGFORIMUMOTIONSENSORSFROMEGOCENTRICVIDEOSANDTEXTNARRATIONSSeungwhanMoon,AndreaMadotto,ZhaojiangLin,AlirezaDirafzoon,AparajitaSarafAmyBearman,BabakDamavandiMetaRealityLabsABSTRACTWepresentIMU2CLIP,anovelpre-trainingapproachtoalignInertialMeasurementUnit(IMU)motio...

展开>> 收起<<

IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: