
Fig. 2
: Illustration of the proposed multimodal contrastive learning for IMU2CLIP. CLIP [
14
] is used to align IMU
↔
Video
(left), and IMU↔Text (right). : the parameters of the encoder are frozen during training.
Our contributions
are as follows: (1) We propose a novel
large-scale pre-training approach for IMU sensors, and release
the resulting large pre-trained IMU encoders for future re-
search. (2) We provide an in-depth empirical analysis evaluat-
ing the pre-trained models, for both upstream and downstream
fine-tuning tasks. (3) Lastly, we present novel applications that
show the feasibility of a wider usage for IMU sensor signals.
2. RELATED WORK
Contrastive Learning
is as an efficient self-supervised frame-
work applied across multiple domains, which learns simi-
lar/dissimilar representations from data that are organized into
similar/dissimilar pairs. For instance, SimCLR [
11
] is a uni-
modal application of contrastive learning in the data augmen-
tation setting, in which the authors propose to learn a vision
encoder given a set of perturbed images. As an example in mul-
timodal settings, Contrastive Language–Image Pre-training
(CLIP) [
10
,
12
] learns visual representations from natural lan-
guage supervision using image and text pairs, achieving com-
petitive results in e.g. zero-shot image classification, image
retrieval via text, and image/caption generation. Similarly,
WAV2CLIP [
13
] proposes to learn audio representation by
distilling it from CLIP. We extend this line of work on con-
trastive learning to a unique multimodal setting that utilizes
IMU signals, which is specific to a new generation of devices
(such as smart glasses) that are equipped with such sensors.
Pre-training Resources
: There are numerous pre-trained re-
sources for well-studied modalities such as images or text.
Many popular computer vision models (e.g. ResNet [
15
]) are
typically trained on large supervised datasets such as ImageNet
[
16
], etc. For language processing, the most popular language
models (LM) include BERT [
6
,
17
], GPT-2 [
18
], and GPT-
3[
19
], which typically use self-superivsion techniques such
as next-word predictions or masked token predictions, thus
without any explicit task labels. Studies report that these pre-
trained resources achieve competitive zero-shot performance
[
10
], and when fine-tuned, often outperform fully supervised
models on several downstream tasks [9].
To our knowledge, the equivalent resource for encoding
IMU signals is not made publicly available. Inspired by this
line of work, we propose to perform large-scale pre-training
for the unique sensor (IMU) signals dataset, and show that
such pre-training significantly improves the performance for
the downstream applications as well.
Egocentric Datasets
: We are particularly interested in ego-
centric (first-person) datasets, for understanding of users’ ac-
tivities from head-mounted devices. Several data collection
efforts have been made for building egocentric datasets, in-
cluding Ego4D [1], Epic-Kitchens [2], and Aria [3] datasets.
Using these datasets, we propose various sub-tasks that can
effectively evaluate diverse capabilities of IMU2CLIP, and
demonstrate the feasibility of future applications. In addition,
we implement a universal multimodal data loader to allow for
easy cross-modality and cross-domain (dataset) studies.
IMU Modeling
: IMU signals have been widely used in var-
ious motion recognition tasks, such as pose estimation [
20
],
walking speed estimation [
20
], foot placement prediction [
5
].
Various deep learning architectures have been explored for
modeling IMU in downstream tasks, including Transformer-
CNN based IMU models [
4
] for gesture recognition, 1D-CNN
and GRU ensemble IMU models [
21
] for clinical balance as-
sessment, and Bi-LSTM IMU models [
22
] for human activity
recognition. Our work proposes a new IMU model architec-
ture, and conducts ablation studies over other models above.
Different from prior work modeling IMU in a specific task,
however, our work focuses on learning general IMU represen-
tations by aligning IMU with other modalities (e.g. images and
text), which can enable wider downstream applications.