ViFiCon Vision and Wireless Association Via Self-Supervised Contrastive Learning Nicholas Meegan

2025-04-24 0 0 1.84MB 10 页 10玖币

侵权投诉

ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive

Learning

Nicholas Meegan

Rutgers University

njm146@scarletmail.rutgers.edu

Hansi Liu

Rutgers University

hansiiii@winlab.rutgers.edu

Bryan Bo Cao

Stony Brook University

boccao@cs.stonybrook.edu

Abrar Alali

Old Dominion University

aalal003@odu.edu

Kristin Dana

Rutgers University

kristin.dana@rutgers.edu

Marco Gruteser

Rutgers University

gruteser@winlab.rutgers.edu

Shubham Jain

Stony Brook University

jain@cs.stonybrook.edu

Ashwin Ashok

Georgia State University

aashok@gsu.edu

Abstract

We introduce ViFiCon, a self-supervised contrastive

learning scheme which uses synchronized information

across vision and wireless modalities to perform cross-

modal association. Speciﬁcally, the system uses pedes-

trian data collected from RGB-D camera footage as well as

WiFi Fine Time Measurements (FTM) from a user’s smart-

phone device. We represent the temporal sequence by stack-

ing multi-person depth data spatially within a banded im-

age. Depth data from RGB-D (vision domain) is inherently

linked with an observable pedestrian, but FTM data (wire-

less domain) is associated only to a smartphone on the net-

work. To formulate the cross-modal association problem as

self-supervised, the network learns a scene-wide synchro-

nization of the two modalities as a pretext task, and then

uses that learned representation for the downstream task

of associating individual bounding boxes to speciﬁc smart-

phones, i.e. associating vision and wireless information.

We use a pre-trained region proposal model on the cam-

era footage and then feed the extrapolated bounding box in-

formation into a dual-branch convolutional neural network

along with the FTM data. We show that compared to fully

supervised SoTA models, ViFiCon achieves high perfor-

mance vision-to-wireless association, ﬁnding which bound-

ing box corresponds to which smartphone device, without

hand-labeled association examples for training data.

1. Introduction

Cross-domain associations are becoming more feasible

in real-world environments with the addition of sensors

Figure 1. Associating visual observations to Wi-Fi signals en-

ables novel applications such as emergency messaging since ob-

served users can be linked to their smart phones. Our approach to

learning vision-to-wireless associations does not require annotated

ground-truth matches. Instead, a pretext task of temporal synchro-

nization guides multimodal representation learning for the down-

stream task of associating individual bounding boxes to speciﬁc

smartphones.

from multiple sensing modalities. Applications that observe

and communicate to participating pedestrians can provide

emergency alerts and other messaging tuned to a pedes-

trian’s position in the scene. Consequently, two domains

which greatly beneﬁt from joint use is the vision domain

(e.g. RGB-D images), and the wireless domain, (e.g. es-

timated distance from a WiFi access point through FTM,

Fine Time Measurements). By identifying and associating

individuals in the camera, even in scenarios where the per-

arXiv:2210.05513v1 [cs.CV] 11 Oct 2022

son’s face is occluded, it is possible to send alert messages

to the user’s device when an emergency occurs. Connect-

ing a pedestrian’s vision data to a wireless identiﬁer such as

a smartphone ID necessitates an opt-in requirement for the

messaging application.

Existing approaches associating information across

modalities use information such as camera and laser ranging

[26, 23], or clothing color and motion patterns [28]. Among

multimodal methods for vision and wireless, [4] fuses cam-

era and received signal strength (RSS) data, and [29, 9] fuse

camera and WiFi channel state information (CSI). These

prior vision and wireless methods have limitations such as

requiring multiple WiFi access points (AP). Most similar

to our work are Vi-Fi [17] and ViTag [5], which use the

vision modality, (camera and depth data) along with the

WiFi modality (FTM and inertial measurement unit IMU),

to perform cross-modal association using a single AP. How-

ever, in our work we compare and use fewer features, us-

ing only depth data and FTM information for creating the

multimodal association, while using the camera data along

with off-the-shelf detected bounding boxes to extract depth.

Moreover, we make no use of hand-labeled data unlike Vi-

Fi and ViTag.

In this paper, we present ViFiCon, a self-supervised con-

trastive learning model to make associations between cam-

era and wireless modalities without hand-crafted labeled

or ground truth data. Hand-labelling datasets (providing

ground truth associations between vision and wireless) is

expensive and time-consuming, and so instead, we create

multimodal associations without using hand-labeled data.

We leverage bounding boxes for person detection using an

off-the-shelf object detection model [1] to obtain combina-

tions of pedestrian depths from the vision modality, with-

out knowledge of which pedestrians are contributing FTM

data. We then construct positive and negative pairings of

these depth combinations from the vision domain and the

known FTM distances from the wireless domain based on

passively collected timestamp information, as inspired by

the time contrastive audio and video self-supervised syn-

chronization task proposed in [14]. To bring the two modal-

ities into a joint representation, we create a novel band im-

age representation that maps the signals into a sequence

of gray-scale bands. Not only can we represent a single

signal for each modality, but we can combine multiple of

these signals in a singular image to learn both a scene-wide

temporal synchronization of the data as well as a down-

stream signal-to-signal association for the depth and FTM

data. As the vision domain uses unlabeled bounding boxes,

this scene-wide synchronization allows the representation

to know how many signals should be considered in a single

image representation by considering the ﬁxed sized FTM

image representation. That is, the number of smartphones is

known and constrains the number of relevant bounding box

depth signals. We then train a siamese convolutional neu-

ral network model on the scene-wide synchronization task

to embed the positive and negative pairings into a joint la-

tent space with a Euclidean distance-based contrastive loss.

When trained on the pseudo-labeled data, we can then apply

the task downstream without any more training on the indi-

vidual association task. We show the motivation of ViFiCon

in Figure 1.

Summary of Contributions We summarize our contribu-

tions as follows:

• We generate a novel representation of signal data,

which represents a group of signals from two modalities as

a set of gray-scale bands.

• We devise a self-supervised learning framework to

learn a multimodal latent space representation of signal data

without the use of hand-labelling.

• We demonstrate the strength of the convolutional neu-

ral network in generalizing a global scene view of signal

data in a pretext synchronization task to a one-to-one indi-

vidual association downstream task without further training,

yielding an 84.77% Identity Precision on a one-to-one asso-

ciation with a 10-frame temporal window view.

2. Background and Related Work

Multimodal Association Multimodal association attempts

to enhance a deep learning model’s performance by intro-

ducing a more robust context of a scene through shared

knowledge between domains. Traditionally, these modal-

ities are fused together by mapping the data from the two

into a shared latent space representation, where vector rep-

resentations of the data can be directly compared with one

another. Multimodal association has been applied to learn-

ing information between audio and video domains to learn

an association between the two, such as whether or not

an input video of an instrument corresponds audio sam-

ples [14] or matching a video of lip movements to mel-

spectrogram audio representations [7]. Other work utilize

the notion of multimodal deep learning for fusing audio

and text for sentiment recognition [12] and video and wire-

less sensors for human activity recognition [29] or tracking

[21, 22].

We speciﬁcally build off of work associating vision do-

main information (such as RGB or depth data) with wireless

domain sensor data such as WiFi Fine Time Measurements

(FTM) or inertial measurement unit (IMU). Because such

signals are able to overcome obstructions such as walls [3],

we can make use of the WiFi modality along with the vision

domain to re-identify or track individuals behind obstruc-

tions, not possible with vision alone. RGB-W [4] lever-

ages captured video data and received signal strength (RSS)

data from cell phones’ WiFi or Bluetooth in the scene to

associate detected bounding boxes with cell phone MAC

addresses. Despite working indoors, an improvement over

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ViFiCon:VisionandWirelessAssociationViaSelf-SupervisedContrastiveLearningNicholasMeeganRutgersUniversitynjm146@scarletmail.rutgers.eduHansiLiuRutgersUniversityhansiiii@winlab.rutgers.eduBryanBoCaoStonyBrookUniversityboccao@cs.stonybrook.eduAbrarAlaliOldDominionUniversityaalal003@odu.eduKristinDanaRu...

展开>> 收起<<

ViFiCon Vision and Wireless Association Via Self-Supervised Contrastive Learning Nicholas Meegan.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ViFiCon Vision and Wireless Association Via Self-Supervised Contrastive Learning Nicholas Meegan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: