A Flexible-Frame-Rate Vision-Aided Inertial Object Tracking System for Mobile Devices Y o-Chung Lau 12 Kuan-Wei Tseng3 I-Ju Hsieh2 Hsiao-Ching Tseng2 Yi-Ping Hung2

2025-04-30 0 0 2.54MB 8 页 10玖币

侵权投诉

A Flexible-Frame-Rate Vision-Aided Inertial Object Tracking System

for Mobile Devices

Yo-Chung Lau*1,2, Kuan-Wei Tseng3, I-Ju Hsieh2, Hsiao-Ching Tseng2, Yi-Ping Hung2

1Chunghwa Telecom Co., Ltd 2National Taiwan University 3Tokyo Institute of Technology

Figure 1: System overview. We propose an object pose tracking system with a client-server architecture for mobile AR

applications. The input of the system is IMU measurements and RGB image sequences. On the frontend side (mobile device),

we perform the fast pose propagation based on IMU measurements. On the backend side (server), we utilize 6DoF object pose

estimation models to estimate the object pose based on RGB images. A more accurate object pose from the backend will be sent

back to the frontend to reﬁne the pose and calibrate the biases of IMU measurements. Note that

and

stand for the object pose

and image taken at timestamp i, respectively.

ABSTRACT

Real-time object pose estimation and tracking is challenging but

essential for emerging augmented reality (AR) applications. In gen-

eral, state-of-the-art methods address this problem using deep neural

networks which indeed yield satisfactory results. Nevertheless, the

high computational cost of these methods makes them unsuitable for

mobile devices where real-world applications usually take place. In

addition, head-mounted displays such as AR glasses require at least

90 FPS to avoid motion sickness, which further complicates the prob-

lem. We propose a ﬂexible-frame-rate object pose estimation and

tracking system for mobile devices. It is a monocular visual-inertial-

based system with a client-server architecture. Inertial measurement

unit (IMU) pose propagation is performed on the client side for

high speed tracking, and RGB image-based 3D pose estimation is

performed on the server side to obtain accurate poses, after which

the pose is sent to the client side for visual-inertial fusion, where we

propose a bias self-correction mechanism to reduce drift. We also

propose a pose inspection algorithm to detect tracking failures and

incorrect pose estimation. Connected by high-speed networking, our

system supports ﬂexible frame rates up to 120 FPS and guarantees

high precision and real-time tracking on low-end devices. Both sim-

ulations and real world experiments show that our method achieves

accurate and robust object tracking.

Index Terms:

Computing methodologies—Computer vision—

Tracking; Human-centered computing—Mixed/augmented reality;

*Corresponding author. Email: d06944010@ntu.edu.tw.

1 INTRODUCTION

The purpose of object pose estimation and tracking is to ﬁnd the

relative 6DoF transformation, including translation and rotation,

between the object and the camera. This important task plays a

signiﬁcant role in real-life applications such as adding virtual objects

in augmented reality (AR) [9, 35] and robotic manipulation [5, 7, 36].

Object pose tracking, in contrast to object pose estimation, puts

emphasis on tracking object pose in consecutive frames [18, 44].

This is challenging since real-time performance is required to en-

sure coherent and smooth user experience. Despite the seeming

prevalence of solutions, whether they are vision-only [8, 44] or

visual-inertial [10, 13, 34], such methods are designed to be run on

computers or even servers. Hou et al. [17], based on Sandler et

al. [33], propose lightweight networks to track objects on mobile

devices, but hardware requirements are still signiﬁcant. Moreover,

with the development of head-mounted displays, frame rate demands

have increased. Although 60 FPS is sufﬁcient for smartphone-based

applications, more than 90 FPS is expected for AR glasses to prevent

the motion sickness.

We thus propose a lightweight system for accurate object pose

estimation and tracking with visual-inertial fusion. It uses a client-

server architecture that performs fast pose tracking on the client side

and accurate pose estimation on the server side. The accumulated

error or the drift on the client side is diminished by data exchanges

with the server. Speciﬁcally, the client is composed of three mod-

ules: a pose propagation module (PPM) to calculate a rough pose

estimation via inertial measurement unit (IMU) integration; a pose

inspection module (PIM) to detect tracking failures, including lost

tracking and large pose errors; and a pose reﬁnement module (PRM)

to optimize the pose and update the IMU state vector to correct the

drift based on the response from the server, which runs state-of-the-

arXiv:2210.12476v1 [cs.CV] 22 Oct 2022

Figure 2: System workﬂow. The system is composed of a frontend client and a backend server. The frontend performs the fast pose propagation

with IMU data and fuses the result of visual pose estimation by the backend server. The state vector contains the device pose and motion

information such as the velocity and biases of IMU measurements.

art object pose estimation methods using RGB images. This pipeline

not only runs in real time but also achieves high frame rates and ac-

curate tracking on low-end mobile devices. The main contributions

of our work are summarized as follows:

•

A monocular visual-inertial-based system with a client-server

architecture to track objects with ﬂexible frame rates on mid-

level or low-level mobile devices.

•

A fast pose inspection algorithm (PIA) to quickly determine

the correctness of object pose when tracking.

•

A bias self-correction mechanism (BSCM) to improve pose

propagation accuracy.

•

A lightweight object pose dataset with RGB images and IMU

measurements to evaluate the quality of object tracking.

2 RELATED WORK

2.1 Object Pose Estimation

Object pose estimation has long been an open issue; of the many

studies on this, some [14,25,38] use the depth information to address

this problem and indeed yield satisfactory results. Unfortunately,

RGB-D images are not always supported or practical in most real

use cases. As a result, we then focus on methods that do not rely on

the depth information.

2.1.1 Classical Methods

Conventional methods which estimate object pose from an RGB

image can be classiﬁed either as feature-based or template-based.

In feature-based methods [26, 32, 37], features in 2D images are

extracted and matched with those on the object 3D model. Given

the 2D-3D correspondences, the object pose is estimated by solving

a PnP problem [11, 22, 24]. This kind of method still performs well

in occlusion cases, but fails in textureless objects without distinctive

features. Template-based methods [15, 16, 31] can handle both

textured and textureless objects. Synthetic images rendered around

an object 3D model from different camera viewpoints are generated

as a template database, and the input image is matched against

the templates to ﬁnd the object pose. However, these methods are

sensitive and not robust when objects are occluded.

2.1.2 Deep Learning-based Methods

Learning-based methods can also be categorized into direct and

PnP-based approaches. Direct approaches regress or infer poses

with feed-forward neural networks. SSD6D [20] disentangles the

6D pose into viewpoint and in-plane rotation, ﬁrst by estimating

the rotation and then by inferring the 3D translation with a rotation

and bounding box. PoseCNN [41] generates semantic labels and

localizes the object center with its distance to the camera via a CNN

network. PnP-based approaches ﬁnd 2D-3D correspondences by

deep learning, and the estimation of object pose is handled by other

PnP solvers. PVNet [29] selects keypoints by the distance from

the center to the surface of the 3D object model. A voting-based

algorithm is also used to help ﬁnd the most correct keypoints in the

image, which allows PVNet to effectively tackle occluded objects.

Yu et al. [43] propose differentiable proxy voting loss (DPVL) to

reduce the search error of object keypoints. Some studies such as

RePOSE [19] and RNNPose [42] add post-reﬁnement procedures

for better pose accuracy. However, these multi-stage pipelines are

too slow for real-time applications.

2.2 Object Pose Tracking

The purpose of object pose tracking is to estimate object poses in

videos. In addition to a single image, temporal information between

consecutive frames is also utilized to facilitate estimation. Studies

such as Li et al. [23] and Weng et al. [39] use a stereo camera or

Lidar to help tracking, but this is not practical in real use cases

in which only a monocular camera is available. In real-world AR

applications, instead of using stereo or RGB-D cameras, IMUs are

also commonplace solutions. Thus, we brieﬂy introduce vision-

based and visual-inertial-based methods.

2.2.1 Vision-based Methods

Classical vision-based methods track features such as SIFT, SURF,

and ORB to estimate the correct pose by solving a PnP problem.

Likewise, these methods may have high accuracy but their high

computational overhead and low robustness to image distortion and

self-occlusion are problems [28]. Based on deep learning, Zhong

et al. [44] tracks object in video effectively by segmenting objects

from the frame even with heavy occlusion.

2.2.2 Visual-inertial-based Methods

Conventional visual-inertial fusion using extended Kalman ﬁl-

ters [10,27] or nonlinear optimization [30,34] has been deployed for

AR and robotic applications. However, these suffer from problems

of low frame rates and the long delay due to their high computa-

tional costs. Recently, learning-based methods [4,6, 13] have been

proposed which regress the fused visual and inertial features for

camera and object pose estimation.

3 PROPOSED METHOD

Compared with studies on implementations for PCs or servers, there

is a lack of studies for mobile devices. MobilePose [17] uses two

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AFlexible-Frame-RateVision-AidedInertialObjectTrackingSystemforMobileDevicesYo-ChungLau*1;2,Kuan-WeiTseng3,I-JuHsieh2,Hsiao-ChingTseng2,Yi-PingHung21ChunghwaTelecomCo.,Ltd2NationalTaiwanUniversity3TokyoInstituteofTechnologyFigure1:Systemoverview.Weproposeanobjectposetrackingsystemwithaclient-servera...

展开>> 收起<<

A Flexible-Frame-Rate Vision-Aided Inertial Object Tracking System for Mobile Devices Y o-Chung Lau 12 Kuan-Wei Tseng3 I-Ju Hsieh2 Hsiao-Ching Tseng2 Yi-Ping Hung2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Flexible-Frame-Rate Vision-Aided Inertial Object Tracking System for Mobile Devices Y o-Chung Lau 12 Kuan-Wei Tseng3 I-Ju Hsieh2 Hsiao-Ching Tseng2 Yi-Ping Hung2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: