A Flexible-Frame-Rate Vision-Aided Inertial Object Tracking System for Mobile Devices Y o-Chung Lau 12 Kuan-Wei Tseng3 I-Ju Hsieh2 Hsiao-Ching Tseng2 Yi-Ping Hung2

2025-04-30 0 0 2.54MB 8 页 10玖币
侵权投诉
A Flexible-Frame-Rate Vision-Aided Inertial Object Tracking System
for Mobile Devices
Yo-Chung Lau*1,2, Kuan-Wei Tseng3, I-Ju Hsieh2, Hsiao-Ching Tseng2, Yi-Ping Hung2
1Chunghwa Telecom Co., Ltd 2National Taiwan University 3Tokyo Institute of Technology
Figure 1: System overview. We propose an object pose tracking system with a client-server architecture for mobile AR
applications. The input of the system is IMU measurements and RGB image sequences. On the frontend side (mobile device),
we perform the fast pose propagation based on IMU measurements. On the backend side (server), we utilize 6DoF object pose
estimation models to estimate the object pose based on RGB images. A more accurate object pose from the backend will be sent
back to the frontend to refine the pose and calibrate the biases of IMU measurements. Note that
Ti
and
Ii
stand for the object pose
and image taken at timestamp i, respectively.
ABSTRACT
Real-time object pose estimation and tracking is challenging but
essential for emerging augmented reality (AR) applications. In gen-
eral, state-of-the-art methods address this problem using deep neural
networks which indeed yield satisfactory results. Nevertheless, the
high computational cost of these methods makes them unsuitable for
mobile devices where real-world applications usually take place. In
addition, head-mounted displays such as AR glasses require at least
90 FPS to avoid motion sickness, which further complicates the prob-
lem. We propose a flexible-frame-rate object pose estimation and
tracking system for mobile devices. It is a monocular visual-inertial-
based system with a client-server architecture. Inertial measurement
unit (IMU) pose propagation is performed on the client side for
high speed tracking, and RGB image-based 3D pose estimation is
performed on the server side to obtain accurate poses, after which
the pose is sent to the client side for visual-inertial fusion, where we
propose a bias self-correction mechanism to reduce drift. We also
propose a pose inspection algorithm to detect tracking failures and
incorrect pose estimation. Connected by high-speed networking, our
system supports flexible frame rates up to 120 FPS and guarantees
high precision and real-time tracking on low-end devices. Both sim-
ulations and real world experiments show that our method achieves
accurate and robust object tracking.
Index Terms:
Computing methodologies—Computer vision—
Tracking; Human-centered computing—Mixed/augmented reality;
*Corresponding author. Email: d06944010@ntu.edu.tw.
1 INTRODUCTION
The purpose of object pose estimation and tracking is to find the
relative 6DoF transformation, including translation and rotation,
between the object and the camera. This important task plays a
significant role in real-life applications such as adding virtual objects
in augmented reality (AR) [9, 35] and robotic manipulation [5, 7, 36].
Object pose tracking, in contrast to object pose estimation, puts
emphasis on tracking object pose in consecutive frames [18, 44].
This is challenging since real-time performance is required to en-
sure coherent and smooth user experience. Despite the seeming
prevalence of solutions, whether they are vision-only [8, 44] or
visual-inertial [10, 13, 34], such methods are designed to be run on
computers or even servers. Hou et al. [17], based on Sandler et
al. [33], propose lightweight networks to track objects on mobile
devices, but hardware requirements are still significant. Moreover,
with the development of head-mounted displays, frame rate demands
have increased. Although 60 FPS is sufficient for smartphone-based
applications, more than 90 FPS is expected for AR glasses to prevent
the motion sickness.
We thus propose a lightweight system for accurate object pose
estimation and tracking with visual-inertial fusion. It uses a client-
server architecture that performs fast pose tracking on the client side
and accurate pose estimation on the server side. The accumulated
error or the drift on the client side is diminished by data exchanges
with the server. Specifically, the client is composed of three mod-
ules: a pose propagation module (PPM) to calculate a rough pose
estimation via inertial measurement unit (IMU) integration; a pose
inspection module (PIM) to detect tracking failures, including lost
tracking and large pose errors; and a pose refinement module (PRM)
to optimize the pose and update the IMU state vector to correct the
drift based on the response from the server, which runs state-of-the-
arXiv:2210.12476v1 [cs.CV] 22 Oct 2022
Figure 2: System workflow. The system is composed of a frontend client and a backend server. The frontend performs the fast pose propagation
with IMU data and fuses the result of visual pose estimation by the backend server. The state vector contains the device pose and motion
information such as the velocity and biases of IMU measurements.
art object pose estimation methods using RGB images. This pipeline
not only runs in real time but also achieves high frame rates and ac-
curate tracking on low-end mobile devices. The main contributions
of our work are summarized as follows:
A monocular visual-inertial-based system with a client-server
architecture to track objects with flexible frame rates on mid-
level or low-level mobile devices.
A fast pose inspection algorithm (PIA) to quickly determine
the correctness of object pose when tracking.
A bias self-correction mechanism (BSCM) to improve pose
propagation accuracy.
A lightweight object pose dataset with RGB images and IMU
measurements to evaluate the quality of object tracking.
2 RELATED WORK
2.1 Object Pose Estimation
Object pose estimation has long been an open issue; of the many
studies on this, some [14,25,38] use the depth information to address
this problem and indeed yield satisfactory results. Unfortunately,
RGB-D images are not always supported or practical in most real
use cases. As a result, we then focus on methods that do not rely on
the depth information.
2.1.1 Classical Methods
Conventional methods which estimate object pose from an RGB
image can be classified either as feature-based or template-based.
In feature-based methods [26, 32, 37], features in 2D images are
extracted and matched with those on the object 3D model. Given
the 2D-3D correspondences, the object pose is estimated by solving
a PnP problem [11, 22, 24]. This kind of method still performs well
in occlusion cases, but fails in textureless objects without distinctive
features. Template-based methods [15, 16, 31] can handle both
textured and textureless objects. Synthetic images rendered around
an object 3D model from different camera viewpoints are generated
as a template database, and the input image is matched against
the templates to find the object pose. However, these methods are
sensitive and not robust when objects are occluded.
2.1.2 Deep Learning-based Methods
Learning-based methods can also be categorized into direct and
PnP-based approaches. Direct approaches regress or infer poses
with feed-forward neural networks. SSD6D [20] disentangles the
6D pose into viewpoint and in-plane rotation, first by estimating
the rotation and then by inferring the 3D translation with a rotation
and bounding box. PoseCNN [41] generates semantic labels and
localizes the object center with its distance to the camera via a CNN
network. PnP-based approaches find 2D-3D correspondences by
deep learning, and the estimation of object pose is handled by other
PnP solvers. PVNet [29] selects keypoints by the distance from
the center to the surface of the 3D object model. A voting-based
algorithm is also used to help find the most correct keypoints in the
image, which allows PVNet to effectively tackle occluded objects.
Yu et al. [43] propose differentiable proxy voting loss (DPVL) to
reduce the search error of object keypoints. Some studies such as
RePOSE [19] and RNNPose [42] add post-refinement procedures
for better pose accuracy. However, these multi-stage pipelines are
too slow for real-time applications.
2.2 Object Pose Tracking
The purpose of object pose tracking is to estimate object poses in
videos. In addition to a single image, temporal information between
consecutive frames is also utilized to facilitate estimation. Studies
such as Li et al. [23] and Weng et al. [39] use a stereo camera or
Lidar to help tracking, but this is not practical in real use cases
in which only a monocular camera is available. In real-world AR
applications, instead of using stereo or RGB-D cameras, IMUs are
also commonplace solutions. Thus, we briefly introduce vision-
based and visual-inertial-based methods.
2.2.1 Vision-based Methods
Classical vision-based methods track features such as SIFT, SURF,
and ORB to estimate the correct pose by solving a PnP problem.
Likewise, these methods may have high accuracy but their high
computational overhead and low robustness to image distortion and
self-occlusion are problems [28]. Based on deep learning, Zhong
et al. [44] tracks object in video effectively by segmenting objects
from the frame even with heavy occlusion.
2.2.2 Visual-inertial-based Methods
Conventional visual-inertial fusion using extended Kalman fil-
ters [10,27] or nonlinear optimization [30,34] has been deployed for
AR and robotic applications. However, these suffer from problems
of low frame rates and the long delay due to their high computa-
tional costs. Recently, learning-based methods [4,6, 13] have been
proposed which regress the fused visual and inertial features for
camera and object pose estimation.
3 PROPOSED METHOD
Compared with studies on implementations for PCs or servers, there
is a lack of studies for mobile devices. MobilePose [17] uses two
摘要:

AFlexible-Frame-RateVision-AidedInertialObjectTrackingSystemforMobileDevicesYo-ChungLau*1;2,Kuan-WeiTseng3,I-JuHsieh2,Hsiao-ChingTseng2,Yi-PingHung21ChunghwaTelecomCo.,Ltd2NationalTaiwanUniversity3TokyoInstituteofTechnologyFigure1:Systemoverview.Weproposeanobjectposetrackingsystemwithaclient-servera...

展开>> 收起<<
A Flexible-Frame-Rate Vision-Aided Inertial Object Tracking System for Mobile Devices Y o-Chung Lau 12 Kuan-Wei Tseng3 I-Ju Hsieh2 Hsiao-Ching Tseng2 Yi-Ping Hung2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:2.54MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注