
Figure 2: System workflow. The system is composed of a frontend client and a backend server. The frontend performs the fast pose propagation
with IMU data and fuses the result of visual pose estimation by the backend server. The state vector contains the device pose and motion
information such as the velocity and biases of IMU measurements.
art object pose estimation methods using RGB images. This pipeline
not only runs in real time but also achieves high frame rates and ac-
curate tracking on low-end mobile devices. The main contributions
of our work are summarized as follows:
•
A monocular visual-inertial-based system with a client-server
architecture to track objects with flexible frame rates on mid-
level or low-level mobile devices.
•
A fast pose inspection algorithm (PIA) to quickly determine
the correctness of object pose when tracking.
•
A bias self-correction mechanism (BSCM) to improve pose
propagation accuracy.
•
A lightweight object pose dataset with RGB images and IMU
measurements to evaluate the quality of object tracking.
2 RELATED WORK
2.1 Object Pose Estimation
Object pose estimation has long been an open issue; of the many
studies on this, some [14,25,38] use the depth information to address
this problem and indeed yield satisfactory results. Unfortunately,
RGB-D images are not always supported or practical in most real
use cases. As a result, we then focus on methods that do not rely on
the depth information.
2.1.1 Classical Methods
Conventional methods which estimate object pose from an RGB
image can be classified either as feature-based or template-based.
In feature-based methods [26, 32, 37], features in 2D images are
extracted and matched with those on the object 3D model. Given
the 2D-3D correspondences, the object pose is estimated by solving
a PnP problem [11, 22, 24]. This kind of method still performs well
in occlusion cases, but fails in textureless objects without distinctive
features. Template-based methods [15, 16, 31] can handle both
textured and textureless objects. Synthetic images rendered around
an object 3D model from different camera viewpoints are generated
as a template database, and the input image is matched against
the templates to find the object pose. However, these methods are
sensitive and not robust when objects are occluded.
2.1.2 Deep Learning-based Methods
Learning-based methods can also be categorized into direct and
PnP-based approaches. Direct approaches regress or infer poses
with feed-forward neural networks. SSD6D [20] disentangles the
6D pose into viewpoint and in-plane rotation, first by estimating
the rotation and then by inferring the 3D translation with a rotation
and bounding box. PoseCNN [41] generates semantic labels and
localizes the object center with its distance to the camera via a CNN
network. PnP-based approaches find 2D-3D correspondences by
deep learning, and the estimation of object pose is handled by other
PnP solvers. PVNet [29] selects keypoints by the distance from
the center to the surface of the 3D object model. A voting-based
algorithm is also used to help find the most correct keypoints in the
image, which allows PVNet to effectively tackle occluded objects.
Yu et al. [43] propose differentiable proxy voting loss (DPVL) to
reduce the search error of object keypoints. Some studies such as
RePOSE [19] and RNNPose [42] add post-refinement procedures
for better pose accuracy. However, these multi-stage pipelines are
too slow for real-time applications.
2.2 Object Pose Tracking
The purpose of object pose tracking is to estimate object poses in
videos. In addition to a single image, temporal information between
consecutive frames is also utilized to facilitate estimation. Studies
such as Li et al. [23] and Weng et al. [39] use a stereo camera or
Lidar to help tracking, but this is not practical in real use cases
in which only a monocular camera is available. In real-world AR
applications, instead of using stereo or RGB-D cameras, IMUs are
also commonplace solutions. Thus, we briefly introduce vision-
based and visual-inertial-based methods.
2.2.1 Vision-based Methods
Classical vision-based methods track features such as SIFT, SURF,
and ORB to estimate the correct pose by solving a PnP problem.
Likewise, these methods may have high accuracy but their high
computational overhead and low robustness to image distortion and
self-occlusion are problems [28]. Based on deep learning, Zhong
et al. [44] tracks object in video effectively by segmenting objects
from the frame even with heavy occlusion.
2.2.2 Visual-inertial-based Methods
Conventional visual-inertial fusion using extended Kalman fil-
ters [10,27] or nonlinear optimization [30,34] has been deployed for
AR and robotic applications. However, these suffer from problems
of low frame rates and the long delay due to their high computa-
tional costs. Recently, learning-based methods [4,6, 13] have been
proposed which regress the fused visual and inertial features for
camera and object pose estimation.
3 PROPOSED METHOD
Compared with studies on implementations for PCs or servers, there
is a lack of studies for mobile devices. MobilePose [17] uses two