ery possibility. This creates a vulnerability to differences be-
tween the distribution of poses in training and testing data,
as well as gaps in appearance domains. This limits the ro-
bustness and accuracy of the system.
To improve the visible feature-pose constraints, recent
methods (Mo et al. 2022; Dong et al. 2019; Zeng et al. 2022)
solve the ambiguity of multiple ground-truth poses relating
to the same visible feature by modeling symmetric objects.
After that, some methods (Sun et al. 2022; Mo et al. 2022)
attempt to leverage an instance segmentation to mitigate the
impact of visible features differences arising from camera
intrinsic factors, the difference in object pose (camera ex-
trinsic) still hampers the capacity of the model to accurately
regress the pose from visible features. Besides, several di-
rect methods (Wang et al. 2020, 2021) introduce an auxil-
iary loss to regress intermediate geometric features, such as
2D-3D correspondences, akin to indirect methods. However,
these geometric features are not complete constraints to en-
able the network to regress the pose parameters based on it.
To solve these issues, we proposed the geometric con-
straints (Geo6D) learning approach that introduces a refor-
mulated pose transformation to establish robust constraints
on both camera and object frames by a relative offset rep-
resentation. Specifically, the proposed Geo6D constraints
are built upon the pose transformation formula. The rigid
object points’ 3D coordinates on the different frames can
be transformed based on the pose. To address the distribu-
tion gap, we introduce a reference point and reformulate the
pose transformation formula from the camera frame 3D co-
ordinate representation (the offset for the visible point to
the camera) to a relative offset for the visible point to the
selected reference point. For making the formula learning-
friendly and mathematically correct during network fitting,
we separate the variables based on coordinate frames as ex-
plicit geometric constraints, demonstrated in Fig 1. For the
camera frame variables, we supply and linearize all required
variables in the camera frame as input and feed them to the
network. For the object frame constraints, we introduce an
additional regression network output head to predict the cor-
responding relative offset value in the object frame.
We encapsulate the Geo6D mechanism as a plugin, which
rebuilds the input and output targets of the network and
integrates it with two pose estimation networks. Extensive
experiments demonstrate the effectiveness of our method,
without sacrificing efficiency in both training and inference
to enhance accuracy and stability and reduce the required
amount of training data. It only requires 10% of training data
to reach the comparable performance of full training data.
Furthermore, we analyze the impact of the Geo6D mecha-
nism from the perspective of the loss function.
To summarize, our main contributions are:
• Introducing a pose transformation formula in a relative
offset presentation to establish explicit geometric con-
straints for direct methods.
• Proposing the Geo6D mechanism, a plugin module that
processes input data and optimization targets to adhere to
the geometric constraints, making the network learning-
friendly and mathematically correct.
• Extensive experimental results demonstrate that the pro-
posed Geo6D effectively improves the accuracy of ex-
isting direct pose estimation methods achieving state-of-
the-art overall results and reducing the training data re-
quirement, thus making it more practical for real-world
applications.
Related work
Indirect 6D pose estimation
Indirect methods first predict intermediate geometric infor-
mation and then exploit the projection constraints to esti-
mate the 6D pose by optimization function. Recent meth-
ods (Peng et al. 2019; He et al. 2020, 2021) introduce the
keypoints mechanism in 6D pose estimation and then esti-
mate the 6D pose by a least-squares fitting algorithm, which
takes advantage of the geometric constraints of rigid ob-
jects to train the keypoint prediction network. Different from
the keypoints-based methods, 2D-3D correspondence-based
methods (Su et al. 2022; Hodan, Barath, and Matas 2020;
Haugaard and Buch 2022; Li, Wang, and Ji 2019; Kiru, Pat-
ten, and Pix2Pose 2019; Rad and Lepetit 2017) first estab-
lish the correspondences between 2D coordinates in the im-
age plane and 3D coordinates in the object coordinate sys-
tem by the neural network and then solve the 6D pose by a
PnP or RANSAC algorithm. However, these indirect meth-
ods are only optimized in the first stage rather than the final
pose regression, which is suboptimal compared with direct
methods. Moreover, the optimization is time-consuming and
computationally expensive in practical applications.
Direct 6D pose estimation
To estimate 6D pose efficiently, recent approaches (Mo et al.
2022; Jiang et al. 2022; Li et al. 2018; Wang et al. 2020,
2021) directly regress the final 6D pose parameters from
the neural network instead of intermediate results. Dense-
fusion (Wang et al. 2019) extracts the visible region features
information from RGB-D images by two separate backbones
to extract the features from 2D and 3D spaces and fuses them
with a dense fusion network. Uni6D (Jiang et al. 2022) sim-
plifies the architecture with a homogeneous single backbone
to process RGB-D data, by introducing the extra UV data
into input to preserve the projection constraints. Since the
corresponding visible features of the object and pose are sen-
sitive to the visual ambiguity of the symmetric object, there
are multiple ground-truth poses related to the same visible
features that confused the network fitting. ES6D individu-
ally models different types of symmetric objects to solve the
issue of multiple pose mapping to the same visible features.
Besides, the camera intrinsic is another factor for visible fea-
tures of the object, Uni6Dv2 adopts an instance segmenta-
tion method to mitigate the impact of visible features differ-
ence from the camera intrinsic difference. However, since
the pose parameters are unpredictable and unrestricted and
the visible features to pose mapping can not be exhaustive
and fragile for the unseen pose of the object in the test scene.
To enhance network training, some methods (Wang et al.
2020, 2021) leverage the intermediate geometric features,
i.e. 2D-3D correspondences, akin to indirect methods as an