Generative Category-Level Shape and Pose Estimation with Semantic Primitives Guanglin Li12Yifeng Li2Zhichao Ye1Qihang Zhang3

2025-05-06 0 0 7.66MB 18 页 10玖币
侵权投诉
Generative Category-Level Shape and Pose
Estimation with Semantic Primitives
Guanglin Li1,2Yifeng Li2Zhichao Ye1Qihang Zhang3
Tao Kong2Zhaopeng Cui1Guofeng Zhang1
1State Key Lab of CAD&CG, Zhejiang University 2ByteDance AI Lab
3Multimedia Laboratory, The Chinese University of Hong Kong
Abstract: Empowering autonomous agents with 3D understanding for daily ob-
jects is a grand challenge in robotics applications. When exploring in an un-
known environment, existing methods for object pose estimation are still not
satisfactory due to the diversity of object shapes. In this paper, we propose a
novel framework for category-level object shape and pose estimation from a sin-
gle RGB-D image. To handle the intra-category variation, we adopt a semantic
primitive representation that encodes diverse shapes into a unified latent space,
which is the key to establish reliable correspondences between observed point
clouds and estimated shapes. Then, by using a SIM(3)-invariant shape descriptor,
we gracefully decouple the shape and pose of an object, thus supporting latent
shape optimization of target objects in arbitrary poses. Extensive experiments
show that the proposed method achieves SOTA pose estimation performance and
better generalization in the real-world dataset. Code and video are available at
https://zju3dv.github.io/gCasp.
Keywords: Category-level Pose Estimation, Shape Estimation
1 Introduction
Estimating the shape and pose for daily objects is a fundamental function which has various
applications, including 3D scene understanding, robot manipulation and autonomous warehous-
ing [1,2,3,4,5,6]. Early works for this task are focused on instance-level pose estima-
tion [7,8,9,10,11], which aligns the observed object with the given CAD model. However, such a
setting is limited in real-world scenarios since it is hard to obtain the exact model of a casual object
in advance. To generalize to these unseen but semantically familiar objects, category-level pose esti-
mation is raising more and more research attention [12,13,14,15,16,17] since it could potentially
handle various instances of the same category in real scenes.
Existing works on category-level pose estimation usually try to predict pixel-wise normalized coor-
dinates for instances within one class [12] or adopt a canonical prior model with shape deformations
to estimate object poses [14,15]. Although great advances have been made, these one-pass pre-
diction methods are still faced with difficulties when large shape differences exist within the same
category. To handle the variety of intra-class objects, some works [18,16] leverage neural implicit
representation [19] to fit the shape of the target object by iteratively optimizing the pose and shape in
a latent space, and achieve better performance. However, in such methods, the pose and shape esti-
mation are coupled together and rely on each other to get reliable results. Thus, their performance is
unsatisfactory in real scene (see Sec. 4.2). So we can see that the huge intra-class shape differences
and coupled estimation of shapes and poses are two main challenges for the existing category-level
pose estimation methods.
To tackle these challenges, we propose to estimate the object poses and shapes with semantic primi-
tives from a generative perspective. The insight behind this is that the objects of a category are often
composed of components with the same semantics although their shapes are various, e.g., a cup is
usually composed of a semicircular handle and a cylindrical body (see Fig. 1(a)). This property
Work done during Guanglin Li’s internship at ByteDance.
6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.
arXiv:2210.01112v2 [cs.CV] 1 Feb 2023
g
Shared
Latent Space
z1
z2
z3
z4
gf
gc
Figure 1: (a) An example of different instances in the same categories, we highlight the same seman-
tic parts (e.g., lens of cameras) between different instances. (b) The detailed shapes (top row) are
decomposed into semantic primitives (bottom row). Different colors on primitives indicate different
semantic labels, which are consistent among different shapes. (c) An overview of the generative
model consisting of two auto-decoders (i.e.,gfand gc) sharing the same latent space. gfcaptures
fine details, and gcrepresents the shapes by simple and semantically consistent shape primitives.
inspired us to handle huge differences in intra-class shapes by fusing this category-level seman-
tic information. Specifically, we propose to encode object shapes into semantic primitives, which
construct a unified semantic representation among objects in the same category (see Fig. 1(b)). Fol-
lowing Hao et al. [20], we map diverse shapes and the corresponding semantic primitives into the
same shared latent space of a generative model. In this way, we can obtain the correspondence be-
tween the observed point cloud and the latent space by dividing the observed point cloud into several
semantic primitives with a part segmentation network.
In order to disentangle the shape and pose estimation, we further extract a novel SIM(3)-invariant
descriptor from the geometric distribution of the point cloud’s semantic primitives. By leveraging
such SIM(3)-invariant descriptors, we can perform latent optimization of object shapes in arbitrary
poses and obtain the resulting shape consistent with the observation. After obtaining the estimated
shape, we use a correspondence-based algorithm to recover the size and pose by aligning semantic
primitives of the target object with that of the estimated shape.
In summary, our contributions are as follows: 1) We propose a novel category-level pose estima-
tion method, which uses semantic primitives to model diverse shapes and bridges the observed
point clouds with implicitly generated shapes. 2) We propose an online shape estimation method,
which enables object shape optimization in arbitrary poses with a novel SIM(3)-invariant descrip-
tor. 3) Even when training on the synthetic dataset only, our method shows good generalization and
achieves state-of-the-art category-level pose estimation performance on the real dataset.
2 Related Works
Instance-Level Object Pose Estimation. Instance-level pose estimation assumes the CAD model
of test objects is known in the training phase, and previous methods are mainly divided into three
categories. The first category of methods [21,8,7,22] directly regress the 6D object pose from
RGB or RGB-D images. PoseCNN [7] extends 2D detection architecture to regress 6D object pose
from RGB images, and DenseFusion [8] combines color and depth information from the RGB-D
input and makes the regressed pose more accurate. The second are correspondence-based methods
[9,10,11,23]. They predict the correspondences between 2D images and 3D models, and then
recover the pose by Perspective-n-Point algorithm [24]. For example, PVNet [9] predicts the 3D
keypoints on the RGB image by a voting scheme. CDPN [11] predicts the dense correspondences
between image pixels and 3D object surface. The third category is rendering-based methods [25,26],
which recover the pose by minimizing the re-projection error between the posed 3D model and 2D
image through a differentiable render. Compared with these instance-level methods, our category-
level method does not need the known CAD model as prior.
Category-Level Object Pose Estimation. Category-level methods [12,27,13] aim at estimating
the pose for the arbitrary shape of the same category. To overcome the intra-category variations,
NOCS [12] introduces normalized object canonical space, which establishes a unified coordinate
space among instances of the same category. CASS [27] learns a variational auto-encoder to recover
object shape and predicts the pose in an end-to-end neural network. These methods directly regress
the pose or coordinates, and thus struggle to form an intra-category representation. Other methods
2
g
f( )
P0
gc(̂
z)
̂
z
gf(̂
z)
ϕ
Observed Point
Cloud
Semantic
Primitives
Shape
Descriptor Latent Space
dδf
dz
Generative
Decoder
back-projection
gc(̂
z)
(c) Pose and Size Estimation
Scale
Rotation
Translation
˜s
˜
R
˜
t
gf(̂
z)Scale
Rotation
Translation
˜s
˜
R
˜
t
(d) Object Mesh Reconstruction
Mask R-CNN
(a) Semantic Primitives Extraction (b) Generative Shape Optimization
f
Figure 2: Overview of the proposed method. The input of our method is the point cloud observed
from a single-view depth image. (a) Our method first extracts semantic primitives ˜
Cfrom object
point cloud P0by a part segmentation network φ. (b) We calculate a SIM(3)-invariant shape descrip-
tor f(˜
C)from ˜
Cand optimize a shape embedding ˆzin the latent space by the difference between
f(˜
C)and f(gc(z)), where gcand gfare the coarse and fine branches of the generative model g,
detailed in Sec. 3.1. (c) The similarity transformation {˜s, ˜
R,˜
t}is recovered through the semantic
correspondences between ˜
Cand optimized shape gc(ˆz). (d) Further, we can simply apply the trans-
formation {˜s, ˜
R,˜
t}on the fine geometry generated through gf(ˆz), and reconstruct the object-level
scene as a by-product besides our main results, which is detailed in the Sec. Bin the supp. material.
utilize a prior model of each category for the category-level representation. For example, SPD [14]
and SGPA [15] predict deformation on canonical prior models and recover the 6D pose through
correspondence between the deformed model and the observed point cloud. However, when facing
a significant difference between the prior model and the target object, the deformation prediction
tend to fail. DualPose [28] takes the consistency between an implicit pose decoder and an explicit
one into account and refines the predicted pose during testing. Besides, FS-Net [13] proposes a
novel data augmentation mechanism to overcome intra-category variations. SAR-Net [17] recovers
the shape by modeling symmetric correspondence of objects, which makes pose and size prediction
easier. Rencently, CenterSnap [29] jointly optimizes detection, reconstruction and pose when train-
ing and achieves real-time inference performance. To utilize the diverse and continuous shapes in
generative models, iCaps[16] and ELLIPSDF[18] introduce the generative models [19,20] to their
shape and pose estimation. Shapes and poses are jointly optimized in these works. Unlike these
previous works, we establish a semantically consistent representation and decouple the shape and
pose estimation, which makes the optimization more robust.
3 Method
Problem Formulation. Given observed point cloud P0={pi|i= 1, ..., N0}of an object with
known category, our goal is to estimate the complete shape, scale and 6D pose for this object. The
recovered shape is represented by a shape embedding ˆzin a latent space. The estimated scale and
6D pose are represented as a similarity transformation {s, R,t} ∈ SIM(3), where scale sR, 3D
rotation RSO(3) and 3D translation tR3. SIM(3) and SO(3) indicate the Lie group of 3D
similarity transformations and 3D rotation, respectively.
Overview. As illustrated in Fig 2, our method takes the RGB-D image as input. A Mask-RCNN[30]
network is used to obtain the mask and category of each object instance. In the following stages,
our method only processes the observed instance point cloud back-projected from the masked depth
image. To overcome the huge intra-category variations, we utilize a unified intra-category rep-
resentation of semantic primitives (Sec 3.1). Then, we process the observed point cloud with a
part segmentation network to extract semantic primitives which establish a connection between the
observed object and the latent space (Sec 3.2). To estimate the shape of the target with its pose
unknown, we design a SIM(3)-invariant descriptor to optimize the generative shape in the latent
space (Sec 3.3). Finally, we estimate the 6D pose and scale of the object through the semantic
correspondence between the observed point cloud and our recovered generative shape (Sec 3.4).
3
3.1 Semantic Primitive Representation
The huge intra-category differences make it challenging to represent objects with a standard model.
To overcome this problem, we utilize a semantic primitive representation. The insight behind the
semantic primitive representation is that although different instances in the same category have var-
ious shapes, they tend to have similar semantic parts, e.g., each camera instance has a lens, and each
laptop has four corners (see Fig 1(a)). In semantic primitive representation, each instance is decom-
posed into several parts, and each part always corresponds to the same semantic part in different
instances (see Fig 1(b)). To generate this representation, we learn a generative model gfollowing
the network structure in [20] and map instances of the same category into a latent space.
Specifically, Fig 1(c) gives an overview of the generative model. The generative model gex-
presses shapes at two granular levels, one capturing detailed shapes and the other representing an
abstracted shape using semantically consistent shape primitives. We note the fine-scale model as
gfand the coarse-scale one as gc.gfand gcare two auto-decoders sharing the same latent code z.
gf(z, x) = SDF (x), where xis a 3D position in canonical space and SDF denotes the signed
distance function. gc(z) = {αi|i= 1, ..., Nc}, where Ncis the number of primitives and αiis the
parameters of a primitive. We use sphere primitives here and α= (c, r), where cis the 3D center of
a sphere and ris the radius. Please refer to [20] and Sec. A.3 in the supp. material for more details.
3.2 Extract Semantic Primitives from Point Cloud
We use a part segmentation network to obtain the semantic primitives of the observation point cloud
in two stages. First, we predict a semantic label lifor each point piin the observed point cloud P0,
and then a centralization operation is performed on the points to extract the center of each primitive.
Semantic Label Prediction. For each point piP0, its semantic label liis defined in terms of
closest primitive center cj:
li= argmin
j=1...Nc
||picj||2.(1)
We treat point label prediction as a classification problem and follow the part segmentation version
of 3D-GCN [31] to perform point-wise classification. Standard cross entropy is used as the loss
function. In addition, because of imperfect segmentation masks produced by Mask R-CNN, the
point cloud back-projected from the masked depth image may contain extra background points. For
these outlier points, we add another “dust” label in the point-wise classification to filter out them.
Symmetric Loss. Many common objects in robot manipulation have symmetrical structures, and
most objects (e.g., bottles) are symmetrical about an axis. The unique pre-labeled ground truth cause
ambiguity when the object rotates around the axis. In order to eliminate the influence of symmetric
structure on our classification task, we introduce the symmetric loss as the same as [12]. For each
symmetric object in the dataset, we define a symmetric axis and make the loss identical when the
object rotates around the axis. Specifically, we generate a separate ground truth for each possible
angle of rotational symmetry. Given the rotation angle θ, the rotated semantic label lθ,i is defined
as:
lθ,i = argmin
j=1...Nc
||picθ,j ||2,(2)
where cθ,j is the position of cjafter θdegrees of rotation about the axis. Then we define the
symmetric loss Lsym as:
Lsym = min
θΘL(˜
li, lθ,i).(3)
where Lis the standard cross entropy, ˜
liis the predicted label by the network, and we set Θ = {0}
for non-symmetric objects and Θ = {i·60|i= 0, ..., 5}for symmetric objects in our experiments.
Primitive centralization. After predicting the semantic label of each 3D point, we count the se-
mantic labels ˜
L={li|i= 1, ..., ˜
Nc}which appear in the partially observed point cloud. Then, we
calculate the primitive centers ˜
C={cl|l=l1, ..., l ˜
Nc}from the labeled points by averaging the
points with the same label. Specifically,
cl=1
Nl
N0
X
i=1
piI(˜
li=l).(4)
4
摘要:

GenerativeCategory-LevelShapeandPoseEstimationwithSemanticPrimitivesGuanglinLi1;2YifengLi2ZhichaoYe1QihangZhang3TaoKong2ZhaopengCui1GuofengZhang11StateKeyLabofCAD&CG,ZhejiangUniversity2ByteDanceAILab3MultimediaLaboratory,TheChineseUniversityofHongKongAbstract:Empoweringautonomousagentswith3Dundersta...

展开>> 收起<<
Generative Category-Level Shape and Pose Estimation with Semantic Primitives Guanglin Li12Yifeng Li2Zhichao Ye1Qihang Zhang3.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:7.66MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注