Generative Category-Level Shape and Pose Estimation with Semantic Primitives Guanglin Li12Yifeng Li2Zhichao Ye1Qihang Zhang3

2025-05-06 0 0 7.66MB 18 页 10玖币

侵权投诉

Generative Category-Level Shape and Pose

Estimation with Semantic Primitives

Guanglin Li1,2Yifeng Li2Zhichao Ye1Qihang Zhang3

Tao Kong2Zhaopeng Cui1Guofeng Zhang1

1State Key Lab of CAD&CG, Zhejiang University 2ByteDance AI Lab

3Multimedia Laboratory, The Chinese University of Hong Kong

Abstract: Empowering autonomous agents with 3D understanding for daily ob-

jects is a grand challenge in robotics applications. When exploring in an un-

known environment, existing methods for object pose estimation are still not

satisfactory due to the diversity of object shapes. In this paper, we propose a

novel framework for category-level object shape and pose estimation from a sin-

gle RGB-D image. To handle the intra-category variation, we adopt a semantic

primitive representation that encodes diverse shapes into a uniﬁed latent space,

which is the key to establish reliable correspondences between observed point

clouds and estimated shapes. Then, by using a SIM(3)-invariant shape descriptor,

we gracefully decouple the shape and pose of an object, thus supporting latent

shape optimization of target objects in arbitrary poses. Extensive experiments

show that the proposed method achieves SOTA pose estimation performance and

better generalization in the real-world dataset. Code and video are available at

https://zju3dv.github.io/gCasp.

Keywords: Category-level Pose Estimation, Shape Estimation

1 Introduction

Estimating the shape and pose for daily objects is a fundamental function which has various

applications, including 3D scene understanding, robot manipulation and autonomous warehous-

ing [1,2,3,4,5,6]. Early works for this task are focused on instance-level pose estima-

tion [7,8,9,10,11], which aligns the observed object with the given CAD model. However, such a

setting is limited in real-world scenarios since it is hard to obtain the exact model of a casual object

in advance. To generalize to these unseen but semantically familiar objects, category-level pose esti-

mation is raising more and more research attention [12,13,14,15,16,17] since it could potentially

handle various instances of the same category in real scenes.

Existing works on category-level pose estimation usually try to predict pixel-wise normalized coor-

dinates for instances within one class [12] or adopt a canonical prior model with shape deformations

to estimate object poses [14,15]. Although great advances have been made, these one-pass pre-

diction methods are still faced with difﬁculties when large shape differences exist within the same

category. To handle the variety of intra-class objects, some works [18,16] leverage neural implicit

representation [19] to ﬁt the shape of the target object by iteratively optimizing the pose and shape in

a latent space, and achieve better performance. However, in such methods, the pose and shape esti-

mation are coupled together and rely on each other to get reliable results. Thus, their performance is

unsatisfactory in real scene (see Sec. 4.2). So we can see that the huge intra-class shape differences

and coupled estimation of shapes and poses are two main challenges for the existing category-level

pose estimation methods.

To tackle these challenges, we propose to estimate the object poses and shapes with semantic primi-

tives from a generative perspective. The insight behind this is that the objects of a category are often

composed of components with the same semantics although their shapes are various, e.g., a cup is

usually composed of a semicircular handle and a cylindrical body (see Fig. 1(a)). This property

Work done during Guanglin Li’s internship at ByteDance.

6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.

arXiv:2210.01112v2 [cs.CV] 1 Feb 2023

Shared

Latent Space

(a) (c)(b)

Semantic Primitives

Original Shape

Intra-Category Semantically Consistent Parts

Original Shape

Semantic Primitives

Figure 1: (a) An example of different instances in the same categories, we highlight the same seman-

tic parts (e.g., lens of cameras) between different instances. (b) The detailed shapes (top row) are

decomposed into semantic primitives (bottom row). Different colors on primitives indicate different

semantic labels, which are consistent among different shapes. (c) An overview of the generative

model consisting of two auto-decoders (i.e.,gfand gc) sharing the same latent space. gfcaptures

ﬁne details, and gcrepresents the shapes by simple and semantically consistent shape primitives.

inspired us to handle huge differences in intra-class shapes by fusing this category-level seman-

tic information. Speciﬁcally, we propose to encode object shapes into semantic primitives, which

construct a uniﬁed semantic representation among objects in the same category (see Fig. 1(b)). Fol-

lowing Hao et al. [20], we map diverse shapes and the corresponding semantic primitives into the

same shared latent space of a generative model. In this way, we can obtain the correspondence be-

tween the observed point cloud and the latent space by dividing the observed point cloud into several

semantic primitives with a part segmentation network.

In order to disentangle the shape and pose estimation, we further extract a novel SIM(3)-invariant

descriptor from the geometric distribution of the point cloud’s semantic primitives. By leveraging

such SIM(3)-invariant descriptors, we can perform latent optimization of object shapes in arbitrary

poses and obtain the resulting shape consistent with the observation. After obtaining the estimated

shape, we use a correspondence-based algorithm to recover the size and pose by aligning semantic

primitives of the target object with that of the estimated shape.

In summary, our contributions are as follows: 1) We propose a novel category-level pose estima-

tion method, which uses semantic primitives to model diverse shapes and bridges the observed

point clouds with implicitly generated shapes. 2) We propose an online shape estimation method,

which enables object shape optimization in arbitrary poses with a novel SIM(3)-invariant descrip-

tor. 3) Even when training on the synthetic dataset only, our method shows good generalization and

achieves state-of-the-art category-level pose estimation performance on the real dataset.

2 Related Works

Instance-Level Object Pose Estimation. Instance-level pose estimation assumes the CAD model

of test objects is known in the training phase, and previous methods are mainly divided into three

categories. The ﬁrst category of methods [21,8,7,22] directly regress the 6D object pose from

RGB or RGB-D images. PoseCNN [7] extends 2D detection architecture to regress 6D object pose

from RGB images, and DenseFusion [8] combines color and depth information from the RGB-D

input and makes the regressed pose more accurate. The second are correspondence-based methods

[9,10,11,23]. They predict the correspondences between 2D images and 3D models, and then

recover the pose by Perspective-n-Point algorithm [24]. For example, PVNet [9] predicts the 3D

keypoints on the RGB image by a voting scheme. CDPN [11] predicts the dense correspondences

between image pixels and 3D object surface. The third category is rendering-based methods [25,26],

which recover the pose by minimizing the re-projection error between the posed 3D model and 2D

image through a differentiable render. Compared with these instance-level methods, our category-

level method does not need the known CAD model as prior.

Category-Level Object Pose Estimation. Category-level methods [12,27,13] aim at estimating

the pose for the arbitrary shape of the same category. To overcome the intra-category variations,

NOCS [12] introduces normalized object canonical space, which establishes a uniﬁed coordinate

space among instances of the same category. CASS [27] learns a variational auto-encoder to recover

object shape and predicts the pose in an end-to-end neural network. These methods directly regress

the pose or coordinates, and thus struggle to form an intra-category representation. Other methods

f( )

gc(̂

gf(̂

Observed Point

Cloud

Semantic

Primitives

Shape

Descriptor Latent Space

dδf

Generative

Decoder

back-projection

gc(̂

Scale

Rotation

Translation

˜s

gf(̂

z)Scale

Rotation

Translation

˜s

(d) Object Mesh Reconstruction

Mask R-CNN

(a) Semantic Primitives Extraction (b) Generative Shape Optimization

Figure 2: Overview of the proposed method. The input of our method is the point cloud observed

from a single-view depth image. (a) Our method ﬁrst extracts semantic primitives ˜

Cfrom object

point cloud P0by a part segmentation network φ. (b) We calculate a SIM(3)-invariant shape descrip-

tor f(˜

C)from ˜

Cand optimize a shape embedding ˆzin the latent space by the difference between

f(˜

C)and f(gc(z)), where gcand gfare the coarse and ﬁne branches of the generative model g,

detailed in Sec. 3.1. (c) The similarity transformation {˜s, ˜

R,˜

t}is recovered through the semantic

correspondences between ˜

Cand optimized shape gc(ˆz). (d) Further, we can simply apply the trans-

formation {˜s, ˜

R,˜

t}on the ﬁne geometry generated through gf(ˆz), and reconstruct the object-level

scene as a by-product besides our main results, which is detailed in the Sec. Bin the supp. material.

utilize a prior model of each category for the category-level representation. For example, SPD [14]

and SGPA [15] predict deformation on canonical prior models and recover the 6D pose through

correspondence between the deformed model and the observed point cloud. However, when facing

a signiﬁcant difference between the prior model and the target object, the deformation prediction

tend to fail. DualPose [28] takes the consistency between an implicit pose decoder and an explicit

one into account and reﬁnes the predicted pose during testing. Besides, FS-Net [13] proposes a

novel data augmentation mechanism to overcome intra-category variations. SAR-Net [17] recovers

the shape by modeling symmetric correspondence of objects, which makes pose and size prediction

easier. Rencently, CenterSnap [29] jointly optimizes detection, reconstruction and pose when train-

ing and achieves real-time inference performance. To utilize the diverse and continuous shapes in

generative models, iCaps[16] and ELLIPSDF[18] introduce the generative models [19,20] to their

shape and pose estimation. Shapes and poses are jointly optimized in these works. Unlike these

previous works, we establish a semantically consistent representation and decouple the shape and

pose estimation, which makes the optimization more robust.

3 Method

Problem Formulation. Given observed point cloud P0={pi|i= 1, ..., N0}of an object with

known category, our goal is to estimate the complete shape, scale and 6D pose for this object. The

recovered shape is represented by a shape embedding ˆzin a latent space. The estimated scale and

6D pose are represented as a similarity transformation {s, R,t} ∈ SIM(3), where scale s∈R, 3D

rotation R∈SO(3) and 3D translation t∈R3. SIM(3) and SO(3) indicate the Lie group of 3D

similarity transformations and 3D rotation, respectively.

Overview. As illustrated in Fig 2, our method takes the RGB-D image as input. A Mask-RCNN[30]

network is used to obtain the mask and category of each object instance. In the following stages,

our method only processes the observed instance point cloud back-projected from the masked depth

image. To overcome the huge intra-category variations, we utilize a uniﬁed intra-category rep-

resentation of semantic primitives (Sec 3.1). Then, we process the observed point cloud with a

part segmentation network to extract semantic primitives which establish a connection between the

observed object and the latent space (Sec 3.2). To estimate the shape of the target with its pose

unknown, we design a SIM(3)-invariant descriptor to optimize the generative shape in the latent

space (Sec 3.3). Finally, we estimate the 6D pose and scale of the object through the semantic

correspondence between the observed point cloud and our recovered generative shape (Sec 3.4).

3.1 Semantic Primitive Representation

The huge intra-category differences make it challenging to represent objects with a standard model.

To overcome this problem, we utilize a semantic primitive representation. The insight behind the

semantic primitive representation is that although different instances in the same category have var-

ious shapes, they tend to have similar semantic parts, e.g., each camera instance has a lens, and each

laptop has four corners (see Fig 1(a)). In semantic primitive representation, each instance is decom-

posed into several parts, and each part always corresponds to the same semantic part in different

instances (see Fig 1(b)). To generate this representation, we learn a generative model gfollowing

the network structure in [20] and map instances of the same category into a latent space.

Speciﬁcally, Fig 1(c) gives an overview of the generative model. The generative model gex-

presses shapes at two granular levels, one capturing detailed shapes and the other representing an

abstracted shape using semantically consistent shape primitives. We note the ﬁne-scale model as

gfand the coarse-scale one as gc.gfand gcare two auto-decoders sharing the same latent code z.

gf(z, x) = SDF (x), where xis a 3D position in canonical space and SDF denotes the signed

distance function. gc(z) = {αi|i= 1, ..., Nc}, where Ncis the number of primitives and αiis the

parameters of a primitive. We use sphere primitives here and α= (c, r), where cis the 3D center of

a sphere and ris the radius. Please refer to [20] and Sec. A.3 in the supp. material for more details.

3.2 Extract Semantic Primitives from Point Cloud

We use a part segmentation network to obtain the semantic primitives of the observation point cloud

in two stages. First, we predict a semantic label lifor each point piin the observed point cloud P0,

and then a centralization operation is performed on the points to extract the center of each primitive.

Semantic Label Prediction. For each point pi∈P0, its semantic label liis deﬁned in terms of

closest primitive center cj:

li= argmin

j=1...Nc

||pi−cj||2.(1)

We treat point label prediction as a classiﬁcation problem and follow the part segmentation version

of 3D-GCN [31] to perform point-wise classiﬁcation. Standard cross entropy is used as the loss

function. In addition, because of imperfect segmentation masks produced by Mask R-CNN, the

point cloud back-projected from the masked depth image may contain extra background points. For

these outlier points, we add another “dust” label in the point-wise classiﬁcation to ﬁlter out them.

Symmetric Loss. Many common objects in robot manipulation have symmetrical structures, and

most objects (e.g., bottles) are symmetrical about an axis. The unique pre-labeled ground truth cause

ambiguity when the object rotates around the axis. In order to eliminate the inﬂuence of symmetric

structure on our classiﬁcation task, we introduce the symmetric loss as the same as [12]. For each

symmetric object in the dataset, we deﬁne a symmetric axis and make the loss identical when the

object rotates around the axis. Speciﬁcally, we generate a separate ground truth for each possible

angle of rotational symmetry. Given the rotation angle θ, the rotated semantic label lθ,i is deﬁned

as:

lθ,i = argmin

j=1...Nc

||pi−cθ,j ||2,(2)

where cθ,j is the position of cjafter θdegrees of rotation about the axis. Then we deﬁne the

symmetric loss Lsym as:

Lsym = min

θ∈ΘL(˜

li, lθ,i).(3)

where Lis the standard cross entropy, ˜

liis the predicted label by the network, and we set Θ = {0◦}

for non-symmetric objects and Θ = {i·60◦|i= 0, ..., 5}for symmetric objects in our experiments.

Primitive centralization. After predicting the semantic label of each 3D point, we count the se-

mantic labels ˜

L={li|i= 1, ..., ˜

Nc}which appear in the partially observed point cloud. Then, we

calculate the primitive centers ˜

C={cl|l=l1, ..., l ˜

Nc}from the labeled points by averaging the

points with the same label. Speciﬁcally,

cl=1

i=1

piI(˜

li=l).(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GenerativeCategory-LevelShapeandPoseEstimationwithSemanticPrimitivesGuanglinLi1;2YifengLi2ZhichaoYe1QihangZhang3TaoKong2ZhaopengCui1GuofengZhang11StateKeyLabofCAD&CG,ZhejiangUniversity2ByteDanceAILab3MultimediaLaboratory,TheChineseUniversityofHongKongAbstract:Empoweringautonomousagentswith3Dundersta...

展开>> 收起<<

Generative Category-Level Shape and Pose Estimation with Semantic Primitives Guanglin Li12Yifeng Li2Zhichao Ye1Qihang Zhang3.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generative Category-Level Shape and Pose Estimation with Semantic Primitives Guanglin Li12Yifeng Li2Zhichao Ye1Qihang Zhang3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: