Let Images Give You More Point Cloud Cross-Modal Training for Shape Analysis Xu Yan12y Heshen Zhan12y Chaoda Zheng12 Jiantao Gao4

2025-04-29 0 0 1011.79KB 15 页 10玖币

侵权投诉

Let Images Give You More:

Point Cloud Cross-Modal Training for Shape Analysis

Xu Yan1,2†, Heshen Zhan1,2†, Chaoda Zheng1,2, Jiantao Gao4,

Ruimao Zhang3, Shuguang Cui2,1,5, Zhen Li2,1∗

1FNii, CUHK-Shenzhen, 2SSE, CUHK-Shenzhen,

3SDS, CUHK-Shenzhen, 4USV, Shanghai University, 5Pengcheng Lab

{xuyan1@link.,heshenzhan@link.,lizhen@}cuhk.edu.cn

Abstract

Although recent point cloud analysis achieves impressive progress, the paradigm

of representation learning from a single modality gradually meets its bottleneck. In

this work, we take a step towards more discriminative 3D point cloud representation

by fully taking advantages of images which inherently contain richer appearance

information, e.g., texture, color, and shade. Speciﬁcally, this paper introduces

a simple but effective point cloud cross-modality training (

PointCMT

) strategy,

which utilizes view-images, i.e., rendered or projected 2D images of the 3D object,

to boost point cloud analysis. In practice, to effectively acquire auxiliary knowledge

from view images, we develop a teacher-student framework and formulate the cross-

modal learning as a knowledge distillation problem. PointCMT eliminates the

distribution discrepancy between different modalities through novel feature and

classiﬁer enhancement criteria and avoids potential negative transfer effectively.

Note that PointCMT effectively improves the point-only representation without

architecture modiﬁcation. Sufﬁcient experiments verify signiﬁcant gains on various

datasets using appealing backbones, i.e., equipped with PointCMT, PointNet++ and

PointMLP achieve state-of-the-art performance on two benchmarks, i.e.,

94.4%

and

86.7%

accuracy on ModelNet40 and ScanObjectNN, respectively. Code will

be made available at https://github.com/ZhanHeshen/PointCMT.

1 Introduction

As the fundamental 3D representation, point clouds have attracted increasing attention for various

applications, e.g., self-driving [

], robotics perception [

], etc. Generally, a point cloud

consists of sparse and unordered points in the 3D space, which is signiﬁcantly different from a 2D

image with a dense and regular pixel array. Prior studies treat the understanding of 2D images and 3D

point clouds as two separate problems, and both have their own merits and drawbacks. Concretely,

rich color and ﬁne-grained texture are easily obtained in 2D images, but they are ambiguous in depth

and shape sensing. Previous works extract features on images through convolution neural networks

(CNN). In contrast, point clouds are superior in providing spatial and geometric information but only

preserve sparse and textureless features. Several prior studies process features on unstructured point

clouds through local aggregation operators [

]. It is natural to raise a question: Could we use the

rich information hidden in 2D images to boost 3D point cloud shape analysis?

To address the above issue, one straightforward way is to leverage the beneﬁts of both images and

point clouds, i.e., fusing information from two complementary representations with task-speciﬁc

design [

]. However, utilizing additional image representation requires designing

∗Corresponding author: Zhen Li. †Equal ﬁrst authorship.

Preprint. Under review.

arXiv:2210.04208v1 [cs.CV] 9 Oct 2022

92.5

93.5

94.5

DGCNN RS-CNN PointNet++

Trained with point clouds only

Trained with PointCMT

(a)

Point Cloud

Analysis Models

Cross-Modal Training

“Airplane”

“Sofa”

Point Clouds Images

(b)

Rendered from CAD Perspective Projection

Figure 1: (a) Our proposed general Cross-Modal Training (PointCMT) strategy. It introduces priors

from images into point cloud shape analysis models only in the training stage without any baseline

model modiﬁcation. (b) Classiﬁcation accuracy (%) on ModelNet40 with or without training with

our proposed PointCMT strategy. Noticeable improvements can be observed.

a multi-modal network, which takes the extra image inputs in both training and inference phases.

Moreover, the exploiting extra-images is usually computation-intensive and paired-images are usually

unavailable during inference. Thus, multi-modal learning meets its bottleneck in many aspects.

This paper tries to ease the barrier of cross-modal learning between images and point clouds. Inspired

by knowledge distillation (KD) that achieves knowledge transfer from a teacher model to a student

one, we formulate the cross-modal learning as a KD problem, conducting alignment between sample

representations learned by images and point clouds. However, previous KD approaches usually

assume that the training data used by the teacher and student are from the same distribution [

Thus, since sparse and disordered point clouds represent visual information different from images,

naive feature alignment between two representations appeals to cause limited gains or negative

transfer for the cross-modal scenario. To this end, we design a novel framework for cross-modal KD

and propose the point cloud cross-modal training strategy, i.e.,

PointCMT

in Figure 1(a), which

distills features derived from images into the point cloud representation. Speciﬁcally, multiple view

images for each 3D object can be generated through either rendering the CAD model or conducting

perspective projection on the point cloud from different viewpoints. These free auxiliary images

are fed into the image network to obtain the global representations for the object. Besides, feature

and classiﬁer enhancements are conducted between the point cloud and image features, in which the

newly proposed criteria effectively avoid negative transfer between different modalities, i.e., directly

applying [

] hampers the performance on ModelNet40. After training, the model gains higher

performance, only taking point clouds as input without architecture modiﬁcation.

Compared with multi-modal approaches, our solution has the following preferable properties:

1) Gen-

erality

: PointCMT can be integrated with arbitrary point cloud analysis models without structural

modiﬁcation.

2) Effectively

: It signiﬁcantly boosts the performance upon several baseline approaches,

e.g., PointNet++ [

] achieves state-of-the-art 94.4% from 93.4% overall accuracy on ModelNet40,

as shown in Figure 1(b).

3) Efﬁciency

: It only utilizes auxiliary image data in the training stage.

After training, the enhanced 3D model infers without image inputs.

4) Flexibility

: The extensive ex-

periments illustrate that PointCMT performs superior even without colorized and dedicated rendered

images, i.e., it can also greatly improve the performance when using images directly projected by

sparse and textureless point clouds. Thus, it provides an alternative solution to enhance the point

cloud shape analysis when the additional rendered images are not accessible.

In summary, our contributions are:

This paper formulates cross-modal learning on point cloud

analysis as a knowledge distillation problem, where we utilize the merits of texture and color-aware

2D images to acquire more discriminative point cloud representation.

We propose point cloud

cross-modal training, i.e., PointCMT, strategy with corresponding criteria to boost point cloud models

during the training stage.

Extensive experiments on several datasets verify the effectiveness of

our approach, where PointCMT greatly boosts several baseline models even on the state-of-the-art,

e.g., PointNet++ [

] trained with PointCMT gains

1.0%

and

4.4%

accuracy improvements on

ModelNet40 [

] and ScanObjectNN, respectively. Even based upon PointMLP [

], it increases its

accuracy by 1% to 86.7% on ScanObjectNN dataset.

2 Related Works

3D Shape Recognition Based on Point Clouds.

These stream methods directly process raw point

clouds as input (also called point-based methods). They are pioneered by PointNet [

], which

approximates a permutation-invariant set function using a per-point Multi-Layer Perceptron (MLP)

followed by a max-pooling layer. Later on, point-based methods aim at designing local aggregation

operators for local feature extraction. Speciﬁcally, they generally sample multiple sub-points from

the original point cloud, and then aggregate neighboring features of each sub-point through local

aggregation operators, in which point-wise MLPs [34,31], adaptive weight [44,46,27] and pseudo

grid based methods [

] are proposed. More recently, there are some attempts to utilize non-local

operator [

], or Transformer [

] to mine the long distance dependency. This paper also follows

the paradigm of point-based methods to conduct point cloud shape analysis.

3D Shape Recognition Based on Images.

Since point clouds are irregular and unordered, some

works consider projecting the 3D shapes into multiple images from different viewpoints (also called

view-based methods) and then leverage the well-developed 2D CNNs to process 3D data. One

seminal work of multi-view learning is MVCNN [

]. It extracts per-view features with a shared

CNN in parallel, then aggregates via a view-level max-pooling layer. Most follow-up works propose

more effective modules to aggregate the view-level features. For instance, some of them enhance

the aggregated feature by considering similarity among views [

] while others focus on the

viewpoint relation [

]. The above methods usually utilize ad-hoc rendered images for each 3D

shape, including shade and texture for the surface mesh. Therefore, they generally achieve higher

performance than point-based methods using sparse point clouds as input. Recently, [

] proposes

a simple but effective method (SimpleView) through directly projecting sparse point clouds onto

image planes, achieving comparable performance with point-based methods. Inspired by view-based

methods, this paper takes advantage of extracted image features from the view-based method, which

are utilized as prior knowledge to boost point cloud shape analysis.

Knowledge Distillation.

Knowledge distillation (KD) aims at compressing a large network (teacher)

to a compact and tiny one (student), and boost the performance of the student at the same time. The

concept was ﬁrst shown by Hinton et al. [

], which trains a student by using the softened logits

of a teacher as targets. Over the past few years, several subsequent approaches [

]

use different criteria to align the sample representations between the teacher and student. However,

almost all the existing works assume that the training data used by the teacher and student networks

are from the same distribution. Our experiment illustrates that new biases and negative transfer will

be introduced in the distillation process if cross-modal data from different distribution (e.g., features

extracted from unordered point cloud and regular grid image) is utilized directly on previous KDs.

Cross-Modal Knowledge Transfer.

Cross-modal knowledge transfer in computer vision is a rela-

tively emerging ﬁeld that aims to utilize additional modalities at the training stage and enhance the

model’s performance on the target modality at the inference. Recently, there are 3D-to-2D knowledge

transfer approaches, which adopt geometric aware 3D features from point clouds to enhance the

performance of 2D tasks through a contrastive manner [

] or feature alignment [

]. Later on,

approaches attempt to transfer priors in images to enhance 3D point cloud-related tasks, and some are

designed for speciﬁc tasks. Concretely, [

] propose the images-to-point contrastive pre-training,

[

] inﬂates 2D convolution kernels to the 3D ones and [

] independently apply cross-modal

training for visual grounding, captioning and semantic segmentation. Inspired but different from the

above, we are the ﬁrst to conduct image-to-point knowledge distillation for point cloud analysis.

3 Methodlogy

3.1 Problem Statement

Let

P ∈ RN×3

and

y∈R1

be the point cloud and ground-truth label of the 3D object. Its

corresponding view-image counterparts can be denoted as

I ∈ RV×H×W×3

, where

and

(H, W )

are the number of points, number of view-images and image size, respectively. View images

can be gained by rendering the 3D CAD model [

] or perspective projecting the raw point cloud [

We denote

and

as the image and point cloud analysis networks, respectively, and regard them as

the

teacher

and

student

in traditional knowledge distillation (KD). For these networks, we split each

of them into two parts: (i)

Encoders

(i.e., feature extractors

Encimg(·)

and

Encpts(·)

), the output of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LetImagesGiveYouMore:PointCloudCross-ModalTrainingforShapeAnalysisXuYan1;2y,HeshenZhan1;2y,ChaodaZheng1;2,JiantaoGao4,RuimaoZhang3,ShuguangCui2;1;5,ZhenLi2;11FNii,CUHK-Shenzhen,2SSE,CUHK-Shenzhen,3SDS,CUHK-Shenzhen,4USV,ShanghaiUniversity,5PengchengLab{xuyan1@link.,heshenzhan@link.,lizhen@}cuhk.edu...

展开>> 收起<<

Let Images Give You More Point Cloud Cross-Modal Training for Shape Analysis Xu Yan12y Heshen Zhan12y Chaoda Zheng12 Jiantao Gao4.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Let Images Give You More Point Cloud Cross-Modal Training for Shape Analysis Xu Yan12y Heshen Zhan12y Chaoda Zheng12 Jiantao Gao4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: