Let Images Give You More Point Cloud Cross-Modal Training for Shape Analysis Xu Yan12y Heshen Zhan12y Chaoda Zheng12 Jiantao Gao4

2025-04-29
0
0
1011.79KB
15 页
10玖币
侵权投诉
Let Images Give You More:
Point Cloud Cross-Modal Training for Shape Analysis
Xu Yan1,2†, Heshen Zhan1,2†, Chaoda Zheng1,2, Jiantao Gao4,
Ruimao Zhang3, Shuguang Cui2,1,5, Zhen Li2,1∗
1FNii, CUHK-Shenzhen, 2SSE, CUHK-Shenzhen,
3SDS, CUHK-Shenzhen, 4USV, Shanghai University, 5Pengcheng Lab
{xuyan1@link.,heshenzhan@link.,lizhen@}cuhk.edu.cn
Abstract
Although recent point cloud analysis achieves impressive progress, the paradigm
of representation learning from a single modality gradually meets its bottleneck. In
this work, we take a step towards more discriminative 3D point cloud representation
by fully taking advantages of images which inherently contain richer appearance
information, e.g., texture, color, and shade. Specifically, this paper introduces
a simple but effective point cloud cross-modality training (
PointCMT
) strategy,
which utilizes view-images, i.e., rendered or projected 2D images of the 3D object,
to boost point cloud analysis. In practice, to effectively acquire auxiliary knowledge
from view images, we develop a teacher-student framework and formulate the cross-
modal learning as a knowledge distillation problem. PointCMT eliminates the
distribution discrepancy between different modalities through novel feature and
classifier enhancement criteria and avoids potential negative transfer effectively.
Note that PointCMT effectively improves the point-only representation without
architecture modification. Sufficient experiments verify significant gains on various
datasets using appealing backbones, i.e., equipped with PointCMT, PointNet++ and
PointMLP achieve state-of-the-art performance on two benchmarks, i.e.,
94.4%
and
86.7%
accuracy on ModelNet40 and ScanObjectNN, respectively. Code will
be made available at https://github.com/ZhanHeshen/PointCMT.
1 Introduction
As the fundamental 3D representation, point clouds have attracted increasing attention for various
applications, e.g., self-driving [
2
,
33
,
34
], robotics perception [
7
,
13
,
6
], etc. Generally, a point cloud
consists of sparse and unordered points in the 3D space, which is significantly different from a 2D
image with a dense and regular pixel array. Prior studies treat the understanding of 2D images and 3D
point clouds as two separate problems, and both have their own merits and drawbacks. Concretely,
rich color and fine-grained texture are easily obtained in 2D images, but they are ambiguous in depth
and shape sensing. Previous works extract features on images through convolution neural networks
(CNN). In contrast, point clouds are superior in providing spatial and geometric information but only
preserve sparse and textureless features. Several prior studies process features on unstructured point
clouds through local aggregation operators [
34
,
43
]. It is natural to raise a question: Could we use the
rich information hidden in 2D images to boost 3D point cloud shape analysis?
To address the above issue, one straightforward way is to leverage the benefits of both images and
point clouds, i.e., fusing information from two complementary representations with task-specific
design [
58
,
24
,
10
,
32
,
42
]. However, utilizing additional image representation requires designing
∗Corresponding author: Zhen Li. †Equal first authorship.
Preprint. Under review.
arXiv:2210.04208v1 [cs.CV] 9 Oct 2022
92
92.5
93
93.5
94
94.5
DGCNN RS-CNN PointNet++
Trained with point clouds only
Trained with PointCMT
(a)
Point Cloud
Analysis Models
Cross-Modal Training
“Airplane”
“Sofa”
Point Clouds Images
(b)
Rendered from CAD Perspective Projection
Figure 1: (a) Our proposed general Cross-Modal Training (PointCMT) strategy. It introduces priors
from images into point cloud shape analysis models only in the training stage without any baseline
model modification. (b) Classification accuracy (%) on ModelNet40 with or without training with
our proposed PointCMT strategy. Noticeable improvements can be observed.
a multi-modal network, which takes the extra image inputs in both training and inference phases.
Moreover, the exploiting extra-images is usually computation-intensive and paired-images are usually
unavailable during inference. Thus, multi-modal learning meets its bottleneck in many aspects.
This paper tries to ease the barrier of cross-modal learning between images and point clouds. Inspired
by knowledge distillation (KD) that achieves knowledge transfer from a teacher model to a student
one, we formulate the cross-modal learning as a KD problem, conducting alignment between sample
representations learned by images and point clouds. However, previous KD approaches usually
assume that the training data used by the teacher and student are from the same distribution [
17
].
Thus, since sparse and disordered point clouds represent visual information different from images,
naive feature alignment between two representations appeals to cause limited gains or negative
transfer for the cross-modal scenario. To this end, we design a novel framework for cross-modal KD
and propose the point cloud cross-modal training strategy, i.e.,
PointCMT
in Figure 1(a), which
distills features derived from images into the point cloud representation. Specifically, multiple view
images for each 3D object can be generated through either rendering the CAD model or conducting
perspective projection on the point cloud from different viewpoints. These free auxiliary images
are fed into the image network to obtain the global representations for the object. Besides, feature
and classifier enhancements are conducted between the point cloud and image features, in which the
newly proposed criteria effectively avoid negative transfer between different modalities, i.e., directly
applying [
17
] hampers the performance on ModelNet40. After training, the model gains higher
performance, only taking point clouds as input without architecture modification.
Compared with multi-modal approaches, our solution has the following preferable properties:
1) Gen-
erality
: PointCMT can be integrated with arbitrary point cloud analysis models without structural
modification.
2) Effectively
: It significantly boosts the performance upon several baseline approaches,
e.g., PointNet++ [
34
] achieves state-of-the-art 94.4% from 93.4% overall accuracy on ModelNet40,
as shown in Figure 1(b).
3) Efficiency
: It only utilizes auxiliary image data in the training stage.
After training, the enhanced 3D model infers without image inputs.
4) Flexibility
: The extensive ex-
periments illustrate that PointCMT performs superior even without colorized and dedicated rendered
images, i.e., it can also greatly improve the performance when using images directly projected by
sparse and textureless point clouds. Thus, it provides an alternative solution to enhance the point
cloud shape analysis when the additional rendered images are not accessible.
In summary, our contributions are:
1)
This paper formulates cross-modal learning on point cloud
analysis as a knowledge distillation problem, where we utilize the merits of texture and color-aware
2D images to acquire more discriminative point cloud representation.
2)
We propose point cloud
cross-modal training, i.e., PointCMT, strategy with corresponding criteria to boost point cloud models
during the training stage.
3)
Extensive experiments on several datasets verify the effectiveness of
our approach, where PointCMT greatly boosts several baseline models even on the state-of-the-art,
e.g., PointNet++ [
34
] trained with PointCMT gains
1.0%
and
4.4%
accuracy improvements on
ModelNet40 [
48
] and ScanObjectNN, respectively. Even based upon PointMLP [
31
], it increases its
accuracy by 1% to 86.7% on ScanObjectNN dataset.
2
2 Related Works
3D Shape Recognition Based on Point Clouds.
These stream methods directly process raw point
clouds as input (also called point-based methods). They are pioneered by PointNet [
33
], which
approximates a permutation-invariant set function using a per-point Multi-Layer Perceptron (MLP)
followed by a max-pooling layer. Later on, point-based methods aim at designing local aggregation
operators for local feature extraction. Specifically, they generally sample multiple sub-points from
the original point cloud, and then aggregate neighboring features of each sub-point through local
aggregation operators, in which point-wise MLPs [34,31], adaptive weight [44,46,27] and pseudo
grid based methods [
39
,
20
] are proposed. More recently, there are some attempts to utilize non-local
operator [
54
], or Transformer [
60
,
14
] to mine the long distance dependency. This paper also follows
the paradigm of point-based methods to conduct point cloud shape analysis.
3D Shape Recognition Based on Images.
Since point clouds are irregular and unordered, some
works consider projecting the 3D shapes into multiple images from different viewpoints (also called
view-based methods) and then leverage the well-developed 2D CNNs to process 3D data. One
seminal work of multi-view learning is MVCNN [
38
]. It extracts per-view features with a shared
CNN in parallel, then aggregates via a view-level max-pooling layer. Most follow-up works propose
more effective modules to aggregate the view-level features. For instance, some of them enhance
the aggregated feature by considering similarity among views [
11
,
56
] while others focus on the
viewpoint relation [
45
,
22
]. The above methods usually utilize ad-hoc rendered images for each 3D
shape, including shade and texture for the surface mesh. Therefore, they generally achieve higher
performance than point-based methods using sparse point clouds as input. Recently, [
12
] proposes
a simple but effective method (SimpleView) through directly projecting sparse point clouds onto
image planes, achieving comparable performance with point-based methods. Inspired by view-based
methods, this paper takes advantage of extracted image features from the view-based method, which
are utilized as prior knowledge to boost point cloud shape analysis.
Knowledge Distillation.
Knowledge distillation (KD) aims at compressing a large network (teacher)
to a compact and tiny one (student), and boost the performance of the student at the same time. The
concept was first shown by Hinton et al. [
17
], which trains a student by using the softened logits
of a teacher as targets. Over the past few years, several subsequent approaches [
23
,
1
,
5
,
40
,
9
,
61
]
use different criteria to align the sample representations between the teacher and student. However,
almost all the existing works assume that the training data used by the teacher and student networks
are from the same distribution. Our experiment illustrates that new biases and negative transfer will
be introduced in the distillation process if cross-modal data from different distribution (e.g., features
extracted from unordered point cloud and regular grid image) is utilized directly on previous KDs.
Cross-Modal Knowledge Transfer.
Cross-modal knowledge transfer in computer vision is a rela-
tively emerging field that aims to utilize additional modalities at the training stage and enhance the
model’s performance on the target modality at the inference. Recently, there are 3D-to-2D knowledge
transfer approaches, which adopt geometric aware 3D features from point clouds to enhance the
performance of 2D tasks through a contrastive manner [
18
] or feature alignment [
30
]. Later on,
approaches attempt to transfer priors in images to enhance 3D point cloud-related tasks, and some are
designed for specific tasks. Concretely, [
28
,
26
] propose the images-to-point contrastive pre-training,
[
50
] inflates 2D convolution kernels to the 3D ones and [
57
,
59
,
53
] independently apply cross-modal
training for visual grounding, captioning and semantic segmentation. Inspired but different from the
above, we are the first to conduct image-to-point knowledge distillation for point cloud analysis.
3 Methodlogy
3.1 Problem Statement
Let
P ∈ RN×3
and
y∈R1
be the point cloud and ground-truth label of the 3D object. Its
corresponding view-image counterparts can be denoted as
I ∈ RV×H×W×3
, where
N
,
V
and
(H, W )
are the number of points, number of view-images and image size, respectively. View images
can be gained by rendering the 3D CAD model [
38
] or perspective projecting the raw point cloud [
12
].
We denote
T
and
S
as the image and point cloud analysis networks, respectively, and regard them as
the
teacher
and
student
in traditional knowledge distillation (KD). For these networks, we split each
of them into two parts: (i)
Encoders
(i.e., feature extractors
Encimg(·)
and
Encpts(·)
), the output of
3
摘要:
展开>>
收起<<
LetImagesGiveYouMore:PointCloudCross-ModalTrainingforShapeAnalysisXuYan1;2y,HeshenZhan1;2y,ChaodaZheng1;2,JiantaoGao4,RuimaoZhang3,ShuguangCui2;1;5,ZhenLi2;11FNii,CUHK-Shenzhen,2SSE,CUHK-Shenzhen,3SDS,CUHK-Shenzhen,4USV,ShanghaiUniversity,5PengchengLab{xuyan1@link.,heshenzhan@link.,lizhen@}cuhk.edu...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 1
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 2
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 0
-
VIP免费2024-11-21 1
-
VIP免费2024-11-21 1
分类:图书资源
价格:10玖币
属性:15 页
大小:1011.79KB
格式:PDF
时间:2025-04-29
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载