the sparsity and disorder of point cloud data result in var-
ious depth distributions from multiple views, further con-
fusing the aggregation of CLIP. Existing pre-training works
focus on the domain gap [1] or multi-view consistency [46]
of point clouds, while we intend to tackle similar issues
based on depth maps. In addition, a solution of adapting
pre-training knowledge to downstream tasks should be in-
cluded in the V-L transfer.
In order to transfer CLIP to the 3D domain, we pro-
pose CLIP2Point, a pre-training scheme with two learning
mechanisms: 1) cross-modality learning for the contrastive
alignment of RGB image and depth map, 2) intra-modality
learning in the depth modality to enhance the invariance of
depth aggregation. In particular, the image encoder Eiis di-
rectly from CLIP weights and is frozen during pre-training.
While the depth encoder Edis trained to 1) align depth fea-
tures with CLIP image features in cross-modality learning
and 2) encourage the depth aggregation to be invariant to
view changes in intra-modality learning. With pre-training,
the depth features can then be well aligned with the visual
CLIP features. As for the training data, we do not adopt
the depth maps in the existing RGB-D datasets as they are
densely sampled and are contradicted to the sparsity of ren-
dered depth maps. Instead, we reconstruct multi-view im-
ages and depth maps from 3D models directly. Specifically,
we render 10 views of RGB images from ShapeNet [4],
which covers 52,460 3D models for 55 object categories.
Meanwhile, we generate corresponding depth maps, with
a new rendering setting that forms a better visual effect for
CLIP encoding. Experiments show that our CLIP2Point can
significantly improve the performance of zero-shot point
cloud classification.
To further adapt our CLIP2Point to downstream tasks,
we propose a novel Gated Dual-Path Adapter (GDPA).
Since our pre-training is to align the instance-level depth
map, it can be complementary with CLIP pre-training
knowledge that focuses on category-level discrimination.
We propose a dual-path structure, where both our pre-
trained depth encoder Edand the CLIP visual encoder Ei
are utilized. A learnable global-view aggregator is attached
to each encoder to extract an overall feature from multiple
views. And the final logits can be calculated by the gated
fusion of two encoders.
To sum up, our contributions can be summarized as:
• We propose a contrastive learning method dubbed
CLIP2Point, with a newly proposed pre-training
dataset that is pre-processed from ShapeNet, transfer-
ring CLIP knowledge to the 3D domain. Experiments
show that CLIP2Point significantly improves the per-
formance of zero-shot classification.
• We propose a novel Gated Dual-Path Adapter (GDPA),
a dual-path structure with global-view aggregators and
gated fusion to efficiently extend CLIP2Point to down-
stream representation learning.
• Extensive experiments are conducted on Model-
Net10, ModelNet40, and ScanobjectNN. In compari-
son to 3D transfer learning and pre-training networks,
CLIP2Point achieves state-of-the-art results on zero-
shot, few-shot, and fully-supervised point cloud clas-
sification tasks.
2. Related Work
2.1. Vision-Language Pre-Training
Vision-language (V-L) pre-training has been a growing
interest in multi-modal tasks. Pre-trained by large-scale
image-text [7,5] or video-text [39] pairs, those models can
be applied to multiple downstream tasks, e.g., visual ques-
tion answering, image/video captioning, and text-to-image
generation. CLIP [35] further leverages V-L pre-training to
transfer cross-modal knowledge, allowing natural language
to understand visual concepts. Nonetheless, pre-training
across 3D vision and language [50,20] is restricted by in-
sufficient 3D-text data pairs. And 3D downstream tasks like
shape retrieval [17] and text-guided shape generation [27]
suffer from limited performance. Considering the vacancy
between 3D vision and language, we attempt to transfer
CLIP pre-trained knowledge to the 3D domain, making lan-
guage applicable to point cloud classification.
2.2. Self-Supervised Pre-Training
Self-supervised pre-training has become an important
issue in computer vision. Since task-related annotations
are not required, it can leverage large-scale data and pre-
text tasks to learn general representation. In particu-
lar, contrastive learning [19,6,30,42] and masked auto-
encoding [18,59,13] are two popular self-supervised
schemes. Different from directly applying masked auto-
encoding to 3D point completion [52,33], Li and Heiz-
mann [24] argue that contrastive learning in 3D vision can
vary from granularity (point/instance/scene) or modality
(point/depth/image). In this work, we aim to adopt image-
depth contrastive learning to bridge the domain gap between
depth features and visual CLIP features, thereby allowing to
transfer CLIP knowledge to the 3D domain.
2.3. Downstream Fine-Tuning
Fine-tuning has been widely used in downstream tasks to
fit pre-trained weights to specific training datasets [53,26,
58,56]. One common practice is to update the entire param-
eters during training, while it may be overfitted if the scale
of training data is limited. Instead, partial tuning [3,54]
is a data-efficient way to fit downstream data. Recently,
prompt tuning has been applied to language [2,25] and