CLIP2Point Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training Tianyu Huang13Bowen Dong1Yunhan Yang14Xiaoshui Huang2

2025-04-27 0 0 3.24MB 17 页 10玖币

侵权投诉

CLIP2Point: Transfer CLIP to Point Cloud Classiﬁcation with Image-Depth

Pre-Training

Tianyu Huang1,3Bowen Dong1Yunhan Yang1,4Xiaoshui Huang2

Rynson W.H. Lau3Wanli Ouyang2Wangmeng Zuo1,5†

1Harbin Institute of Technology 2Shanghai AI Laboratory 3City University of Hong Kong

4The University of Hong Kong 5Peng Cheng Laboratory

tyhuang0428@gmail.com, rynson.lau@cityu.edu.hk, wanli.ouyang@sydney.edu.au, wmzuo@hit.edu.cn

Abstract

Pre-training across 3D vision and language remains un-

der development because of limited training data. Re-

cent works attempt to transfer vision-language (V-L) pre-

training methods to 3D vision. However, the domain gap

between 3D and images is unsolved, so that V-L pre-trained

models are restricted in 3D downstream tasks. To ad-

dress this issue, we propose CLIP2Point, an image-depth

pre-training method by contrastive learning to transfer

CLIP to the 3D domain, and adapt it to point cloud clas-

siﬁcation. We introduce a new depth rendering setting

that forms a better visual effect, and then render 52,460

pairs of images and depth maps from ShapeNet for pre-

training. The pre-training scheme of CLIP2Point com-

bines cross-modality learning to enforce the depth fea-

tures for capturing expressive visual and textual features

and intra-modality learning to enhance the invariance of

depth aggregation. Additionally, we propose a novel Gated

Dual-Path Adapter (GDPA), i.e., a dual-path structure

with global-view aggregators and gated fusion for down-

stream representative learning. It allows the ensemble

of CLIP and CLIP2Point, tuning pre-training knowledge

to downstream tasks in an efﬁcient adaptation. Experi-

mental results show that CLIP2Point is effective in trans-

ferring CLIP knowledge to 3D vision. CLIP2Point out-

performs other 3D transfer learning and pre-training net-

works, achieving state-of-the-art results on zero-shot, few-

shot, and fully-supervised classiﬁcation. Codes are avail-

able at: https://github.com/tyhuang0428/CLIP2Point.

1. Introduction

Vision-language (V-L) pre-training has achieved great

success in computer vision. Beneﬁting from large-scale

†Corresponding Author: Wangmeng Zuo (wmzuo@hit.edu.cn)

image of a [CLASS]

image of a car

image of a chair

……

image of a plane

Textural

Encoder

CLIP Transfer

Tex tua l Prompt K

CLIP Visual

Encoder

Softmax

Depth Map

Rendering

Depth

Encoder

Figure 1. Overall architecture of CLIP transfer learning on the 3D

domain. Point clouds are ﬁrst projected to multi-view depth maps,

and then aggregated by the CLIP visual encoder. Comparison with

textual prompts presents the classiﬁcation prediction. However,

we argue that the domain gap exists between depth maps and CLIP

pre-training images. To this end, a pre-trained depth encoder via

CLIP2Point is proposed.

data, V-L pre-trained models [35,49] transfer language

knowledge to visual understanding, which can be ﬁne-tuned

to multiple downstream tasks. However, pre-training across

3D vision and language remains an open question, due to

the lack of sufﬁcient training data. For example, Contrastive

Language-Image Pre-training (CLIP) [35] takes more than

400M image-text pairs as training data. In contrast, few

studies have been given to pre-training across 3D vision and

language. Moreover, even the conventional 3D pre-training

method PointContrast [46] is trained on ScanNet [11] with

only 100k pairs of point clouds from 1,513 scenes. Due

to the limitation of 3D pre-training, most existing 3D deep

networks [34,43] are trained from scratch on speciﬁc down-

stream datasets.

One remedy is to leverage the existing successful V-L

pre-trained model for 3D vision tasks. To this end, one may

ﬁrst convert the 3D point clouds to multi-view 2D depth

maps [38,15,16,44]. By simply treating 2D depth maps as

images, PointCLIP [55] applies CLIP to 3D tasks, providing

zero-shot and few-shot settings in the point cloud classiﬁ-

cation with textual prompting. However, its results are still

limited since the rendered depth maps are much different

from the image domain of the CLIP training dataset. And

arXiv:2210.01055v3 [cs.CV] 23 Aug 2023

the sparsity and disorder of point cloud data result in var-

ious depth distributions from multiple views, further con-

fusing the aggregation of CLIP. Existing pre-training works

focus on the domain gap [1] or multi-view consistency [46]

of point clouds, while we intend to tackle similar issues

based on depth maps. In addition, a solution of adapting

pre-training knowledge to downstream tasks should be in-

cluded in the V-L transfer.

In order to transfer CLIP to the 3D domain, we pro-

pose CLIP2Point, a pre-training scheme with two learning

mechanisms: 1) cross-modality learning for the contrastive

alignment of RGB image and depth map, 2) intra-modality

learning in the depth modality to enhance the invariance of

depth aggregation. In particular, the image encoder Eiis di-

rectly from CLIP weights and is frozen during pre-training.

While the depth encoder Edis trained to 1) align depth fea-

tures with CLIP image features in cross-modality learning

and 2) encourage the depth aggregation to be invariant to

view changes in intra-modality learning. With pre-training,

the depth features can then be well aligned with the visual

CLIP features. As for the training data, we do not adopt

the depth maps in the existing RGB-D datasets as they are

densely sampled and are contradicted to the sparsity of ren-

dered depth maps. Instead, we reconstruct multi-view im-

ages and depth maps from 3D models directly. Speciﬁcally,

we render 10 views of RGB images from ShapeNet [4],

which covers 52,460 3D models for 55 object categories.

Meanwhile, we generate corresponding depth maps, with

a new rendering setting that forms a better visual effect for

CLIP encoding. Experiments show that our CLIP2Point can

signiﬁcantly improve the performance of zero-shot point

cloud classiﬁcation.

To further adapt our CLIP2Point to downstream tasks,

we propose a novel Gated Dual-Path Adapter (GDPA).

Since our pre-training is to align the instance-level depth

map, it can be complementary with CLIP pre-training

knowledge that focuses on category-level discrimination.

We propose a dual-path structure, where both our pre-

trained depth encoder Edand the CLIP visual encoder Ei

are utilized. A learnable global-view aggregator is attached

to each encoder to extract an overall feature from multiple

views. And the ﬁnal logits can be calculated by the gated

fusion of two encoders.

To sum up, our contributions can be summarized as:

• We propose a contrastive learning method dubbed

CLIP2Point, with a newly proposed pre-training

dataset that is pre-processed from ShapeNet, transfer-

ring CLIP knowledge to the 3D domain. Experiments

show that CLIP2Point signiﬁcantly improves the per-

formance of zero-shot classiﬁcation.

• We propose a novel Gated Dual-Path Adapter (GDPA),

a dual-path structure with global-view aggregators and

gated fusion to efﬁciently extend CLIP2Point to down-

stream representation learning.

• Extensive experiments are conducted on Model-

Net10, ModelNet40, and ScanobjectNN. In compari-

son to 3D transfer learning and pre-training networks,

CLIP2Point achieves state-of-the-art results on zero-

shot, few-shot, and fully-supervised point cloud clas-

siﬁcation tasks.

2. Related Work

2.1. Vision-Language Pre-Training

Vision-language (V-L) pre-training has been a growing

interest in multi-modal tasks. Pre-trained by large-scale

image-text [7,5] or video-text [39] pairs, those models can

be applied to multiple downstream tasks, e.g., visual ques-

tion answering, image/video captioning, and text-to-image

generation. CLIP [35] further leverages V-L pre-training to

transfer cross-modal knowledge, allowing natural language

to understand visual concepts. Nonetheless, pre-training

across 3D vision and language [50,20] is restricted by in-

sufﬁcient 3D-text data pairs. And 3D downstream tasks like

shape retrieval [17] and text-guided shape generation [27]

suffer from limited performance. Considering the vacancy

between 3D vision and language, we attempt to transfer

CLIP pre-trained knowledge to the 3D domain, making lan-

guage applicable to point cloud classiﬁcation.

2.2. Self-Supervised Pre-Training

Self-supervised pre-training has become an important

issue in computer vision. Since task-related annotations

are not required, it can leverage large-scale data and pre-

text tasks to learn general representation. In particu-

lar, contrastive learning [19,6,30,42] and masked auto-

encoding [18,59,13] are two popular self-supervised

schemes. Different from directly applying masked auto-

encoding to 3D point completion [52,33], Li and Heiz-

mann [24] argue that contrastive learning in 3D vision can

vary from granularity (point/instance/scene) or modality

(point/depth/image). In this work, we aim to adopt image-

depth contrastive learning to bridge the domain gap between

depth features and visual CLIP features, thereby allowing to

transfer CLIP knowledge to the 3D domain.

2.3. Downstream Fine-Tuning

Fine-tuning has been widely used in downstream tasks to

ﬁt pre-trained weights to speciﬁc training datasets [53,26,

58,56]. One common practice is to update the entire param-

eters during training, while it may be overﬁtted if the scale

of training data is limited. Instead, partial tuning [3,54]

is a data-efﬁcient way to ﬁt downstream data. Recently,

prompt tuning has been applied to language [2,25] and

Depth Input

rendering

sampling

Image

Encoder

Depth

Encoder

Depth

Encoder

share weights

ℒintra

Learnable FrozenImage Input

initialized by CLIP visual encoder

ℒcross

Fd1

Fd2

Figure 2. Pre-training scheme of CLIP2Point. We propose a self-supervised pre-training scheme with intra-modality and cross-modality

contrastive learning to align depth features with CLIP visual features. We randomly choose a camera view for each 3D model and modify

the distances of the view to construct a pair of rendered depth maps. We adopt one NT-Xent loss between pairs of depth features extracted

from the depth encoder and the other between image features and average depth features. We freeze the image encoder during training,

enforcing the depth features by depth encoder to be aligned with the image features by CLIP visual encoder. Additionally, instead of all

the blue points, we only consider the red point during depth rendering, which improves the visual effect.

vision [14,21] models. Prompt tuning provides several

learnable token sequences and speciﬁc task heads for the

adaptation, without the full tuning of pre-trained parame-

ters. Note that pre-trained models in 3D vision are still

in early exploration, and existing deep networks in point

cloud [34,43,31] all follow a full tuning paradigm. In

contrast, we propose a novel Gated Dual-Path Adapter for

lightweight ﬁne-tuning. With CLIP textual prompts, a su-

pervised downstream setting is available by tuning efﬁcient

adapters only.

3. CLIP-Based Transfer Learning in 3D

Transfer learning works [38,15,16,44] in 3D vision

are basically based on 2D pre-training networks, convert-

ing point clouds to 2D depth maps. Recently, the suc-

cess of V-L pre-training opens up potential opportunities

for 3D-Language transfer. PointCLIP [55] directly adopts a

CLIP visual encoder to projected depth maps. However, the

image-depth domain gap restricts its performance. Instead,

we manage to align depth features to the CLIP domain, al-

lowing a boost on downstream tasks.

3.1. Review of CLIP and PointCLIP

CLIP [35] is a vision-language pre-training method that

matches images and texts by contrastive learning. It con-

tains two individual encoders: a visual encoder and a lan-

guage encoder, to respectively extract image features FI∈

R1×Cand textual features FT∈R1×C. Here, Cis the

embedding dimension of encoders. For zero-shot transfer,

the cosine similarity of FIand FTimplies the matching

results. Taking a K-category classiﬁcation task as an exam-

ple, textual prompts are generated with the category names

and then encoded by CLIP, extracting a list of textual fea-

tures {FT

k}K

k=1 ∈RK×C. For each image feature FI, we

can calculate the predicted probability pas follows,

lk= cos(FI,FT

k), p = softmax([l1, . . . , lK]).(1)

where lkdenotes the logit for k-th category.

PointCLIP [55] applies CLIP to 3D point cloud data. It

renders multi-view depth maps from point clouds, and then

extracts the depth map features {FD

v}N

v=1 with the CLIP

visual encoder, where Nis the number of views. Log-

its of the zero-shot classiﬁcation can be calculated simi-

larly to Eq.(1), while multi-view features are gathered with

searched weights. PointCLIP also proposes an inter-view

adapter for the few-shot classiﬁcation. It adopts a residual

form, which concatenates multi-view features {FD

v}N

v=1 for

a global representation GD∈R1×Cand then add GDback

to extract adapted features ˆ

v∈R1×C. The adapter can

be formulated as,

GD=f2(ReLU(f1(concat({FD

v}N

v=1)))),(2)

v= ReLU(GDWT

v),(3)

lk=

v=1

αv(cos((FD

v+ˆ

v),FT

k)),(4)

where concat(·)denotes the concatenation on channel di-

mensions, f1and f2are two-layer MLPs, and Wv∈RC×C

and αvdenote the view transformation and the summation

weights of the v-th view. f1,f2, and Wvare learnable dur-

ing the few-shot learning, while αvis post-searched.

However, depth maps are representations of geometry in-

formation, which lack natural texture information. There-

fore, it is inappropriate to directly apply CLIP visual en-

coder for the extraction of depth features, leaving some lee-

way for boosting point cloud classiﬁcation.

3.2. Aligning with CLIP Visual Features

Instead of directly applying CLIP visual encoder to

depth maps, we suggest to learn a depth encoder for aligning

depth features with CLIP visual features. In other words, we

expect the extracted features of a rendered depth map to be

consistent with CLIP visual features of the corresponding

image. Then, CLIP textual prompts can be directly adopted

to match the depth features. Moreover, since depth maps

are presented in multiple views, the consistency of depth

distribution needs maintaining as well.

Contrastive learning is a self-supervised pre-training

method that aligns features of each sample with its pos-

itive samples, and satisﬁes our expectations of minimiz-

ing the distance between image and depth features, as well

as enhancing the consistency of multi-view depth features.

We reconstruct a pre-training dataset from ShapeNet, which

contains pairs of rendered RGB images and corresponding

depth maps. We propose a pre-training scheme with intra-

modality and cross-modality contrastive learning. Then, the

pre-trained depth encoder can well adapt to CLIP prompts.

To further generate depth maps with a better visual effect for

CLIP encoding, a new depth rendering setting is adopted.

3.2.1 Pre-Training Scheme

As shown in Fig. 2, our pre-training network includes a

depth encoder Edand an image encoder Ei. Given the input

dataset S={Ii}|S|

i=1, where Ii∈R3×H×Wis the i-th ren-

dered image in a random camera view, we render the corre-

sponding depth maps Di,d1and Di,d2in the same view an-

gle with different distances d1and d2. We ﬁrst adopt a intra-

modality aggregation among {(Di,d1,Di,d2)}|S|

i=1 with Ed,

and then extract image features from {Ii}|S|

i=1 with Ei, en-

forcing Edto keep consistent with Eiin a cross-modality

aspect. Edand Eiare both initialized with the weights of

the visual encoder in CLIP. We freeze the parameters of Ei

during training, while Edis learnable.

Intra-Modality Learning. Considering the sparsity and

disorder of point clouds in the 3D space, even though we

render depth maps at the same distance, distributions of

depth values for different views vary a lot. To keep the in-

variance of distance aggregation in Ed, intra-modality con-

trastive learning is adopted. For each input depth map Di,

we randomly modify the distance of the camera view but

keep the view angle, generating two augmented depth maps

Di,d1and Di,d2.Di,d1and Di,d2are then fed into Ed, ex-

tracting depth features FD

i,d1,FD

i,d2∈R1×C. Following the

NT-Xent loss in SimCLR [6], the intra-modality contrastive

loss Lintra can be formulated as,

Lintra =1

i=1

(li

intra(d1, d2) + li

intra(d2, d1)),(5)

where Ndenotes the batch size. li

intra(·)is based on In-

foNCE [32] loss. Please refer to the supplementary for more

details. And the ﬁnal depth feature map FD

iis the mean of

i,d1and FD

i,d2.

Cross-Modality Learning. For a set of rendered RGB-

D data, cross-modality contrastive learning aims to mini-

mize the distance between rendered images and depth maps

in the same pair, while maximizing the distance of oth-

ers. For each input image Ii, we extract the image features

i∈R1×C, which is exactly the same as CLIP visual fea-

tures. Together with depth features FD

i, we obtain the cross-

modality contrastive loss Lcross as follows,

Lcross =1

i=1

(li

cross(D, I) + li

cross(I, D)).(6)

Similarly, li

cross(·)is based on InfoNCE [32] loss.

Lintra and Lcross are independently propagated, and

Lintra drops much faster than Lcross during our pre-

training. Thus, we adopt a multi-task loss [22] to balance

the two terms. The overall loss function Lis formulated as,

L=1

σ2Lintra +Lcross + log(σ+ 1),(7)

where σis a learnable balance parameter.

3.2.2 Depth Rendering

To convert point cloud data into rendered depth images,we

need to project 3D coordinates (X, Y, Z)∈R3to 2D

coordinates (ˆ

X, ˆ

Y)∈Z2in a speciﬁc view. Here we

choose rendering from the front view as an example: a point

at (x, y, z)can simply match the corresponding pixel at

(⌈x/z⌉,⌈y/z⌉)by perspective projection. However, there

are still two issues: 1) multiple points can be projected

to the same pixel in a speciﬁc plane; 2) a large area of

the rendered depth maps remains blank since points are

sparsely distributed in the 3D space. For the ﬁrst issue, ex-

isting works [15,55] prefer weighted summation of multi-

ple points,

d(ˆx, ˆy) = P(x,y,z)z/(z+ϵ)

P(x,y,z)1/z ,(8)

where (x, y, z)is the set of points matching (ˆx, ˆy), and ϵ

denotes a minimal value, e.g., 1e−12. We argue that the

minimum depth value of those points is more intuitive in

2D vision, as we cannot watch an object perspectively with

naked eyes. For the second issue, few pixels can be covered

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CLIP2Point:TransferCLIPtoPointCloudClassificationwithImage-DepthPre-TrainingTianyuHuang1,3BowenDong1YunhanYang1,4XiaoshuiHuang2RynsonW.H.Lau3WanliOuyang2WangmengZuo1,5†1HarbinInstituteofTechnology2ShanghaiAILaboratory3CityUniversityofHongKong4TheUniversityofHongKong5PengChengLaboratorytyhuang0428@gm...

展开>> 收起<<

CLIP2Point Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training Tianyu Huang13Bowen Dong1Yunhan Yang14Xiaoshui Huang2.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CLIP2Point Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training Tianyu Huang13Bowen Dong1Yunhan Yang14Xiaoshui Huang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: