CLIP2Point Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training Tianyu Huang13Bowen Dong1Yunhan Yang14Xiaoshui Huang2

2025-04-27 0 0 3.24MB 17 页 10玖币
侵权投诉
CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth
Pre-Training
Tianyu Huang1,3Bowen Dong1Yunhan Yang1,4Xiaoshui Huang2
Rynson W.H. Lau3Wanli Ouyang2Wangmeng Zuo1,5
1Harbin Institute of Technology 2Shanghai AI Laboratory 3City University of Hong Kong
4The University of Hong Kong 5Peng Cheng Laboratory
tyhuang0428@gmail.com, rynson.lau@cityu.edu.hk, wanli.ouyang@sydney.edu.au, wmzuo@hit.edu.cn
Abstract
Pre-training across 3D vision and language remains un-
der development because of limited training data. Re-
cent works attempt to transfer vision-language (V-L) pre-
training methods to 3D vision. However, the domain gap
between 3D and images is unsolved, so that V-L pre-trained
models are restricted in 3D downstream tasks. To ad-
dress this issue, we propose CLIP2Point, an image-depth
pre-training method by contrastive learning to transfer
CLIP to the 3D domain, and adapt it to point cloud clas-
sification. We introduce a new depth rendering setting
that forms a better visual effect, and then render 52,460
pairs of images and depth maps from ShapeNet for pre-
training. The pre-training scheme of CLIP2Point com-
bines cross-modality learning to enforce the depth fea-
tures for capturing expressive visual and textual features
and intra-modality learning to enhance the invariance of
depth aggregation. Additionally, we propose a novel Gated
Dual-Path Adapter (GDPA), i.e., a dual-path structure
with global-view aggregators and gated fusion for down-
stream representative learning. It allows the ensemble
of CLIP and CLIP2Point, tuning pre-training knowledge
to downstream tasks in an efficient adaptation. Experi-
mental results show that CLIP2Point is effective in trans-
ferring CLIP knowledge to 3D vision. CLIP2Point out-
performs other 3D transfer learning and pre-training net-
works, achieving state-of-the-art results on zero-shot, few-
shot, and fully-supervised classification. Codes are avail-
able at: https://github.com/tyhuang0428/CLIP2Point.
1. Introduction
Vision-language (V-L) pre-training has achieved great
success in computer vision. Benefiting from large-scale
Corresponding Author: Wangmeng Zuo (wmzuo@hit.edu.cn)
image of a [CLASS]
image of a car
image of a chair
……
image of a plane
Textural
Encoder
CLIP Transfer
Tex tua l Prompt K
N
N
K
+
CLIP Visual
Encoder
Softmax
Depth Map
Rendering
Depth
Encoder
FD
1
FD
2
FD
N
FT
1
FT
2
FT
K
Figure 1. Overall architecture of CLIP transfer learning on the 3D
domain. Point clouds are first projected to multi-view depth maps,
and then aggregated by the CLIP visual encoder. Comparison with
textual prompts presents the classification prediction. However,
we argue that the domain gap exists between depth maps and CLIP
pre-training images. To this end, a pre-trained depth encoder via
CLIP2Point is proposed.
data, V-L pre-trained models [35,49] transfer language
knowledge to visual understanding, which can be fine-tuned
to multiple downstream tasks. However, pre-training across
3D vision and language remains an open question, due to
the lack of sufficient training data. For example, Contrastive
Language-Image Pre-training (CLIP) [35] takes more than
400M image-text pairs as training data. In contrast, few
studies have been given to pre-training across 3D vision and
language. Moreover, even the conventional 3D pre-training
method PointContrast [46] is trained on ScanNet [11] with
only 100k pairs of point clouds from 1,513 scenes. Due
to the limitation of 3D pre-training, most existing 3D deep
networks [34,43] are trained from scratch on specific down-
stream datasets.
One remedy is to leverage the existing successful V-L
pre-trained model for 3D vision tasks. To this end, one may
first convert the 3D point clouds to multi-view 2D depth
maps [38,15,16,44]. By simply treating 2D depth maps as
images, PointCLIP [55] applies CLIP to 3D tasks, providing
zero-shot and few-shot settings in the point cloud classifi-
cation with textual prompting. However, its results are still
limited since the rendered depth maps are much different
from the image domain of the CLIP training dataset. And
arXiv:2210.01055v3 [cs.CV] 23 Aug 2023
the sparsity and disorder of point cloud data result in var-
ious depth distributions from multiple views, further con-
fusing the aggregation of CLIP. Existing pre-training works
focus on the domain gap [1] or multi-view consistency [46]
of point clouds, while we intend to tackle similar issues
based on depth maps. In addition, a solution of adapting
pre-training knowledge to downstream tasks should be in-
cluded in the V-L transfer.
In order to transfer CLIP to the 3D domain, we pro-
pose CLIP2Point, a pre-training scheme with two learning
mechanisms: 1) cross-modality learning for the contrastive
alignment of RGB image and depth map, 2) intra-modality
learning in the depth modality to enhance the invariance of
depth aggregation. In particular, the image encoder Eiis di-
rectly from CLIP weights and is frozen during pre-training.
While the depth encoder Edis trained to 1) align depth fea-
tures with CLIP image features in cross-modality learning
and 2) encourage the depth aggregation to be invariant to
view changes in intra-modality learning. With pre-training,
the depth features can then be well aligned with the visual
CLIP features. As for the training data, we do not adopt
the depth maps in the existing RGB-D datasets as they are
densely sampled and are contradicted to the sparsity of ren-
dered depth maps. Instead, we reconstruct multi-view im-
ages and depth maps from 3D models directly. Specifically,
we render 10 views of RGB images from ShapeNet [4],
which covers 52,460 3D models for 55 object categories.
Meanwhile, we generate corresponding depth maps, with
a new rendering setting that forms a better visual effect for
CLIP encoding. Experiments show that our CLIP2Point can
significantly improve the performance of zero-shot point
cloud classification.
To further adapt our CLIP2Point to downstream tasks,
we propose a novel Gated Dual-Path Adapter (GDPA).
Since our pre-training is to align the instance-level depth
map, it can be complementary with CLIP pre-training
knowledge that focuses on category-level discrimination.
We propose a dual-path structure, where both our pre-
trained depth encoder Edand the CLIP visual encoder Ei
are utilized. A learnable global-view aggregator is attached
to each encoder to extract an overall feature from multiple
views. And the final logits can be calculated by the gated
fusion of two encoders.
To sum up, our contributions can be summarized as:
We propose a contrastive learning method dubbed
CLIP2Point, with a newly proposed pre-training
dataset that is pre-processed from ShapeNet, transfer-
ring CLIP knowledge to the 3D domain. Experiments
show that CLIP2Point significantly improves the per-
formance of zero-shot classification.
We propose a novel Gated Dual-Path Adapter (GDPA),
a dual-path structure with global-view aggregators and
gated fusion to efficiently extend CLIP2Point to down-
stream representation learning.
Extensive experiments are conducted on Model-
Net10, ModelNet40, and ScanobjectNN. In compari-
son to 3D transfer learning and pre-training networks,
CLIP2Point achieves state-of-the-art results on zero-
shot, few-shot, and fully-supervised point cloud clas-
sification tasks.
2. Related Work
2.1. Vision-Language Pre-Training
Vision-language (V-L) pre-training has been a growing
interest in multi-modal tasks. Pre-trained by large-scale
image-text [7,5] or video-text [39] pairs, those models can
be applied to multiple downstream tasks, e.g., visual ques-
tion answering, image/video captioning, and text-to-image
generation. CLIP [35] further leverages V-L pre-training to
transfer cross-modal knowledge, allowing natural language
to understand visual concepts. Nonetheless, pre-training
across 3D vision and language [50,20] is restricted by in-
sufficient 3D-text data pairs. And 3D downstream tasks like
shape retrieval [17] and text-guided shape generation [27]
suffer from limited performance. Considering the vacancy
between 3D vision and language, we attempt to transfer
CLIP pre-trained knowledge to the 3D domain, making lan-
guage applicable to point cloud classification.
2.2. Self-Supervised Pre-Training
Self-supervised pre-training has become an important
issue in computer vision. Since task-related annotations
are not required, it can leverage large-scale data and pre-
text tasks to learn general representation. In particu-
lar, contrastive learning [19,6,30,42] and masked auto-
encoding [18,59,13] are two popular self-supervised
schemes. Different from directly applying masked auto-
encoding to 3D point completion [52,33], Li and Heiz-
mann [24] argue that contrastive learning in 3D vision can
vary from granularity (point/instance/scene) or modality
(point/depth/image). In this work, we aim to adopt image-
depth contrastive learning to bridge the domain gap between
depth features and visual CLIP features, thereby allowing to
transfer CLIP knowledge to the 3D domain.
2.3. Downstream Fine-Tuning
Fine-tuning has been widely used in downstream tasks to
fit pre-trained weights to specific training datasets [53,26,
58,56]. One common practice is to update the entire param-
eters during training, while it may be overfitted if the scale
of training data is limited. Instead, partial tuning [3,54]
is a data-efficient way to fit downstream data. Recently,
prompt tuning has been applied to language [2,25] and
Depth Input
rendering
sampling
d1
d2
Image
Encoder
Depth
Encoder
Depth
Encoder
+
share weights
Learnable FrozenImage Input
FI
FD
initialized by CLIP visual encoder
cross
Fd1
Fd2
Figure 2. Pre-training scheme of CLIP2Point. We propose a self-supervised pre-training scheme with intra-modality and cross-modality
contrastive learning to align depth features with CLIP visual features. We randomly choose a camera view for each 3D model and modify
the distances of the view to construct a pair of rendered depth maps. We adopt one NT-Xent loss between pairs of depth features extracted
from the depth encoder and the other between image features and average depth features. We freeze the image encoder during training,
enforcing the depth features by depth encoder to be aligned with the image features by CLIP visual encoder. Additionally, instead of all
the blue points, we only consider the red point during depth rendering, which improves the visual effect.
vision [14,21] models. Prompt tuning provides several
learnable token sequences and specific task heads for the
adaptation, without the full tuning of pre-trained parame-
ters. Note that pre-trained models in 3D vision are still
in early exploration, and existing deep networks in point
cloud [34,43,31] all follow a full tuning paradigm. In
contrast, we propose a novel Gated Dual-Path Adapter for
lightweight fine-tuning. With CLIP textual prompts, a su-
pervised downstream setting is available by tuning efficient
adapters only.
3. CLIP-Based Transfer Learning in 3D
Transfer learning works [38,15,16,44] in 3D vision
are basically based on 2D pre-training networks, convert-
ing point clouds to 2D depth maps. Recently, the suc-
cess of V-L pre-training opens up potential opportunities
for 3D-Language transfer. PointCLIP [55] directly adopts a
CLIP visual encoder to projected depth maps. However, the
image-depth domain gap restricts its performance. Instead,
we manage to align depth features to the CLIP domain, al-
lowing a boost on downstream tasks.
3.1. Review of CLIP and PointCLIP
CLIP [35] is a vision-language pre-training method that
matches images and texts by contrastive learning. It con-
tains two individual encoders: a visual encoder and a lan-
guage encoder, to respectively extract image features FI
R1×Cand textual features FTR1×C. Here, Cis the
embedding dimension of encoders. For zero-shot transfer,
the cosine similarity of FIand FTimplies the matching
results. Taking a K-category classification task as an exam-
ple, textual prompts are generated with the category names
and then encoded by CLIP, extracting a list of textual fea-
tures {FT
k}K
k=1 RK×C. For each image feature FI, we
can calculate the predicted probability pas follows,
lk= cos(FI,FT
k), p = softmax([l1, . . . , lK]).(1)
where lkdenotes the logit for k-th category.
PointCLIP [55] applies CLIP to 3D point cloud data. It
renders multi-view depth maps from point clouds, and then
extracts the depth map features {FD
v}N
v=1 with the CLIP
visual encoder, where Nis the number of views. Log-
its of the zero-shot classification can be calculated simi-
larly to Eq.(1), while multi-view features are gathered with
searched weights. PointCLIP also proposes an inter-view
adapter for the few-shot classification. It adopts a residual
form, which concatenates multi-view features {FD
v}N
v=1 for
a global representation GDR1×Cand then add GDback
to extract adapted features ˆ
FD
vR1×C. The adapter can
be formulated as,
GD=f2(ReLU(f1(concat({FD
v}N
v=1)))),(2)
ˆ
FD
v= ReLU(GDWT
v),(3)
lk=
N
X
v=1
αv(cos((FD
v+ˆ
FD
v),FT
k)),(4)
where concat(·)denotes the concatenation on channel di-
mensions, f1and f2are two-layer MLPs, and WvRC×C
and αvdenote the view transformation and the summation
weights of the v-th view. f1,f2, and Wvare learnable dur-
ing the few-shot learning, while αvis post-searched.
However, depth maps are representations of geometry in-
formation, which lack natural texture information. There-
fore, it is inappropriate to directly apply CLIP visual en-
coder for the extraction of depth features, leaving some lee-
way for boosting point cloud classification.
3.2. Aligning with CLIP Visual Features
Instead of directly applying CLIP visual encoder to
depth maps, we suggest to learn a depth encoder for aligning
depth features with CLIP visual features. In other words, we
expect the extracted features of a rendered depth map to be
consistent with CLIP visual features of the corresponding
image. Then, CLIP textual prompts can be directly adopted
to match the depth features. Moreover, since depth maps
are presented in multiple views, the consistency of depth
distribution needs maintaining as well.
Contrastive learning is a self-supervised pre-training
method that aligns features of each sample with its pos-
itive samples, and satisfies our expectations of minimiz-
ing the distance between image and depth features, as well
as enhancing the consistency of multi-view depth features.
We reconstruct a pre-training dataset from ShapeNet, which
contains pairs of rendered RGB images and corresponding
depth maps. We propose a pre-training scheme with intra-
modality and cross-modality contrastive learning. Then, the
pre-trained depth encoder can well adapt to CLIP prompts.
To further generate depth maps with a better visual effect for
CLIP encoding, a new depth rendering setting is adopted.
3.2.1 Pre-Training Scheme
As shown in Fig. 2, our pre-training network includes a
depth encoder Edand an image encoder Ei. Given the input
dataset S={Ii}|S|
i=1, where IiR3×H×Wis the i-th ren-
dered image in a random camera view, we render the corre-
sponding depth maps Di,d1and Di,d2in the same view an-
gle with different distances d1and d2. We first adopt a intra-
modality aggregation among {(Di,d1,Di,d2)}|S|
i=1 with Ed,
and then extract image features from {Ii}|S|
i=1 with Ei, en-
forcing Edto keep consistent with Eiin a cross-modality
aspect. Edand Eiare both initialized with the weights of
the visual encoder in CLIP. We freeze the parameters of Ei
during training, while Edis learnable.
Intra-Modality Learning. Considering the sparsity and
disorder of point clouds in the 3D space, even though we
render depth maps at the same distance, distributions of
depth values for different views vary a lot. To keep the in-
variance of distance aggregation in Ed, intra-modality con-
trastive learning is adopted. For each input depth map Di,
we randomly modify the distance of the camera view but
keep the view angle, generating two augmented depth maps
Di,d1and Di,d2.Di,d1and Di,d2are then fed into Ed, ex-
tracting depth features FD
i,d1,FD
i,d2R1×C. Following the
NT-Xent loss in SimCLR [6], the intra-modality contrastive
loss Lintra can be formulated as,
Lintra =1
2N
N
X
i=1
(li
intra(d1, d2) + li
intra(d2, d1)),(5)
where Ndenotes the batch size. li
intra(·)is based on In-
foNCE [32] loss. Please refer to the supplementary for more
details. And the final depth feature map FD
iis the mean of
FD
i,d1and FD
i,d2.
Cross-Modality Learning. For a set of rendered RGB-
D data, cross-modality contrastive learning aims to mini-
mize the distance between rendered images and depth maps
in the same pair, while maximizing the distance of oth-
ers. For each input image Ii, we extract the image features
FI
iR1×C, which is exactly the same as CLIP visual fea-
tures. Together with depth features FD
i, we obtain the cross-
modality contrastive loss Lcross as follows,
Lcross =1
2N
N
X
i=1
(li
cross(D, I) + li
cross(I, D)).(6)
Similarly, li
cross(·)is based on InfoNCE [32] loss.
Lintra and Lcross are independently propagated, and
Lintra drops much faster than Lcross during our pre-
training. Thus, we adopt a multi-task loss [22] to balance
the two terms. The overall loss function Lis formulated as,
L=1
σ2Lintra +Lcross + log(σ+ 1),(7)
where σis a learnable balance parameter.
3.2.2 Depth Rendering
To convert point cloud data into rendered depth images,we
need to project 3D coordinates (X, Y, Z)R3to 2D
coordinates (ˆ
X, ˆ
Y)Z2in a specific view. Here we
choose rendering from the front view as an example: a point
at (x, y, z)can simply match the corresponding pixel at
(x/z,y/z)by perspective projection. However, there
are still two issues: 1) multiple points can be projected
to the same pixel in a specific plane; 2) a large area of
the rendered depth maps remains blank since points are
sparsely distributed in the 3D space. For the first issue, ex-
isting works [15,55] prefer weighted summation of multi-
ple points,
d(ˆx, ˆy) = P(x,y,z)z/(z+ϵ)
P(x,y,z)1/z ,(8)
where (x, y, z)is the set of points matching (ˆx, ˆy), and ϵ
denotes a minimal value, e.g., 1e12. We argue that the
minimum depth value of those points is more intuitive in
2D vision, as we cannot watch an object perspectively with
naked eyes. For the second issue, few pixels can be covered
摘要:

CLIP2Point:TransferCLIPtoPointCloudClassificationwithImage-DepthPre-TrainingTianyuHuang1,3BowenDong1YunhanYang1,4XiaoshuiHuang2RynsonW.H.Lau3WanliOuyang2WangmengZuo1,5†1HarbinInstituteofTechnology2ShanghaiAILaboratory3CityUniversityofHongKong4TheUniversityofHongKong5PengChengLaboratorytyhuang0428@gm...

展开>> 收起<<
CLIP2Point Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training Tianyu Huang13Bowen Dong1Yunhan Yang14Xiaoshui Huang2.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:3.24MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注