Coded Residual Transform for Generalizable Deep Metric Learning Shichao Kan1 Yixiong Liang1 Min Li1 Yigang Cen23 Jianxin Wang1 Zhihai He45

2025-04-27 0 0 2.41MB 15 页 10玖币
侵权投诉
Coded Residual Transform for Generalizable Deep
Metric Learning
Shichao Kan1, Yixiong Liang1, Min Li1, Yigang Cen2,3,, Jianxin Wang1, Zhihai He4,5,
1School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083
2Institute of Information Science, School of Computer and Information Technology,
Beijing Jiaotong University, Beijing 100044, China
3Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China
4Department of Electrical and Electronic Engineering, Southern University of Science and Technology,
Shenzhen, China
5Pengcheng Lab, Shenzhen, 518066, China
kanshichao@csu.edu.cn, yxliang@csu.edu.cn, limin@mail.csu.edu.cn
ygcen@bjtu.edu.cn, jxwang@mail.csu.edu.cn, hezh@sustech.edu.cn
Abstract
A fundamental challenge in deep metric learning is the generalization capability
of the feature embedding network model since the embedding network learned
on training classes need to be evaluated on new test classes. To address this chal-
lenge, in this paper, we introduce a new method called coded residual transform
(CRT) for deep metric learning to significantly improve its generalization capability.
Specifically, we learn a set of diversified prototype features, project the feature
map onto each prototype, and then encode its features using their projection residu-
als weighted by their correlation coefficients with each prototype. The proposed
CRT method has the following two unique characteristics. First, it represents
and encodes the feature map from a set of complimentary perspectives based on
projections onto diversified prototypes. Second, unlike existing transformer-based
feature representation approaches which encode the original values of features
based on global correlation analysis, the proposed coded residual transform en-
codes the relative differences between the original features and their projected
prototypes. Embedding space density and spectral decay analysis show that this
multi-perspective projection onto diversified prototypes and coded residual repre-
sentation are able to achieve significantly improved generalization capability in
metric learning. Finally, to further enhance the generalization performance, we pro-
pose to enforce the consistency on their feature similarity matrices between coded
residual transforms with different sizes of projection prototypes and embedding
dimensions. Our extensive experimental results and ablation studies demonstrate
that the proposed CRT method outperform the state-of-the-art deep metric learning
methods by large margins and improving upon the current best method by up to
4.28% on the CUB dataset.
1 Introduction
Deep metric learning (DML) aims to learn effective features to characterize or represent images, which
has important applications in image retrieval [
1
;
2
], image recognition [
3
], person re-identification [
4
],
image segmentation [
5
], and tracking [
6
]. Successful metric learning needs to achieve the following
two objectives: (1) Discriminative. In the embedded feature space, image features with the same
Corresponding authors
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04180v1 [cs.CV] 9 Oct 2022
semantic labels should be aggregated into compact clusters in the high-dimensional feature space
while those from different classes should be well separated from each other. (2) Generalizable.
The learned features should be able to generalize well from the training images to test images of
new classes which have not been seen before. During the past a few years, methods based on
deep neural networks, such as metric loss functions design [
7
;
8
;
9
;
10
;
11
], embedding transfer
[
12
;
13
;
14
;
15
;
16
], structural matching [
17
], graph neural networks [
2
;
18
], language guidance [
19
],
and vision transformer [
1
;
20
;
21
], have achieved remarkable progress on learning discriminative
features. However, the generalization onto unseen new classes remains a significant challenge for
existing deep metric learning methods.
In the literature, to improve the deep metric learning performance and alleviate the generalization
problem on unseen classes, regularization techniques [
22
;
15
], language-guided DML [
19
], and
feature fusion [
2
;
8
;
18
] methods have been developed. Existing approaches to addressing the
generalization challenge in metric learning focus on the robustness of linear or kernel-based distance
metrics [
23
;
24
], analysis of error bounds of the generalization process [
25
], and correlation analysis
between generalization and structure characteristics of the learned embedding space [
22
]. It should
be noted that in existing methods, the input image is analyzed and transformed as a whole into an
embedded feature. In other words, the image is represented and projected globally from a single
perspective. We recognize that this single-perspective projection is not able to represent and encode
the highly complex and dynamic correlation structures in the high-dimensional feature space since
they are being collapsed and globally projected onto one single perspective and the local correlation
dynamics have been suppressed. According to our experiments, this single-perspective global
projection will increase the marginal variance [
26
] and consequently degrade the generalization
capability of the deep metric learning method.
Furthermore, we observe that existing deep metric learning methods attempt to transform and encode
the original features. From the generalization point of view, we find that it is more effective to learn
the embedding based on relative difference between features since the absolute value of features
may vary significantly from the training to new test classes, but the relative change patterns between
features may remain largely invariant. To further understand this idea, consider the following toy
example: a face in the daytime may appear much different from a face in the night time due to
changes in lighting conditions. However, an effective face detection with sufficient generalization
power will not focus on the absolute pixel values of the face image. Instead, it detects the face based
on the relative change patterns between neighboring regions inside the face image. Motivated by this
observation, in this work, to address the generalization challenge, we propose to learn the embedded
feature from the projection residuals of the feature map, instead of its absolute features.
The above two ideas, namely, multi-perspective projection and residual encoding, lead to our proposed
method of coded residual transform for deep metric learning. Specifically, we propose to learn a
set of diversified prototype features, project the features onto every prototype, and then encode
the features using the projection residuals weighted by their correlation coefficients with the target
prototype. Unlike existing transformer-based feature representation approaches which encode the
original values of features based on global correlation analysis [
27
;
28
;
29
], the proposed coded
residual transform encodes the relative differences between original features and their projected
prototypes. Our extensive experimental results and ablation studies demonstrate that the proposed
CRT method is able to improve the generalization performance of deep metric learning, outperforming
the state-of-the-art methods by large margins and improving upon the current best method by up to
4.28%.
We learn those projection prototypes based on the training classes and transfer them into the test
classes. Although the training and test classes share the same set of prototypes, the actual distributions
of project prototypes during training and testing could be much different due to the distribution
shift between the training and testing classes. During our coded residual transform, we assign
different weights for different project residuals based on the correlation between the feature and the
corresponding prototype. Therefore, for training classes, the subset of prototypes which are close to
the training images will have larger weights. Similarly, for the testing classes, the subset of prototypes
which are close to the test images will have larger weights. This correlation-based weighting for the
projection residual contribute significantly to the overall performance gain.
2
2 Related Work and Unique Contributions
This work is related to deep metric learning, transformer-based learning methods, and residual
encoding. In this section, we review the existing methods on these topics and discuss the unique
novelty of our approach.
(1) Deep metric learning.
Deep metric learning aims to learn discriminative features with the goal to
minimize intra-class sample distance and maximize inter-class sample distance in a contrastive manner.
Contrastive loss [
30
] has been successfully used in early methods of deep metric learning, aiming
to optimize pairwise distance of between samples. By exploring more sophisticated relationship
between samples, a variety of metric loss functions, such as triplet loss [
31
], lifted structured loss [
32
],
proxy-anchor loss [
9
], and multi-similarity (MS) loss [
11
], have been developed. According to the
studies in [
22
] and [
33
], the MS loss was verified to be one of the most efficient metric loss functions.
Some recent methods explore how to use multiple features to learn robust feature embeddings. Kan
et al. [
2
] and Seidenschwarz et al. [
18
] adopt K-nearest neighbors (k-NN) of an anchor image to
build local graph neural network (GNN) and refine embedding vectors based on message exchanges
between the graph nodes. Zhao et al. [
17
] proposed a structural matching method to learn a metric
function between feature maps based on the optimal transport theory. Based on the form of softmax,
margin-based softmax loss functions [
34
;
35
;
36
] were also proposed to learn discriminative features.
Sohn [
37
] improved the contrastive loss and triplet loss by introducing
N
negative examples and
proposed the N-pair loss function to speed up model convergence during training.
(2) Transformer-based learning methods.
This work is related to transformer-based learning
methods since our method also analyze the correlation between features and uses this correlation
information to aggregate features. The original work of transformer [
27
] aims to learn a self-attention
function and a feed forward transformation network for nature language processing. Recently, it
has been successfully applied to computer vision and image processing. ViT [
38
] demonstrates
that a pure transformer can achieve state-of-the-art performance in image classification. ViT treats
each image as a sequence of tokens and then feeds them to multiple transformer layers to perform
the classification. Subsequently, DeiT [
39
] further explores a data-efficient training strategy and a
distillation approach for ViT. More recent methods such as T2T ViT [
29
], TNT [
40
], CrossViT [
41
]
and LocalViT [
42
] further improve the ViT method for image classification. PVT [
43
] incorporates a
pyramid structure into the transformer for dense prediction tasks. After that, methods such as Swin
[
28
], CvT [
44
], CoaT [
45
], LeViT [
46
], Twins [
47
] and MiT [
48
] enhance the local continuity of
features and remove fixed size position embedding to improve the performance of transformers for
dense prediction tasks. For deep metric learning, El-Nouby et al. [
1
] and Ermolov et al. [
20
] adopt
the DeiT-S network[39] as a backbone to extract features, achieving impressive performance.
(3) Residual encoding.
Residual encoding was first proposed by Jégou et al. [
49
], where the vector
of locally aggregated descriptors (VLAD) algorithm is used to aggregate the residuals between
features and their best-matching codewords. Based on the VLAD method, VLAD-CNN [
50
] has
developed the residual encoders for visual recognition and understanding tasks. NetVLAD [
51
]
and Deep-TEN [
52
] extend this idea and develop an end-to-end learnable residual encoder based
on soft-assignment. It should be noted that features learned by these methods typically have very
large sizes, for example, 16k, 32k and 4096 for AlexNet [
53
], VGG-16 [
54
] and ResNet-50 [
55
],
respectively.
(4) Unique Contributions.
Compared to the above existing methods, the unique contributions of
this paper can be summarized as follows: (1) We introduce a new CRT method which learns a set
of prototype features, project the feature map onto each prototype, and then encode its features
using their projection residuals weighted by their correlation coefficients with each prototype. (2)
We introduce a diversity constraint for the set of prototype features so that the CRT method can
represent and encode the feature map from a set of complimentary perspectives. Unlike existing
transformer-based feature representation approaches which encode the original values of features
based on global correlation analysis, the proposed coded residual transform encode the relative
differences between original features and their projected prototypes. (3) To further enhance the
generalization performance, we propose to enforce the feature distribution consistency between coded
residual transforms with different sizes of projection prototypes and embedding dimensions. (4) We
demonstrate that this multi-perspective projection with diversified prototypes and coded residual
representation based on relative differences are able to achieve significantly improved generalization
3
摘要:

CodedResidualTransformforGeneralizableDeepMetricLearningShichaoKan1,YixiongLiang1,MinLi1,YigangCen2;3;,JianxinWang1,ZhihaiHe4;5;1SchoolofComputerScienceandEngineering,CentralSouthUniversity,Changsha,Hunan,4100832InstituteofInformationScience,SchoolofComputerandInformationTechnology,BeijingJiaotong...

展开>> 收起<<
Coded Residual Transform for Generalizable Deep Metric Learning Shichao Kan1 Yixiong Liang1 Min Li1 Yigang Cen23 Jianxin Wang1 Zhihai He45.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.41MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注