Coded Residual Transform for Generalizable Deep Metric Learning Shichao Kan1 Yixiong Liang1 Min Li1 Yigang Cen23 Jianxin Wang1 Zhihai He45

2025-04-27 0 0 2.41MB 15 页 10玖币

侵权投诉

Coded Residual Transform for Generalizable Deep

Metric Learning

Shichao Kan1, Yixiong Liang1, Min Li1, Yigang Cen2,3,∗, Jianxin Wang1, Zhihai He4,5,∗

1School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083

2Institute of Information Science, School of Computer and Information Technology,

Beijing Jiaotong University, Beijing 100044, China

3Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China

4Department of Electrical and Electronic Engineering, Southern University of Science and Technology,

Shenzhen, China

5Pengcheng Lab, Shenzhen, 518066, China

kanshichao@csu.edu.cn, yxliang@csu.edu.cn, limin@mail.csu.edu.cn

ygcen@bjtu.edu.cn, jxwang@mail.csu.edu.cn, hezh@sustech.edu.cn

Abstract

A fundamental challenge in deep metric learning is the generalization capability

of the feature embedding network model since the embedding network learned

on training classes need to be evaluated on new test classes. To address this chal-

lenge, in this paper, we introduce a new method called coded residual transform

(CRT) for deep metric learning to signiﬁcantly improve its generalization capability.

Speciﬁcally, we learn a set of diversiﬁed prototype features, project the feature

map onto each prototype, and then encode its features using their projection residu-

als weighted by their correlation coefﬁcients with each prototype. The proposed

CRT method has the following two unique characteristics. First, it represents

and encodes the feature map from a set of complimentary perspectives based on

projections onto diversiﬁed prototypes. Second, unlike existing transformer-based

feature representation approaches which encode the original values of features

based on global correlation analysis, the proposed coded residual transform en-

codes the relative differences between the original features and their projected

prototypes. Embedding space density and spectral decay analysis show that this

multi-perspective projection onto diversiﬁed prototypes and coded residual repre-

sentation are able to achieve signiﬁcantly improved generalization capability in

metric learning. Finally, to further enhance the generalization performance, we pro-

pose to enforce the consistency on their feature similarity matrices between coded

residual transforms with different sizes of projection prototypes and embedding

dimensions. Our extensive experimental results and ablation studies demonstrate

that the proposed CRT method outperform the state-of-the-art deep metric learning

methods by large margins and improving upon the current best method by up to

4.28% on the CUB dataset.

1 Introduction

Deep metric learning (DML) aims to learn effective features to characterize or represent images, which

has important applications in image retrieval [

;

], image recognition [

], person re-identiﬁcation [

image segmentation [

], and tracking [

]. Successful metric learning needs to achieve the following

two objectives: (1) Discriminative. In the embedded feature space, image features with the same

∗Corresponding authors

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04180v1 [cs.CV] 9 Oct 2022

semantic labels should be aggregated into compact clusters in the high-dimensional feature space

while those from different classes should be well separated from each other. (2) Generalizable.

The learned features should be able to generalize well from the training images to test images of

new classes which have not been seen before. During the past a few years, methods based on

deep neural networks, such as metric loss functions design [

;

], embedding transfer

[

;

], structural matching [

], graph neural networks [

;

], language guidance [

and vision transformer [

;

], have achieved remarkable progress on learning discriminative

features. However, the generalization onto unseen new classes remains a signiﬁcant challenge for

existing deep metric learning methods.

In the literature, to improve the deep metric learning performance and alleviate the generalization

problem on unseen classes, regularization techniques [

;

], language-guided DML [

], and

feature fusion [

;

] methods have been developed. Existing approaches to addressing the

generalization challenge in metric learning focus on the robustness of linear or kernel-based distance

metrics [

;

], analysis of error bounds of the generalization process [

], and correlation analysis

between generalization and structure characteristics of the learned embedding space [

]. It should

be noted that in existing methods, the input image is analyzed and transformed as a whole into an

embedded feature. In other words, the image is represented and projected globally from a single

perspective. We recognize that this single-perspective projection is not able to represent and encode

the highly complex and dynamic correlation structures in the high-dimensional feature space since

they are being collapsed and globally projected onto one single perspective and the local correlation

dynamics have been suppressed. According to our experiments, this single-perspective global

projection will increase the marginal variance [

] and consequently degrade the generalization

capability of the deep metric learning method.

Furthermore, we observe that existing deep metric learning methods attempt to transform and encode

the original features. From the generalization point of view, we ﬁnd that it is more effective to learn

the embedding based on relative difference between features since the absolute value of features

may vary signiﬁcantly from the training to new test classes, but the relative change patterns between

features may remain largely invariant. To further understand this idea, consider the following toy

example: a face in the daytime may appear much different from a face in the night time due to

changes in lighting conditions. However, an effective face detection with sufﬁcient generalization

power will not focus on the absolute pixel values of the face image. Instead, it detects the face based

on the relative change patterns between neighboring regions inside the face image. Motivated by this

observation, in this work, to address the generalization challenge, we propose to learn the embedded

feature from the projection residuals of the feature map, instead of its absolute features.

The above two ideas, namely, multi-perspective projection and residual encoding, lead to our proposed

method of coded residual transform for deep metric learning. Speciﬁcally, we propose to learn a

set of diversiﬁed prototype features, project the features onto every prototype, and then encode

the features using the projection residuals weighted by their correlation coefﬁcients with the target

prototype. Unlike existing transformer-based feature representation approaches which encode the

original values of features based on global correlation analysis [

;

], the proposed coded

residual transform encodes the relative differences between original features and their projected

prototypes. Our extensive experimental results and ablation studies demonstrate that the proposed

CRT method is able to improve the generalization performance of deep metric learning, outperforming

the state-of-the-art methods by large margins and improving upon the current best method by up to

4.28%.

We learn those projection prototypes based on the training classes and transfer them into the test

classes. Although the training and test classes share the same set of prototypes, the actual distributions

of project prototypes during training and testing could be much different due to the distribution

shift between the training and testing classes. During our coded residual transform, we assign

different weights for different project residuals based on the correlation between the feature and the

corresponding prototype. Therefore, for training classes, the subset of prototypes which are close to

the training images will have larger weights. Similarly, for the testing classes, the subset of prototypes

which are close to the test images will have larger weights. This correlation-based weighting for the

projection residual contribute signiﬁcantly to the overall performance gain.

2 Related Work and Unique Contributions

This work is related to deep metric learning, transformer-based learning methods, and residual

encoding. In this section, we review the existing methods on these topics and discuss the unique

novelty of our approach.

(1) Deep metric learning.

Deep metric learning aims to learn discriminative features with the goal to

minimize intra-class sample distance and maximize inter-class sample distance in a contrastive manner.

Contrastive loss [

] has been successfully used in early methods of deep metric learning, aiming

to optimize pairwise distance of between samples. By exploring more sophisticated relationship

between samples, a variety of metric loss functions, such as triplet loss [

], lifted structured loss [

proxy-anchor loss [

], and multi-similarity (MS) loss [

], have been developed. According to the

studies in [

] and [

], the MS loss was veriﬁed to be one of the most efﬁcient metric loss functions.

Some recent methods explore how to use multiple features to learn robust feature embeddings. Kan

et al. [

] and Seidenschwarz et al. [

] adopt K-nearest neighbors (k-NN) of an anchor image to

build local graph neural network (GNN) and reﬁne embedding vectors based on message exchanges

between the graph nodes. Zhao et al. [

] proposed a structural matching method to learn a metric

function between feature maps based on the optimal transport theory. Based on the form of softmax,

margin-based softmax loss functions [

;

] were also proposed to learn discriminative features.

Sohn [

] improved the contrastive loss and triplet loss by introducing

negative examples and

proposed the N-pair loss function to speed up model convergence during training.

(2) Transformer-based learning methods.

This work is related to transformer-based learning

methods since our method also analyze the correlation between features and uses this correlation

information to aggregate features. The original work of transformer [

] aims to learn a self-attention

function and a feed forward transformation network for nature language processing. Recently, it

has been successfully applied to computer vision and image processing. ViT [

] demonstrates

that a pure transformer can achieve state-of-the-art performance in image classiﬁcation. ViT treats

each image as a sequence of tokens and then feeds them to multiple transformer layers to perform

the classiﬁcation. Subsequently, DeiT [

] further explores a data-efﬁcient training strategy and a

distillation approach for ViT. More recent methods such as T2T ViT [

], TNT [

], CrossViT [

]

and LocalViT [

] further improve the ViT method for image classiﬁcation. PVT [

] incorporates a

pyramid structure into the transformer for dense prediction tasks. After that, methods such as Swin

[

], CvT [

], CoaT [

], LeViT [

], Twins [

] and MiT [

] enhance the local continuity of

features and remove ﬁxed size position embedding to improve the performance of transformers for

dense prediction tasks. For deep metric learning, El-Nouby et al. [

] and Ermolov et al. [

] adopt

the DeiT-S network[39] as a backbone to extract features, achieving impressive performance.

(3) Residual encoding.

Residual encoding was ﬁrst proposed by Jégou et al. [

], where the vector

of locally aggregated descriptors (VLAD) algorithm is used to aggregate the residuals between

features and their best-matching codewords. Based on the VLAD method, VLAD-CNN [

] has

developed the residual encoders for visual recognition and understanding tasks. NetVLAD [

]

and Deep-TEN [

] extend this idea and develop an end-to-end learnable residual encoder based

on soft-assignment. It should be noted that features learned by these methods typically have very

large sizes, for example, 16k, 32k and 4096 for AlexNet [

], VGG-16 [

] and ResNet-50 [

respectively.

(4) Unique Contributions.

Compared to the above existing methods, the unique contributions of

this paper can be summarized as follows: (1) We introduce a new CRT method which learns a set

of prototype features, project the feature map onto each prototype, and then encode its features

using their projection residuals weighted by their correlation coefﬁcients with each prototype. (2)

We introduce a diversity constraint for the set of prototype features so that the CRT method can

represent and encode the feature map from a set of complimentary perspectives. Unlike existing

transformer-based feature representation approaches which encode the original values of features

based on global correlation analysis, the proposed coded residual transform encode the relative

differences between original features and their projected prototypes. (3) To further enhance the

generalization performance, we propose to enforce the feature distribution consistency between coded

residual transforms with different sizes of projection prototypes and embedding dimensions. (4) We

demonstrate that this multi-perspective projection with diversiﬁed prototypes and coded residual

representation based on relative differences are able to achieve signiﬁcantly improved generalization

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CodedResidualTransformforGeneralizableDeepMetricLearningShichaoKan1,YixiongLiang1,MinLi1,YigangCen2;3;,JianxinWang1,ZhihaiHe4;5;1SchoolofComputerScienceandEngineering,CentralSouthUniversity,Changsha,Hunan,4100832InstituteofInformationScience,SchoolofComputerandInformationTechnology,BeijingJiaotong...

展开>> 收起<<

Coded Residual Transform for Generalizable Deep Metric Learning Shichao Kan1 Yixiong Liang1 Min Li1 Yigang Cen23 Jianxin Wang1 Zhihai He45.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Coded Residual Transform for Generalizable Deep Metric Learning Shichao Kan1 Yixiong Liang1 Min Li1 Yigang Cen23 Jianxin Wang1 Zhihai He45

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: