2 Related Work and Unique Contributions
This work is related to deep metric learning, transformer-based learning methods, and residual
encoding. In this section, we review the existing methods on these topics and discuss the unique
novelty of our approach.
(1) Deep metric learning.
Deep metric learning aims to learn discriminative features with the goal to
minimize intra-class sample distance and maximize inter-class sample distance in a contrastive manner.
Contrastive loss [
30
] has been successfully used in early methods of deep metric learning, aiming
to optimize pairwise distance of between samples. By exploring more sophisticated relationship
between samples, a variety of metric loss functions, such as triplet loss [
31
], lifted structured loss [
32
],
proxy-anchor loss [
9
], and multi-similarity (MS) loss [
11
], have been developed. According to the
studies in [
22
] and [
33
], the MS loss was verified to be one of the most efficient metric loss functions.
Some recent methods explore how to use multiple features to learn robust feature embeddings. Kan
et al. [
2
] and Seidenschwarz et al. [
18
] adopt K-nearest neighbors (k-NN) of an anchor image to
build local graph neural network (GNN) and refine embedding vectors based on message exchanges
between the graph nodes. Zhao et al. [
17
] proposed a structural matching method to learn a metric
function between feature maps based on the optimal transport theory. Based on the form of softmax,
margin-based softmax loss functions [
34
;
35
;
36
] were also proposed to learn discriminative features.
Sohn [
37
] improved the contrastive loss and triplet loss by introducing
N
negative examples and
proposed the N-pair loss function to speed up model convergence during training.
(2) Transformer-based learning methods.
This work is related to transformer-based learning
methods since our method also analyze the correlation between features and uses this correlation
information to aggregate features. The original work of transformer [
27
] aims to learn a self-attention
function and a feed forward transformation network for nature language processing. Recently, it
has been successfully applied to computer vision and image processing. ViT [
38
] demonstrates
that a pure transformer can achieve state-of-the-art performance in image classification. ViT treats
each image as a sequence of tokens and then feeds them to multiple transformer layers to perform
the classification. Subsequently, DeiT [
39
] further explores a data-efficient training strategy and a
distillation approach for ViT. More recent methods such as T2T ViT [
29
], TNT [
40
], CrossViT [
41
]
and LocalViT [
42
] further improve the ViT method for image classification. PVT [
43
] incorporates a
pyramid structure into the transformer for dense prediction tasks. After that, methods such as Swin
[
28
], CvT [
44
], CoaT [
45
], LeViT [
46
], Twins [
47
] and MiT [
48
] enhance the local continuity of
features and remove fixed size position embedding to improve the performance of transformers for
dense prediction tasks. For deep metric learning, El-Nouby et al. [
1
] and Ermolov et al. [
20
] adopt
the DeiT-S network[39] as a backbone to extract features, achieving impressive performance.
(3) Residual encoding.
Residual encoding was first proposed by Jégou et al. [
49
], where the vector
of locally aggregated descriptors (VLAD) algorithm is used to aggregate the residuals between
features and their best-matching codewords. Based on the VLAD method, VLAD-CNN [
50
] has
developed the residual encoders for visual recognition and understanding tasks. NetVLAD [
51
]
and Deep-TEN [
52
] extend this idea and develop an end-to-end learnable residual encoder based
on soft-assignment. It should be noted that features learned by these methods typically have very
large sizes, for example, 16k, 32k and 4096 for AlexNet [
53
], VGG-16 [
54
] and ResNet-50 [
55
],
respectively.
(4) Unique Contributions.
Compared to the above existing methods, the unique contributions of
this paper can be summarized as follows: (1) We introduce a new CRT method which learns a set
of prototype features, project the feature map onto each prototype, and then encode its features
using their projection residuals weighted by their correlation coefficients with each prototype. (2)
We introduce a diversity constraint for the set of prototype features so that the CRT method can
represent and encode the feature map from a set of complimentary perspectives. Unlike existing
transformer-based feature representation approaches which encode the original values of features
based on global correlation analysis, the proposed coded residual transform encode the relative
differences between original features and their projected prototypes. (3) To further enhance the
generalization performance, we propose to enforce the feature distribution consistency between coded
residual transforms with different sizes of projection prototypes and embedding dimensions. (4) We
demonstrate that this multi-perspective projection with diversified prototypes and coded residual
representation based on relative differences are able to achieve significantly improved generalization
3