Large-to-small Image Resolution Asymmetry in Deep Metric Learning Pavel Suma Giorgos Tolias Visual Recognition Group Faculty of Electrical Engineering Czech Technical University in Prague

2025-04-24 0 0 457.68KB 10 页 10玖币
侵权投诉
Large-to-small Image Resolution Asymmetry in Deep Metric Learning
Pavel Suma Giorgos Tolias
Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague
sumapave,toliageo@fel.cvut.cz
Abstract
Deep metric learning for vision is trained by optimiz-
ing a representation network to map (non-)matching im-
age pairs to (non-)similar representations. During test-
ing, which typically corresponds to image retrieval, both
database and query examples are processed by the same
network to obtain the representation used for similarity es-
timation and ranking. In this work, we explore an asym-
metric setup by light-weight processing of the query at a
small image resolution to enable fast representation extrac-
tion. The goal is to obtain a network for database exam-
ples that is trained to operate on large resolution images
and benefits from fine-grained image details, and a second
network for query examples that operates on small resolu-
tion images but preserves a representation space aligned
with that of the database network. We achieve this with a
distillation approach that transfers knowledge from a fixed
teacher network to a student via a loss that operates per
image and solely relies on coupled augmentations with-
out the use of any labels. In contrast to prior work that
explores such asymmetry from the point of view of differ-
ent network architectures, this work uses the same archi-
tecture but modifies the image resolution. We conclude
that resolution asymmetry is a better way to optimize the
performance/efficiency trade-off than architecture asymme-
try. Evaluation is performed on three standard deep metric
learning benchmarks, namely CUB200, Cars196, and SOP.
Code: https://github.com/pavelsuma/raml
1. Introduction
The performance of deep learning models typically in-
creases with their size and computational complexity. Most
work focuses on improving recognition performance and
therefore relies on expensive models to train and deploy.
Standard deep network architectures [21, 18] are available
in different variants that cover a range of the trade-off be-
tween performance and efficiency. Optimizing this trade-
off forms a particular line of research that attracts a lot of
attention since efficient and lightweight deep models allow
0 5 10 15
30
35
40
45
R50:M
R50:0.5M
R18:M
R18:0.5M
R50:M0.5M
R50:M0.7M
R50:M0.35M
R50:M0.5M
R50:M0.7M
R50:M0.35M
R50R18:M
query extraction cost (GFLOPs)
performance (mAP)
teacher symmetric (1)
teacher asymmetric (res.) (3)
teacher-student asymmetric (net)
(2)
teacher-student asymmetric (res.)
(4)
Figure 1. Retrieval performance (mAP) vs. extraction cost of the
query representation (GFLOPs) for the CUB200 dataset. The no-
tation format used is “database setup”“query setup”, where R50
and R18 are two variants of ResNet architecture. Mis equal to
448 and indicates the width and height of images. Contrary to the
standard symmetric retrieval (circle), the query in the asymmetric
setting is processed by a lighter network (triangle) or in a smaller
resolution (diamond, pentagon).
: networks trained with the pro-
posed distillation approach to achieve resolution asymmetry (the
focus of this work) and also network asymmetry for comparison.
deployment on mobile and low-resource devices or enable
real-time execution. Therefore, powerful yet efficient mod-
els are desirable. One of the standard practices is to initially
train a large model that is then used to obtain a smaller one,
with weight pruning [17] and network distillation [20] being
two dominant approaches to achieve that.
Network distillation uses the large model as a teacher
that guides the training of a smaller student model. Most
work in distillation is related to classification tasks, where
the teacher logits act as a supervision [20]. Still, some work
focuses on metric learning and image retrieval, where ei-
ther the underlined vector representation [33] or pairwise
scalar similarity values [30] function as supervision for dis-
tillation. In classification, once the small model is ob-
tained, the large one is no longer used. However, due to
the pairwise nature of the retrieval task, a possible asym-
metry emerges; the query and database examples are pro-
cessed by two different networks, with the network of the
arXiv:2210.05463v1 [cs.CV] 11 Oct 2022
former being lightweight to reduce the extraction cost dur-
ing query time. For small to medium size databases or with
the use of fast nearest neighbor search methods [24, 2, 25],
the extraction cost of the representation can be the test-time
bottleneck. In the asymmetric setup, the two representation
spaces corresponding to each of the two networks need to
be aligned and compatible. This is the objective of asym-
metric metric learning (AML) as introduced by Budnik and
Avrithis [6].
AML is studied under the lens of asymmetric network
architectures; the student model is a pruned variant of the
teacher, a different but lighter architecture possibly discov-
ered by neural architecture search. All these aspects reduce
the query time. The resolution of the input images is an
important aspect that is overlooked. The use of fully convo-
lutional architectures allows any input resolution, while the
representation extraction cost is roughly quadratic in that
matter. Metric learning tasks that focus on instance-level
recognition are known to benefit from the use of a large
image resolution [31, 4]. The same holds for fine-grained
recognition where object details matter [35], as also shown
in this work. Therefore, input resolution is a critical param-
eter for the performance/efficiency trade-off.
This work focuses on AML, where the asymmetry comes
at the level of input resolution between the database net-
work and the query network. The two architectures are the
same, but the query network is trained at a low resolution to
match the representation of the database network at a high
resolution, which is performed with a distillation process
that transfers knowledge from a teacher (database network)
to the student (query network). The contributions of this
work are summarized as follows:
Asymmetry in the form of input image resolution is ex-
plored for the first time in deep metric learning.
A distillation approach is proposed to align the repre-
sentation (absolute distillation) and the pairwise similar-
ities (relational distillation) between student and teacher
across task-tailored augmentations of the same image.
We conclude that resolution asymmetry is a better way
to optimize the performance vs. efficiency trade-off com-
pared to network asymmetry.
As a side-effect, the student obtained with distillation no-
ticeably outperforms the deep metric learning baselines
in a conventional/symmetric retrieval.
A performance vs. efficiency comparison is shown in
Figure 1. Compared to the baselines where a single net-
work extracts both database and query examples with the
same resolution (circle) and different resolution (diamond),
distillation performs much better. The superiority of reso-
lution over network asymmetry is evident too. More details
about these experiments are discussed in Section 4.
2. Related work
Asymmetric embedding compatibility. In image re-
trieval, embedding compatibility has to be ensured when
the database examples are processed by a different net-
work than the query examples. To this end, AML [6] re-
defines standard metric learning losses in an asymmetric
way, i.e. the anchor example is processed by the query net-
work, while the database network processes the correspond-
ing positive and negative examples. However, for the ob-
jective of representation space alignment, these losses are
outperformed by a simple unsupervised regression loss on
the embeddings, a form of knowledge distillation between
teacher and student. Using the supervised losses, on the
other hand, boosts the performance of symmetric retrieval
with the student, which even surpasses its teacher. Follow-
ing the paradigm of unsupervised distillation, another re-
cent approach compels the student to mimic the contextual
similarities of image neighbors in the teacher embedding
space [50]. Other than optimizing the weights of the stu-
dent network, a generalization includes using neural archi-
tecture search to additionally optimize the network architec-
ture [11] in a work that focuses on training for classification
rather than in a metric learning manner.
Classification-based training is the dominant approach in
a relevant task to ours, called backward compatible learning
(BCT) [36]. However, the underlined task assumptions are
different. Its objective is to add new data processed with a
stronger backbone version without back-filling the current
database. Compatibility is established with cross-entropy
loss of the old classifier on old and new embeddings of
the same input image. This is extended to the compati-
bility of multiple embedding versions [23] or to tackling
open-set backward compatibility with a continual learning
approach [42]. Similarly, forward compatible learning [32]
stores side information during training, which is leveraged
in the future to transfer the old embeddings to another task.
Other methods for fixing inconsistent representation spaces
include class prototypes alignment [3, 52], and transforma-
tion of both spaces, rather than a single one [43, 22].
Asymmetry also emerges when embeddings are col-
lected from diverse devices which use different models, e.g.
in the domain of faces where the recognition should be com-
patible with all models [8], or in localization and mapping
task with multiple agents [12].
Distillation and small image resolution. Image down-
sampling remains the primary pre-processing step even at
present times when the average visual memory of GPU al-
lows processing bigger resolutions during training and test-
ing. It is observed that using larger images reliably trans-
lates to higher performance regardless of the objective or
dataset [35]. Yet there are still many valid use cases where
the inference has to be done with limited resources. In this
context, distillation is used to align embeddings of high- and
摘要:

Large-to-smallImageResolutionAsymmetryinDeepMetricLearningPavelSumaGiorgosToliasVisualRecognitionGroup,FacultyofElectricalEngineering,CzechTechnicalUniversityinPraguesumapave,toliageo@fel.cvut.czAbstractDeepmetriclearningforvisionistrainedbyoptimiz-ingarepresentationnetworktomap(non-)matchingim-agep...

展开>> 收起<<
Large-to-small Image Resolution Asymmetry in Deep Metric Learning Pavel Suma Giorgos Tolias Visual Recognition Group Faculty of Electrical Engineering Czech Technical University in Prague.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:457.68KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注