Large-to-small Image Resolution Asymmetry in Deep Metric Learning Pavel Suma Giorgos Tolias Visual Recognition Group Faculty of Electrical Engineering Czech Technical University in Prague

2025-04-24 0 0 457.68KB 10 页 10玖币

侵权投诉

Large-to-small Image Resolution Asymmetry in Deep Metric Learning

Pavel Suma Giorgos Tolias

Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague

sumapave,toliageo@fel.cvut.cz

Abstract

Deep metric learning for vision is trained by optimiz-

ing a representation network to map (non-)matching im-

age pairs to (non-)similar representations. During test-

ing, which typically corresponds to image retrieval, both

database and query examples are processed by the same

network to obtain the representation used for similarity es-

timation and ranking. In this work, we explore an asym-

metric setup by light-weight processing of the query at a

small image resolution to enable fast representation extrac-

tion. The goal is to obtain a network for database exam-

ples that is trained to operate on large resolution images

and beneﬁts from ﬁne-grained image details, and a second

network for query examples that operates on small resolu-

tion images but preserves a representation space aligned

with that of the database network. We achieve this with a

distillation approach that transfers knowledge from a ﬁxed

teacher network to a student via a loss that operates per

image and solely relies on coupled augmentations with-

out the use of any labels. In contrast to prior work that

explores such asymmetry from the point of view of differ-

ent network architectures, this work uses the same archi-

tecture but modiﬁes the image resolution. We conclude

that resolution asymmetry is a better way to optimize the

performance/efﬁciency trade-off than architecture asymme-

try. Evaluation is performed on three standard deep metric

learning benchmarks, namely CUB200, Cars196, and SOP.

Code: https://github.com/pavelsuma/raml

1. Introduction

The performance of deep learning models typically in-

creases with their size and computational complexity. Most

work focuses on improving recognition performance and

therefore relies on expensive models to train and deploy.

Standard deep network architectures [21, 18] are available

in different variants that cover a range of the trade-off be-

tween performance and efﬁciency. Optimizing this trade-

off forms a particular line of research that attracts a lot of

attention since efﬁcient and lightweight deep models allow

0 5 10 15

R50:M

R50:0.5M

R18:M

R18:0.5M

R50:M→0.5M

R50:M→0.7M

R50:M→0.35M

R50:M→0.5M

R50:M→0.7M

R50:M→0.35M

R50→R18:M

query extraction cost (GFLOPs)

performance (mAP)

teacher symmetric (1)

teacher asymmetric (res.) (3)

teacher-student asymmetric (net)

⚗

(2)

teacher-student asymmetric (res.)

⚗

(4)

Figure 1. Retrieval performance (mAP) vs. extraction cost of the

query representation (GFLOPs) for the CUB200 dataset. The no-

tation format used is “database setup”→“query setup”, where R50

and R18 are two variants of ResNet architecture. Mis equal to

448 and indicates the width and height of images. Contrary to the

standard symmetric retrieval (circle), the query in the asymmetric

setting is processed by a lighter network (triangle) or in a smaller

resolution (diamond, pentagon).

⚗

: networks trained with the pro-

posed distillation approach to achieve resolution asymmetry (the

focus of this work) and also network asymmetry for comparison.

deployment on mobile and low-resource devices or enable

real-time execution. Therefore, powerful yet efﬁcient mod-

els are desirable. One of the standard practices is to initially

train a large model that is then used to obtain a smaller one,

with weight pruning [17] and network distillation [20] being

two dominant approaches to achieve that.

Network distillation uses the large model as a teacher

that guides the training of a smaller student model. Most

work in distillation is related to classiﬁcation tasks, where

the teacher logits act as a supervision [20]. Still, some work

focuses on metric learning and image retrieval, where ei-

ther the underlined vector representation [33] or pairwise

scalar similarity values [30] function as supervision for dis-

tillation. In classiﬁcation, once the small model is ob-

tained, the large one is no longer used. However, due to

the pairwise nature of the retrieval task, a possible asym-

metry emerges; the query and database examples are pro-

cessed by two different networks, with the network of the

arXiv:2210.05463v1 [cs.CV] 11 Oct 2022

former being lightweight to reduce the extraction cost dur-

ing query time. For small to medium size databases or with

the use of fast nearest neighbor search methods [24, 2, 25],

the extraction cost of the representation can be the test-time

bottleneck. In the asymmetric setup, the two representation

spaces corresponding to each of the two networks need to

be aligned and compatible. This is the objective of asym-

metric metric learning (AML) as introduced by Budnik and

Avrithis [6].

AML is studied under the lens of asymmetric network

architectures; the student model is a pruned variant of the

teacher, a different but lighter architecture possibly discov-

ered by neural architecture search. All these aspects reduce

the query time. The resolution of the input images is an

important aspect that is overlooked. The use of fully convo-

lutional architectures allows any input resolution, while the

representation extraction cost is roughly quadratic in that

matter. Metric learning tasks that focus on instance-level

recognition are known to beneﬁt from the use of a large

image resolution [31, 4]. The same holds for ﬁne-grained

recognition where object details matter [35], as also shown

in this work. Therefore, input resolution is a critical param-

eter for the performance/efﬁciency trade-off.

This work focuses on AML, where the asymmetry comes

at the level of input resolution between the database net-

work and the query network. The two architectures are the

same, but the query network is trained at a low resolution to

match the representation of the database network at a high

resolution, which is performed with a distillation process

that transfers knowledge from a teacher (database network)

to the student (query network). The contributions of this

work are summarized as follows:

• Asymmetry in the form of input image resolution is ex-

plored for the ﬁrst time in deep metric learning.

• A distillation approach is proposed to align the repre-

sentation (absolute distillation) and the pairwise similar-

ities (relational distillation) between student and teacher

across task-tailored augmentations of the same image.

• We conclude that resolution asymmetry is a better way

to optimize the performance vs. efﬁciency trade-off com-

pared to network asymmetry.

• As a side-effect, the student obtained with distillation no-

ticeably outperforms the deep metric learning baselines

in a conventional/symmetric retrieval.

A performance vs. efﬁciency comparison is shown in

Figure 1. Compared to the baselines where a single net-

work extracts both database and query examples with the

same resolution (circle) and different resolution (diamond),

distillation performs much better. The superiority of reso-

lution over network asymmetry is evident too. More details

about these experiments are discussed in Section 4.

2. Related work

Asymmetric embedding compatibility. In image re-

trieval, embedding compatibility has to be ensured when

the database examples are processed by a different net-

work than the query examples. To this end, AML [6] re-

deﬁnes standard metric learning losses in an asymmetric

way, i.e. the anchor example is processed by the query net-

work, while the database network processes the correspond-

ing positive and negative examples. However, for the ob-

jective of representation space alignment, these losses are

outperformed by a simple unsupervised regression loss on

the embeddings, a form of knowledge distillation between

teacher and student. Using the supervised losses, on the

other hand, boosts the performance of symmetric retrieval

with the student, which even surpasses its teacher. Follow-

ing the paradigm of unsupervised distillation, another re-

cent approach compels the student to mimic the contextual

similarities of image neighbors in the teacher embedding

space [50]. Other than optimizing the weights of the stu-

dent network, a generalization includes using neural archi-

tecture search to additionally optimize the network architec-

ture [11] in a work that focuses on training for classiﬁcation

rather than in a metric learning manner.

Classiﬁcation-based training is the dominant approach in

a relevant task to ours, called backward compatible learning

(BCT) [36]. However, the underlined task assumptions are

different. Its objective is to add new data processed with a

stronger backbone version without back-ﬁlling the current

database. Compatibility is established with cross-entropy

loss of the old classiﬁer on old and new embeddings of

the same input image. This is extended to the compati-

bility of multiple embedding versions [23] or to tackling

open-set backward compatibility with a continual learning

approach [42]. Similarly, forward compatible learning [32]

stores side information during training, which is leveraged

in the future to transfer the old embeddings to another task.

Other methods for ﬁxing inconsistent representation spaces

include class prototypes alignment [3, 52], and transforma-

tion of both spaces, rather than a single one [43, 22].

Asymmetry also emerges when embeddings are col-

lected from diverse devices which use different models, e.g.

in the domain of faces where the recognition should be com-

patible with all models [8], or in localization and mapping

task with multiple agents [12].

Distillation and small image resolution. Image down-

sampling remains the primary pre-processing step even at

present times when the average visual memory of GPU al-

lows processing bigger resolutions during training and test-

ing. It is observed that using larger images reliably trans-

lates to higher performance regardless of the objective or

dataset [35]. Yet there are still many valid use cases where

the inference has to be done with limited resources. In this

context, distillation is used to align embeddings of high- and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Large-to-smallImageResolutionAsymmetryinDeepMetricLearningPavelSumaGiorgosToliasVisualRecognitionGroup,FacultyofElectricalEngineering,CzechTechnicalUniversityinPraguesumapave,toliageo@fel.cvut.czAbstractDeepmetriclearningforvisionistrainedbyoptimiz-ingarepresentationnetworktomap(non-)matchingim-agep...

展开>> 收起<<

Large-to-small Image Resolution Asymmetry in Deep Metric Learning Pavel Suma Giorgos Tolias Visual Recognition Group Faculty of Electrical Engineering Czech Technical University in Prague.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Large-to-small Image Resolution Asymmetry in Deep Metric Learning Pavel Suma Giorgos Tolias Visual Recognition Group Faculty of Electrical Engineering Czech Technical University in Prague

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: