General Image Descriptors for Open World Image Retrieval using ViT CLIP Marcos V . Conde1 Ivan Aerlic2 Simon J egou3 1H2O.ai and Computer Vision Lab CAIDAS University of W urzburg Germany

2025-05-06 0 0 3.73MB 5 页 10玖币

侵权投诉

General Image Descriptors for Open World Image Retrieval using ViT CLIP

Marcos V. Conde1, Ivan Aerlic2, Simon J´

egou3

1H2O.ai and Computer Vision Lab, CAIDAS, University of W¨

urzburg, Germany

2Independent researcher and Team Leader, Australia

3Independent researcher, France

marcos.conde-osorio@uni-wuerzburg.de

https://github.com/IvanAer/G-Universal-CLIP

Abstract

The Google Universal Image Embedding (GUIE) Chal-

lenge is one of the ﬁrst competitions in multi-domain image

representations in the wild, covering a wide distribution of

objects: landmarks, artwork, food, etc. This is a funda-

mental computer vision problem with notable applications

in image retrieval, search engines and e-commerce.

In this work, we explain our 4th place solution to the

GUIE Challenge, and our ”bag of tricks” to ﬁne-tune zero-

shot Vision Transformers (ViT) pre-trained using CLIP.

1. Introduction

Image representations are a critical building block of

computer vision applications [11]. Traditionally, research

on image embedding learning has been conducted with a fo-

cus on per-domain models [18,20,23]. Generally, solutions

are based on generic embedding learning techniques which

are applied to different domains separately, rather than de-

veloping generic embedding models which could be applied

to all domains combined.

At the Google Universal Image Embedding (GUIE)

Challenge, the proposed models are expected to retrieve rel-

evant index database images to a given query image (i.e. im-

ages containing the same object as the query) considering a

great variety of domains. Our proposed solution has real-

world visual search applications, such as organizing photos,

improving search engines, and visual e-commerce.

Problem deﬁnition We seek for a function φsuch that:

φ:RH×W×37→ R64 φ(x) = q∈R64 (1)

given an input 3-channel RGB image xof dimension

H×W, our model φextract a compact 64-dimensional

(64D) image descriptor or embedding φ(x).

Then the image retrieval task [1,4,20] considers an

index-reference database of images Z={z1, z2, . . . , zn},

and a given a query image x, we calculate

argmin

kφ(x)−φ(zi)k2

2(2)

ﬁnally retrieve the top-kmost similar images (i.e. those that

minimize the previous equation).

Evaluation Methods are evaluated according to the mean

Precision at k= 5 (abbreviated as mP @5):

mP @5 = 1

q=1

min(nq,5)

j=1

relq(j)(3)

where Qis the number of query images, nqis the number

of index images containing an object in common with the

query image q. Note that nq>0for any query image q.

The term relq(j)denotes the relevance of prediction jfor

the q-th query: relq(j) = 1 if the j-th prediction is correct,

and 0 otherwise. Participants must submit a model ﬁle (e.g.

.pt). The model must take an image as an input, and return

a ﬂoat vector (i.e. the image embedding) as the output. The

challenge platform Kaggle use the submitted model to:

1. Extract embeddings for the private test dataset (query

and index images).

2. Create a kNN (k= 5) lookup for each test sample,

using the Euclidean distance between test and index

embeddings. See Equation 2.

3. Score the quality of the lookups using Equation 3.

In Figure 1we provide an illustrative example of a real-

world image retrieval system, similar to the one employed

in this challenge for evaluating the quality of the produced

image descriptors.

arXiv:2210.11141v1 [cs.CV] 20 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneralImageDescriptorsforOpenWorldImageRetrievalusingViTCLIPMarcosV.Conde1,IvanAerlic2,SimonJ´egou31H2O.aiandComputerVisionLab,CAIDAS,UniversityofW¨urzburg,Germany2IndependentresearcherandTeamLeader,Australia3Independentresearcher,Francemarcos.conde-osorio@uni-wuerzburg.dehttps://github.com/IvanAer...

展开>> 收起<<

General Image Descriptors for Open World Image Retrieval using ViT CLIP Marcos V . Conde1 Ivan Aerlic2 Simon J egou3 1H2O.ai and Computer Vision Lab CAIDAS University of W urzburg Germany.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

General Image Descriptors for Open World Image Retrieval using ViT CLIP Marcos V . Conde1 Ivan Aerlic2 Simon J egou3 1H2O.ai and Computer Vision Lab CAIDAS University of W urzburg Germany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: