Visually Similar Products Retrieval for Shopsy

2025-05-06 1 0 2.14MB 10 页 10玖币

侵权投诉

Prajit Nadkarni

prajit.pn@ipkart.com

Flipkat Internet Pvt. Ltd.

Bengaluru, India

Narendra Varma Dasararaju

narendra.varma@ipkart.com

Flipkat Internet Pvt. Ltd.

Bengaluru, India

ABSTRACT

Visual search is of great assistance in reseller commerce, especially

for non-tech savvy users with anity towards regional languages. It

allows resellers to accurately locate the products that they seek, un-

like textual search which recommends products from head brands.

Product attributes available in e-commerce have a great potential

for building better visual search systems [

] as they capture

ne grained relations between data points. In this work, we design

a visual search system for reseller commerce using a multi-task

learning approach. We also highlight and address the challenges

like image compression, cropping, scribbling on the image, etc,

faced in reseller commerce.

Our model consists of three dierent tasks: attribute classica-

tion, triplet ranking and variational autoencoder (VAE). Masking

technique [

] is used for designing the attribute classication. Next,

we introduce an oine triplet mining technique which utilizes in-

formation from multiple attributes to capture relative order within

the data. This technique displays a better performance compared

to the traditional triplet mining [

] baseline, which uses single la-

bel/attribute information. We also compare and report incremental

gain achieved by our unied multi-task model over each individual

task separately. The eectiveness of our method is demonstrated

using the in-house dataset of product images from the Lifestyle

business-unit of Flipkart, India’s largest e-commerce company. To

eciently retrieve the images in production, we use the Approx-

imate Nearest Neighbor (ANN) index. Finally, we highlight our

production environment constraints and present the design choices

and experiments conducted to select a suitable ANN index.

KEYWORDS

content based image retrieval, visual search, multi-task Learning,

triplet loss, variational autoencoder

ACM Reference Format:

Prajit Nadkarni and Narendra Varma Dasararaju. 2022. Visually Similar

Products Retrieval for Shopsy. In ,. ACM, New York, NY, USA, 10 pages.

https://doi.org/00.0000/0000000.0000000

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

NA ’22, NA

ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00

https://doi.org/00.0000/0000000.0000000

1 INTRODUCTION

Reseller commerce in India is a continuously growing multi-billion

dollar market, which helps resellers utilize social platforms like

Facebook and Whatsapp to bring commerce to the Next 500 Million

customers. Resellers inuence and assist their customers by curat-

ing the right products and doing order management, thus building

a layer of trust and assistance for users to perform online shop-

ping via social platforms. Shopsy is an app by Flipkart that allows

resellers to share products with ease to their end customers and

earn money by enabling commerce. Resellers communicate with

the end user over social platforms in a way to keep all the Shopsy

constructs hidden, thus building their business with no intermedi-

ary. Therefore, reseller communication with the end user does not

include any product links, but is done entirely using images and

written description.

Reseller commerce covers following use cases: (1) Reseller pro-

motes the products by sharing the images or description of the

product with their end-user. The user then shows interest in buying

a specic product and shares back the product image. (2) Reseller

wants to check the availability of a product, that the user found in-

teresting on social media, in the Shopsy catalog. It should be noted

that images shared by the users with the reseller may be cropped

or contain additional markings to highlight specic aspects of the

product. Images also undergo compression while being shared over

chat. The reseller may also add their logo on the image to promote

their business. Over time as the reseller promotes multiple products

to multiple users, it becomes hard to locate the requested product

using textual search. It is dicult to describe visual characteristics

of a product using words. Searching products using text tends to

surface products from head brands and does not guarantee retrieval

of the required item. Visual search tackles these limitations as it

captures the exact visual patterns of the query image and retrieves

the best matched item. Therefore, we build a visual search system

for reseller commerce to handle the above mentioned use cases.

In recent years, visual search has been built across many com-

panies including Alibaba’s Pailitao [

], Pinterest Flashlight and

Lens [

], Google Lens [

], Microsoft’s Visual Search [

etc. These applications demonstrate large scale visual search sys-

tems for massive updating data. They focus on the quality of rec-

ommendations to improve user engagement. Flipkart catalog also

contains millions of products and our primary aim is to assist the

reseller in retrieving the exact item. The images in our catalog up-

date upon the introduction of new products. Therefore, we design a

system that oers high precision and low latency, while considering

the size of our catalog and the update rate.

In this work, we consider products only in the fashion category,

as currently the reseller commerce in India is focused on fashion.

Recent works in fashion [

] have demonstrated the use of

product attributes to build high quality visual embeddings using

arXiv:2210.04560v1 [cs.CV] 10 Oct 2022

NA ’22, NA Prajit Nadkarni and Narendra Varma Dasararaju

a combination of attribute classication and triplet ranking loss.

We design a multi-task model that learns from three dierent tasks:

attribute-classication, triplet ranking and variational autoencoder.

Finally, we highlight our production constraints and build an end

to end visual search system for our use case.

Our key contributions can be summarized as follows:

•

We build a visual search system for reseller commerce and

highlight challenges in this domain like image compression,

cropping, scribbling on the image, etc.

•

We present a triplet mining technique that uses information

from multiple attributes to capture relative order within the

data. It gives us twice as good performance as the traditional

triplet mining technique, that uses a single label/attribute,

which we have used as a baseline.

•

We build a multi-task model to learn high-quality visual

embeddings and attain a 4% incremental gain over the best

individual task.

•

We highlight the business requirements and infrastructure

constraints for our reseller commerce environment, and

demonstrate an end to end visual search system that of-

fers high precision and low latency, while considering our

catalogue size and the data update rate.

•

We present experiments and choices made for selecting an

appropriate Approximate Nearest Neighbor (ANN) index for

our production use case.

2 RELATED WORKS

Large scale visual search systems have been built across many

companies [

–

], demonstrating large scale index-

ing for massive updating data. There has also been research in

domain specic image retrieval systems, designed for fashion prod-

ucts [

]. They leverage the product attribute infor-

mation available in the e-commerce domain to build high quality

visual embeddings. Other works that focus on extracting visual

attributes for e-commerce [

] demonstrate multi-class classi-

cation techniques. Parekh et al. [

] employ a masking technique to

handle missing attribute values, a practical approach when dealing

with products across dierent verticals. We use the same masking

technique and build a multi-task learning approach with attribute

classication and triplet ranking loss.

Distance metric learning techniques are primarily designed for

image retrieval systems, with the seminal works like contrastive-

loss [

] and triplet-loss [

]. Triplet loss considers a data point

as anchor and associates it with a positive and a negative data

point, and constrains the distance of an anchor-positive pair to be

smaller than the anchor-negative pair. These methods have evolved

over time, with early generations like Schro et al. [

], where

they introduced a semi-hard negative mining approach. This is an

online triplet mining technique which computes useful triplets on

the y by sampling hard positive/negatives from within a mini-

batch. Later, techniques evolved to incorporate information beyond

a single triplet like Lifted Structured loss [

], N-Pair loss [

], etc.

These losses associate an anchor point with a single positive and

multiple negative points, and consider their relative hardness while

pushing or pulling these points. The above losses consider the rich

data-to-data relations and are able to learn ne-grained relations

between them. However, these losses suer from high training

complexity

𝑂(𝑀2)

𝑂(𝑀3)

where M is the number of data points,

thus slow convergence. Recent works like Proxy-NCA [

], Proxy

Anchor [

], etc, resolve the above complexity issue by introducing

proxies, thus aiding in faster convergence.

In all of the above losses, pair-based or proxy-based, the positives

and negatives are chosen based on the class label, ie. positives are

from the same class as anchor and negatives from a dierent class.

For instance, in the face-recognition setting, to ensure enough

positives in each mini-batch, Schro et al. [

] used a mini-batch

of 1800 such that around 40 faces are selected per identity per mini-

batch. In the case of proxy based losses, all proxies are part of the

model and are kept in memory. Since each proxy represents a class,

it puts a limit on the number of classes. Applying these techniques

to e-commerce is challenging, where the possible class labels could

be a product-id or a product-vertical (eg. t–shirt, shoe, watch, etc).

In e-commerce, we have over millions of products with only 3–4

images per product, that appear on its product page, and the total

number of verticals range only in a few hundreds. Choosing the

class label as product-id can be too restrictive as there are only a

few positives to learn from, and in the proxy based setting it would

lead to millions of proxies. Choosing the product-vertical as class

label makes the relation between data points too slack and thus we

lose the ne grained intra-vertical details (e.g. discriminating one

t–shirt pattern from another). Thus, applications in the e-commerce

domain resort to using product attributes for mining the triplets.

Ak et al. and others [

], etc, choose triplets such that the anchor

and the positive must have the same attribute value whereas the

negative is chosen with a dierent attribute value. For instance,

given that the anchor is a ‘blue’ color, positive can be any image

with ‘blue’ color. Serra at el. [

] use images with noisy tags (e.g.

red-sweater, red-tshirt) and use similarity score ‘intersection over

union’ between the tags. They then choose positives that have a

similarity score above a threshold and negatives with a score below

the threshold. Shankar et al. [

] prepare triplets with three levels

(positive, in-class-negative, out-of-class-negative), and use a ‘basic

image similarity scorer’ (e.g. pretrained AlexNet, color-histogram,

PatternNet) for selecting candidates across levels. Drawing ideas

from the above works, we dene an oine triplet mining technique

that prepares candidates across multiple levels, such that it captures

the relative order within the data. We sample the candidates under

each level based on the percentage of attributes matched.

Another technique that has been used for image retrieval appli-

cations is Autoencoder [

]. Autoencoder is a type of articial

neural network where the output is the same as the input. It has

an encoder, a decoder and a bottleneck layer in the middle which

captures the latent representation of the data. Thus, the bottleneck

layer learns the most important characteristics of the image in

an unsupervised way. A Variational Autoencoder (VAE) [

] has

the same structure as an autoencoder but uses a probabilistic ap-

proach to learn the latent representation. Unlike an autoencoder,

VAE learns the disentangled embedding representation [

], i.e.

where a single latent dimension is aected by only one generative

factor and is invariant to changes in other factors. Thus, the under-

lying embedding spaces have a smooth continuous transformation

over a latent dimension. For instance, a latent dimension which

captures color variations, arranges the red t-shirt closer to maroon

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VisuallySimilarProductsRetrievalforShopsyPrajitNadkarniprajit.pn@flipkart.comFlipkatInternetPvt.Ltd.Bengaluru,IndiaNarendraVarmaDasararajunarendra.varma@flipkart.comFlipkatInternetPvt.Ltd.Bengaluru,IndiaABSTRACTVisualsearchisofgreatassistanceinresellercommerce,especiallyfornon-techsavvyuserswithaffi...

展开>> 收起<<

Visually Similar Products Retrieval for Shopsy.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Visually Similar Products Retrieval for Shopsy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: