Visually Similar Products Retrieval for Shopsy

2025-05-06 0 0 2.14MB 10 页 10玖币
侵权投诉
Visually Similar Products Retrieval for Shopsy
Prajit Nadkarni
prajit.pn@ipkart.com
Flipkat Internet Pvt. Ltd.
Bengaluru, India
Narendra Varma Dasararaju
narendra.varma@ipkart.com
Flipkat Internet Pvt. Ltd.
Bengaluru, India
ABSTRACT
Visual search is of great assistance in reseller commerce, especially
for non-tech savvy users with anity towards regional languages. It
allows resellers to accurately locate the products that they seek, un-
like textual search which recommends products from head brands.
Product attributes available in e-commerce have a great potential
for building better visual search systems [
2
,
20
,
29
] as they capture
ne grained relations between data points. In this work, we design
a visual search system for reseller commerce using a multi-task
learning approach. We also highlight and address the challenges
like image compression, cropping, scribbling on the image, etc,
faced in reseller commerce.
Our model consists of three dierent tasks: attribute classica-
tion, triplet ranking and variational autoencoder (VAE). Masking
technique [
23
] is used for designing the attribute classication. Next,
we introduce an oine triplet mining technique which utilizes in-
formation from multiple attributes to capture relative order within
the data. This technique displays a better performance compared
to the traditional triplet mining [
27
] baseline, which uses single la-
bel/attribute information. We also compare and report incremental
gain achieved by our unied multi-task model over each individual
task separately. The eectiveness of our method is demonstrated
using the in-house dataset of product images from the Lifestyle
business-unit of Flipkart, India’s largest e-commerce company. To
eciently retrieve the images in production, we use the Approx-
imate Nearest Neighbor (ANN) index. Finally, we highlight our
production environment constraints and present the design choices
and experiments conducted to select a suitable ANN index.
KEYWORDS
content based image retrieval, visual search, multi-task Learning,
triplet loss, variational autoencoder
ACM Reference Format:
Prajit Nadkarni and Narendra Varma Dasararaju. 2022. Visually Similar
Products Retrieval for Shopsy. In ,. ACM, New York, NY, USA, 10 pages.
https://doi.org/00.0000/0000000.0000000
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
NA ’22, NA
©2022 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/00.0000/0000000.0000000
1 INTRODUCTION
Reseller commerce in India is a continuously growing multi-billion
dollar market, which helps resellers utilize social platforms like
Facebook and Whatsapp to bring commerce to the Next 500 Million
customers. Resellers inuence and assist their customers by curat-
ing the right products and doing order management, thus building
a layer of trust and assistance for users to perform online shop-
ping via social platforms. Shopsy is an app by Flipkart that allows
resellers to share products with ease to their end customers and
earn money by enabling commerce. Resellers communicate with
the end user over social platforms in a way to keep all the Shopsy
constructs hidden, thus building their business with no intermedi-
ary. Therefore, reseller communication with the end user does not
include any product links, but is done entirely using images and
written description.
Reseller commerce covers following use cases: (1) Reseller pro-
motes the products by sharing the images or description of the
product with their end-user. The user then shows interest in buying
a specic product and shares back the product image. (2) Reseller
wants to check the availability of a product, that the user found in-
teresting on social media, in the Shopsy catalog. It should be noted
that images shared by the users with the reseller may be cropped
or contain additional markings to highlight specic aspects of the
product. Images also undergo compression while being shared over
chat. The reseller may also add their logo on the image to promote
their business. Over time as the reseller promotes multiple products
to multiple users, it becomes hard to locate the requested product
using textual search. It is dicult to describe visual characteristics
of a product using words. Searching products using text tends to
surface products from head brands and does not guarantee retrieval
of the required item. Visual search tackles these limitations as it
captures the exact visual patterns of the query image and retrieves
the best matched item. Therefore, we build a visual search system
for reseller commerce to handle the above mentioned use cases.
In recent years, visual search has been built across many com-
panies including Alibaba’s Pailitao [
36
], Pinterest Flashlight and
Lens [
13
,
34
,
35
], Google Lens [
24
], Microsoft’s Visual Search [
12
],
etc. These applications demonstrate large scale visual search sys-
tems for massive updating data. They focus on the quality of rec-
ommendations to improve user engagement. Flipkart catalog also
contains millions of products and our primary aim is to assist the
reseller in retrieving the exact item. The images in our catalog up-
date upon the introduction of new products. Therefore, we design a
system that oers high precision and low latency, while considering
the size of our catalog and the update rate.
In this work, we consider products only in the fashion category,
as currently the reseller commerce in India is focused on fashion.
Recent works in fashion [
2
,
20
,
29
] have demonstrated the use of
product attributes to build high quality visual embeddings using
arXiv:2210.04560v1 [cs.CV] 10 Oct 2022
NA ’22, NA Prajit Nadkarni and Narendra Varma Dasararaju
a combination of attribute classication and triplet ranking loss.
We design a multi-task model that learns from three dierent tasks:
attribute-classication, triplet ranking and variational autoencoder.
Finally, we highlight our production constraints and build an end
to end visual search system for our use case.
Our key contributions can be summarized as follows:
We build a visual search system for reseller commerce and
highlight challenges in this domain like image compression,
cropping, scribbling on the image, etc.
We present a triplet mining technique that uses information
from multiple attributes to capture relative order within the
data. It gives us twice as good performance as the traditional
triplet mining technique, that uses a single label/attribute,
which we have used as a baseline.
We build a multi-task model to learn high-quality visual
embeddings and attain a 4% incremental gain over the best
individual task.
We highlight the business requirements and infrastructure
constraints for our reseller commerce environment, and
demonstrate an end to end visual search system that of-
fers high precision and low latency, while considering our
catalogue size and the data update rate.
We present experiments and choices made for selecting an
appropriate Approximate Nearest Neighbor (ANN) index for
our production use case.
2 RELATED WORKS
Large scale visual search systems have been built across many
companies [
12
,
13
,
24
,
33
36
], demonstrating large scale index-
ing for massive updating data. There has also been research in
domain specic image retrieval systems, designed for fashion prod-
ucts [
2
,
6
,
20
,
29
,
37
]. They leverage the product attribute infor-
mation available in the e-commerce domain to build high quality
visual embeddings. Other works that focus on extracting visual
attributes for e-commerce [
1
,
7
,
23
] demonstrate multi-class classi-
cation techniques. Parekh et al. [
23
] employ a masking technique to
handle missing attribute values, a practical approach when dealing
with products across dierent verticals. We use the same masking
technique and build a multi-task learning approach with attribute
classication and triplet ranking loss.
Distance metric learning techniques are primarily designed for
image retrieval systems, with the seminal works like contrastive-
loss [
5
] and triplet-loss [
27
]. Triplet loss considers a data point
as anchor and associates it with a positive and a negative data
point, and constrains the distance of an anchor-positive pair to be
smaller than the anchor-negative pair. These methods have evolved
over time, with early generations like Schro et al. [
27
], where
they introduced a semi-hard negative mining approach. This is an
online triplet mining technique which computes useful triplets on
the y by sampling hard positive/negatives from within a mini-
batch. Later, techniques evolved to incorporate information beyond
a single triplet like Lifted Structured loss [
31
], N-Pair loss [
30
], etc.
These losses associate an anchor point with a single positive and
multiple negative points, and consider their relative hardness while
pushing or pulling these points. The above losses consider the rich
data-to-data relations and are able to learn ne-grained relations
between them. However, these losses suer from high training
complexity
𝑂(𝑀2)
or
𝑂(𝑀3)
where M is the number of data points,
thus slow convergence. Recent works like Proxy-NCA [
21
], Proxy
Anchor [
16
], etc, resolve the above complexity issue by introducing
proxies, thus aiding in faster convergence.
In all of the above losses, pair-based or proxy-based, the positives
and negatives are chosen based on the class label, ie. positives are
from the same class as anchor and negatives from a dierent class.
For instance, in the face-recognition setting, to ensure enough
positives in each mini-batch, Schro et al. [
27
] used a mini-batch
of 1800 such that around 40 faces are selected per identity per mini-
batch. In the case of proxy based losses, all proxies are part of the
model and are kept in memory. Since each proxy represents a class,
it puts a limit on the number of classes. Applying these techniques
to e-commerce is challenging, where the possible class labels could
be a product-id or a product-vertical (eg. t–shirt, shoe, watch, etc).
In e-commerce, we have over millions of products with only 3–4
images per product, that appear on its product page, and the total
number of verticals range only in a few hundreds. Choosing the
class label as product-id can be too restrictive as there are only a
few positives to learn from, and in the proxy based setting it would
lead to millions of proxies. Choosing the product-vertical as class
label makes the relation between data points too slack and thus we
lose the ne grained intra-vertical details (e.g. discriminating one
t–shirt pattern from another). Thus, applications in the e-commerce
domain resort to using product attributes for mining the triplets.
Ak et al. and others [
2
,
6
], etc, choose triplets such that the anchor
and the positive must have the same attribute value whereas the
negative is chosen with a dierent attribute value. For instance,
given that the anchor is a ‘blue’ color, positive can be any image
with ‘blue’ color. Serra at el. [
29
] use images with noisy tags (e.g.
red-sweater, red-tshirt) and use similarity score ‘intersection over
union’ between the tags. They then choose positives that have a
similarity score above a threshold and negatives with a score below
the threshold. Shankar et al. [
28
] prepare triplets with three levels
(positive, in-class-negative, out-of-class-negative), and use a ‘basic
image similarity scorer’ (e.g. pretrained AlexNet, color-histogram,
PatternNet) for selecting candidates across levels. Drawing ideas
from the above works, we dene an oine triplet mining technique
that prepares candidates across multiple levels, such that it captures
the relative order within the data. We sample the candidates under
each level based on the percentage of attributes matched.
Another technique that has been used for image retrieval appli-
cations is Autoencoder [
11
,
19
]. Autoencoder is a type of articial
neural network where the output is the same as the input. It has
an encoder, a decoder and a bottleneck layer in the middle which
captures the latent representation of the data. Thus, the bottleneck
layer learns the most important characteristics of the image in
an unsupervised way. A Variational Autoencoder (VAE) [
18
] has
the same structure as an autoencoder but uses a probabilistic ap-
proach to learn the latent representation. Unlike an autoencoder,
VAE learns the disentangled embedding representation [
10
], i.e.
where a single latent dimension is aected by only one generative
factor and is invariant to changes in other factors. Thus, the under-
lying embedding spaces have a smooth continuous transformation
over a latent dimension. For instance, a latent dimension which
captures color variations, arranges the red t-shirt closer to maroon
摘要:

VisuallySimilarProductsRetrievalforShopsyPrajitNadkarniprajit.pn@flipkart.comFlipkatInternetPvt.Ltd.Bengaluru,IndiaNarendraVarmaDasararajunarendra.varma@flipkart.comFlipkatInternetPvt.Ltd.Bengaluru,IndiaABSTRACTVisualsearchisofgreatassistanceinresellercommerce,especiallyfornon-techsavvyuserswithaffi...

展开>> 收起<<
Visually Similar Products Retrieval for Shopsy.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:2.14MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注