NA ’22, NA Prajit Nadkarni and Narendra Varma Dasararaju
a combination of attribute classication and triplet ranking loss.
We design a multi-task model that learns from three dierent tasks:
attribute-classication, triplet ranking and variational autoencoder.
Finally, we highlight our production constraints and build an end
to end visual search system for our use case.
Our key contributions can be summarized as follows:
•
We build a visual search system for reseller commerce and
highlight challenges in this domain like image compression,
cropping, scribbling on the image, etc.
•
We present a triplet mining technique that uses information
from multiple attributes to capture relative order within the
data. It gives us twice as good performance as the traditional
triplet mining technique, that uses a single label/attribute,
which we have used as a baseline.
•
We build a multi-task model to learn high-quality visual
embeddings and attain a 4% incremental gain over the best
individual task.
•
We highlight the business requirements and infrastructure
constraints for our reseller commerce environment, and
demonstrate an end to end visual search system that of-
fers high precision and low latency, while considering our
catalogue size and the data update rate.
•
We present experiments and choices made for selecting an
appropriate Approximate Nearest Neighbor (ANN) index for
our production use case.
2 RELATED WORKS
Large scale visual search systems have been built across many
companies [
12
,
13
,
24
,
33
–
36
], demonstrating large scale index-
ing for massive updating data. There has also been research in
domain specic image retrieval systems, designed for fashion prod-
ucts [
2
,
6
,
20
,
29
,
37
]. They leverage the product attribute infor-
mation available in the e-commerce domain to build high quality
visual embeddings. Other works that focus on extracting visual
attributes for e-commerce [
1
,
7
,
23
] demonstrate multi-class classi-
cation techniques. Parekh et al. [
23
] employ a masking technique to
handle missing attribute values, a practical approach when dealing
with products across dierent verticals. We use the same masking
technique and build a multi-task learning approach with attribute
classication and triplet ranking loss.
Distance metric learning techniques are primarily designed for
image retrieval systems, with the seminal works like contrastive-
loss [
5
] and triplet-loss [
27
]. Triplet loss considers a data point
as anchor and associates it with a positive and a negative data
point, and constrains the distance of an anchor-positive pair to be
smaller than the anchor-negative pair. These methods have evolved
over time, with early generations like Schro et al. [
27
], where
they introduced a semi-hard negative mining approach. This is an
online triplet mining technique which computes useful triplets on
the y by sampling hard positive/negatives from within a mini-
batch. Later, techniques evolved to incorporate information beyond
a single triplet like Lifted Structured loss [
31
], N-Pair loss [
30
], etc.
These losses associate an anchor point with a single positive and
multiple negative points, and consider their relative hardness while
pushing or pulling these points. The above losses consider the rich
data-to-data relations and are able to learn ne-grained relations
between them. However, these losses suer from high training
complexity
𝑂(𝑀2)
or
𝑂(𝑀3)
where M is the number of data points,
thus slow convergence. Recent works like Proxy-NCA [
21
], Proxy
Anchor [
16
], etc, resolve the above complexity issue by introducing
proxies, thus aiding in faster convergence.
In all of the above losses, pair-based or proxy-based, the positives
and negatives are chosen based on the class label, ie. positives are
from the same class as anchor and negatives from a dierent class.
For instance, in the face-recognition setting, to ensure enough
positives in each mini-batch, Schro et al. [
27
] used a mini-batch
of 1800 such that around 40 faces are selected per identity per mini-
batch. In the case of proxy based losses, all proxies are part of the
model and are kept in memory. Since each proxy represents a class,
it puts a limit on the number of classes. Applying these techniques
to e-commerce is challenging, where the possible class labels could
be a product-id or a product-vertical (eg. t–shirt, shoe, watch, etc).
In e-commerce, we have over millions of products with only 3–4
images per product, that appear on its product page, and the total
number of verticals range only in a few hundreds. Choosing the
class label as product-id can be too restrictive as there are only a
few positives to learn from, and in the proxy based setting it would
lead to millions of proxies. Choosing the product-vertical as class
label makes the relation between data points too slack and thus we
lose the ne grained intra-vertical details (e.g. discriminating one
t–shirt pattern from another). Thus, applications in the e-commerce
domain resort to using product attributes for mining the triplets.
Ak et al. and others [
2
,
6
], etc, choose triplets such that the anchor
and the positive must have the same attribute value whereas the
negative is chosen with a dierent attribute value. For instance,
given that the anchor is a ‘blue’ color, positive can be any image
with ‘blue’ color. Serra at el. [
29
] use images with noisy tags (e.g.
red-sweater, red-tshirt) and use similarity score ‘intersection over
union’ between the tags. They then choose positives that have a
similarity score above a threshold and negatives with a score below
the threshold. Shankar et al. [
28
] prepare triplets with three levels
(positive, in-class-negative, out-of-class-negative), and use a ‘basic
image similarity scorer’ (e.g. pretrained AlexNet, color-histogram,
PatternNet) for selecting candidates across levels. Drawing ideas
from the above works, we dene an oine triplet mining technique
that prepares candidates across multiple levels, such that it captures
the relative order within the data. We sample the candidates under
each level based on the percentage of attributes matched.
Another technique that has been used for image retrieval appli-
cations is Autoencoder [
11
,
19
]. Autoencoder is a type of articial
neural network where the output is the same as the input. It has
an encoder, a decoder and a bottleneck layer in the middle which
captures the latent representation of the data. Thus, the bottleneck
layer learns the most important characteristics of the image in
an unsupervised way. A Variational Autoencoder (VAE) [
18
] has
the same structure as an autoencoder but uses a probabilistic ap-
proach to learn the latent representation. Unlike an autoencoder,
VAE learns the disentangled embedding representation [
10
], i.e.
where a single latent dimension is aected by only one generative
factor and is invariant to changes in other factors. Thus, the under-
lying embedding spaces have a smooth continuous transformation
over a latent dimension. For instance, a latent dimension which
captures color variations, arranges the red t-shirt closer to maroon