
TABLE I: Frequently-used symbols.
Notation Definition
S={s1, ..., s|S|}set of sets
E={e1, ..., e|E|}set of entities
B = (c,f) a box with center cand offset f
V(B) volume of box B
T+and T−a set of positive & negative samples
Qc∈R|E|×dcenter embedding matrix of entities
Qf∈R|E|×d
+offset embedding matrix of entities
Dnumber of subspaces
Knumber of key boxes in each subspace
scheme. We summarize our contributions as follows:
•Accurate & Versatile Algorithm: We propose SET2BOX, a
set representation learning method that accurately preserves
similarity between sets in terms of four measures.
•Concise & Effective Algorithm: We devise SET2BOX+to
enhance SET2BOX through an end-to-end box quantization
scheme. It yields up to 40.8×more accurate similarity esti-
mation while requiring 60% fewer bits than its competitors.
•Extensive Experiments: Using 8 real-world datasets, we
validate the advantages of SET2BOX+over its competitors
and the effectiveness of each of its components.
For reproducibility, the code and data are available at https:
//github.com/geon0325/Set2Box.
In Section II, we review related work. In Section III, we
define the problem of similarity-preserving set embedding
and discuss intuitive approaches. In Section IV, we present
SET2BOX and SET2BOX+. In Section V, we provide ex-
perimental results. In Section VI, we analyze the considered
methods. Lastly, we offer conclusions in Section VII.
II. RELATED WORK
Here, we review previous studies related to our work.
Similarity-Preserving Embedding: Representation learning
for preserving similarities between instances has been studied
for graphs [11]–[14], images [15]–[17], and texts [18]. These
methods aim to yield high-quality embeddings by minimizing
the information loss of the original data. However, most of
them are designed to preserve the predetermined similarity
matrix, which are not extensible to new measures [13], [14].
In this paper, we focus on the problem of learning similarity-
preserving representations for sets, and we aim to learn a ver-
satile representation of sets, which various similarity measures
(e.g., Jaccard Index and Dice Index) can be estimated from.
Box Embedding: Boxes [19] are useful abstractions to ex-
press high-order information of the data. Thanks to their
powerful expressiveness, they have been used in diverse ap-
plications including knowledge bases [20]–[25], word em-
bedding [26], image embedding [27], and recommender sys-
tems [28], [29]. For instance, Query2Box [24] uses boxes to
embed queries with conjunctions (∧) or logical disjunctions
(∨). Zhang et al. [28] represent users as boxes to accurately
model the users’ preferences to the items. However, in this
work, we embed sets as boxes to accurately preserve their
structural relationships and also similarities between them. In
an algorithmic aspect, methods for improving the optimization
of learning boxes have been presented, and examples include
smoothing hard edges using Gaussian convolutions [30] and
improving the parameter identifiability of boxes using Gumbel
random variables [31].
Set Embedding: The problem of embedding sets has attracted
much attention, with unique requirements of permutation
invariance and size flexibility. For example, DeepSets [32]
uses simple symmetric functions over input features, and
Set2Set [33] is based on a LSTM-based pooling function. Set
Transformer [34] uses an attention-based pooling function to
aggregate information of the entities. Despite their promising
results in some predictive tasks, they suffer from several
limitations. First, they require attribute information of entities,
which in fact largely affects the quality of the set embeddings.
In addition, set representations are trained specifically for
downstream tasks, and thus they may lose explicit similarity
information of sets, which we aim to preserve in this paper.
In another aspect, sets can be represented as compact binary
vectors by hashing or sketching [4], [10], without requiring
attribute information. Such binary vectors are used by Locality
Sensitive Hashing (LSH) and its variants [35]–[37] for a rapid
search of similar sets based on a predefined similarity measure
(e.g., Jaccard Index). Refer to Section III for further discussion
of set embedding methods.
Differentiable Product Quantization: Product quantization
[38], [39] is an effective strategy for vector compression.
Recently, deep learning methods for learning discrete codes
in an end-to-end manner have been proposed [40], [41], and
they have been applied in knowledge graphs [42] and image
retrieval [43]–[45]. In this paper, we propose a novel box
quantization method for compressing boxes while preserving
their original geometric properties.
III. PRELIMINARIES
In this section, we introduce notations and define the prob-
lem. Then, we review some intuitive methods for the problem.
Notations: Consider a set S={s1,· · · , s|S|}of sets and a set
E={e1,· · · , e|E|}of entities. Each set s∈ S is a non-empty
subset of Eand its size (i.e., cardinality) is denoted by |s|. A
representation of the set sis denoted by zsand its encoding
cost (the number of bits to encode zs) in bits is denoted by
Cost(zs). Refer to Table I for frequently-used notations.
Problem Definition: The problem of learning similarity-
preserving set representations, which we focus in this work,
is formulated as:
Problem 1 (Similarity-Preserving Set Embedding).
•Given: (1) a set Sof sets and (2) a budget b
•Find: a latent representation zsof each set s∈ S
•to Minimize: the difference between (1) the similarity
between sand s0, and (2) the similarity between zsand
zs0for all s6=s0∈ S
•Subject to: the total encoding cost Cost({zs:s∈ S})≤b.