Clustering the Sketch Dynamic Compression for Embedding Tables Henry Ling-Hei Tsang

2025-04-27 0 0 1014.17KB 26 页 10玖币
侵权投诉
Clustering the Sketch: Dynamic Compression for
Embedding Tables
Henry Ling-Hei Tsang
Meta
henrylhtsang@meta.com
Thomas Dybdahl Ahle
Meta
Normal Computing
thomas@ahle.dk
Abstract
Embedding tables are used by machine learning systems to work with categorical
features. In modern Recommendation Systems, these tables can be very large,
necessitating the development of new methods for fitting them in memory, even
during training. We suggest Clustered Compositional Embeddings (CCE) which
combines clustering-based compression like quantization to codebooks with dy-
namic methods like The Hashing Trick and Compositional Embeddings [Shi et al.,
2020]. Experimentally CCE achieves the best of both worlds: The high com-
pression rate of codebook-based quantization, but dynamically like hashing-based
methods, so it can be used during training. Theoretically, we prove that CCE is
guaranteed to converge to the optimal codebook and give a tight bound for the
number of iterations required.
1 Introduction
Neural networks can efficiently handle various data types, including continuous, sparse, and sequential
features. However, categorical features present a unique challenge as they require embedding a
typically vast vocabulary into a smaller vector space for further calculations. Examples of these
features include user IDs, post IDs on social networks, video IDs, and IP addresses commonly
encountered in Recommendation Systems.
In some domains where embeddings are employed, such as Natural Language Processing [Mikolov
et al., 2013], the vocabulary can be significantly reduced by considering “subwords” or “byte pair
encodings”. In Recommendation Systems like Matrix Factorization or DLRM (see Figure 2) it is
typically not possible to factorize the vocabulary this way, leading to large embedding tables that
demand hundreds of gigabytes of GPU memory [Naumov et al., 2019]. This necessitates splitting
models across multiple GPUs, increasing cost and creating a communication bottleneck during both
training and inference.
The traditional solution is to hash the IDs down to a manageable size using the Hashing Trick [Wein-
berger et al., 2009], accepting the possibility that unrelated IDs may share the same representation.
Excessively aggressive hashing can impair the model’s ability to distinguish its inputs, as it may mix
up unrelated concepts, ultimately reducing model performance.
Another option for managing embedding tables is quantization. This typically involves reducing the
precision to 4 or 8 bits or using multi-dimensional methods like Product Quantization and Residual
Vector Quantization, which rely on clustering (e.g., K-means) to identify representative “code words”
for each original ID. (See Gray and Neuhoff [1998] for a survey of quantization methods.) For
instance, vectors representing “red”, “orange”, and “blue” may be stored as simply “dark orange”
and “blue”, with the first two concepts pointing to the same average embedding. See Section 1 for
Equal contribution.
Work done mostly at Probability at Meta.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.05974v3 [cs.LG] 22 Oct 2023
Train
Start
Hash
(a)
(b)
(c)
(d)
Cluster
Retrain
(a) Single iteration of CCE:(a) Starting from a ran-
dom embedding table, each ID is hashed to a vector
in each of 2 small tables. (b) During training, the em-
bedding of an ID is taken to be the mean of the two
referenced code words. (c) After training for an epoch,
the vectors for all (or a sample of) the IDs are computed
and clustered. This leaves a new small table in which
similar IDs are represented by the same vector. (d)
We can choose to combine the cluster centers with a
new random table (and new hash function), after which
the process can be repeated for an increasingly better
understanding of which ID should be combined.
100101102103104
Cluster + Learn iterations
300
400
500
600
700
800
900
MSE
Optimal, 1 sparse
Optimal, 2 sparse
CCE
(b) CCE for Least Squares: The least squares problem
is to find the matrix
T
that minimizes
XT YF
.
To save space we may use codebook quantization, that
is factorize
THM
, where
H
is a sparse Boolean
matrix, and
M
is a small dense matrix. If we already
know
T
this is easy to do using the K-means algorithm,
but if
T
is too large to store in memory, we can find the
compressed form directly using CCE, which we prove
finds the optimal Hand M.
For the plot, we sampled
XR104×103
and
Y
R104×10
. We ran CCE and compared it with the opti-
mal solution of first
T
and then factorised it with either
one or two 1s per row in H.
Figure 1: Clustered Compositional Embeddings is a simple algorithm which you run interspersed
with your normal model training, such as every epoch of SGD. While the theoretical bound (and the
least squares setting shown here) requires a lot of iterations for perfect convergence, in practice we
get substantial gains from running 1-6 iterations.
an example. Clustering also plays a crucial role in the theoretical literature on vector compression
[Indyk and Wagner, 2022]. However, a significant drawback of these quantization methods is that the
model is only quantized after training, leaving memory utilization during training unaffected3
Recent authors have explored more advanced uses of hashing to address this challenge: Tito Svenstrup
et al. [2017], Shi et al. [2020], Desai et al. [2022], Yin et al. [2021], Kang et al. [2021]. A common
theme is to employ multiple hash functions, enabling features to have unique representations, while
still mapping into a small shared parameter table. Although these methods outperform the traditional
Hashing Trick in certain scenarios, they still enforce random sharing of parameters between unrelated
concepts, introducing substantial noise into the subsequent machine learning model has to overcome.
Clearly, there is an essential difference between “post-training” compression methods like Product
Quantization which can utilize similarities between concepts and “during training” techniques
based on hashing, which are forced to randomly mix up concepts. This paper’s key contribution
is to bridge that gap: We present a novel compression approach we call “Clustered Compositional
Embeddings” (or CCE for short) that combines hashing and clustering while retaining the benefits
of both methods. By continuously interleaving clustering with training, we train recommendation
models with performance matching post-training quantization, while using a fixed parameter count
and computational cost throughout training, matching hashing-based methods.
In spirit, our effort can be likened to methods like RigL [Evci et al., 2020], which discovers the wiring
of a sparse neural network during training rather than pruning a dense network post-training. Our
work can also be seen as a form of “Online Product Quantization” Xu et al. [2018], though prior work
3
Reducing the precision to 16-bit floats is often feasible during training, but this work aims for memory
reductions much larger than that.
2
focused only on updating code words already assigned to the concept. Our goal is more ambitious:
We want to learn which concepts to group together without ever knowing the “true” embedding for
the concepts.
Why is this hard? Imagine you are training your model and at some point decide to use the same
vector for IDs
i
and
j
. For the remaining duration of the training, you can never distinguish the two
IDs again, and thus any decision you make is permanent. The more you cluster, the smaller your
table gets. But we are interested in keeping a constant number of parameters throughout training,
while continuously improving the clustering.
In summary, our main contributions are:
A new dynamic quantization algorithm (CCE) that combines clustering and sketching. Particularly
well suited to optimizing the compression of embedding tables in Recommendation Systems.
We use CCE to train the Deep Learning Recommendation System (DLRM) to baseline performance
with less than 50% of the table parameters required by the previous state of the art, and a wobbling
11,000 times fewer parameters than the baseline models without compression.
Using slightly more parameters, but still significantly less than the baseline, we can also improve
the Binary Cross Entropy by 0.66%. Showing that CCE helps combat overfitting problems.
We prove theoretically that a version of our method for the linear least-squares problem always suc-
ceeds in finding the optimal embedding table in a number of steps logarithmic in the approximation
accuracy desired.
An implementation of our methods and related work is available at github.com/thomasahle/cce.
2 Background and Related Work
Figure 2: Typical Recommenda-
tion System Architecture: The
DLRM model Naumov et al. [2019]
embeds each categorical feature
separately and combines the result-
ing vectors with pair-wise dot prod-
ucts. Other architectures use dif-
ferent interaction layers or a single
embedding table for all categorical
features, but the central role of the
embedding table is universal.
(Picture credit: Nvidia).
We show how most previous work on table compression can
be seen in the theoretical framework of linear dimensionality
reduction. This allows us to generalize many techniques and
guide our intuition on how to choose the quality and number
of hash functions in the system.
We omit standard common preprocessing tricks, such as weight-
ing entities by frequency, using separate tables and precision
for common vs uncommon elements, or completely pruning
rare entities. We also don’t cover the background of “post-
training” quantization, but refer to the survey by Gray and
Neuhoff [1998].
Theoretical work by Li et al. [2023] suggests “Learning to
Count sketch”, but these methods require a very large number of
full training runs of the model. We only consider methods that
are practical to scale to very large Recommendation Systems.
See also Indyk and Wagner [2022] on metric compression.
2.1 Embedding Tables as Linear Maps
An embedding table is typically expressed as a tall skinny
matrix
TRd1×d2
, where each ID
i[d1]
is mapped to the
i
-th row,
T[i]
. Alternatively,
i
can be expressed as a one-hot
row-vector ei∈ {0,1}d1in which case T[i] = eiTRd2.
Most previous work in the area of table compression is based
on the idea of sketching: We introduce a (typically sparse)
matrix
H∈ {0,1}d1×k
and a dense matrix
MRk×d2
, where
k << d1, and take T=HM. In other words, to compute T[i]
we compute
(eiH)M
. Since
H
and
ei
are both sparse, this requires very little memory and takes only
constant time. The vector
eiHRk
is called “the sketch of
i
” and
M
is the “compressed embedding
table” that is trained with gradient descent.
3
In this framework, we can also express most other approaches to training-time table compression.
Some previous work has focused on the “problem” of avoiding hash collisions, which intuitively
makes sense as they make the model completely blind to differences in the colliding concepts.
However, from our experiments, hashing does nearly as well as said proposed methods, suggesting
that a different approach is needed. Sketching is a more general way to understand this.
The Hashing Trick [Weinberger et al., 2009] is normally described by a hash function
h: [d1][k]
,
such that
i
is given the vector
M[h(i)]
, where
M
is a table with just
k << d1
rows. Alternatively, we
can think of this trick as multiplying
ei
with a random matrix
H∈ {0,1}d1×k
which has exactly one
1 in each row. Then the embedding of iis M[h(i)] = eiHM, where HM Rd1×d2.
Hash Embeddings [Tito Svenstrup et al., 2017] map each ID
iV
to the sum of a few table rows.
For example, if
i
is mapped to two rows, then its embedding vector is
v=M[h1(i)] + M[h2(i)]
.
Using the notation of
H∈ {0,1}m×n
, one can check that this corresponds to each row having exactly
two 1s. In the paper, the authors also consider weighted combinations, which simply means that the
non-zero entries of Hcan be some real numbers.
Compositional Embeddings (CE or “Quotient Remainder”, Shi et al., 2020), define
h1(i) = i/p
and
h2(i) = imod p
for integer
p
, and then combines
T[h1(i)]
and
T[h2(i)]
in various ways.
As mentioned by the authors, this choice is, however, not of great importance, and more general
hash functions can also be used, which allows for more flexibility in the size and number of tables.
Besides using sums, like Hash Embeddings, the authors propose element-wise multiplication
4
and concatenation. Concatenation
[T[h1(i)], T [h2(i)]]
can again be described with a matrix
H
{0,1}d1×k
where each row has exactly one 1 in the top half of
H
and one in the bottom half of
H
, as
well as a block diagonal matrix
M
. While this restricts the variations in embedding matrices
T
that
are allowed, we usually compensate by picking a larger
m
, so the difference in entropy is not much
different from Hash Embeddings, and the practical results are very similar as well.
ROBE embeddings [Desai et al., 2022] are essentially Compositional Embeddings with concatena-
tion as described above, but add some more flexibility in the indexing from the ability of pieces to
“wrap around” in the embedding table. In our experiments, ROBE was nearly indistinguishable from
CE with concatenation for large models, though it did give some measurable improvements for very
small tables.
Deep Hashing Embeddings (DHE, Kang et al., 2021) picks 1024 hash functions
h1, . . . , h1024 :
[d1][1,1]
and feed the vector
(h1(i), . . . , h1024(i))
into a multi-layer perceptron. While the
idea of using an MLP to save memory at the cost of larger compute is novel and departs from the
sketching framework, the first hashing step of DHE is just sketching with a dense random matrix
H[1,1]d1×1024
. While this is less efficient than a sparse matrix, it can still be applied efficiently
to sparse inputs,
ei
, and stored in small amounts of memory. Indeed in our experiments, for a fixed
parameter budget, the fewer layers of the MLP, the better DHE performed. This indicates to us that
the sketching part of DHE is still the most important part.
Tensor Train [Yin et al., 2021] doesn’t use hashing, but like CE it splits the input in a deterministic
way that can be generalized to a random hash function if so inclined. Instead of adding or concatenat-
ing chunks, Tensor Train multiplies them together as matrices, which makes it not strictly a linear
operation. However, like DHE, the first step in reducing the input size is sketching.
Learning to Collide In recent parallel work, Ghaemmaghami et al. [2022] propose an alternate
method for learning a clustering based on a low dimensional embedding table. This is like a
horizontal sketch, rather than a vertical, which unfortunately means the potential parameter savings is
substantially smaller.
Our method introduces a novel approach to dynamic compression by shifting from random sketching
to learned sketching. This process can be represented as
eiHM
, where
H
is a sparse matrix and
M
is a small dense matrix. The distinguishing factor is that we derive
H
from the data, instead of relying
on a random or fixed matrix. This adaptation, both theoretically and empirically, allows learning the
same model using less memory.
4
While combining vectors with element-wise multiplication is not a linear operation, from personal commu-
nication with the authors, it is unfortunately hard to train such embeddings in practice. Hence we focus on the
two linear variants.
4
16
1000
(a) The Hashing Trick:
Also known as Count
Sketch, each ID is hashed
to one location in a table
(here with 1000 rows) and
it is assigned the embed-
ding vector stored at the lo-
cation. Many IDs are likely
to share the same vector.
+
16
500
500
(b) Hash Embeddings:
Each ID is hashed to two
rows, one per table, and
its embedding vector is as-
signed to be the sum of
those two vectors. Here,
we use two separate tables
unlike in Tito Svenstrup
et al. [2017].
16
1000
(c) ROBE: Similar to CE
with concatenation, but in-
stead of using separate
columns, it uses a contin-
uous array from which the
(in this case dim=4) pieces
are picked. The pieces may
even overlap.
Σ
#hidden
1000
MLP
(d) Deep Hash Em-
beddings: Computes a
weighted sum of all the
table entries, where the
weights come from hash-
ing the input 1000 times
to
[1,1]
. The produced
(hidden) vector is then
sent through an MLP for
further refinement.
4 4 4 4
1000
(e) CE with concatenation: In the hashing version
of compositional embeddings (CE), each ID is hashed
to a location in each of, say, 4 different tables. The
four vectors stored there are concatenated into the final
embedding vector. Given the large number of possible
combinations (here
10004
), it is unlikely that two IDs
get assigned the exact same embedding vector, even if
they may share each part with some other IDs.
+
4
+
4
+
4
+
4
500
500
(f) Clustered CE: We can combine the sum hashing
method of Tito Svenstrup et al. [2017] with the concate-
nation method of Shi et al. [2020]. Each ID then gets
assigned a vector that is the concatenation of smaller
sums. This is enhanced with the clustering idea shown
in Section 1: In each of the four blocks, we apply
clustering every epoch, setting the results in the green
tables, and replacing the hash functions in the blue ta-
bles with new random values.
Figure 3: The evolution of hashing-based methods for embedding tables. The Hashing Trick
and Hash Embeddings shown at the top, side by side with an equal amount of parameters. Next we
introduce the idea of splitting the space into multiple concatenated subspaces. This is a classic idea
from product quantization and reminiscent of multi-head attention in transformers. Finally in CCE
we combine both methods in a way that allows iterative improvement using clustering.
5
摘要:

ClusteringtheSketch:DynamicCompressionforEmbeddingTablesHenryLing-HeiTsang∗Metahenrylhtsang@meta.comThomasDybdahlAhle∗Meta†NormalComputingthomas@ahle.dkAbstractEmbeddingtablesareusedbymachinelearningsystemstoworkwithcategoricalfeatures.InmodernRecommendationSystems,thesetablescanbeverylarge,necessit...

展开>> 收起<<
Clustering the Sketch Dynamic Compression for Embedding Tables Henry Ling-Hei Tsang.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:1014.17KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注