Clustering the Sketch Dynamic Compression for Embedding Tables Henry Ling-Hei Tsang

2025-04-27 0 0 1014.17KB 26 页 10玖币

侵权投诉

Clustering the Sketch: Dynamic Compression for

Embedding Tables

Henry Ling-Hei Tsang∗

Meta

henrylhtsang@meta.com

Thomas Dybdahl Ahle∗

Meta†

Normal Computing

thomas@ahle.dk

Abstract

Embedding tables are used by machine learning systems to work with categorical

features. In modern Recommendation Systems, these tables can be very large,

necessitating the development of new methods for ﬁtting them in memory, even

during training. We suggest Clustered Compositional Embeddings (CCE) which

combines clustering-based compression like quantization to codebooks with dy-

namic methods like The Hashing Trick and Compositional Embeddings [Shi et al.,

2020]. Experimentally CCE achieves the best of both worlds: The high com-

pression rate of codebook-based quantization, but dynamically like hashing-based

methods, so it can be used during training. Theoretically, we prove that CCE is

guaranteed to converge to the optimal codebook and give a tight bound for the

number of iterations required.

1 Introduction

Neural networks can efﬁciently handle various data types, including continuous, sparse, and sequential

features. However, categorical features present a unique challenge as they require embedding a

typically vast vocabulary into a smaller vector space for further calculations. Examples of these

features include user IDs, post IDs on social networks, video IDs, and IP addresses commonly

encountered in Recommendation Systems.

In some domains where embeddings are employed, such as Natural Language Processing [Mikolov

et al., 2013], the vocabulary can be signiﬁcantly reduced by considering “subwords” or “byte pair

encodings”. In Recommendation Systems like Matrix Factorization or DLRM (see Figure 2) it is

typically not possible to factorize the vocabulary this way, leading to large embedding tables that

demand hundreds of gigabytes of GPU memory [Naumov et al., 2019]. This necessitates splitting

models across multiple GPUs, increasing cost and creating a communication bottleneck during both

training and inference.

The traditional solution is to hash the IDs down to a manageable size using the Hashing Trick [Wein-

berger et al., 2009], accepting the possibility that unrelated IDs may share the same representation.

Excessively aggressive hashing can impair the model’s ability to distinguish its inputs, as it may mix

up unrelated concepts, ultimately reducing model performance.

Another option for managing embedding tables is quantization. This typically involves reducing the

precision to 4 or 8 bits or using multi-dimensional methods like Product Quantization and Residual

Vector Quantization, which rely on clustering (e.g., K-means) to identify representative “code words”

for each original ID. (See Gray and Neuhoff [1998] for a survey of quantization methods.) For

instance, vectors representing “red”, “orange”, and “blue” may be stored as simply “dark orange”

and “blue”, with the ﬁrst two concepts pointing to the same average embedding. See Section 1 for

∗Equal contribution.

†Work done mostly at Probability at Meta.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.05974v3 [cs.LG] 22 Oct 2023

Train

Start

Hash

(a)

(b)

(c)

(d)

Cluster

Retrain

(a) Single iteration of CCE:(a) Starting from a ran-

dom embedding table, each ID is hashed to a vector

in each of 2 small tables. (b) During training, the em-

bedding of an ID is taken to be the mean of the two

referenced code words. (c) After training for an epoch,

the vectors for all (or a sample of) the IDs are computed

and clustered. This leaves a new small table in which

similar IDs are represented by the same vector. (d)

We can choose to combine the cluster centers with a

new random table (and new hash function), after which

the process can be repeated for an increasingly better

understanding of which ID should be combined.

100101102103104

Cluster + Learn iterations

300

400

500

600

700

800

900

MSE

Optimal, 1 sparse

Optimal, 2 sparse

CCE

(b) CCE for Least Squares: The least squares problem

is to ﬁnd the matrix

that minimizes

∥XT −Y∥F

To save space we may use codebook quantization, that

is factorize

T≈HM

, where

is a sparse Boolean

matrix, and

is a small dense matrix. If we already

know

this is easy to do using the K-means algorithm,

but if

is too large to store in memory, we can ﬁnd the

compressed form directly using CCE, which we prove

ﬁnds the optimal Hand M.

For the plot, we sampled

X∈R104×103

and

Y∈

R104×10

. We ran CCE and compared it with the opti-

mal solution of ﬁrst

and then factorised it with either

one or two 1s per row in H.

Figure 1: Clustered Compositional Embeddings is a simple algorithm which you run interspersed

with your normal model training, such as every epoch of SGD. While the theoretical bound (and the

least squares setting shown here) requires a lot of iterations for perfect convergence, in practice we

get substantial gains from running 1-6 iterations.

an example. Clustering also plays a crucial role in the theoretical literature on vector compression

[Indyk and Wagner, 2022]. However, a signiﬁcant drawback of these quantization methods is that the

model is only quantized after training, leaving memory utilization during training unaffected3

Recent authors have explored more advanced uses of hashing to address this challenge: Tito Svenstrup

et al. [2017], Shi et al. [2020], Desai et al. [2022], Yin et al. [2021], Kang et al. [2021]. A common

theme is to employ multiple hash functions, enabling features to have unique representations, while

still mapping into a small shared parameter table. Although these methods outperform the traditional

Hashing Trick in certain scenarios, they still enforce random sharing of parameters between unrelated

concepts, introducing substantial noise into the subsequent machine learning model has to overcome.

Clearly, there is an essential difference between “post-training” compression methods like Product

Quantization which can utilize similarities between concepts and “during training” techniques

based on hashing, which are forced to randomly mix up concepts. This paper’s key contribution

is to bridge that gap: We present a novel compression approach we call “Clustered Compositional

Embeddings” (or CCE for short) that combines hashing and clustering while retaining the beneﬁts

of both methods. By continuously interleaving clustering with training, we train recommendation

models with performance matching post-training quantization, while using a ﬁxed parameter count

and computational cost throughout training, matching hashing-based methods.

In spirit, our effort can be likened to methods like RigL [Evci et al., 2020], which discovers the wiring

of a sparse neural network during training rather than pruning a dense network post-training. Our

work can also be seen as a form of “Online Product Quantization” Xu et al. [2018], though prior work

Reducing the precision to 16-bit ﬂoats is often feasible during training, but this work aims for memory

reductions much larger than that.

focused only on updating code words already assigned to the concept. Our goal is more ambitious:

We want to learn which concepts to group together without ever knowing the “true” embedding for

the concepts.

Why is this hard? Imagine you are training your model and at some point decide to use the same

vector for IDs

and

. For the remaining duration of the training, you can never distinguish the two

IDs again, and thus any decision you make is permanent. The more you cluster, the smaller your

table gets. But we are interested in keeping a constant number of parameters throughout training,

while continuously improving the clustering.

In summary, our main contributions are:

•

A new dynamic quantization algorithm (CCE) that combines clustering and sketching. Particularly

well suited to optimizing the compression of embedding tables in Recommendation Systems.

•

We use CCE to train the Deep Learning Recommendation System (DLRM) to baseline performance

with less than 50% of the table parameters required by the previous state of the art, and a wobbling

11,000 times fewer parameters than the baseline models without compression.

•

Using slightly more parameters, but still signiﬁcantly less than the baseline, we can also improve

the Binary Cross Entropy by 0.66%. Showing that CCE helps combat overﬁtting problems.

•

We prove theoretically that a version of our method for the linear least-squares problem always suc-

ceeds in ﬁnding the optimal embedding table in a number of steps logarithmic in the approximation

accuracy desired.

An implementation of our methods and related work is available at github.com/thomasahle/cce.

2 Background and Related Work

Figure 2: Typical Recommenda-

tion System Architecture: The

DLRM model Naumov et al. [2019]

embeds each categorical feature

separately and combines the result-

ing vectors with pair-wise dot prod-

ucts. Other architectures use dif-

ferent interaction layers or a single

embedding table for all categorical

features, but the central role of the

embedding table is universal.

(Picture credit: Nvidia).

We show how most previous work on table compression can

be seen in the theoretical framework of linear dimensionality

reduction. This allows us to generalize many techniques and

guide our intuition on how to choose the quality and number

of hash functions in the system.

We omit standard common preprocessing tricks, such as weight-

ing entities by frequency, using separate tables and precision

for common vs uncommon elements, or completely pruning

rare entities. We also don’t cover the background of “post-

training” quantization, but refer to the survey by Gray and

Neuhoff [1998].

Theoretical work by Li et al. [2023] suggests “Learning to

Count sketch”, but these methods require a very large number of

full training runs of the model. We only consider methods that

are practical to scale to very large Recommendation Systems.

See also Indyk and Wagner [2022] on metric compression.

2.1 Embedding Tables as Linear Maps

An embedding table is typically expressed as a tall skinny

matrix

T∈Rd1×d2

, where each ID

i∈[d1]

is mapped to the

-th row,

T[i]

. Alternatively,

can be expressed as a one-hot

row-vector ei∈ {0,1}d1in which case T[i] = eiT∈Rd2.

Most previous work in the area of table compression is based

on the idea of sketching: We introduce a (typically sparse)

matrix

H∈ {0,1}d1×k

and a dense matrix

M∈Rk×d2

, where

k << d1, and take T=HM. In other words, to compute T[i]

we compute

(eiH)M

. Since

and

are both sparse, this requires very little memory and takes only

constant time. The vector

eiH∈Rk

is called “the sketch of

” and

is the “compressed embedding

table” that is trained with gradient descent.

In this framework, we can also express most other approaches to training-time table compression.

Some previous work has focused on the “problem” of avoiding hash collisions, which intuitively

makes sense as they make the model completely blind to differences in the colliding concepts.

However, from our experiments, hashing does nearly as well as said proposed methods, suggesting

that a different approach is needed. Sketching is a more general way to understand this.

The Hashing Trick [Weinberger et al., 2009] is normally described by a hash function

h: [d1]→[k]

such that

is given the vector

M[h(i)]

, where

is a table with just

k << d1

rows. Alternatively, we

can think of this trick as multiplying

with a random matrix

H∈ {0,1}d1×k

which has exactly one

1 in each row. Then the embedding of iis M[h(i)] = eiHM, where HM ∈Rd1×d2.

Hash Embeddings [Tito Svenstrup et al., 2017] map each ID

i∈V

to the sum of a few table rows.

For example, if

is mapped to two rows, then its embedding vector is

v=M[h1(i)] + M[h2(i)]

Using the notation of

H∈ {0,1}m×n

, one can check that this corresponds to each row having exactly

two 1s. In the paper, the authors also consider weighted combinations, which simply means that the

non-zero entries of Hcan be some real numbers.

Compositional Embeddings (CE or “Quotient Remainder”, Shi et al., 2020), deﬁne

h1(i) = ⌊i/p⌋

and

h2(i) = imod p

for integer

, and then combines

T[h1(i)]

and

T[h2(i)]

in various ways.

As mentioned by the authors, this choice is, however, not of great importance, and more general

hash functions can also be used, which allows for more ﬂexibility in the size and number of tables.

Besides using sums, like Hash Embeddings, the authors propose element-wise multiplication

and concatenation. Concatenation

[T[h1(i)], T [h2(i)]]

can again be described with a matrix

H∈

{0,1}d1×k

where each row has exactly one 1 in the top half of

and one in the bottom half of

, as

well as a block diagonal matrix

. While this restricts the variations in embedding matrices

that

are allowed, we usually compensate by picking a larger

, so the difference in entropy is not much

different from Hash Embeddings, and the practical results are very similar as well.

ROBE embeddings [Desai et al., 2022] are essentially Compositional Embeddings with concatena-

tion as described above, but add some more ﬂexibility in the indexing from the ability of pieces to

“wrap around” in the embedding table. In our experiments, ROBE was nearly indistinguishable from

CE with concatenation for large models, though it did give some measurable improvements for very

small tables.

Deep Hashing Embeddings (DHE, Kang et al., 2021) picks 1024 hash functions

h1, . . . , h1024 :

[d1]→[−1,1]

and feed the vector

(h1(i), . . . , h1024(i))

into a multi-layer perceptron. While the

idea of using an MLP to save memory at the cost of larger compute is novel and departs from the

sketching framework, the ﬁrst hashing step of DHE is just sketching with a dense random matrix

H∈[−1,1]d1×1024

. While this is less efﬁcient than a sparse matrix, it can still be applied efﬁciently

to sparse inputs,

, and stored in small amounts of memory. Indeed in our experiments, for a ﬁxed

parameter budget, the fewer layers of the MLP, the better DHE performed. This indicates to us that

the sketching part of DHE is still the most important part.

Tensor Train [Yin et al., 2021] doesn’t use hashing, but like CE it splits the input in a deterministic

way that can be generalized to a random hash function if so inclined. Instead of adding or concatenat-

ing chunks, Tensor Train multiplies them together as matrices, which makes it not strictly a linear

operation. However, like DHE, the ﬁrst step in reducing the input size is sketching.

Learning to Collide In recent parallel work, Ghaemmaghami et al. [2022] propose an alternate

method for learning a clustering based on a low dimensional embedding table. This is like a

horizontal sketch, rather than a vertical, which unfortunately means the potential parameter savings is

substantially smaller.

Our method introduces a novel approach to dynamic compression by shifting from random sketching

to learned sketching. This process can be represented as

eiHM

, where

is a sparse matrix and

is a small dense matrix. The distinguishing factor is that we derive

from the data, instead of relying

on a random or ﬁxed matrix. This adaptation, both theoretically and empirically, allows learning the

same model using less memory.

While combining vectors with element-wise multiplication is not a linear operation, from personal commu-

nication with the authors, it is unfortunately hard to train such embeddings in practice. Hence we focus on the

two linear variants.

1000

(a) The Hashing Trick:

Also known as Count

Sketch, each ID is hashed

to one location in a table

(here with 1000 rows) and

it is assigned the embed-

ding vector stored at the lo-

cation. Many IDs are likely

to share the same vector.

500

(b) Hash Embeddings:

Each ID is hashed to two

rows, one per table, and

its embedding vector is as-

signed to be the sum of

those two vectors. Here,

we use two separate tables

unlike in Tito Svenstrup

et al. [2017].

1000

with concatenation, but in-

stead of using separate

columns, it uses a contin-

uous array from which the

(in this case dim=4) pieces

are picked. The pieces may

even overlap.

#hidden

1000

MLP

(d) Deep Hash Em-

beddings: Computes a

weighted sum of all the

table entries, where the

weights come from hash-

ing the input 1000 times

[−1,1]

. The produced

(hidden) vector is then

sent through an MLP for

further reﬁnement.

4 4 4 4

1000

(e) CE with concatenation: In the hashing version

of compositional embeddings (CE), each ID is hashed

to a location in each of, say, 4 different tables. The

four vectors stored there are concatenated into the ﬁnal

embedding vector. Given the large number of possible

combinations (here

10004

), it is unlikely that two IDs

get assigned the exact same embedding vector, even if

they may share each part with some other IDs.

500

(f) Clustered CE: We can combine the sum hashing

method of Tito Svenstrup et al. [2017] with the concate-

nation method of Shi et al. [2020]. Each ID then gets

assigned a vector that is the concatenation of smaller

sums. This is enhanced with the clustering idea shown

in Section 1: In each of the four blocks, we apply

clustering every epoch, setting the results in the green

tables, and replacing the hash functions in the blue ta-

bles with new random values.

Figure 3: The evolution of hashing-based methods for embedding tables. The Hashing Trick

and Hash Embeddings shown at the top, side by side with an equal amount of parameters. Next we

introduce the idea of splitting the space into multiple concatenated subspaces. This is a classic idea

from product quantization and reminiscent of multi-head attention in transformers. Finally in CCE

we combine both methods in a way that allows iterative improvement using clustering.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ClusteringtheSketch:DynamicCompressionforEmbeddingTablesHenryLing-HeiTsang∗Metahenrylhtsang@meta.comThomasDybdahlAhle∗Meta†NormalComputingthomas@ahle.dkAbstractEmbeddingtablesareusedbymachinelearningsystemstoworkwithcategoricalfeatures.InmodernRecommendationSystems,thesetablescanbeverylarge,necessit...

展开>> 收起<<

Clustering the Sketch Dynamic Compression for Embedding Tables Henry Ling-Hei Tsang.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Clustering the Sketch Dynamic Compression for Embedding Tables Henry Ling-Hei Tsang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: