
In this framework, we can also express most other approaches to training-time table compression.
Some previous work has focused on the “problem” of avoiding hash collisions, which intuitively
makes sense as they make the model completely blind to differences in the colliding concepts.
However, from our experiments, hashing does nearly as well as said proposed methods, suggesting
that a different approach is needed. Sketching is a more general way to understand this.
The Hashing Trick [Weinberger et al., 2009] is normally described by a hash function
h: [d1]→[k]
,
such that
i
is given the vector
M[h(i)]
, where
M
is a table with just
k << d1
rows. Alternatively, we
can think of this trick as multiplying
ei
with a random matrix
H∈ {0,1}d1×k
which has exactly one
1 in each row. Then the embedding of iis M[h(i)] = eiHM, where HM ∈Rd1×d2.
Hash Embeddings [Tito Svenstrup et al., 2017] map each ID
i∈V
to the sum of a few table rows.
For example, if
i
is mapped to two rows, then its embedding vector is
v=M[h1(i)] + M[h2(i)]
.
Using the notation of
H∈ {0,1}m×n
, one can check that this corresponds to each row having exactly
two 1s. In the paper, the authors also consider weighted combinations, which simply means that the
non-zero entries of Hcan be some real numbers.
Compositional Embeddings (CE or “Quotient Remainder”, Shi et al., 2020), define
h1(i) = ⌊i/p⌋
and
h2(i) = imod p
for integer
p
, and then combines
T[h1(i)]
and
T[h2(i)]
in various ways.
As mentioned by the authors, this choice is, however, not of great importance, and more general
hash functions can also be used, which allows for more flexibility in the size and number of tables.
Besides using sums, like Hash Embeddings, the authors propose element-wise multiplication
4
and concatenation. Concatenation
[T[h1(i)], T [h2(i)]]
can again be described with a matrix
H∈
{0,1}d1×k
where each row has exactly one 1 in the top half of
H
and one in the bottom half of
H
, as
well as a block diagonal matrix
M
. While this restricts the variations in embedding matrices
T
that
are allowed, we usually compensate by picking a larger
m
, so the difference in entropy is not much
different from Hash Embeddings, and the practical results are very similar as well.
ROBE embeddings [Desai et al., 2022] are essentially Compositional Embeddings with concatena-
tion as described above, but add some more flexibility in the indexing from the ability of pieces to
“wrap around” in the embedding table. In our experiments, ROBE was nearly indistinguishable from
CE with concatenation for large models, though it did give some measurable improvements for very
small tables.
Deep Hashing Embeddings (DHE, Kang et al., 2021) picks 1024 hash functions
h1, . . . , h1024 :
[d1]→[−1,1]
and feed the vector
(h1(i), . . . , h1024(i))
into a multi-layer perceptron. While the
idea of using an MLP to save memory at the cost of larger compute is novel and departs from the
sketching framework, the first hashing step of DHE is just sketching with a dense random matrix
H∈[−1,1]d1×1024
. While this is less efficient than a sparse matrix, it can still be applied efficiently
to sparse inputs,
ei
, and stored in small amounts of memory. Indeed in our experiments, for a fixed
parameter budget, the fewer layers of the MLP, the better DHE performed. This indicates to us that
the sketching part of DHE is still the most important part.
Tensor Train [Yin et al., 2021] doesn’t use hashing, but like CE it splits the input in a deterministic
way that can be generalized to a random hash function if so inclined. Instead of adding or concatenat-
ing chunks, Tensor Train multiplies them together as matrices, which makes it not strictly a linear
operation. However, like DHE, the first step in reducing the input size is sketching.
Learning to Collide In recent parallel work, Ghaemmaghami et al. [2022] propose an alternate
method for learning a clustering based on a low dimensional embedding table. This is like a
horizontal sketch, rather than a vertical, which unfortunately means the potential parameter savings is
substantially smaller.
Our method introduces a novel approach to dynamic compression by shifting from random sketching
to learned sketching. This process can be represented as
eiHM
, where
H
is a sparse matrix and
M
is a small dense matrix. The distinguishing factor is that we derive
H
from the data, instead of relying
on a random or fixed matrix. This adaptation, both theoretically and empirically, allows learning the
same model using less memory.
4
While combining vectors with element-wise multiplication is not a linear operation, from personal commu-
nication with the authors, it is unfortunately hard to train such embeddings in practice. Hence we focus on the
two linear variants.
4