
unkindness
unkindly
ness
un
ly
kind
Figure 1: Morphemes of "unkindly" and "unkind-
ness".
Table 1: Phenomena in word formation.
Phenomenon Example
Inflection cook+s, cook+ed, cook+ing
cold+er, cold+est
Derivation un+like, un+like+ly
im+poss+ible, im+poss+ibly
Compounding police+man, post+man
cuttle+fish, gold+fish
Word2ket represents a word embedding as an entangled tensor of multiple small vectors (tensors) via
the tensor product. The entangled form is essentially consistent with Canonical Polyadic decompo-
sition [
16
]. Decomposition-based methods approximate the original large word embedding matrix
with multiple small matrices and tensors. However, these small matrices or tensors have no specific
meaning and lack interpretation, and the approximate substitution with them often hurts the model
performance in complicated NLP tasks such as machine translation [15, 27].
In this study, we focus on high-quality compressed word embeddings. To this end, we propose
the Morphologically-enhanced Tensorized Embeddings (
MorphTE
), which injects morphological
knowledge in tensorized embeddings. Specifically, MorphTE models the embedding for a word as
the entangled form of their
morpheme
vectors via
tensor products
. Notably, the quality of word
embeddings can be improved by fine-grained morphemes, which has been verified in literature [
4
,
5
].
The benefits of introducing the morphology of morphemes in MorphTE can be summed up in two
points.
(1)
A word consists of morphemes which are considered to be the smallest meaning-bearing or
grammatical units of a language [
24
]. As shown in Figure 1, the root ‘
kind
’ determines the underlying
meanings of ‘
unkindly
’ and ‘
unkindness
’. The affixes ‘
un
’, ‘
ly
’, and ‘
ness
’grammatically refer
to negations, adverbs, and nouns, respectively. In MorphTE, using these meaningful morphemes
to generate word embeddings explicitly injects prior semantic and grammatical knowledge into the
learning of word embeddings.
(2)
As shown in Table 1, linguistic phenomena such as inflection and
derivation in word formation make morphologically similar words often semantically related. In
MorphTE, these similar words can be connected by sharing the same morpheme vector.
MorphTE only needs to train and store morpheme vectors, which are smaller in embedding size and
vocabulary size than original word embeddings, leading to fewer parameters. For example, a word
embedding of size
512
can be generated using three morpheme vectors of size
8
via tensor products.
In addition, since morphemes are the basic units of words, the size of the morpheme vocabulary
is smaller than the size of the word vocabulary. To sum up, MorphTE can learn high-quality and
space-efficient word embeddings, combining the prior knowledge of morphology and the compression
ability of tensor products.
We conducted comparative experiments on machine translation, retrieval-based question answering,
and natural language inference tasks. Our proposed MorphTE achieves better model performance on
these tasks compared to related word embedding compression methods. Compared with Word2ket,
MorphTE achieves improvements of
0.7
,
0.6
, and
0.6
BLEU scores on De-En, En-It, and En-Ru
datasets respectively. In addition, on
4
translation datasets in different languages, our method can
maintain the original performance when compressing the number of parameters of word embeddings
by more than
20
times and reducing the proportion of word embeddings to the total parameters
approximately from 30% to 2%, while other compression methods hurt the performance.
The main contributions of our work can be summarized as follows:
•
We propose MorphTE, a novel compression method for word embeddings using the form of
entangled tensors with morphology. The combination of morpheme and tensor product can
compress word embeddings in terms of both vocabulary and embedding size.
•
MorphTE introduces prior semantic knowledge in the learning of word embeddings from a
fine-grained morpheme perspective, and explicitly models the connections between words by
sharing morpheme vectors. These enabled it to learn high-quality compressed embeddings.
2