MorphTE Injecting Morphology in Tensorized Embeddings Guobing Gan1 Peng Zhang1 Sunzhu Li1 Xiuqing Lu1 and Benyou Wang2

2025-05-02 0 0 1.62MB 20 页 10玖币
侵权投诉
MorphTE: Injecting Morphology in Tensorized
Embeddings
Guobing Gan1, Peng Zhang1
, Sunzhu Li1, Xiuqing Lu1, and Benyou Wang2
1College of Intelligence and Computing, Tianjin University, Tianjin, China
2School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
{ganguobing,pzhang,lisunzhu,lvxiuqing}@tju.edu.cn, wangbenyou@cuhk.edu.cn
Abstract
In the era of deep learning, word embeddings are essential when dealing with text
tasks. However, storing and accessing these embeddings requires a large amount of
space. This is not conducive to the deployment of these models on resource-limited
devices. Combining the powerful compression capability of tensor products, we
propose a word embedding compression method with morphological augmenta-
tion,
Morphologically-enhanced Tensorized Embeddings
(
MorphTE
). A word
consists of one or more
morphemes
, the smallest units that bear meaning or have
a grammatical function. MorphTE represents a word embedding as an entangled
form of its morpheme vectors via the
tensor product
, which injects prior semantic
and grammatical knowledge into the learning of embeddings. Furthermore, the
dimensionality of the morpheme vector and the number of morphemes are much
smaller than those of words, which greatly reduces the parameters of the word
embeddings. We conduct experiments on tasks such as machine translation and
question answering. Experimental results on four translation datasets of different
languages show that MorphTE can compress word embedding parameters by about
20
times without performance loss and significantly outperforms related embedding
compression methods.
1 Introduction
The word embedding layer is a key component of the neural network models in natural language
processing (NLP). It uses an embedding matrix to map each word into a dense real-valued vector.
However, when the vocabulary size and word embedding size (dimensionality) are large, the word
embedding matrix requires a large number of parameters. For example, the One Billion Word task of
language modeling [
8
] has a vocabulary size (
|V|
) of around
800K
. Besides, the embedding size
(
d
) can range from
300
to
1024
[
30
,
11
,
22
]. Storing and accessing the
|V| × d
embedding matrix
requires a large amount of disk and memory space. This limits the deployment of these models on
such devices having limited resources. To resolve this issue, there are many studies compressing
embedding layers [
32
,
15
,
27
]. They can be roughly divided into two lines:
product quantization-
based
and
decomposition-based
methods. The product quantization-based methods [
32
,
20
,
36
]
mainly utilize the compositional coding for constructing the word embeddings with fewer parameters,
and it needs to introduce an additional task to learn the compact code for each word.
The decomposition-based word embedding compression methods are mostly based on low-rank ma-
trix factorization [
18
,
9
] and tensor decomposition [
15
,
27
]. Utilizing low-rank matrix factorization,
ALBERT [
18
] replaces the embedding matrix with the product of two small matrices. Inspired by
quantum entanglement, Word2ket and Word2ketXS embeddings are proposed [
27
]. Specifically,
Corresponding Author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15379v1 [cs.CL] 27 Oct 2022
unkindness
unkindly
ness
un
ly
kind
Figure 1: Morphemes of "unkindly" and "unkind-
ness".
Table 1: Phenomena in word formation.
Phenomenon Example
Inflection cook+s, cook+ed, cook+ing
cold+er, cold+est
Derivation un+like, un+like+ly
im+poss+ible, im+poss+ibly
Compounding police+man, post+man
cuttle+fish, gold+fish
Word2ket represents a word embedding as an entangled tensor of multiple small vectors (tensors) via
the tensor product. The entangled form is essentially consistent with Canonical Polyadic decompo-
sition [
16
]. Decomposition-based methods approximate the original large word embedding matrix
with multiple small matrices and tensors. However, these small matrices or tensors have no specific
meaning and lack interpretation, and the approximate substitution with them often hurts the model
performance in complicated NLP tasks such as machine translation [15, 27].
In this study, we focus on high-quality compressed word embeddings. To this end, we propose
the Morphologically-enhanced Tensorized Embeddings (
MorphTE
), which injects morphological
knowledge in tensorized embeddings. Specifically, MorphTE models the embedding for a word as
the entangled form of their
morpheme
vectors via
tensor products
. Notably, the quality of word
embeddings can be improved by fine-grained morphemes, which has been verified in literature [
4
,
5
].
The benefits of introducing the morphology of morphemes in MorphTE can be summed up in two
points.
(1)
A word consists of morphemes which are considered to be the smallest meaning-bearing or
grammatical units of a language [
24
]. As shown in Figure 1, the root ‘
kind
’ determines the underlying
meanings of ‘
unkindly
’ and ‘
unkindness
’. The affixes ‘
un
’, ‘
ly
’, and ‘
ness
grammatically refer
to negations, adverbs, and nouns, respectively. In MorphTE, using these meaningful morphemes
to generate word embeddings explicitly injects prior semantic and grammatical knowledge into the
learning of word embeddings.
(2)
As shown in Table 1, linguistic phenomena such as inflection and
derivation in word formation make morphologically similar words often semantically related. In
MorphTE, these similar words can be connected by sharing the same morpheme vector.
MorphTE only needs to train and store morpheme vectors, which are smaller in embedding size and
vocabulary size than original word embeddings, leading to fewer parameters. For example, a word
embedding of size
512
can be generated using three morpheme vectors of size
8
via tensor products.
In addition, since morphemes are the basic units of words, the size of the morpheme vocabulary
is smaller than the size of the word vocabulary. To sum up, MorphTE can learn high-quality and
space-efficient word embeddings, combining the prior knowledge of morphology and the compression
ability of tensor products.
We conducted comparative experiments on machine translation, retrieval-based question answering,
and natural language inference tasks. Our proposed MorphTE achieves better model performance on
these tasks compared to related word embedding compression methods. Compared with Word2ket,
MorphTE achieves improvements of
0.7
,
0.6
, and
0.6
BLEU scores on De-En, En-It, and En-Ru
datasets respectively. In addition, on
4
translation datasets in different languages, our method can
maintain the original performance when compressing the number of parameters of word embeddings
by more than
20
times and reducing the proportion of word embeddings to the total parameters
approximately from 30% to 2%, while other compression methods hurt the performance.
The main contributions of our work can be summarized as follows:
We propose MorphTE, a novel compression method for word embeddings using the form of
entangled tensors with morphology. The combination of morpheme and tensor product can
compress word embeddings in terms of both vocabulary and embedding size.
MorphTE introduces prior semantic knowledge in the learning of word embeddings from a
fine-grained morpheme perspective, and explicitly models the connections between words by
sharing morpheme vectors. These enabled it to learn high-quality compressed embeddings.
2
Experiments on multiple languages and tasks show that MorphTE can compress word
embedding parameters over 20 times without hurting the original performance.
2 Related Work
Morphologically-augmented Embeddings.
Related works [
23
,
4
,
5
,
29
,
2
,
10
] propose to improve
the quality of word embeddings by integrating morphological information. Representing word
embeddings as the sum of morpheme and surface form vectors has been employed in several
studies [
5
,
29
,
2
]. Morphological RNNs [
5
] learns word representations using morphemes as units of
recursive neural networks [
33
]. Our proposed MorphTE also utilizes the information of morphemes
and is a decomposition-based word embedding compression method, similar to Word2ket [27].
Decomposition-based Compression Embeddings.
Decomposition-based methods are either
based on low-rank matrix factorization [
18
,
9
,
1
,
21
,
19
] or tensor decomposition [
15
,
41
,
27
,
34
].
Based on low-rank matrix factorization, ALBERT [
18
] simply approximates the embedding matrix
by the product of two small matrices. GroupReduce [
9
] and DiscBlock [
19
] perform a fine-grained
matrix factorization. They first block the word embedding matrix according to the word frequency
and then approximate each block. Notably, the method based on matrix factorization has a low-rank
bottleneck, and its expressive ability is limited under the condition of a high compression ratio [35].
As for the tensor decomposition, TT embeddings [
15
] uses the Tensor Train decomposition [
25
] to
approximate the embedding matrix with several
2
-order and
3
-order tensors. TT-Rec [
41
] improves
TT embeddings in terms of implementation and initialization to fit the recommended scenario.
Word2ket [
27
] represents a word embedding as an entangled tensor via multiple small vectors. It
essentially exploits Canonical Polyadic decomposition [
16
,
17
]. Word2ketXs [
27
] is similar to
Word2ket, but it compresses embeddings from the perspective of all words rather than individual
words. In addition, KroneckerBERT [
34
] uses Kronecker decomposition to compress the word
embeddings, and the form of Kronecker Embeddings is consistent with Word2ket [
27
] with an order
of
2
. Unlike these compression methods, our MorphTE utilizes meaningful morphemes as basic units
for generating word embeddings, rather than vectors or tensors with no specific meaning.
3 Preliminary
3.1 Tensor Product Space and Entangled Tensors
A tensor product space of two separable Hilbert spaces
V
and
W
is also a separable Hilbert space
H
, which is denoted as
H=V ⊗ W
. Suppose
{ψ1, . . . , ψg
} and
{φ1, . . . , φh}
are the orthonormal
basis in
V
and
W
, respectively. The tensor product of the vector
c=Pg
j=1 cjψj∈ V
and
e=
Ph
k=1 ekφk∈ W is defined as follow:
ce=(g
X
j=1
cjψj)(h
X
k=1
ekφk)=
g
X
j=1
h
X
k=1
cjekψjφk(1)
The set
{ψjφk}jk
forms the orthonormal basis in
H
, and the dimensionality of
H
is the product
(
gh
) of dimensionalities of
V
and
W
. This tensor product operation can be simplified as the product
of the corresponding coefficients as follow:
ce= [c1, c2,...,cg][e1, e2,...,eh]
= [c1e1, c1e2,...,c1eh,...,cge1, cge2,...,cgeh](2)
The cumulative tensor product space of the following form is said to have a tensor
order
of
n
, and
the dimensionality of cumulative tensor product space is the cumulative product of its subspace
dimensionalities. See Appendix B for concrete examples of the cumulative tensor product of multiple
vectors. n
O
j=1 H=H1⊗ H2. . . ⊗ Hn(3)
Considering the n-order tensor product space
Nn
j=1 Hj
, vectors of the form
v=n
j=1vj
, where
vj∈ Hj
, are called
simple tensors
. In addition, vectors need to be represented as the sum of multiple
3
simple tensors are called
entangled tensors
. Tensor
rank
of a vector
v
is the smallest number of
simple tensors that sum up to v.
3.2 Tensorized Embeddings with Tensor Product
+
v11 v21 v31 v12 v22 v32
v
Word Embedding
Figure 2: Word2ket embedding with a rank
of r= 2 and an order of n= 3.
Tensor products have been introduced to learn parameter-
efficient word embeddings in KroneckerBERT [
34
] and
Word2ket[
27
]. As shown in Figure 2, Word2ket[
27
] rep-
resents the embedding
vRd
of a word as an entangled
tensor of rank rand order nas follow:
v=
r
X
k=1
n
O
j=1
vjk (4)
where
vjk Rq
and
vRqn
. Word2ket only needs
to store and use these small vectors
vjk
to generate a
large word embedding. If
qn> d
, the excess part of the
generated embedding will be cut off. Therefore, setting
qn=d
can avoid the waste of parameters
caused by clipping, and the number of embedding parameters for a word is reduced from
d
to
rn n
d
.
For example, when
d= 512
,
q= 8
,
n= 3
, and
r= 2
, the number of parameters of a word
embedding can be reduced from 512 to 48.
4 Methodology: Morphologically-enhanced Tensorized Embeddings
In this section, we first discuss the rationale for introducing morphology in the embedding com-
pression. Then, we propose
MorphTE
, a morphologically-enhanced word embedding compression
method based on the tensor product. Finally, we show the detailed workflow of MorphTE.
4.1 Motivation to Introduce Morphology
To achieve compression, existing decomposition-based word embedding compression methods [
18
,
15
] use a series of small vectors or tensors to generate large word embeddings, as shown in Figure 2.
These methods are not only uninterpretable as their small tensors do not have specific meaning [
15
,
27
],
but also lack lexical knowledge. We argue that, in resource-limited scenarios like compression,
knowledge injection is much more critical than in common scenarios. Since with a significant amount
of parameters in common scenarios it could to an easier extent learn implicitly such knowledge in a
data-driven way, which is also one of the objectives for neural networks. However, in compression,
it is more beneficial to inject explicit knowledge to compensate for inferiority in parameter scales,
therefore underscoring the importance of knowledge injection in compression.
From a reductionism point of view, words might not be the smallest unit for some languages; for
example
unfeelingly
could be separated into four meaningful parts
[un,feel,ing,ly]
, a.k.a.,
morphemes
2
. By using a limited number of morphemes, one could possibly exponentially extend
a given core vocabulary by composing morphemes as new words according to the rules of word
formation in Table 1. The adoption of morphemes largely reduces the memory burden and therefore
facilitates the learning of words for humans. We hypothesize that morphology also helps for word
representation in neural networks, especially in resource-limited scenarios like compression.
4.2 Definition of MorphTE
Considering the above analysis, we propose to inject morphological knowledge to achieve high-
quality and space-efficient word embeddings. Suppose a word is segmented as
l
morphemes
[m1, m2, . . . , ml]
in the natural order. For example, a four-morpheme word
unfeelingly
is
segmented as
[un,feel,ing,ly]
. We refer
f1(·), f2(·),···fr(·) : NRq
as
r
different repre-
sentation functions of morphemes
3
, selected from a parametric family
F={f:NRq}
. The
2
This also holds for logogram, written symbols of which represent words instead of sounds. For example,
Chinese language has character components, a.k.a, radicals.
3
For example, such a function could be morpheme embeddings, each vector of which is a
q
-sized vector,
which are similar to word embeddings.
4
摘要:

MorphTE:InjectingMorphologyinTensorizedEmbeddingsGuobingGan1,PengZhang1,SunzhuLi1,XiuqingLu1,andBenyouWang21CollegeofIntelligenceandComputing,TianjinUniversity,Tianjin,China2SchoolofDataScience,TheChineseUniversityofHongKong,Shenzhen,China{ganguobing,pzhang,lisunzhu,lvxiuqing}@tju.edu.cn,wangbenyou...

展开>> 收起<<
MorphTE Injecting Morphology in Tensorized Embeddings Guobing Gan1 Peng Zhang1 Sunzhu Li1 Xiuqing Lu1 and Benyou Wang2.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.62MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注