MorphTE Injecting Morphology in Tensorized Embeddings Guobing Gan1 Peng Zhang1 Sunzhu Li1 Xiuqing Lu1 and Benyou Wang2

2025-05-02 0 0 1.62MB 20 页 10玖币

侵权投诉

MorphTE: Injecting Morphology in Tensorized

Embeddings

Guobing Gan1, Peng Zhang1∗

, Sunzhu Li1, Xiuqing Lu1, and Benyou Wang2

1College of Intelligence and Computing, Tianjin University, Tianjin, China

2School of Data Science, The Chinese University of Hong Kong, Shenzhen, China

{ganguobing,pzhang,lisunzhu,lvxiuqing}@tju.edu.cn, wangbenyou@cuhk.edu.cn

Abstract

In the era of deep learning, word embeddings are essential when dealing with text

tasks. However, storing and accessing these embeddings requires a large amount of

space. This is not conducive to the deployment of these models on resource-limited

devices. Combining the powerful compression capability of tensor products, we

propose a word embedding compression method with morphological augmenta-

tion,

Morphologically-enhanced Tensorized Embeddings

(

MorphTE

). A word

consists of one or more

morphemes

, the smallest units that bear meaning or have

a grammatical function. MorphTE represents a word embedding as an entangled

form of its morpheme vectors via the

tensor product

, which injects prior semantic

and grammatical knowledge into the learning of embeddings. Furthermore, the

dimensionality of the morpheme vector and the number of morphemes are much

smaller than those of words, which greatly reduces the parameters of the word

embeddings. We conduct experiments on tasks such as machine translation and

question answering. Experimental results on four translation datasets of different

languages show that MorphTE can compress word embedding parameters by about

times without performance loss and signiﬁcantly outperforms related embedding

compression methods.

1 Introduction

The word embedding layer is a key component of the neural network models in natural language

processing (NLP). It uses an embedding matrix to map each word into a dense real-valued vector.

However, when the vocabulary size and word embedding size (dimensionality) are large, the word

embedding matrix requires a large number of parameters. For example, the One Billion Word task of

language modeling [

] has a vocabulary size (

|V|

) of around

800K

. Besides, the embedding size

(

) can range from

300

1024

[

]. Storing and accessing the

|V| × d

embedding matrix

requires a large amount of disk and memory space. This limits the deployment of these models on

such devices having limited resources. To resolve this issue, there are many studies compressing

embedding layers [

]. They can be roughly divided into two lines:

product quantization-

based

and

decomposition-based

methods. The product quantization-based methods [

]

mainly utilize the compositional coding for constructing the word embeddings with fewer parameters,

and it needs to introduce an additional task to learn the compact code for each word.

The decomposition-based word embedding compression methods are mostly based on low-rank ma-

trix factorization [

] and tensor decomposition [

]. Utilizing low-rank matrix factorization,

ALBERT [

] replaces the embedding matrix with the product of two small matrices. Inspired by

quantum entanglement, Word2ket and Word2ketXS embeddings are proposed [

]. Speciﬁcally,

∗Corresponding Author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15379v1 [cs.CL] 27 Oct 2022

unkindness

unkindly

ness

kind

Figure 1: Morphemes of "unkindly" and "unkind-

ness".

Table 1: Phenomena in word formation.

Phenomenon Example

Inﬂection cook+s, cook+ed, cook+ing

cold+er, cold+est

Derivation un+like, un+like+ly

im+poss+ible, im+poss+ibly

Compounding police+man, post+man

cuttle+ﬁsh, gold+ﬁsh

Word2ket represents a word embedding as an entangled tensor of multiple small vectors (tensors) via

the tensor product. The entangled form is essentially consistent with Canonical Polyadic decompo-

sition [

]. Decomposition-based methods approximate the original large word embedding matrix

with multiple small matrices and tensors. However, these small matrices or tensors have no speciﬁc

meaning and lack interpretation, and the approximate substitution with them often hurts the model

performance in complicated NLP tasks such as machine translation [15, 27].

In this study, we focus on high-quality compressed word embeddings. To this end, we propose

the Morphologically-enhanced Tensorized Embeddings (

MorphTE

), which injects morphological

knowledge in tensorized embeddings. Speciﬁcally, MorphTE models the embedding for a word as

the entangled form of their

morpheme

vectors via

tensor products

. Notably, the quality of word

embeddings can be improved by ﬁne-grained morphemes, which has been veriﬁed in literature [

The beneﬁts of introducing the morphology of morphemes in MorphTE can be summed up in two

points.

(1)

A word consists of morphemes which are considered to be the smallest meaning-bearing or

grammatical units of a language [

]. As shown in Figure 1, the root ‘

kind

’ determines the underlying

meanings of ‘

unkindly

’ and ‘

unkindness

’. The afﬁxes ‘

’, ‘

’, and ‘

ness

’grammatically refer

to negations, adverbs, and nouns, respectively. In MorphTE, using these meaningful morphemes

to generate word embeddings explicitly injects prior semantic and grammatical knowledge into the

learning of word embeddings.

(2)

As shown in Table 1, linguistic phenomena such as inﬂection and

derivation in word formation make morphologically similar words often semantically related. In

MorphTE, these similar words can be connected by sharing the same morpheme vector.

MorphTE only needs to train and store morpheme vectors, which are smaller in embedding size and

vocabulary size than original word embeddings, leading to fewer parameters. For example, a word

embedding of size

512

can be generated using three morpheme vectors of size

via tensor products.

In addition, since morphemes are the basic units of words, the size of the morpheme vocabulary

is smaller than the size of the word vocabulary. To sum up, MorphTE can learn high-quality and

space-efﬁcient word embeddings, combining the prior knowledge of morphology and the compression

ability of tensor products.

We conducted comparative experiments on machine translation, retrieval-based question answering,

and natural language inference tasks. Our proposed MorphTE achieves better model performance on

these tasks compared to related word embedding compression methods. Compared with Word2ket,

MorphTE achieves improvements of

0.7

0.6

, and

0.6

BLEU scores on De-En, En-It, and En-Ru

datasets respectively. In addition, on

translation datasets in different languages, our method can

maintain the original performance when compressing the number of parameters of word embeddings

by more than

times and reducing the proportion of word embeddings to the total parameters

approximately from 30% to 2%, while other compression methods hurt the performance.

The main contributions of our work can be summarized as follows:

•

We propose MorphTE, a novel compression method for word embeddings using the form of

entangled tensors with morphology. The combination of morpheme and tensor product can

compress word embeddings in terms of both vocabulary and embedding size.

•

MorphTE introduces prior semantic knowledge in the learning of word embeddings from a

ﬁne-grained morpheme perspective, and explicitly models the connections between words by

sharing morpheme vectors. These enabled it to learn high-quality compressed embeddings.

•

Experiments on multiple languages and tasks show that MorphTE can compress word

embedding parameters over 20 times without hurting the original performance.

2 Related Work

Morphologically-augmented Embeddings.

Related works [

] propose to improve

the quality of word embeddings by integrating morphological information. Representing word

embeddings as the sum of morpheme and surface form vectors has been employed in several

studies [

]. Morphological RNNs [

] learns word representations using morphemes as units of

recursive neural networks [

]. Our proposed MorphTE also utilizes the information of morphemes

and is a decomposition-based word embedding compression method, similar to Word2ket [27].

Decomposition-based Compression Embeddings.

Decomposition-based methods are either

based on low-rank matrix factorization [

] or tensor decomposition [

Based on low-rank matrix factorization, ALBERT [

] simply approximates the embedding matrix

by the product of two small matrices. GroupReduce [

] and DiscBlock [

] perform a ﬁne-grained

matrix factorization. They ﬁrst block the word embedding matrix according to the word frequency

and then approximate each block. Notably, the method based on matrix factorization has a low-rank

bottleneck, and its expressive ability is limited under the condition of a high compression ratio [35].

As for the tensor decomposition, TT embeddings [

] uses the Tensor Train decomposition [

] to

approximate the embedding matrix with several

-order and

-order tensors. TT-Rec [

] improves

TT embeddings in terms of implementation and initialization to ﬁt the recommended scenario.

Word2ket [

] represents a word embedding as an entangled tensor via multiple small vectors. It

essentially exploits Canonical Polyadic decomposition [

]. Word2ketXs [

] is similar to

Word2ket, but it compresses embeddings from the perspective of all words rather than individual

words. In addition, KroneckerBERT [

] uses Kronecker decomposition to compress the word

embeddings, and the form of Kronecker Embeddings is consistent with Word2ket [

] with an order

. Unlike these compression methods, our MorphTE utilizes meaningful morphemes as basic units

for generating word embeddings, rather than vectors or tensors with no speciﬁc meaning.

3 Preliminary

3.1 Tensor Product Space and Entangled Tensors

A tensor product space of two separable Hilbert spaces

and

is also a separable Hilbert space

, which is denoted as

H=V ⊗ W

. Suppose

{ψ1, . . . , ψg

} and

{φ1, . . . , φh}

are the orthonormal

basis in

and

, respectively. The tensor product of the vector

c=Pg

j=1 cjψj∈ V

and

k=1 ekφk∈ W is deﬁned as follow:

c⊗e=(g

j=1

cjψj)⊗(h

k=1

ekφk)=

j=1

k=1

cjekψj⊗φk(1)

The set

{ψj⊗φk}jk

forms the orthonormal basis in

, and the dimensionality of

is the product

(

) of dimensionalities of

and

. This tensor product operation can be simpliﬁed as the product

of the corresponding coefﬁcients as follow:

c⊗e= [c1, c2,...,cg]⊗[e1, e2,...,eh]

= [c1e1, c1e2,...,c1eh,...,cge1, cge2,...,cgeh](2)

The cumulative tensor product space of the following form is said to have a tensor

order

, and

the dimensionality of cumulative tensor product space is the cumulative product of its subspace

dimensionalities. See Appendix B for concrete examples of the cumulative tensor product of multiple

vectors. n

j=1 H=H1⊗ H2. . . ⊗ Hn(3)

Considering the n-order tensor product space

j=1 Hj

, vectors of the form

v=⊗n

j=1vj

, where

vj∈ Hj

, are called

simple tensors

. In addition, vectors need to be represented as the sum of multiple

simple tensors are called

entangled tensors

. Tensor

rank

of a vector

is the smallest number of

simple tensors that sum up to v.

3.2 Tensorized Embeddings with Tensor Product

⊗

v11 v21 v31 v12 v22 v32

Word Embedding

⊗

Figure 2: Word2ket embedding with a rank

of r= 2 and an order of n= 3.

Tensor products have been introduced to learn parameter-

efﬁcient word embeddings in KroneckerBERT [

] and

Word2ket[

]. As shown in Figure 2, Word2ket[

] rep-

resents the embedding

v∈Rd

of a word as an entangled

tensor of rank rand order nas follow:

k=1

j=1

vjk (4)

where

vjk ∈Rq

and

v∈Rqn

. Word2ket only needs

to store and use these small vectors

vjk

to generate a

large word embedding. If

qn> d

, the excess part of the

generated embedding will be cut off. Therefore, setting

qn=d

can avoid the waste of parameters

caused by clipping, and the number of embedding parameters for a word is reduced from

rn n

√d

For example, when

d= 512

q= 8

n= 3

, and

r= 2

, the number of parameters of a word

embedding can be reduced from 512 to 48.

4 Methodology: Morphologically-enhanced Tensorized Embeddings

In this section, we ﬁrst discuss the rationale for introducing morphology in the embedding com-

pression. Then, we propose

MorphTE

, a morphologically-enhanced word embedding compression

method based on the tensor product. Finally, we show the detailed workﬂow of MorphTE.

4.1 Motivation to Introduce Morphology

To achieve compression, existing decomposition-based word embedding compression methods [

] use a series of small vectors or tensors to generate large word embeddings, as shown in Figure 2.

These methods are not only uninterpretable as their small tensors do not have speciﬁc meaning [

but also lack lexical knowledge. We argue that, in resource-limited scenarios like compression,

knowledge injection is much more critical than in common scenarios. Since with a signiﬁcant amount

of parameters in common scenarios it could to an easier extent learn implicitly such knowledge in a

data-driven way, which is also one of the objectives for neural networks. However, in compression,

it is more beneﬁcial to inject explicit knowledge to compensate for inferiority in parameter scales,

therefore underscoring the importance of knowledge injection in compression.

From a reductionism point of view, words might not be the smallest unit for some languages; for

example

unfeelingly

could be separated into four meaningful parts

[un,feel,ing,ly]

, a.k.a.,

morphemes

. By using a limited number of morphemes, one could possibly exponentially extend

a given core vocabulary by composing morphemes as new words according to the rules of word

formation in Table 1. The adoption of morphemes largely reduces the memory burden and therefore

facilitates the learning of words for humans. We hypothesize that morphology also helps for word

representation in neural networks, especially in resource-limited scenarios like compression.

4.2 Deﬁnition of MorphTE

Considering the above analysis, we propose to inject morphological knowledge to achieve high-

quality and space-efﬁcient word embeddings. Suppose a word is segmented as

morphemes

[m1, m2, . . . , ml]

in the natural order. For example, a four-morpheme word

unfeelingly

segmented as

[un,feel,ing,ly]

. We refer

f1(·), f2(·),···fr(·) : N→Rq

different repre-

sentation functions of morphemes

, selected from a parametric family

F={f:N→Rq}

. The

This also holds for logogram, written symbols of which represent words instead of sounds. For example,

Chinese language has character components, a.k.a, radicals.

For example, such a function could be morpheme embeddings, each vector of which is a

-sized vector,

which are similar to word embeddings.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MorphTE:InjectingMorphologyinTensorizedEmbeddingsGuobingGan1,PengZhang1,SunzhuLi1,XiuqingLu1,andBenyouWang21CollegeofIntelligenceandComputing,TianjinUniversity,Tianjin,China2SchoolofDataScience,TheChineseUniversityofHongKong,Shenzhen,China{ganguobing,pzhang,lisunzhu,lvxiuqing}@tju.edu.cn,wangbenyou...

展开>> 收起<<

MorphTE Injecting Morphology in Tensorized Embeddings Guobing Gan1 Peng Zhang1 Sunzhu Li1 Xiuqing Lu1 and Benyou Wang2.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MorphTE Injecting Morphology in Tensorized Embeddings Guobing Gan1 Peng Zhang1 Sunzhu Li1 Xiuqing Lu1 and Benyou Wang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: