Unsupervised Sentence Textual Similarity with Compositional Phrase Semantics Zihao Wang

2025-05-06 1 0 710.31KB 20 页 10玖币

侵权投诉

Unsupervised Sentence Textual Similarity

with Compositional Phrase Semantics

Zihao Wang

Department of CSE

HKUST

Hong Kong, China

zwanggc@cse.ust.hk

Jiaheng Dou and Yong Zhang ∗

BNRist, RIIT, Institute of Internet Industry

Department of Computer Science and Technology

Tsinghua University, Beijing, China

djh19@mails.tsinghua.edu.cn

zhangyong05@tsinghua.edu.cn

Abstract

Measuring Sentence Textual Similarity (STS)

is a classic task that can be applied to many

downstream NLP applications such as text gen-

eration and retrieval. In this paper, we fo-

cus on unsupervised STS that works on var-

ious domains but only requires minimal data

and computational resources. Theoretically,

we propose a light-weighted Expectation-

Correction (EC) formulation for STS compu-

tation. EC formulation uniﬁes unsupervised

STS approaches including the cosine similar-

ity of Additively Composed (AC) sentence em-

beddings (Arora et al.,2017), Optimal Trans-

port (OT) (Kusner et al.,2015), and Tree Ker-

nels (TK) (Le et al.,2018). Moreover, we pro-

pose the Recursive Optimal Transport Similar-

ity (ROTS) algorithm to capture the composi-

tional phrase semantics by composing multi-

ple recursive EC formulations. ROTS ﬁnishes

in linear time and is faster than its predeces-

sors. ROTS is empirically more effective and

scalable than previous approaches. Extensive

experiments on 29 STS tasks under various set-

tings show the clear advantage of ROTS over

existing approaches.1Detailed ablation stud-

ies demonstrate the effectiveness of our ap-

proaches.

1 Introduction

Sentence Textual Similarity (STS) measures the

semantic equivalence between a pair of sentences,

which is supposed to be consistent with human

evaluation (Agirre et al.,2012). STS is also an ef-

fective sentence-level semantic measure for many

downstream tasks such as text generation and re-

trieval (Wieting et al.,2019;Zhao et al.,2019;

Nikolentzos et al.,2020;Çelikyilmaz et al.,2020).

In this paper, we focus on unsupervised STS which

is expected to compare texts of various domains

∗Corresponding author.

Our code can be found in

https://github.com/

zihao-wang/rots.

but only requires minimal data and computational

resources.

There are several typical ways to compute un-

supervised STS, including 1) treat each sentence

as an embedding by the Additive Composition

(

) (Arora et al.,2017) of word vectors, then

estimate the STS of two sentences by their co-

sine similarity; 2) treat each sentence as a prob-

abilistic distribution of word vectors, then measure

the distance between distributions. Notably, Opti-

mal Transport (

) (Peyré and Cuturi,2019)

adopted to compute the STS (Kusner et al.,2015).

OT-based approaches search for the best alignment

with respect to the

word-level semantics

and re-

sult in state-of-the-art solution (Yokoi et al.,2020).

In this paper, we argue that

phrase-level seman-

tics

should also be exploited to fully understand the

sentences. For example, “optimal transport” should

be considered as a mathematical term rather than

two independent words. Speciﬁcally, the phrase

chunk is composed of lower-level chunks and is

usually represented as a node in tree structures.

The aforementioned AC and OT-based STS meth-

ods are too shallow to include such structures. Tree

Kernels (

) (Le et al.,2018) consider the parsed

syntax labels. However, it boils down to syntax-

based but sub-optimal word alignment under our

comparison experiment.

Recent advancement of Pretrained Language

Models (PLMs) also demonstrate the importance of

contextualization (Peters et al.,2018;Devlin et al.,

2019;Ethayarajh,2019). PLMs can be further

adopted to STS tasks by supervised ﬁne-tuning (De-

vlin et al.,2019), under carefully designed transfer

learning (Reimers and Gurevych,2019) or domain-

adaptation (Li et al.,2020;Gao et al.,2021). With-

out those treatments, the performances of PLM-

based STSs are observed to be very poor (Yokoi

et al.,2020). Meanwhile, PLM-based STSs suffer

OT-based distance reﬂects the dissimilarity between sen-

tences and can also be used as STS.

arXiv:2210.02284v1 [cs.CL] 5 Oct 2022

from high computational costs to ﬁt large amounts

of high-quality data, which might prevent them

from broader downstream scenarios.

In this paper, we propose a set of concepts and

similarities to exploit the phrase semantics in the

unsupervised setup. Our contributions are four

folds:

Uniﬁed formulation

We unify three types of un-

supervised STS models (AC (Arora et al.,

2017), OT (Yokoi et al.,2020) and TK (Le

et al.,2018)) by the EC similarity in Sec-

tion 3. EC similarity uncovers the strengths

and weaknesses of the three approaches.

Phrase vectors and their alignment

We general-

ize the idea of word alignment to phrase align-

ment in Section 4. After the formal deﬁnition

of Recursive Phrase Partition (RPP), we com-

pose the phrase weights and vectors by those

from ﬁner-grained partitions under the invari-

ant additive phrase composition and general-

ize the word alignment to phrase alignment.

Empirical observations show that EC similar-

ity is an effective formulation to interpolate

the existing unsupervised STS, and yields bet-

ter performances.

Recursive Optimal Transport

We propose the

Recursive Optimal Transport Similarity

(ROTS) in Section 5based on the phrase align-

ment introduced in Section 4. ROTS com-

putes the EC similarity at each phrase parti-

tion level and ensembles them. Notably, Prior

Optimal Transport (Prior OT) is adopted to

guide the ﬁner-grained phrase alignment by

the coarser-grained phrase alignment at each

expectation step of EC similarity.

Extensive experiments

We show the comprehen-

sive performance of ROTS on a wide spectrum

of experimental settings in Section 6and the

Appendix, including 29 STS tasks, ﬁve types

of word vectors, and three typical preprocess-

ing setups. Speciﬁcally, ROTS is shown to be

better than all other unsupervised approaches

including

BERT

based STS in terms of both

effectiveness and efﬁciency. Detailed abla-

tion studies also show that our constructive

deﬁnitions are sufﬁciently important and the

hyper-parameters can be easily chosen to ob-

tain the new SOTA performances.

2 Related Work

Embedding the symbolic words into continuous

space to present their semantics (Mikolov et al.,

2013;Pennington et al.,2014;Bojanowski et al.,

2017) is one of the breakthroughs of modern NLP.

Notably, it shows that the vector (or semantics)

of a phrase can be approximated by the addi-

tive composition of the vectors of its containing

words (Mikolov et al.,2013). Thus, word embed-

dings can be further utilized to describe the se-

mantics of texts beyond the word level. Several

strategies were proposed to provide sentence em-

beddings.

Additive Composition.

Additive composition of

word vectors (Arora et al.,2017) forms effective

sentence embeddings. The cosine similarity be-

tween the sentence embeddings has been shown to

be a stronger STS under transferred(Wieting et al.,

2016;Wieting and Gimpel,2018) and unsupervised

settings (Arora et al.,2017;Ethayarajh,2018) than

most of the deep learning approaches (Socher et al.,

2013;Le and Mikolov,2014;Kiros et al.,2015;

Tai et al.,2015).

Optimal Transport.

By considering sentences as

distributions of embeddings, the similarity between

sentence pairs is the consequence of optimal trans-

port of sentence distributions (Kusner et al.,2015;

Huang et al.,2016;Wu et al.,2018;Yokoi et al.,

2020). OT models ﬁnd the optimal alignment with

respect to word semantics via their embeddings and

have the SOTA performances (Yokoi et al.,2020).

Syntax Information.

One possible way to inte-

grate contextual information in a sentence is to

explicitly employ syntactic information. Recurrent

neural networks (Socher et al.,2013) were pro-

posed to exploit the tree structures in the supervised

setting but were sub-optimal than AC-based STS.

Meanwhile, tree kernels (Moschitti,2006;Croce

et al.,2011) can measure the similarity between

parsing trees. Most recently, ACV-tree kernels (Le

et al.,2018) combine word embedding similarities

with parsed constituency labels. However, tree ker-

nels compare all the sub-trees and suffer from high

computational complexity.

Pretrained Language Models

This paradigm pro-

duces contextualized sentence embeddings by ag-

gregating the word embeddings repeatedly with the

deep neural networks (Vaswani et al.,2017) trained

on large corpuses (Devlin et al.,2019). In the unsu-

pervised setting, PLMs are sub-optimal compared

to SOTA OT-based models (Yokoi et al.,2020).

One of the common strategies to improve the per-

formance is to adjust PLM-generated embedding

according to a large amount of external data such

as transfer learning (Reimers and Gurevych,2019),

ﬂow (Li et al.,2020), whitening (Su et al.,2021),

and contrastive learning (Gao et al.,2021). How-

ever, this domain adaptation paradigm requires a

complex training process and the performance is

highly affected by the similarity between the target

test data and external data (Li et al.,2020;Gao

et al.,2021).

3 Uniﬁcation of Unsupervised STS

Methods

Given a pair of sentences

(s(1), s(2))

, we are

expected to estimate their similarity score

s∈

[0,1]

. For sentence

s(1)

(or

s(2)

), we have vec-

tor

{v(1)

i}m

i=1

(or

{v(2)

j}n

j=1

) and weight

{w(1)

i}m

i=1

(or

{w(2)

j}n

j=1

). We quickly review three types of

unsupervised STS in Section 3.1 (see Figure 1(a-

c)), then unify them by the Expectation-Correction

similarity in Section 3.2.

3.1 Review of Three Types of STS

Additive Composition (AC)

AC methods (Arora

et al.,2017;Ethayarajh,2018) ﬁrstly compute

the sentence embedding

x(·)=Piw(·)

iv(·)

, then

estimate the similarity by the cosine similarity

sAC = cos(x(1), x(2)), see Figure 1(a).

Optimal Transport (OT)

Given pairwise word

distance matrix

D=Dij

and two marginal dis-

tributions

µi

and

νi

, the optimal transport align-

ment

ΓOT

is computed by solving the following

minimization problem (Kusner et al.,2015).

ΓOT = arg min

Γij ≥0X

Γij Dij ,(1)

s.t. X

Γij =µi,X

Γij =νj.

The higher

ΓOT ,ij

means that the alignment from

-th word in

s(1)

-th word in

s(2)

is preferred,

because those two words are semantically closer,

see Figure 1(c). Different choices of

D, µ, ν

lead

to different distances. The SOTA OT-based STS

is the Word Rotator’s Distance (WRD)

(Yokoi

et al.,2020), which solves Problem

(1)

with

Dij =

3Without further speciﬁcation, OT is referred to WRD

1−cos(w(1)

i, w(2)

j)and

µi=w(1)

ikv(1)

ik2

Pkw(1)

kkv(1)

kk2

,(2)

νj=w(2)

jkv(2)

jk2

Pkw(2)

kkv(2)

kk2

The similarity is

sOT =X

ΓOT ,ij cos(w(1)

i, w(2)

j).(3)

WRD is equivalent to AC if and only if each sen-

tence contains one word (Yokoi et al.,2020).

Tree Kernel (TK)

General tree kernels compare

the syntactic parsing information (Moschitti,2006;

Croce et al.,2011). Recently, ACV-Tree (Le et al.,

2018) combines word-level semantics with syntax

information by a simpliﬁed partial tree kernel (Mos-

chitti,2006), see Figure 1(b). Word similarities

from the same structure, i.e. NP, are repeatedly

counted and thus more important. Then the simi-

larity score can be re-written as

sT K =X

ΓT K,ij cos(w(1)

i, w(2)

j)(4)

where

ΓT K

is the normalized weight matrix gener-

ated by the tree kernel 4.

3.2 Expectation Correction (EC)

Three approaches discussed above, though moti-

vated in different ways, can be seen as a linear ag-

gregation of pair-wise cosine similarities of words.

We uniﬁed them into the following EC similarity

with two steps called expectation and correction.

Expectation

Both ACV-Tree (see Equation

(4)

)

and OT (see Equation

(3)

) aggregate pairwise word

similarities by the alignment matrix

ΓT K

and

ΓOT

AC also implies the implicit word alignment

ΓAC

the cosine similarity can be further decomposed by

plugging in the sentence vectors:

cos(x(1), x(2)) = hPiw(1)

iv(1)

i,Pjw(2)

jv(2)

kx(1)kkx(2)k

=CX

ΓAC,ij cos(v(1)

i, v(2)

j)(5)

4In this paper, TK indicates the ACV-Tree kernel

𝑣!

(#)

𝑤!

(#) 𝑣

(&)

𝑤

(&)

𝑣!

(#)

𝑤!

(#) 𝑣

(&)

𝑤

(&)

VP VP

ACV-Tree

𝑣!

(#)

𝑤!

(#) 𝑣

(&)

𝑤

(&)

𝑣!

(#)

𝑤!

(#) 𝑣

(&)

𝑤

(&)

ROTS (COARSE)

𝑣!

(#)

𝑤!

(#) 𝑣

(&)

𝑤

(&)

ROTS (FINE)

word vector

phrase vector

phrase weight

word weight additive composition

cosine similarity

alignment

prior alignment

(a) (b) (c)

(d) (e)

Figure 1: Different unsupervised STS methods with blue elements for s(1) and orange elements for s(2).(a)

AC (Arora et al.,2017): cosine similarity between additively composed sentence embeddings. (b) ACV-Tree (Le

et al.,2018): weighted averaging pairwise word similarity. Similarities from v(1)

ito vectors in s(2) are shown.

More weights are assigned to pairs contained in the same constituency structure, indicated by thicker arrows. (c)

OT (Yokoi et al.,2020): compute the optimal transport alignment of words by solving problem (1). (d) ROTS at

coarser hierarchy: the OT alignment of phrases vectors and weights. (e) ROTS at ﬁner hierarchy: ﬁne-level OT

alignment based on the prior of coarse-level alignment in (d).

Table 1: The comparison of different approaches.

Method Inter-sentence Expectation Intra-sentence Correction Tiime Complexity

Word Semantics Phrase Semantics Syntax

AC (Arora et al.,2017;Ethayarajh,2018)7 7 7 3 O(m+n)

OT (Kusner et al.,2015;Yokoi et al.,2020)3 7 7 7 O(mn)

TK (Le et al.,2018)7 7 3 7 O(mn)

ROTS (ours) 3 3 3 3 O(m+n)

where

ΓAC,ij =µiνj

and

are deﬁned in Equa-

tion

(2)

. This observation connects AC to the expec-

tation of word similarities

. Hence, the key of

ex-

pectation

step, is to compute

inter-sentence

word

alignment matrix

. Speciﬁcally,

ΓAC

is implicitly

induced by weights and vector norms without con-

sidering the semantics or syntax between words,

ΓT K

is constructed by comparing node labels in

syntax trees, and

ΓOT

is obtained by optimizing

word semantics. (See Table 1)

Correction In Equation (5), the coefﬁcient

C=Pkw(2)

kkv(2)

kPkw(2)

kv(2)

kkPkw(1)

kkv(1)

kPkw(1)

kv(1)

kk=pK1K2

also has special interpretation. For the speciﬁc sen-

tence

i= 1,2

, the coefﬁcient

can be rewritten

Equation

(5)

motivates the marginal conditions of WRD

in a different way

Ki−1 = (Pkw(i)

kkv(i)

kk)2

kPkw(i)

kv(i)

kk2−1

k6=m

w(i)

kw(i)

mkv(i)

kkkv(i)

kPkw(i)

kv(i)

kk2h1−cos(v(i)

k, v(i)

m)i.

We have

Ki≥1

and the equality holds if and only

if all word vectors are in the same direction, i.e.

they are semantically close.

increases as the

semantics of words in a sentence become more

diverse. In the latter situation, the sentence similar-

ity tends to be underestimated since unnecessary

alignments are forced by the joint distribution. The

coefﬁcient

corrects this

intra-sentence

seman-

tics. This correction step distinguishes AC from

OT and TK approaches (see Table 1).

Then we introduce the EC similarity by combin-

ing E-step and C-step as follows:

Deﬁnition 1

(EC similarity)

The EC similarity of

STS is deﬁned by:

Γij cos(v(1)

i, v(2)

j),(6)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedSentenceTextualSimilaritywithCompositionalPhraseSemanticsZihaoWangDepartmentofCSEHKUSTHongKong,Chinazwanggc@cse.ust.hkJiahengDouandYongZhangBNRist,RIIT,InstituteofInternetIndustryDepartmentofComputerScienceandTechnologyTsinghuaUniversity,Beijing,Chinadjh19@mails.tsinghua.edu.cnzhangyong...

展开>> 收起<<

Unsupervised Sentence Textual Similarity with Compositional Phrase Semantics Zihao Wang.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Sentence Textual Similarity with Compositional Phrase Semantics Zihao Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: