Unsupervised Sentence Textual Similarity with Compositional Phrase Semantics Zihao Wang

2025-05-06 0 0 710.31KB 20 页 10玖币
侵权投诉
Unsupervised Sentence Textual Similarity
with Compositional Phrase Semantics
Zihao Wang
Department of CSE
HKUST
Hong Kong, China
zwanggc@cse.ust.hk
Jiaheng Dou and Yong Zhang
BNRist, RIIT, Institute of Internet Industry
Department of Computer Science and Technology
Tsinghua University, Beijing, China
djh19@mails.tsinghua.edu.cn
zhangyong05@tsinghua.edu.cn
Abstract
Measuring Sentence Textual Similarity (STS)
is a classic task that can be applied to many
downstream NLP applications such as text gen-
eration and retrieval. In this paper, we fo-
cus on unsupervised STS that works on var-
ious domains but only requires minimal data
and computational resources. Theoretically,
we propose a light-weighted Expectation-
Correction (EC) formulation for STS compu-
tation. EC formulation unifies unsupervised
STS approaches including the cosine similar-
ity of Additively Composed (AC) sentence em-
beddings (Arora et al.,2017), Optimal Trans-
port (OT) (Kusner et al.,2015), and Tree Ker-
nels (TK) (Le et al.,2018). Moreover, we pro-
pose the Recursive Optimal Transport Similar-
ity (ROTS) algorithm to capture the composi-
tional phrase semantics by composing multi-
ple recursive EC formulations. ROTS finishes
in linear time and is faster than its predeces-
sors. ROTS is empirically more effective and
scalable than previous approaches. Extensive
experiments on 29 STS tasks under various set-
tings show the clear advantage of ROTS over
existing approaches.1Detailed ablation stud-
ies demonstrate the effectiveness of our ap-
proaches.
1 Introduction
Sentence Textual Similarity (STS) measures the
semantic equivalence between a pair of sentences,
which is supposed to be consistent with human
evaluation (Agirre et al.,2012). STS is also an ef-
fective sentence-level semantic measure for many
downstream tasks such as text generation and re-
trieval (Wieting et al.,2019;Zhao et al.,2019;
Nikolentzos et al.,2020;Çelikyilmaz et al.,2020).
In this paper, we focus on unsupervised STS which
is expected to compare texts of various domains
Corresponding author.
1
Our code can be found in
https://github.com/
zihao-wang/rots.
but only requires minimal data and computational
resources.
There are several typical ways to compute un-
supervised STS, including 1) treat each sentence
as an embedding by the Additive Composition
(
AC
) (Arora et al.,2017) of word vectors, then
estimate the STS of two sentences by their co-
sine similarity; 2) treat each sentence as a prob-
abilistic distribution of word vectors, then measure
the distance between distributions. Notably, Opti-
mal Transport (
OT
) (Peyré and Cuturi,2019)
2
is
adopted to compute the STS (Kusner et al.,2015).
OT-based approaches search for the best alignment
with respect to the
word-level semantics
and re-
sult in state-of-the-art solution (Yokoi et al.,2020).
In this paper, we argue that
phrase-level seman-
tics
should also be exploited to fully understand the
sentences. For example, “optimal transport” should
be considered as a mathematical term rather than
two independent words. Specifically, the phrase
chunk is composed of lower-level chunks and is
usually represented as a node in tree structures.
The aforementioned AC and OT-based STS meth-
ods are too shallow to include such structures. Tree
Kernels (
TK
) (Le et al.,2018) consider the parsed
syntax labels. However, it boils down to syntax-
based but sub-optimal word alignment under our
comparison experiment.
Recent advancement of Pretrained Language
Models (PLMs) also demonstrate the importance of
contextualization (Peters et al.,2018;Devlin et al.,
2019;Ethayarajh,2019). PLMs can be further
adopted to STS tasks by supervised fine-tuning (De-
vlin et al.,2019), under carefully designed transfer
learning (Reimers and Gurevych,2019) or domain-
adaptation (Li et al.,2020;Gao et al.,2021). With-
out those treatments, the performances of PLM-
based STSs are observed to be very poor (Yokoi
et al.,2020). Meanwhile, PLM-based STSs suffer
2
OT-based distance reflects the dissimilarity between sen-
tences and can also be used as STS.
arXiv:2210.02284v1 [cs.CL] 5 Oct 2022
from high computational costs to fit large amounts
of high-quality data, which might prevent them
from broader downstream scenarios.
In this paper, we propose a set of concepts and
similarities to exploit the phrase semantics in the
unsupervised setup. Our contributions are four
folds:
Unified formulation
We unify three types of un-
supervised STS models (AC (Arora et al.,
2017), OT (Yokoi et al.,2020) and TK (Le
et al.,2018)) by the EC similarity in Sec-
tion 3. EC similarity uncovers the strengths
and weaknesses of the three approaches.
Phrase vectors and their alignment
We general-
ize the idea of word alignment to phrase align-
ment in Section 4. After the formal definition
of Recursive Phrase Partition (RPP), we com-
pose the phrase weights and vectors by those
from finer-grained partitions under the invari-
ant additive phrase composition and general-
ize the word alignment to phrase alignment.
Empirical observations show that EC similar-
ity is an effective formulation to interpolate
the existing unsupervised STS, and yields bet-
ter performances.
Recursive Optimal Transport
We propose the
Recursive Optimal Transport Similarity
(ROTS) in Section 5based on the phrase align-
ment introduced in Section 4. ROTS com-
putes the EC similarity at each phrase parti-
tion level and ensembles them. Notably, Prior
Optimal Transport (Prior OT) is adopted to
guide the finer-grained phrase alignment by
the coarser-grained phrase alignment at each
expectation step of EC similarity.
Extensive experiments
We show the comprehen-
sive performance of ROTS on a wide spectrum
of experimental settings in Section 6and the
Appendix, including 29 STS tasks, five types
of word vectors, and three typical preprocess-
ing setups. Specifically, ROTS is shown to be
better than all other unsupervised approaches
including
BERT
based STS in terms of both
effectiveness and efficiency. Detailed abla-
tion studies also show that our constructive
definitions are sufficiently important and the
hyper-parameters can be easily chosen to ob-
tain the new SOTA performances.
2 Related Work
Embedding the symbolic words into continuous
space to present their semantics (Mikolov et al.,
2013;Pennington et al.,2014;Bojanowski et al.,
2017) is one of the breakthroughs of modern NLP.
Notably, it shows that the vector (or semantics)
of a phrase can be approximated by the addi-
tive composition of the vectors of its containing
words (Mikolov et al.,2013). Thus, word embed-
dings can be further utilized to describe the se-
mantics of texts beyond the word level. Several
strategies were proposed to provide sentence em-
beddings.
Additive Composition.
Additive composition of
word vectors (Arora et al.,2017) forms effective
sentence embeddings. The cosine similarity be-
tween the sentence embeddings has been shown to
be a stronger STS under transferred(Wieting et al.,
2016;Wieting and Gimpel,2018) and unsupervised
settings (Arora et al.,2017;Ethayarajh,2018) than
most of the deep learning approaches (Socher et al.,
2013;Le and Mikolov,2014;Kiros et al.,2015;
Tai et al.,2015).
Optimal Transport.
By considering sentences as
distributions of embeddings, the similarity between
sentence pairs is the consequence of optimal trans-
port of sentence distributions (Kusner et al.,2015;
Huang et al.,2016;Wu et al.,2018;Yokoi et al.,
2020). OT models find the optimal alignment with
respect to word semantics via their embeddings and
have the SOTA performances (Yokoi et al.,2020).
Syntax Information.
One possible way to inte-
grate contextual information in a sentence is to
explicitly employ syntactic information. Recurrent
neural networks (Socher et al.,2013) were pro-
posed to exploit the tree structures in the supervised
setting but were sub-optimal than AC-based STS.
Meanwhile, tree kernels (Moschitti,2006;Croce
et al.,2011) can measure the similarity between
parsing trees. Most recently, ACV-tree kernels (Le
et al.,2018) combine word embedding similarities
with parsed constituency labels. However, tree ker-
nels compare all the sub-trees and suffer from high
computational complexity.
Pretrained Language Models
This paradigm pro-
duces contextualized sentence embeddings by ag-
gregating the word embeddings repeatedly with the
deep neural networks (Vaswani et al.,2017) trained
on large corpuses (Devlin et al.,2019). In the unsu-
pervised setting, PLMs are sub-optimal compared
to SOTA OT-based models (Yokoi et al.,2020).
One of the common strategies to improve the per-
formance is to adjust PLM-generated embedding
according to a large amount of external data such
as transfer learning (Reimers and Gurevych,2019),
flow (Li et al.,2020), whitening (Su et al.,2021),
and contrastive learning (Gao et al.,2021). How-
ever, this domain adaptation paradigm requires a
complex training process and the performance is
highly affected by the similarity between the target
test data and external data (Li et al.,2020;Gao
et al.,2021).
3 Unification of Unsupervised STS
Methods
Given a pair of sentences
(s(1), s(2))
, we are
expected to estimate their similarity score
s
[0,1]
. For sentence
s(1)
(or
s(2)
), we have vec-
tor
{v(1)
i}m
i=1
(or
{v(2)
j}n
j=1
) and weight
{w(1)
i}m
i=1
(or
{w(2)
j}n
j=1
). We quickly review three types of
unsupervised STS in Section 3.1 (see Figure 1(a-
c)), then unify them by the Expectation-Correction
similarity in Section 3.2.
3.1 Review of Three Types of STS
Additive Composition (AC)
AC methods (Arora
et al.,2017;Ethayarajh,2018) firstly compute
the sentence embedding
x(·)=Piw(·)
iv(·)
i
, then
estimate the similarity by the cosine similarity
sAC = cos(x(1), x(2)), see Figure 1(a).
Optimal Transport (OT)
Given pairwise word
distance matrix
D=Dij
and two marginal dis-
tributions
µi
and
νi
, the optimal transport align-
ment
ΓOT
is computed by solving the following
minimization problem (Kusner et al.,2015).
ΓOT = arg min
Γij 0X
ij
Γij Dij ,(1)
s.t. X
j
Γij =µi,X
i
Γij =νj.
The higher
ΓOT ,ij
means that the alignment from
i
-th word in
s(1)
to
j
-th word in
s(2)
is preferred,
because those two words are semantically closer,
see Figure 1(c). Different choices of
D, µ, ν
lead
to different distances. The SOTA OT-based STS
is the Word Rotator’s Distance (WRD)
3
(Yokoi
et al.,2020), which solves Problem
(1)
with
Dij =
3Without further specification, OT is referred to WRD
1cos(w(1)
i, w(2)
j)and
µi=w(1)
ikv(1)
ik2
Pkw(1)
kkv(1)
kk2
,(2)
νj=w(2)
jkv(2)
jk2
Pkw(2)
kkv(2)
kk2
.
The similarity is
sOT =X
ij
ΓOT ,ij cos(w(1)
i, w(2)
j).(3)
WRD is equivalent to AC if and only if each sen-
tence contains one word (Yokoi et al.,2020).
Tree Kernel (TK)
General tree kernels compare
the syntactic parsing information (Moschitti,2006;
Croce et al.,2011). Recently, ACV-Tree (Le et al.,
2018) combines word-level semantics with syntax
information by a simplified partial tree kernel (Mos-
chitti,2006), see Figure 1(b). Word similarities
from the same structure, i.e. NP, are repeatedly
counted and thus more important. Then the simi-
larity score can be re-written as
sT K =X
ij
ΓT K,ij cos(w(1)
i, w(2)
j)(4)
where
ΓT K
is the normalized weight matrix gener-
ated by the tree kernel 4.
3.2 Expectation Correction (EC)
Three approaches discussed above, though moti-
vated in different ways, can be seen as a linear ag-
gregation of pair-wise cosine similarities of words.
We unified them into the following EC similarity
with two steps called expectation and correction.
Expectation
Both ACV-Tree (see Equation
(4)
)
and OT (see Equation
(3)
) aggregate pairwise word
similarities by the alignment matrix
ΓT K
and
ΓOT
.
AC also implies the implicit word alignment
ΓAC
,
the cosine similarity can be further decomposed by
plugging in the sentence vectors:
cos(x(1), x(2)) = hPiw(1)
iv(1)
i,Pjw(2)
jv(2)
ji
kx(1)kkx(2)k
=CX
ij
ΓAC,ij cos(v(1)
i, v(2)
j)(5)
4In this paper, TK indicates the ACV-Tree kernel
𝑣!
(#)
𝑤!
(#) 𝑣
%
(&)
𝑤
%
(&)
AC
𝑣!
(#)
𝑤!
(#) 𝑣
%
(&)
𝑤
%
(&)
NP
NP
VP VP
ACV-Tree
𝑣!
(#)
𝑤!
(#) 𝑣
%
(&)
𝑤
%
(&)
OT
𝑣!
(#)
𝑤!
(#) 𝑣
%
(&)
𝑤
%
(&)
NP
NP
VP
VP
ROTS (COARSE)
𝑣!
(#)
𝑤!
(#) 𝑣
%
(&)
𝑤
%
(&)
NP
NP
VP
VP
ROTS (FINE)
word vector
phrase vector
phrase weight
word weight additive composition
cosine similarity
alignment
prior alignment
(a) (b) (c)
(d) (e)
Figure 1: Different unsupervised STS methods with blue elements for s(1) and orange elements for s(2).(a)
AC (Arora et al.,2017): cosine similarity between additively composed sentence embeddings. (b) ACV-Tree (Le
et al.,2018): weighted averaging pairwise word similarity. Similarities from v(1)
ito vectors in s(2) are shown.
More weights are assigned to pairs contained in the same constituency structure, indicated by thicker arrows. (c)
OT (Yokoi et al.,2020): compute the optimal transport alignment of words by solving problem (1). (d) ROTS at
coarser hierarchy: the OT alignment of phrases vectors and weights. (e) ROTS at finer hierarchy: fine-level OT
alignment based on the prior of coarse-level alignment in (d).
Table 1: The comparison of different approaches.
Method Inter-sentence Expectation Intra-sentence Correction Tiime Complexity
Word Semantics Phrase Semantics Syntax
AC (Arora et al.,2017;Ethayarajh,2018)7 7 7 3 O(m+n)
OT (Kusner et al.,2015;Yokoi et al.,2020)3 7 7 7 O(mn)
TK (Le et al.,2018)7 7 3 7 O(mn)
ROTS (ours) 3 3 3 3 O(m+n)
where
ΓAC,ij =µiνj
,
µ
and
ν
are defined in Equa-
tion
(2)
. This observation connects AC to the expec-
tation of word similarities
5
. Hence, the key of
ex-
pectation
step, is to compute
inter-sentence
word
alignment matrix
Γ
. Specifically,
ΓAC
is implicitly
induced by weights and vector norms without con-
sidering the semantics or syntax between words,
ΓT K
is constructed by comparing node labels in
syntax trees, and
ΓOT
is obtained by optimizing
word semantics. (See Table 1)
Correction In Equation (5), the coefficient
C=Pkw(2)
kkv(2)
kk
kPkw(2)
kv(2)
kkPkw(1)
kkv(1)
kk
kPkw(1)
kv(1)
kk=pK1K2
also has special interpretation. For the specific sen-
tence
i= 1,2
, the coefficient
Ki
can be rewritten
5
Equation
(5)
motivates the marginal conditions of WRD
in a different way
as
Ki1 = (Pkw(i)
kkv(i)
kk)2
kPkw(i)
kv(i)
kk21
=X
k6=m
w(i)
kw(i)
mkv(i)
kkkv(i)
mk
kPkw(i)
kv(i)
kk2h1cos(v(i)
k, v(i)
m)i.
We have
Ki1
and the equality holds if and only
if all word vectors are in the same direction, i.e.
they are semantically close.
Ki
increases as the
semantics of words in a sentence become more
diverse. In the latter situation, the sentence similar-
ity tends to be underestimated since unnecessary
alignments are forced by the joint distribution. The
coefficient
C
corrects this
intra-sentence
seman-
tics. This correction step distinguishes AC from
OT and TK approaches (see Table 1).
Then we introduce the EC similarity by combin-
ing E-step and C-step as follows:
Definition 1
(EC similarity)
.
The EC similarity of
STS is defined by:
˜
CX
ij
Γij cos(v(1)
i, v(2)
j),(6)
摘要:

UnsupervisedSentenceTextualSimilaritywithCompositionalPhraseSemanticsZihaoWangDepartmentofCSEHKUSTHongKong,Chinazwanggc@cse.ust.hkJiahengDouandYongZhangBNRist,RIIT,InstituteofInternetIndustryDepartmentofComputerScienceandTechnologyTsinghuaUniversity,Beijing,Chinadjh19@mails.tsinghua.edu.cnzhangyong...

展开>> 收起<<
Unsupervised Sentence Textual Similarity with Compositional Phrase Semantics Zihao Wang.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:710.31KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注