from high computational costs to fit large amounts
of high-quality data, which might prevent them
from broader downstream scenarios.
In this paper, we propose a set of concepts and
similarities to exploit the phrase semantics in the
unsupervised setup. Our contributions are four
folds:
Unified formulation
We unify three types of un-
supervised STS models (AC (Arora et al.,
2017), OT (Yokoi et al.,2020) and TK (Le
et al.,2018)) by the EC similarity in Sec-
tion 3. EC similarity uncovers the strengths
and weaknesses of the three approaches.
Phrase vectors and their alignment
We general-
ize the idea of word alignment to phrase align-
ment in Section 4. After the formal definition
of Recursive Phrase Partition (RPP), we com-
pose the phrase weights and vectors by those
from finer-grained partitions under the invari-
ant additive phrase composition and general-
ize the word alignment to phrase alignment.
Empirical observations show that EC similar-
ity is an effective formulation to interpolate
the existing unsupervised STS, and yields bet-
ter performances.
Recursive Optimal Transport
We propose the
Recursive Optimal Transport Similarity
(ROTS) in Section 5based on the phrase align-
ment introduced in Section 4. ROTS com-
putes the EC similarity at each phrase parti-
tion level and ensembles them. Notably, Prior
Optimal Transport (Prior OT) is adopted to
guide the finer-grained phrase alignment by
the coarser-grained phrase alignment at each
expectation step of EC similarity.
Extensive experiments
We show the comprehen-
sive performance of ROTS on a wide spectrum
of experimental settings in Section 6and the
Appendix, including 29 STS tasks, five types
of word vectors, and three typical preprocess-
ing setups. Specifically, ROTS is shown to be
better than all other unsupervised approaches
including
BERT
based STS in terms of both
effectiveness and efficiency. Detailed abla-
tion studies also show that our constructive
definitions are sufficiently important and the
hyper-parameters can be easily chosen to ob-
tain the new SOTA performances.
2 Related Work
Embedding the symbolic words into continuous
space to present their semantics (Mikolov et al.,
2013;Pennington et al.,2014;Bojanowski et al.,
2017) is one of the breakthroughs of modern NLP.
Notably, it shows that the vector (or semantics)
of a phrase can be approximated by the addi-
tive composition of the vectors of its containing
words (Mikolov et al.,2013). Thus, word embed-
dings can be further utilized to describe the se-
mantics of texts beyond the word level. Several
strategies were proposed to provide sentence em-
beddings.
Additive Composition.
Additive composition of
word vectors (Arora et al.,2017) forms effective
sentence embeddings. The cosine similarity be-
tween the sentence embeddings has been shown to
be a stronger STS under transferred(Wieting et al.,
2016;Wieting and Gimpel,2018) and unsupervised
settings (Arora et al.,2017;Ethayarajh,2018) than
most of the deep learning approaches (Socher et al.,
2013;Le and Mikolov,2014;Kiros et al.,2015;
Tai et al.,2015).
Optimal Transport.
By considering sentences as
distributions of embeddings, the similarity between
sentence pairs is the consequence of optimal trans-
port of sentence distributions (Kusner et al.,2015;
Huang et al.,2016;Wu et al.,2018;Yokoi et al.,
2020). OT models find the optimal alignment with
respect to word semantics via their embeddings and
have the SOTA performances (Yokoi et al.,2020).
Syntax Information.
One possible way to inte-
grate contextual information in a sentence is to
explicitly employ syntactic information. Recurrent
neural networks (Socher et al.,2013) were pro-
posed to exploit the tree structures in the supervised
setting but were sub-optimal than AC-based STS.
Meanwhile, tree kernels (Moschitti,2006;Croce
et al.,2011) can measure the similarity between
parsing trees. Most recently, ACV-tree kernels (Le
et al.,2018) combine word embedding similarities
with parsed constituency labels. However, tree ker-
nels compare all the sub-trees and suffer from high
computational complexity.
Pretrained Language Models
This paradigm pro-
duces contextualized sentence embeddings by ag-
gregating the word embeddings repeatedly with the
deep neural networks (Vaswani et al.,2017) trained
on large corpuses (Devlin et al.,2019). In the unsu-
pervised setting, PLMs are sub-optimal compared
to SOTA OT-based models (Yokoi et al.,2020).