Subspace Representations for Soft Set Operations and Sentence Similarities Yoichi Ishibashi12Sho Yokoi34Katsuhito Sudoh1Satoshi Nakamura1 1Nara Institute of Science and Technology2Kyoto University

2025-05-02 0 0 3.17MB 13 页 10玖币
侵权投诉
Subspace Representations for Soft Set Operations and Sentence Similarities
Yoichi Ishibashi1,2Sho Yokoi3,4Katsuhito Sudoh1Satoshi Nakamura1
1Nara Institute of Science and Technology 2Kyoto University
3Tohoku University 4RIKEN
{ishibashi.yoichi.ir3, sudoh, s-nakamura}@is.naist.jp
yokoi@tohoku.ac.jp
Abstract
In the field of natural language processing
(NLP), continuous vector representations are
crucial for capturing the semantic meanings
of individual words. Yet, when it comes to
the representations of sets of words, the con-
ventional vector-based approaches often strug-
gle with expressiveness and lack the essential
set operations such as union, intersection, and
complement. Inspired by quantum logic, we
realize the representation of word sets and cor-
responding set operations within pre-trained
word embedding spaces. By grounding our ap-
proach in the linear subspaces, we enable effi-
cient computation of various set operations and
facilitate the soft computation of membership
functions within continuous spaces. Moreover,
we allow for the computation of the F-score
directly within word vectors, thereby establish-
ing a direct link to the assessment of sentence
similarity. In experiments with widely-used
pre-trained embeddings and benchmarks, we
show that our subspace-based set operations
consistently outperform vector-based ones in
both sentence similarity and set retrieval tasks.
1
1 Introduction
Embedding-based word representations have be-
come fundamental in the field of natural language
processing (NLP). Models like word2vec (Mikolov
et al.,2013) and GloVe (Pennington et al.,2014),
along with recent Transformer-based architec-
tures (Vaswani et al.,2017;Devlin et al.,2019),
have underscored the significance of embeddings
in capturing the complexities of linguistic seman-
tics.
The importance of representing collections of
words is pivotal in understanding concepts and
Work done while at Nara Institute of Science and Tech-
nology.
1
Our code is publicly available at
https://github.com/
yoichi1484/subspace
Spearman's ρ
0.0
0.1
0.2
0.3
0.4
0.5
Text Similarity (STS-B)
BERTScore SubspaceBERTScore
Recall@100
0
10
20
30
Text Concept Set Retrieval
Fuzzy set Subspace set
We queen
king
the
are
We
queen
king
the
Vector set representation Subspace set representation
Span
Figure 1: Superiority of subspace representations:
Our
subspace representation (blue)
surpasses the tradi-
tional
vector set representation (gray)
in both text sim-
ilarity and text concept set retrieval tasks.
relationships within language contexts (Zaheer
et al.,2017;Zhelezniak et al.,2019). For instance,
while words like “apple” and “orange” each carry
their distinct meanings, together they represent the
broader concept of fruits. Another example of im-
portant application is a sentence representation (Za-
heer et al.,2017). The set of words in a sentence
captures the overall meaning, allowing for compu-
tations such as text similarity (Agirre et al.,2012).
Against this backdrop, our research recognizes
the significance of applying set operations in NLP
and explores a new approach. Set operations enable
a richer representation of relationships between
collections of words, leading to more accurate se-
mantic analysis based on context. For example,
employing set operations allows for a clearer un-
derstanding of shared semantic features and dif-
ferences among word groups within a text. This
directly benefits tasks like determining semantic
similarity and expanding word sets.
In response to these challenges, our study intro-
duces a novel methodology that exploits the prin-
ciples of quantum logic (Birkhoff and Von Neu-
mann,1936), applied within embedding spaces to
arXiv:2210.13034v4 [cs.CL] 10 Apr 2024
define set operations. Our proposed framework
adopts a subspace-based approach for representing
word sets, aiming to maintain the intricate seman-
tic relationships within these sets. We represent a
word set as a subspace which is spanned by pre-
trained embeddings. Additionally, it adheres to
the foundational laws of set theory as delineated
in the framework of quantum logic. This compli-
ance ensures that our set operations, such as union,
intersection, and complement, are not only mathe-
matically robust but also linguistically meaningful
when applied them in pre-trained embedding space.
We first introduce a subspace set representation
along with basic operations (
,
, and
). Subse-
quently, to highlight the usefulness of our proposed
framework, we introduce two core set computa-
tions: text similarity and set membership. The
empirical results consistently point towards the no-
table superiority of our approach; our straightfor-
ward approach of spanning subspaces with pre-
trained embedding sets enables a rich set repre-
sentation, and we demonstrated its consistent per-
formance enhancement in downstream tasks (Fig-
ure 1). Our research contributions include:
1.
The introduction of continuous set represen-
tations and a framework for set operations,
enabling more effective manipulation of word
embedding sets (§4).
2.
We propose SubspaceBERTScore, an exten-
sion of the embedding set-based text similar-
ity method, BERTScore (Zhang et al.,2020).
By simply transitioning from a vector set rep-
resentation to a subspace, and incorporating
a subspace-based indicator function, we ob-
serve a salient improvement in performance
across all text similarity benchmarks (§5).
3.
We apply subspace-based basic operations
(
,,
and
) to set expansion task and achive
high performance (§6).
2 Preliminaries
To make the following discussion clear, we de-
fine several symbols. The sets of tokens in
two sentences (
A
and
B
) are denoted as
A=
{a1, a2, . . . }, B ={b1, b2, . . . }
respectively. The
sets of contextualized token vectors are denoted
as
A={a1,a2, . . . },B={b1,b2, . . . }
, where
a
and
b
are token vectors generated by the pre-
trained embedding model such as BERT. The
subspace spanned by
A
is denoted as
SA=
span(a1,a2, . . . )
. Note that the bases of the sub-
space is orthonormalized.
3 Symbolic Set Operations
We first formulate various set operations in a pre-
trained embedding space. Among many types of
operations for practical NLP applications, this work
focuses on set similarity:
A={A,boy,walks,in,this,park},
B={The,kid,runs,in,the,square},
Similarity(A,B),
(1)
set membership () and basic operations (,):
Color ={red,blue,green,orange, . . . },
Fruit ={apple,orange,peach, . . . },
orange Color Fruit.
(2)
For this purpose, we need following representa-
tions on a pre-trained embedding space2:
An element and a set of elements The rep-
resentations of an element and a set of ele-
ments are the most basic ones. To exploit
word embeddings, we represent a word (e.g.,
orange
) as an element and a group of words (e.g.,
{red,blue,green,orange, . . . }) as a word set.
Quantification of set membership (indicator
function) Membership denotes a relation in
which word
w
is an element of set
A
, i.e.,
w
A
. We quantify it based on vector representations.
Although the membership is typically a binary deci-
sion identical to that in a symbolic space, it can also
be measured by the degree of closeness in a contin-
uous vector space. Membership can be computed
as an indicator function. The indicator function
set
quantifies whether the word
w
is included (
1
)
or not (0) in the set in a discrete manner:
set[wA] = (1 if wA,
0 if w/A.(3)
Similarity between discrete symbol sets Set
similarity, such as recall and precision, is an es-
sential operation when calculating the similarity
of texts. Despite its simplicity, the word overlap-
based sentence similarity serves as a remarkably
effective approximation and has found widespread
2
These operations do not include some set operations such
as cardinality, but are sufficient for expressing the practical
forms of sets such as Eq. (1).
practical application, as evidenced by numerous
studies(Bojar et al.,2018;Zhang et al.,2020;Cer
et al.,2017;Zhelezniak et al.,2019). They stand
out as excellent similarity metrics based on embed-
dings. BERTScore (Zhang et al.,2020), which uti-
lizes embeddings for its computation, is grounded
in recall and precision
3
. The typical computations
for recall (R) and precision (P) are as follows4:
R=1
|A|X
aiA
set[aiB], (4)
P=1
|B|X
biB
set[biA], (5)
Basic set operations We need three basic set op-
erations: intersection (
AB
), union (
AB
),
and complement (
A
). They allow us to represent
various sets using such different combinations as
Color Fruit.
4 Subspace-based Set Representations
We propose the representations of a word set and
set operations based on quantum logic (Birkhoff
and Von Neumann,1936). They hold advantages of
geometric properties in an embedding space, and
the set operations are guaranteed to hold for the
laws of a set defined in quantum logic.
4.1 Quantum logic
While word embedding represents a word’s mean-
ing as a vector in linear space, quantum mechanics
similarly represents a quantum state as a vector in
linear space. These two intuitively different fields
are very close to each other in terms of the repre-
sentation and the operation of information.
Quantum logic (Birkhoff and Von Neumann,
1936) theory describes quantum mechanical phe-
nomena. Intuitively, it is a framework for set oper-
ations in a vector space. In quantum logic, a set
of vectors is represented as a linear subspace in
a Hilbert space, and such set operations as union,
intersection, and complement are defined as opera-
tions on subspaces. Quantum logic, which employs
a complete orthomodular lattice as its system of
truth values, guarantees to hold various set opera-
tions, such as De Morgan’s laws
(AB) = AB
3
Unlike the symbolic set similarity, which do not consider
word order, contextualized embeddings enable the capture of
word sequence information.
4
For simplicity in explanation, we present
A
and
B
in Eq.
(4) and Eq. (5) as a set of tokens.
Algorithm 1 Computing basis of a subspace
Input: {v(1),...,v(k)} ⊆ R1×d
: Word embed-
dings to span subspace SA
Output: SARr×d: Bases of SA
ARk×dSTACK_ROWS(v(1),...,v(k))
SARr×d(ORTHO_NORMAL(A))
Orthonormalize the bases. ris the rank of A
return SA
and
(AB) = AB
, idempotent law:
AA=A
,
and double complement: A=A.
4.2 Set Operations in an Embedding Space
The representations of an element, a set, and such
set operations as union, intersection, and comple-
ment in quantum logic can be applied directly in
a word embedding space because it is a Euclidean
space and therefore also a Hilbert space. How-
ever, since set similarity and set membership for
a word embedding space are still missing in quan-
tum logic, we propose a novel formulation of those
operations using subspace-based representations,
which is consistent with quantum logic. The corre-
spondence between symbolic and subspace-based
set operations is shown in Table 1.
Set and elements Let
Rn
be a
n
-dimensional
embedding space (Euclidean space), let
A=
{w1,w2, . . . }
be a set of words, and let
vwRn
be a word (token) vector corresponding to
w
. As
discussed in §3, we first formulate the representa-
tion of a word and a word set. In quantum logic,
an element is represented by a vector, and a set is
represented by a subspace spanned by the vectors
corresponding to its elements. Here we assume an
element, i.e., word
w
, is represented by vector
vw
,
and a word set is represented by linear subspace
SARnspanned by word vectors:
SA:= span(A):= span(a1,a2, . . . ).(6)
Hereinafter we simply refer to linear subspace as
subspace. Algorithm 1is the pseudocode for com-
puting the basis of the subspace.
Basic set operations The complement of set
A
,
denoted by
A
, is represented by the orthogonal
complement of subspace SA:
SA:= (SA)={v| ∃aSA,v·a= 0}.(7)
The union of two sets,
A
and
B
, denoted by
AB
, is represented by the sum space of two
摘要:

SubspaceRepresentationsforSoftSetOperationsandSentenceSimilaritiesYoichiIshibashi1,2∗ShoYokoi3,4KatsuhitoSudoh1SatoshiNakamura11NaraInstituteofScienceandTechnology2KyotoUniversity3TohokuUniversity4RIKEN{ishibashi.yoichi.ir3,sudoh,s-nakamura}@is.naist.jpyokoi@tohoku.ac.jpAbstractInthefieldofnaturalla...

展开>> 收起<<
Subspace Representations for Soft Set Operations and Sentence Similarities Yoichi Ishibashi12Sho Yokoi34Katsuhito Sudoh1Satoshi Nakamura1 1Nara Institute of Science and Technology2Kyoto University.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:3.17MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注