Subspace Representations for Soft Set Operations and Sentence Similarities Yoichi Ishibashi12Sho Yokoi34Katsuhito Sudoh1Satoshi Nakamura1 1Nara Institute of Science and Technology2Kyoto University

2025-05-02 0 0 3.17MB 13 页 10玖币

侵权投诉

Subspace Representations for Soft Set Operations and Sentence Similarities

Yoichi Ishibashi1,2∗Sho Yokoi3,4Katsuhito Sudoh1Satoshi Nakamura1

1Nara Institute of Science and Technology 2Kyoto University

3Tohoku University 4RIKEN

{ishibashi.yoichi.ir3, sudoh, s-nakamura}@is.naist.jp

yokoi@tohoku.ac.jp

Abstract

In the ﬁeld of natural language processing

(NLP), continuous vector representations are

crucial for capturing the semantic meanings

of individual words. Yet, when it comes to

the representations of sets of words, the con-

ventional vector-based approaches often strug-

gle with expressiveness and lack the essential

set operations such as union, intersection, and

complement. Inspired by quantum logic, we

realize the representation of word sets and cor-

responding set operations within pre-trained

word embedding spaces. By grounding our ap-

proach in the linear subspaces, we enable efﬁ-

cient computation of various set operations and

facilitate the soft computation of membership

functions within continuous spaces. Moreover,

we allow for the computation of the F-score

directly within word vectors, thereby establish-

ing a direct link to the assessment of sentence

similarity. In experiments with widely-used

pre-trained embeddings and benchmarks, we

show that our subspace-based set operations

consistently outperform vector-based ones in

both sentence similarity and set retrieval tasks.

1 Introduction

Embedding-based word representations have be-

come fundamental in the ﬁeld of natural language

processing (NLP). Models like word2vec (Mikolov

et al.,2013) and GloVe (Pennington et al.,2014),

along with recent Transformer-based architec-

tures (Vaswani et al.,2017;Devlin et al.,2019),

have underscored the signiﬁcance of embeddings

in capturing the complexities of linguistic seman-

tics.

The importance of representing collections of

words is pivotal in understanding concepts and

∗Work done while at Nara Institute of Science and Tech-

nology.

Our code is publicly available at

https://github.com/

yoichi1484/subspace

Spearman's ρ

0.0

0.1

0.2

0.3

0.4

0.5

Text Similarity (STS-B)

BERTScore SubspaceBERTScore

Recall@100

Text Concept Set Retrieval

Fuzzy set Subspace set

We queen

king

the

are

queen

king

the

are

Vector set representation Subspace set representation

Span

Figure 1: Superiority of subspace representations:

Our

subspace representation (blue)

surpasses the tradi-

tional

vector set representation (gray)

in both text sim-

ilarity and text concept set retrieval tasks.

relationships within language contexts (Zaheer

et al.,2017;Zhelezniak et al.,2019). For instance,

while words like “apple” and “orange” each carry

their distinct meanings, together they represent the

broader concept of fruits. Another example of im-

portant application is a sentence representation (Za-

heer et al.,2017). The set of words in a sentence

captures the overall meaning, allowing for compu-

tations such as text similarity (Agirre et al.,2012).

Against this backdrop, our research recognizes

the signiﬁcance of applying set operations in NLP

and explores a new approach. Set operations enable

a richer representation of relationships between

collections of words, leading to more accurate se-

mantic analysis based on context. For example,

employing set operations allows for a clearer un-

derstanding of shared semantic features and dif-

ferences among word groups within a text. This

directly beneﬁts tasks like determining semantic

similarity and expanding word sets.

In response to these challenges, our study intro-

duces a novel methodology that exploits the prin-

ciples of quantum logic (Birkhoff and Von Neu-

mann,1936), applied within embedding spaces to

arXiv:2210.13034v4 [cs.CL] 10 Apr 2024

deﬁne set operations. Our proposed framework

adopts a subspace-based approach for representing

word sets, aiming to maintain the intricate seman-

tic relationships within these sets. We represent a

word set as a subspace which is spanned by pre-

trained embeddings. Additionally, it adheres to

the foundational laws of set theory as delineated

in the framework of quantum logic. This compli-

ance ensures that our set operations, such as union,

intersection, and complement, are not only mathe-

matically robust but also linguistically meaningful

when applied them in pre-trained embedding space.

We ﬁrst introduce a subspace set representation

along with basic operations (

∩

∪

, and

∈

). Subse-

quently, to highlight the usefulness of our proposed

framework, we introduce two core set computa-

tions: text similarity and set membership. The

empirical results consistently point towards the no-

table superiority of our approach; our straightfor-

ward approach of spanning subspaces with pre-

trained embedding sets enables a rich set repre-

sentation, and we demonstrated its consistent per-

formance enhancement in downstream tasks (Fig-

ure 1). Our research contributions include:

The introduction of continuous set represen-

tations and a framework for set operations,

enabling more effective manipulation of word

embedding sets (§4).

We propose SubspaceBERTScore, an exten-

sion of the embedding set-based text similar-

ity method, BERTScore (Zhang et al.,2020).

By simply transitioning from a vector set rep-

resentation to a subspace, and incorporating

a subspace-based indicator function, we ob-

serve a salient improvement in performance

across all text similarity benchmarks (§5).

We apply subspace-based basic operations

(

∩,∪,

and

∈

) to set expansion task and achive

high performance (§6).

2 Preliminaries

To make the following discussion clear, we de-

ﬁne several symbols. The sets of tokens in

two sentences (

and

) are denoted as

{a1, a2, . . . }, B ={b1, b2, . . . }

respectively. The

sets of contextualized token vectors are denoted

A={a1,a2, . . . },B={b1,b2, . . . }

, where

and

are token vectors generated by the pre-

trained embedding model such as BERT. The

subspace spanned by

is denoted as

SA=

span(a1,a2, . . . )

. Note that the bases of the sub-

space is orthonormalized.

3 Symbolic Set Operations

We ﬁrst formulate various set operations in a pre-

trained embedding space. Among many types of

operations for practical NLP applications, this work

focuses on set similarity:

A={A,boy,walks,in,this,park},

B={The,kid,runs,in,the,square},

Similarity(A,B),

(1)

set membership (∈) and basic operations (∩,∪):

Color ={red,blue,green,orange, . . . },

Fruit ={apple,orange,peach, . . . },

orange ∈Color ∩Fruit.

(2)

For this purpose, we need following representa-

tions on a pre-trained embedding space2:

An element and a set of elements The rep-

resentations of an element and a set of ele-

ments are the most basic ones. To exploit

word embeddings, we represent a word (e.g.,

orange

) as an element and a group of words (e.g.,

{red,blue,green,orange, . . . }) as a word set.

Quantiﬁcation of set membership (indicator

function) Membership denotes a relation in

which word

is an element of set

, i.e.,

w∈

. We quantify it based on vector representations.

Although the membership is typically a binary deci-

sion identical to that in a symbolic space, it can also

be measured by the degree of closeness in a contin-

uous vector space. Membership can be computed

as an indicator function. The indicator function

set

quantiﬁes whether the word

is included (

)

or not (0) in the set in a discrete manner:

set[w∈A] = (1 if w∈A,

0 if w/∈A.(3)

Similarity between discrete symbol sets Set

similarity, such as recall and precision, is an es-

sential operation when calculating the similarity

of texts. Despite its simplicity, the word overlap-

based sentence similarity serves as a remarkably

effective approximation and has found widespread

These operations do not include some set operations such

as cardinality, but are sufﬁcient for expressing the practical

forms of sets such as Eq. (1).

practical application, as evidenced by numerous

studies(Bojar et al.,2018;Zhang et al.,2020;Cer

et al.,2017;Zhelezniak et al.,2019). They stand

out as excellent similarity metrics based on embed-

dings. BERTScore (Zhang et al.,2020), which uti-

lizes embeddings for its computation, is grounded

in recall and precision

. The typical computations

for recall (R) and precision (P) are as follows4:

R=1

|A|X

ai∈A

set[ai∈B], (4)

P=1

|B|X

bi∈B

set[bi∈A], (5)

Basic set operations We need three basic set op-

erations: intersection (

A∩B

), union (

A∪B

and complement (

). They allow us to represent

various sets using such different combinations as

Color ∩Fruit.

4 Subspace-based Set Representations

We propose the representations of a word set and

set operations based on quantum logic (Birkhoff

and Von Neumann,1936). They hold advantages of

geometric properties in an embedding space, and

the set operations are guaranteed to hold for the

laws of a set deﬁned in quantum logic.

4.1 Quantum logic

While word embedding represents a word’s mean-

ing as a vector in linear space, quantum mechanics

similarly represents a quantum state as a vector in

linear space. These two intuitively different ﬁelds

are very close to each other in terms of the repre-

sentation and the operation of information.

Quantum logic (Birkhoff and Von Neumann,

1936) theory describes quantum mechanical phe-

nomena. Intuitively, it is a framework for set oper-

ations in a vector space. In quantum logic, a set

of vectors is represented as a linear subspace in

a Hilbert space, and such set operations as union,

intersection, and complement are deﬁned as opera-

tions on subspaces. Quantum logic, which employs

a complete orthomodular lattice as its system of

truth values, guarantees to hold various set opera-

tions, such as De Morgan’s laws

(A∩B) = A∪B

Unlike the symbolic set similarity, which do not consider

word order, contextualized embeddings enable the capture of

word sequence information.

For simplicity in explanation, we present

and

in Eq.

(4) and Eq. (5) as a set of tokens.

Algorithm 1 Computing basis of a subspace

Input: {v(1),...,v(k)} ⊆ R1×d

: Word embed-

dings to span subspace SA

Output: SA∈Rr×d: Bases of SA

A∈Rk×d←STACK_ROWS(v(1),...,v(k))

SA∈Rr×d←(ORTHO_NORMAL(A⊤))⊤▷

Orthonormalize the bases. ris the rank of A

return SA

and

(A∪B) = A∩B

, idempotent law:

A∩A=A

and double complement: A=A.

4.2 Set Operations in an Embedding Space

The representations of an element, a set, and such

set operations as union, intersection, and comple-

ment in quantum logic can be applied directly in

a word embedding space because it is a Euclidean

space and therefore also a Hilbert space. How-

ever, since set similarity and set membership for

a word embedding space are still missing in quan-

tum logic, we propose a novel formulation of those

operations using subspace-based representations,

which is consistent with quantum logic. The corre-

spondence between symbolic and subspace-based

set operations is shown in Table 1.

Set and elements Let

be a

-dimensional

embedding space (Euclidean space), let

{w1,w2, . . . }

be a set of words, and let

vw∈Rn

be a word (token) vector corresponding to

. As

discussed in §3, we ﬁrst formulate the representa-

tion of a word and a word set. In quantum logic,

an element is represented by a vector, and a set is

represented by a subspace spanned by the vectors

corresponding to its elements. Here we assume an

element, i.e., word

, is represented by vector

and a word set is represented by linear subspace

SA⊂Rnspanned by word vectors:

SA:= span(A):= span(a1,a2, . . . ).(6)

Hereinafter we simply refer to linear subspace as

subspace. Algorithm 1is the pseudocode for com-

puting the basis of the subspace.

Basic set operations The complement of set

denoted by

, is represented by the orthogonal

complement of subspace SA:

SA:= (SA)⊥={v| ∃a∈SA,v·a= 0}.(7)

The union of two sets,

and

, denoted by

A∪B

, is represented by the sum space of two

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SubspaceRepresentationsforSoftSetOperationsandSentenceSimilaritiesYoichiIshibashi1,2∗ShoYokoi3,4KatsuhitoSudoh1SatoshiNakamura11NaraInstituteofScienceandTechnology2KyotoUniversity3TohokuUniversity4RIKEN{ishibashi.yoichi.ir3,sudoh,s-nakamura}@is.naist.jpyokoi@tohoku.ac.jpAbstractInthefieldofnaturalla...

展开>> 收起<<

Subspace Representations for Soft Set Operations and Sentence Similarities Yoichi Ishibashi12Sho Yokoi34Katsuhito Sudoh1Satoshi Nakamura1 1Nara Institute of Science and Technology2Kyoto University.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Subspace Representations for Soft Set Operations and Sentence Similarities Yoichi Ishibashi12Sho Yokoi34Katsuhito Sudoh1Satoshi Nakamura1 1Nara Institute of Science and Technology2Kyoto University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: