
define set operations. Our proposed framework
adopts a subspace-based approach for representing
word sets, aiming to maintain the intricate seman-
tic relationships within these sets. We represent a
word set as a subspace which is spanned by pre-
trained embeddings. Additionally, it adheres to
the foundational laws of set theory as delineated
in the framework of quantum logic. This compli-
ance ensures that our set operations, such as union,
intersection, and complement, are not only mathe-
matically robust but also linguistically meaningful
when applied them in pre-trained embedding space.
We first introduce a subspace set representation
along with basic operations (
∩
,
∪
, and
∈
). Subse-
quently, to highlight the usefulness of our proposed
framework, we introduce two core set computa-
tions: text similarity and set membership. The
empirical results consistently point towards the no-
table superiority of our approach; our straightfor-
ward approach of spanning subspaces with pre-
trained embedding sets enables a rich set repre-
sentation, and we demonstrated its consistent per-
formance enhancement in downstream tasks (Fig-
ure 1). Our research contributions include:
1.
The introduction of continuous set represen-
tations and a framework for set operations,
enabling more effective manipulation of word
embedding sets (§4).
2.
We propose SubspaceBERTScore, an exten-
sion of the embedding set-based text similar-
ity method, BERTScore (Zhang et al.,2020).
By simply transitioning from a vector set rep-
resentation to a subspace, and incorporating
a subspace-based indicator function, we ob-
serve a salient improvement in performance
across all text similarity benchmarks (§5).
3.
We apply subspace-based basic operations
(
∩,∪,
and
∈
) to set expansion task and achive
high performance (§6).
2 Preliminaries
To make the following discussion clear, we de-
fine several symbols. The sets of tokens in
two sentences (
A
and
B
) are denoted as
A=
{a1, a2, . . . }, B ={b1, b2, . . . }
respectively. The
sets of contextualized token vectors are denoted
as
A={a1,a2, . . . },B={b1,b2, . . . }
, where
a
and
b
are token vectors generated by the pre-
trained embedding model such as BERT. The
subspace spanned by
A
is denoted as
SA=
span(a1,a2, . . . )
. Note that the bases of the sub-
space is orthonormalized.
3 Symbolic Set Operations
We first formulate various set operations in a pre-
trained embedding space. Among many types of
operations for practical NLP applications, this work
focuses on set similarity:
A={A,boy,walks,in,this,park},
B={The,kid,runs,in,the,square},
Similarity(A,B),
(1)
set membership (∈) and basic operations (∩,∪):
Color ={red,blue,green,orange, . . . },
Fruit ={apple,orange,peach, . . . },
orange ∈Color ∩Fruit.
(2)
For this purpose, we need following representa-
tions on a pre-trained embedding space2:
An element and a set of elements The rep-
resentations of an element and a set of ele-
ments are the most basic ones. To exploit
word embeddings, we represent a word (e.g.,
orange
) as an element and a group of words (e.g.,
{red,blue,green,orange, . . . }) as a word set.
Quantification of set membership (indicator
function) Membership denotes a relation in
which word
w
is an element of set
A
, i.e.,
w∈
A
. We quantify it based on vector representations.
Although the membership is typically a binary deci-
sion identical to that in a symbolic space, it can also
be measured by the degree of closeness in a contin-
uous vector space. Membership can be computed
as an indicator function. The indicator function
set
quantifies whether the word
w
is included (
1
)
or not (0) in the set in a discrete manner:
set[w∈A] = (1 if w∈A,
0 if w/∈A.(3)
Similarity between discrete symbol sets Set
similarity, such as recall and precision, is an es-
sential operation when calculating the similarity
of texts. Despite its simplicity, the word overlap-
based sentence similarity serves as a remarkably
effective approximation and has found widespread
2
These operations do not include some set operations such
as cardinality, but are sufficient for expressing the practical
forms of sets such as Eq. (1).