Embeddings as Epistemic States Limitations on the Use of Pooling Operators for Accumulating Knowledge

2025-04-26 0 0 924.57KB 35 页 10玖币

侵权投诉

Embeddings as Epistemic States: Limitations on the Use of Pooling

Operators for Accumulating Knowledge

Steven Schockaert

Cardiﬀ University, School of Computer Science & Informatics,

Abacws building, Senghennydd Road, CF24 4AG, Cardiﬀ, UK

Abstract

Various neural network architectures rely on pooling operators to aggregate information coming from dif-

ferent sources. It is often implicitly assumed in such contexts that vectors encode epistemic states, i.e. that

vectors capture the evidence that has been obtained about some properties of interest, and that pooling

these vectors yields a vector that combines this evidence. We study, for a number of standard pooling

operators, under what conditions they are compatible with this idea, which we call the epistemic pooling

principle. While we ﬁnd that all the considered pooling operators can satisfy the epistemic pooling principle,

this only holds when embeddings are suﬃciently high-dimensional and, for most pooling operators, when the

embeddings satisfy particular constraints (e.g. having non-negative coordinates). We furthermore show that

these constraints have important implications on how the embeddings can be used in practice. In particular,

we ﬁnd that when the epistemic pooling principle is satisﬁed, in most cases it is impossible to verify the

satisfaction of propositional formulas using linear scoring functions, with two exceptions: (i) max-pooling

with embeddings that are upper-bounded and (ii) Hadamard pooling with non-negative embeddings. This

ﬁnding helps to clarify, among others, why Graph Neural Networks sometimes under-perform in reasoning

tasks. Finally, we also study an extension of the epistemic pooling principle to weighted epistemic states,

which are important in the context of non-monotonic reasoning, where max-pooling emerges as the most

suitable operator.

Keywords: Learning and Reasoning, Knowledge Representation, Propositional Logic

1. Introduction

One of the key challenges in many sub-areas of Machine Learning is to learn suitable vector space

embeddings of the objects of interest (e.g. graphs, images or sentences). A question which is usually left

implicit is what the embedding of an object represents. We can take at least two diﬀerent views on this.

First, we may consider that embeddings essentially serve as a compact representation of a distance metric,

which intuitively captures some form of similarity. What matters, then, is that objects which are similar,

in some sense, are represented by similar vectors, while objects which are dissimilar are not. This intuition

provides the foundation, for instance, for the use of contrastive pre-training strategies [1, 2]. Second, we

may consider that embeddings are essentially compact encodings of epistemic states. In other words, the

embedding of an object encodes what we know about that object. Embeddings then essentially play a similar

role as formulas in propositional logic. This view implicitly underpins most strategies that combine neural

network learning with aspects of symbolic reasoning, e.g. when using a semantic loss function to encourage

neural network predictions to satisfy certain constraints [3] or when using neural network predictions as

input to a probabilistic logic program [4]. In this paper, we focus on this second view.

Email address: schockaerts1@cardiff.ac.uk (Steven Schockaert)

Preprint submitted to Elsevier July 4, 2023

arXiv:2210.05723v2 [cs.AI] 3 Jul 2023

In practice, the embedding of an object is often obtained by combining the embeddings of related objects

using some kind of pooling operator. For instance, in the context of Computer Vision, convolutional feature

extractors such as ResNet [5] provide an embedding for each sub-region of the image. An embedding of the

overall image is then typically obtained by averaging these sub-region embeddings. Along similar lines, a

common setting in Natural Language Processing consists in using a transformer based language model such

as BERT [6] to obtain paragraph embeddings, and to average these embeddings to obtain an embedding for

a full document. In multi-modal settings, it is common to obtain embeddings for the individual modalities

ﬁrst, and to subsequently aggregate these embeddings [7]. Graph Neural Networks [8, 9] also crucially rely

on pooling operators, learning node representations by aggregating embeddings derived from neighbouring

nodes. Essentially, in all these cases we have an embedding ewhich is obtained by aggregating embeddings

e1, ..., ekusing some pooling operator ⋄:

e=⋄(e1, ..., ek) (1)

Under the epistemic1view, this pooling operator is implicitly assumed to aggregate the knowledge that is

captured by the embeddings e1, ..., ek. For instance, the embeddings e1, ..., ekmay encode which objects

are present in diﬀerent parts of the image. After pooling these embeddings, we should end up with an

embedding ethat captures which objects are present throughout the entire image. Let us write Γ(ei) for the

knowledge that is captured by the embedding ei. More precisely, we will think of Γ(ei) as a set of properties

that are known to be satisﬁed. If we view pooling as the process of accumulating knowledge from diﬀerent

sources (e.g. from diﬀerent regions of the image, or diﬀerent neighbours in a graph neural network), then

we would expect the following to be true: Γ(e) = Γ(e1)∪... ∪Γ(en). We will refer to this principle as the

epistemic pooling principle. As an important special case, we will consider the case where the properties

of interest correspond to possible worlds (i.e. propositional interpretations). The set Γ(ei) then contains

the set of possible worlds that can be excluded based on the knowledge encoded in the embedding ei. In

other words, Γ(ei) can then be characterised as a propositional formula. The epistemic pooling principle

then states that the formula corresponding to ⋄(e1, ..., ek) should be equivalent to the conjunction of the

formulas corresponding to e1, ..., ek.

The main aim of this paper is to study under which conditions the epistemic pooling principle can be

satisﬁed. Analysing a number of standard pooling operators, we ﬁnd that the epistemic pooling principle

can be satisﬁed for all of them, but with several important caveats:

•We need at least as many dimensions as there are properties. In settings where we want to model

propositional formulas, the properties of interest correspond to possible worlds. Without further

restrictions, this means that we need embeddings with as many dimensions as there are possible

worlds.

•For most of the pooling operators, we ﬁnd that embeddings need to be constrained in a particular

way, e.g. by only allowing vectors with non-negative coordinates.

•We also identify important restrictions on how embeddings can be linked to the formulas they capture.

In particular, when summation or averaging is used for pooling, we ﬁnd that the satisfaction of

propositional formulas can, in general, not be predicted by a linear classiﬁer when the epistemic

pooling principle is satisﬁed.

The fundamental question which we want to answer is whether a vector-based representation can act as

a formal knowledge representation framework, or whether reasoning with neural networks is inevitably

approximate in nature. While we focus on a theoretical analysis of pooling operators in this paper, our

results provide a number of important insights for the design of neural network models for tasks that

require reasoning. For instance, two operators emerge from our analysis as being particularly suitable for

1The word epistemic, in this paper, merely refers to the idea that vectors capture what we know about some properties of

interest. In particular, note that our setting is not directly related to epistemic logics.

c1c2

Figure 1: A graph containing four types of nodes. The nodes c1, c2represent committees, r1, r2, r3represent researchers,

a1, a2, a3represent articles, and t1, t2, t3, t4, t5represent research topics. Topic nodes are connected to the articles that discuss

them. Article nodes are connected to their authors. Researchers are connected to the committees they belong to.

applications where we need to reason about propositional formulas: max-pooling, with the constraint that

the coordinates of all embeddings are upper-bounded by some constant z, and the Hadamard operator (i.e.

the component-wise product), with the constraint that all coordinates are non-negative.

Our results also help to explain the outperformance of neuro-symbolic methods [10] over Graph Neural

Networks (GNN) in certain applications. GNNs are, in theory, capable of reasoning in the two-variable

fragment of ﬁrst-order logic with counting quantiﬁers (FOC2) [11]. Nonetheless, in practice, GNNs often

perform relatively poorly in tasks that require reasoning, even when staying within FOC2. Crucially, GNNs

typically use averaging (or summation) for pooling the messages coming from adjacent nodes. As we will see

in Section 5.4, in such cases, GNNs can only capture logical reasoning if the node embeddings are essentially

binary. The discrete nature of the required embeddings makes them diﬃcult to learn, which helps to explain

why GNNs are often outperformed by models that are more speciﬁcally tailored towards reasoning [12, 13].

Furthermore, our lower bounds on the required dimensionality of epistemic embeddings suggest that,

beyond toy problems, using neural networks for reasoning inevitably requires some kind of modularity.

For instance, an important challenge in Natural Language Processing is to design models that can reason

about information that comes from diﬀerent sources (e.g. diﬀerent news sources). Combining diﬀerent text

fragments by pooling their embeddings would require a prohibitively high dimensionality, if we want these

embeddings to capture epistemic states in a faithful way. Instead, it is more common to use vectors to encode

what we know about a given entity, or about the relationship between two entities. Reasoning about the

overall story then requires us to combine these diﬀerent entity and relation vectors, by using the structure

of the problem domain in some way, e.g. by using GNNs or neuro-symbolic approaches. By exploiting this

structure, embeddings can be used to capture more focused knowledge, which means that the prohibitively

high dimensionality that is otherwise needed can be avoided.

The remainder of this paper is structured as follows. In the next section, we ﬁrst discuss a number of

simple examples to illustrate the considered setting. Subsequently, in Section 3, we formalise the problem

setting and introduce the notations that will be used throughout the paper. In Section 4, we then analyse

under which conditions the epistemic pooling principle can be satisﬁed. One of the main ﬁndings from this

section is that the requirement to satisfy the epistemic pooling principle fundamentally constrains which

embeddings can be allowed, and how these embeddings encode knowledge. In Section 5, we then investigate

how this impacts our ability to use vector embeddings for propositional reasoning. In particular, we focus

on the problem of verifying whether a given propositional formula is satisﬁed in the epistemic state encoded

by a given vector. Subsequently, in Section 6, we look at a generalisation of the epistemic pooling principle

for dealing with weighted epistemic states. In this case, vectors encode the strength with which we believe a

given property to be satisﬁed. Section 7 presents a discussion of our results in the context of related work,

after which we summarise our conclusions.

2. Illustrative examples

In this section, we consider a number of toy examples to illustrate our problem setting. These examples

involve message-passing GNNs, where pooling arises because we need to aggregate the messages coming from

adjacent nodes. In Section 2.1, we discuss a simple example to illustrate the basic principles. This example

involves aggregating sets of properties, without any further reasoning. We then build on this example in

Section 2.2 to illustrate how propositional reasoning can be carried out in this framework. Section 2.3

illustrates a setting where we need to reason in the presence of background knowledge. Finally, Section 2.4

shows how this can be extended to situations involving non-monotonic reasoning with defeasible knowledge.

2.1. Basic setting

Figure 1 displays a graph with four types of nodes, referring to research topics (t1, t2, t3, t4, t5), scientiﬁc

articles (a1, a2, a3), researchers (r1, r2, r3) and committees (c1, c2), respectively. In this graph, research topics

are connected to the articles that discuss them, articles are connected to their authors and researchers are

connected to the committees they belong to. We say that a researcher is an expert on a topic if they have

published at least one article which discusses this topic. We say that a committee is complete if it contains

an expert on each of the ﬁve topics. For instance, in the case of Figure 1, we can see that researcher r1

is an expert on topics t1, t2; researcher r2is an expert on topics t1, t2, t3, t4; and researcher r3is an expert

on topics t4, t5. It follows that committee c2is complete, as each of the ﬁve topics are covered by its two

members, but committee c1is not. We are now interested in designing a message-passing GNN which can

predict whether a given committee is complete or not. A message-passing GNN learns embeddings of the

nodes of a given graph, by iteratively updating each node’s embedding based on the embeddings of its

neighbouring nodes. To specify a message-passing GNN, we thus need to specify (i) the initial embedding

of each node and (ii) the update mechanism which is used.

A straightforward solution is as follows. Let us write x(l)∈R5for the representation of a node xin layer

lof the GNN. In particular, x(0)represents the input embedding of node x, which we deﬁne as follows:

t(0)

1= (1,0,0,0,0) t(0)

2= (0,1,0,0,0) t(0)

3= (0,0,1,0,0) t(0)

4= (0,0,0,1,0) t(0)

5= (0,0,0,0,1)

In other words, we use a one-hot encoding for representing the diﬀerent topics. The input embeddings for

all the other nodes are set to x(0)= (0,0,0,0,0). In the subsequent layers, the node representations are

updated as follows:

x(l+1)= max({x(l)}∪{y(l)|(y, x)∈ E}) (2)

where Erepresents the set of edges, i.e. (y, x)∈ E if there is an edge from node yto node x, and the maximum

is applied component-wise. It is easy to verify that an article aicovers topic tjif the jth coordinate of a(l)

iis

1, for l≥1. Similarly, a researcher riis an expert on topic jif the jth coordinate of r(l)

iis 1, for l≥2; and

a committee cicontains expertise on topic jif the jth coordinate of c(l)

iis 1, for l≥3. We thus have that

the committee ciis complete iﬀ c(3)

i= (1,1,1,1,1). Note that the latter condition can be checked using a

linear scoring function, since it is equivalent with ci(3) ·1≥5, where 1= (1,1,1,1,1).

In the proposed construction, a number of particular design decisions were made. For instance, we used

5-dimensional embeddings for representing nodes and we used the maximum for aggregating the evidence

from neighbouring nodes. Our analysis in this paper is aimed at studying the importance of such decisions.

For instance, we may wonder whether it is possible to devise an encoding that relies on four-dimensional

embeddings. If we are allowed to replace the maximum with an arbitrarily complex function (and we allow

non-linear scoring functions), then the answer is clearly positive2. However, as we will see in Section 4,

2In fact, we can even use one-dimensional embeddings. Let nbe a number which is larger than the number of topics

k. Then we can initialise topic ias ni−1. To aggregate two node representations xand y, we ﬁrst decompose them as

x=x0+x1n+x2n2+... +xk−1nk−1and y=y0+y1n+y2n2+... +yk−1nk−1, with x0, ..., xk−1, y0, ...yk−1∈ {0,1}. Then

the aggregated embedding can be deﬁned as max(x0, y0) + max(x1, y1)n+... + max(xk−1, yk−1)nk−1.

when we restrict ourselves to standard pooling operators, such as max-pooling, summation or averaging,

the answer is negative. Another question is whether a variant of the construction above can be found which

relies on summation or averaging, rather than max-pooling. Here, the answer depends on what assumptions

we make on the ﬁnal node classiﬁer, i.e. the function that maps the embedding c(l)of a committee onto a

decision. The most common approach is to rely on a linear classiﬁer to make such predictions. In that case,

we can show that no encoding of the problem can be found that relies on summation or averaging, as we

will see in Section 5.

2.2. Propositional reasoning

The aforementioned example simply required us to aggregate sets of features, which corresponded to

research topics in that example. Throughout this paper, we will refer to such features as properties. At ﬁrst

glance, it may seem like this setting only involves rather basic forms of reasoning. However, by identifying

the considered properties with possible worlds, we can in fact design GNNs which perform propositional

reasoning. We illustrate this with an example. Let us consider the same setting as before, but now we

only focus on research topics, articles and researchers. We say that a researcher is a generalist if they have

worked on at least two suﬃciently distinct topics. Let us write tito denote that some researcher has worked

on topic ti. Then we deﬁne:

generalist ≡(t1∧t4)∨(t1∧t5)∨(t2∧t5)

where the idea is that the other topic combinations are too similar to each other (e.g. working on t1and t2

would not make someone a generalist because these are related research topics). We want to predict whether

a given researcher is a generalist by applying a linear classiﬁer to the corresponding node embedding. As

before, we assume that available knowledge about researchers is encoded as a graph with topic, article and

researcher nodes (although we no longer consider committees). Let ω1, ..., ω32 be an enumeration of all the

possible worlds of the propositional logic over the set of atoms {t1, ..., t5}. The input embedding t(0)

iof the

node tiis now deﬁned as a 32-dimensional vector, where the jth component is 0 if ωj|=¬tiand -1 otherwise.

Note how each component now corresponds to a possible world. The jth coordinate in the embedding of

node nis 0 if that node captures the knowledge that the world ωjcan be excluded. The input embeddings of

the article and research nodes are initialised as (−1, ..., −1). The embeddings of the nodes in the subsequent

layers are again computed using (2). For l≥1 we then have that a(l)

i= (x1, ..., x32) where xj= 0 iﬀ article

aihas some topic tksuch that ωj|=¬tk, and xj=−1 otherwise. For l≥2, we have that r(l)

i= (x1, ..., x32)

where xj= 0 iﬀ researcher rihas written an article that has some topic tksuch that ωj|=¬tk. In other

words, we have that xj= 0 if we can exclude ωjas a possible representation of the expertise of ribased

on the knowledge encoded in the graph. To test whether we can entail that riis a generalist, it suﬃces to

check whether every countermodel of (t1∧t4)∨(t1∧t5)∨(t2∧t5) can be excluded. Let c∈R32 be the

vector whose jth component is 0 if ωiis a model of (t1∧t4)∨(t1∧t5)∨(t2∧t5) and 1 otherwise. Then we

have that researcher riis a generalist iﬀ

r(2)

i·c≥0

By associating properties with possible worlds, we can thus study settings in which GNNs need to aggregate

knowledge encoded using propositional formulas. While the kind of knowledge that we had to consider

in this example was rather simple, in general the formulas that need to be aggregated can be arbitrarily

complex propositional formulas (including e.g. formulas with negation). The topic of propositional reasoning

by pooling embeddings, as illustrated in the aforementioned example, will be studied in Section 5. Among

others, we will see that the choice of the pooling operator plays an important role, where faithful reasoning

is, in general, not possible with averaging and summation, when linear scoring functions are used.

2.3. Background knowledge

Background knowledge can straightforwardly be taken into account by restricting the set of possible

worlds to the models of a given knowledge base. Embeddings then capture which models of the knowledge

base are still possible. To illustrate this, let us consider a similar setting as before, but with four topics:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EmbeddingsasEpistemicStates:LimitationsontheUseofPoolingOperatorsforAccumulatingKnowledgeStevenSchockaertCardiffUniversity,SchoolofComputerScience&Informatics,Abacwsbuilding,SenghennyddRoad,CF244AG,Cardiff,UKAbstractVariousneuralnetworkarchitecturesrelyonpoolingoperatorstoaggregateinformationcomingf...

展开>> 收起<<

Embeddings as Epistemic States Limitations on the Use of Pooling Operators for Accumulating Knowledge.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Embeddings as Epistemic States Limitations on the Use of Pooling Operators for Accumulating Knowledge

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: