Embeddings as Epistemic States Limitations on the Use of Pooling Operators for Accumulating Knowledge

2025-04-26 0 0 924.57KB 35 页 10玖币
侵权投诉
Embeddings as Epistemic States: Limitations on the Use of Pooling
Operators for Accumulating Knowledge
Steven Schockaert
Cardiff University, School of Computer Science & Informatics,
Abacws building, Senghennydd Road, CF24 4AG, Cardiff, UK
Abstract
Various neural network architectures rely on pooling operators to aggregate information coming from dif-
ferent sources. It is often implicitly assumed in such contexts that vectors encode epistemic states, i.e. that
vectors capture the evidence that has been obtained about some properties of interest, and that pooling
these vectors yields a vector that combines this evidence. We study, for a number of standard pooling
operators, under what conditions they are compatible with this idea, which we call the epistemic pooling
principle. While we find that all the considered pooling operators can satisfy the epistemic pooling principle,
this only holds when embeddings are sufficiently high-dimensional and, for most pooling operators, when the
embeddings satisfy particular constraints (e.g. having non-negative coordinates). We furthermore show that
these constraints have important implications on how the embeddings can be used in practice. In particular,
we find that when the epistemic pooling principle is satisfied, in most cases it is impossible to verify the
satisfaction of propositional formulas using linear scoring functions, with two exceptions: (i) max-pooling
with embeddings that are upper-bounded and (ii) Hadamard pooling with non-negative embeddings. This
finding helps to clarify, among others, why Graph Neural Networks sometimes under-perform in reasoning
tasks. Finally, we also study an extension of the epistemic pooling principle to weighted epistemic states,
which are important in the context of non-monotonic reasoning, where max-pooling emerges as the most
suitable operator.
Keywords: Learning and Reasoning, Knowledge Representation, Propositional Logic
1. Introduction
One of the key challenges in many sub-areas of Machine Learning is to learn suitable vector space
embeddings of the objects of interest (e.g. graphs, images or sentences). A question which is usually left
implicit is what the embedding of an object represents. We can take at least two different views on this.
First, we may consider that embeddings essentially serve as a compact representation of a distance metric,
which intuitively captures some form of similarity. What matters, then, is that objects which are similar,
in some sense, are represented by similar vectors, while objects which are dissimilar are not. This intuition
provides the foundation, for instance, for the use of contrastive pre-training strategies [1, 2]. Second, we
may consider that embeddings are essentially compact encodings of epistemic states. In other words, the
embedding of an object encodes what we know about that object. Embeddings then essentially play a similar
role as formulas in propositional logic. This view implicitly underpins most strategies that combine neural
network learning with aspects of symbolic reasoning, e.g. when using a semantic loss function to encourage
neural network predictions to satisfy certain constraints [3] or when using neural network predictions as
input to a probabilistic logic program [4]. In this paper, we focus on this second view.
Email address: schockaerts1@cardiff.ac.uk (Steven Schockaert)
Preprint submitted to Elsevier July 4, 2023
arXiv:2210.05723v2 [cs.AI] 3 Jul 2023
In practice, the embedding of an object is often obtained by combining the embeddings of related objects
using some kind of pooling operator. For instance, in the context of Computer Vision, convolutional feature
extractors such as ResNet [5] provide an embedding for each sub-region of the image. An embedding of the
overall image is then typically obtained by averaging these sub-region embeddings. Along similar lines, a
common setting in Natural Language Processing consists in using a transformer based language model such
as BERT [6] to obtain paragraph embeddings, and to average these embeddings to obtain an embedding for
a full document. In multi-modal settings, it is common to obtain embeddings for the individual modalities
first, and to subsequently aggregate these embeddings [7]. Graph Neural Networks [8, 9] also crucially rely
on pooling operators, learning node representations by aggregating embeddings derived from neighbouring
nodes. Essentially, in all these cases we have an embedding ewhich is obtained by aggregating embeddings
e1, ..., ekusing some pooling operator :
e=(e1, ..., ek) (1)
Under the epistemic1view, this pooling operator is implicitly assumed to aggregate the knowledge that is
captured by the embeddings e1, ..., ek. For instance, the embeddings e1, ..., ekmay encode which objects
are present in different parts of the image. After pooling these embeddings, we should end up with an
embedding ethat captures which objects are present throughout the entire image. Let us write Γ(ei) for the
knowledge that is captured by the embedding ei. More precisely, we will think of Γ(ei) as a set of properties
that are known to be satisfied. If we view pooling as the process of accumulating knowledge from different
sources (e.g. from different regions of the image, or different neighbours in a graph neural network), then
we would expect the following to be true: Γ(e) = Γ(e1)... Γ(en). We will refer to this principle as the
epistemic pooling principle. As an important special case, we will consider the case where the properties
of interest correspond to possible worlds (i.e. propositional interpretations). The set Γ(ei) then contains
the set of possible worlds that can be excluded based on the knowledge encoded in the embedding ei. In
other words, Γ(ei) can then be characterised as a propositional formula. The epistemic pooling principle
then states that the formula corresponding to (e1, ..., ek) should be equivalent to the conjunction of the
formulas corresponding to e1, ..., ek.
The main aim of this paper is to study under which conditions the epistemic pooling principle can be
satisfied. Analysing a number of standard pooling operators, we find that the epistemic pooling principle
can be satisfied for all of them, but with several important caveats:
We need at least as many dimensions as there are properties. In settings where we want to model
propositional formulas, the properties of interest correspond to possible worlds. Without further
restrictions, this means that we need embeddings with as many dimensions as there are possible
worlds.
For most of the pooling operators, we find that embeddings need to be constrained in a particular
way, e.g. by only allowing vectors with non-negative coordinates.
We also identify important restrictions on how embeddings can be linked to the formulas they capture.
In particular, when summation or averaging is used for pooling, we find that the satisfaction of
propositional formulas can, in general, not be predicted by a linear classifier when the epistemic
pooling principle is satisfied.
The fundamental question which we want to answer is whether a vector-based representation can act as
a formal knowledge representation framework, or whether reasoning with neural networks is inevitably
approximate in nature. While we focus on a theoretical analysis of pooling operators in this paper, our
results provide a number of important insights for the design of neural network models for tasks that
require reasoning. For instance, two operators emerge from our analysis as being particularly suitable for
1The word epistemic, in this paper, merely refers to the idea that vectors capture what we know about some properties of
interest. In particular, note that our setting is not directly related to epistemic logics.
2
c1c2
r1
r2
r3
a1
a2
a3
t1
t2
t3
t4
t5
Figure 1: A graph containing four types of nodes. The nodes c1, c2represent committees, r1, r2, r3represent researchers,
a1, a2, a3represent articles, and t1, t2, t3, t4, t5represent research topics. Topic nodes are connected to the articles that discuss
them. Article nodes are connected to their authors. Researchers are connected to the committees they belong to.
applications where we need to reason about propositional formulas: max-pooling, with the constraint that
the coordinates of all embeddings are upper-bounded by some constant z, and the Hadamard operator (i.e.
the component-wise product), with the constraint that all coordinates are non-negative.
Our results also help to explain the outperformance of neuro-symbolic methods [10] over Graph Neural
Networks (GNN) in certain applications. GNNs are, in theory, capable of reasoning in the two-variable
fragment of first-order logic with counting quantifiers (FOC2) [11]. Nonetheless, in practice, GNNs often
perform relatively poorly in tasks that require reasoning, even when staying within FOC2. Crucially, GNNs
typically use averaging (or summation) for pooling the messages coming from adjacent nodes. As we will see
in Section 5.4, in such cases, GNNs can only capture logical reasoning if the node embeddings are essentially
binary. The discrete nature of the required embeddings makes them difficult to learn, which helps to explain
why GNNs are often outperformed by models that are more specifically tailored towards reasoning [12, 13].
Furthermore, our lower bounds on the required dimensionality of epistemic embeddings suggest that,
beyond toy problems, using neural networks for reasoning inevitably requires some kind of modularity.
For instance, an important challenge in Natural Language Processing is to design models that can reason
about information that comes from different sources (e.g. different news sources). Combining different text
fragments by pooling their embeddings would require a prohibitively high dimensionality, if we want these
embeddings to capture epistemic states in a faithful way. Instead, it is more common to use vectors to encode
what we know about a given entity, or about the relationship between two entities. Reasoning about the
overall story then requires us to combine these different entity and relation vectors, by using the structure
of the problem domain in some way, e.g. by using GNNs or neuro-symbolic approaches. By exploiting this
structure, embeddings can be used to capture more focused knowledge, which means that the prohibitively
high dimensionality that is otherwise needed can be avoided.
The remainder of this paper is structured as follows. In the next section, we first discuss a number of
simple examples to illustrate the considered setting. Subsequently, in Section 3, we formalise the problem
setting and introduce the notations that will be used throughout the paper. In Section 4, we then analyse
under which conditions the epistemic pooling principle can be satisfied. One of the main findings from this
section is that the requirement to satisfy the epistemic pooling principle fundamentally constrains which
embeddings can be allowed, and how these embeddings encode knowledge. In Section 5, we then investigate
how this impacts our ability to use vector embeddings for propositional reasoning. In particular, we focus
on the problem of verifying whether a given propositional formula is satisfied in the epistemic state encoded
by a given vector. Subsequently, in Section 6, we look at a generalisation of the epistemic pooling principle
for dealing with weighted epistemic states. In this case, vectors encode the strength with which we believe a
given property to be satisfied. Section 7 presents a discussion of our results in the context of related work,
after which we summarise our conclusions.
3
2. Illustrative examples
In this section, we consider a number of toy examples to illustrate our problem setting. These examples
involve message-passing GNNs, where pooling arises because we need to aggregate the messages coming from
adjacent nodes. In Section 2.1, we discuss a simple example to illustrate the basic principles. This example
involves aggregating sets of properties, without any further reasoning. We then build on this example in
Section 2.2 to illustrate how propositional reasoning can be carried out in this framework. Section 2.3
illustrates a setting where we need to reason in the presence of background knowledge. Finally, Section 2.4
shows how this can be extended to situations involving non-monotonic reasoning with defeasible knowledge.
2.1. Basic setting
Figure 1 displays a graph with four types of nodes, referring to research topics (t1, t2, t3, t4, t5), scientific
articles (a1, a2, a3), researchers (r1, r2, r3) and committees (c1, c2), respectively. In this graph, research topics
are connected to the articles that discuss them, articles are connected to their authors and researchers are
connected to the committees they belong to. We say that a researcher is an expert on a topic if they have
published at least one article which discusses this topic. We say that a committee is complete if it contains
an expert on each of the five topics. For instance, in the case of Figure 1, we can see that researcher r1
is an expert on topics t1, t2; researcher r2is an expert on topics t1, t2, t3, t4; and researcher r3is an expert
on topics t4, t5. It follows that committee c2is complete, as each of the five topics are covered by its two
members, but committee c1is not. We are now interested in designing a message-passing GNN which can
predict whether a given committee is complete or not. A message-passing GNN learns embeddings of the
nodes of a given graph, by iteratively updating each node’s embedding based on the embeddings of its
neighbouring nodes. To specify a message-passing GNN, we thus need to specify (i) the initial embedding
of each node and (ii) the update mechanism which is used.
A straightforward solution is as follows. Let us write x(l)R5for the representation of a node xin layer
lof the GNN. In particular, x(0)represents the input embedding of node x, which we define as follows:
t(0)
1= (1,0,0,0,0) t(0)
2= (0,1,0,0,0) t(0)
3= (0,0,1,0,0) t(0)
4= (0,0,0,1,0) t(0)
5= (0,0,0,0,1)
In other words, we use a one-hot encoding for representing the different topics. The input embeddings for
all the other nodes are set to x(0)= (0,0,0,0,0). In the subsequent layers, the node representations are
updated as follows:
x(l+1)= max({x(l)}∪{y(l)|(y, x)∈ E}) (2)
where Erepresents the set of edges, i.e. (y, x)∈ E if there is an edge from node yto node x, and the maximum
is applied component-wise. It is easy to verify that an article aicovers topic tjif the jth coordinate of a(l)
iis
1, for l1. Similarly, a researcher riis an expert on topic jif the jth coordinate of r(l)
iis 1, for l2; and
a committee cicontains expertise on topic jif the jth coordinate of c(l)
iis 1, for l3. We thus have that
the committee ciis complete iff c(3)
i= (1,1,1,1,1). Note that the latter condition can be checked using a
linear scoring function, since it is equivalent with ci(3) ·15, where 1= (1,1,1,1,1).
In the proposed construction, a number of particular design decisions were made. For instance, we used
5-dimensional embeddings for representing nodes and we used the maximum for aggregating the evidence
from neighbouring nodes. Our analysis in this paper is aimed at studying the importance of such decisions.
For instance, we may wonder whether it is possible to devise an encoding that relies on four-dimensional
embeddings. If we are allowed to replace the maximum with an arbitrarily complex function (and we allow
non-linear scoring functions), then the answer is clearly positive2. However, as we will see in Section 4,
2In fact, we can even use one-dimensional embeddings. Let nbe a number which is larger than the number of topics
k. Then we can initialise topic ias ni1. To aggregate two node representations xand y, we first decompose them as
x=x0+x1n+x2n2+... +xk1nk1and y=y0+y1n+y2n2+... +yk1nk1, with x0, ..., xk1, y0, ...yk1∈ {0,1}. Then
the aggregated embedding can be defined as max(x0, y0) + max(x1, y1)n+... + max(xk1, yk1)nk1.
4
when we restrict ourselves to standard pooling operators, such as max-pooling, summation or averaging,
the answer is negative. Another question is whether a variant of the construction above can be found which
relies on summation or averaging, rather than max-pooling. Here, the answer depends on what assumptions
we make on the final node classifier, i.e. the function that maps the embedding c(l)of a committee onto a
decision. The most common approach is to rely on a linear classifier to make such predictions. In that case,
we can show that no encoding of the problem can be found that relies on summation or averaging, as we
will see in Section 5.
2.2. Propositional reasoning
The aforementioned example simply required us to aggregate sets of features, which corresponded to
research topics in that example. Throughout this paper, we will refer to such features as properties. At first
glance, it may seem like this setting only involves rather basic forms of reasoning. However, by identifying
the considered properties with possible worlds, we can in fact design GNNs which perform propositional
reasoning. We illustrate this with an example. Let us consider the same setting as before, but now we
only focus on research topics, articles and researchers. We say that a researcher is a generalist if they have
worked on at least two sufficiently distinct topics. Let us write tito denote that some researcher has worked
on topic ti. Then we define:
generalist (t1t4)(t1t5)(t2t5)
where the idea is that the other topic combinations are too similar to each other (e.g. working on t1and t2
would not make someone a generalist because these are related research topics). We want to predict whether
a given researcher is a generalist by applying a linear classifier to the corresponding node embedding. As
before, we assume that available knowledge about researchers is encoded as a graph with topic, article and
researcher nodes (although we no longer consider committees). Let ω1, ..., ω32 be an enumeration of all the
possible worlds of the propositional logic over the set of atoms {t1, ..., t5}. The input embedding t(0)
iof the
node tiis now defined as a 32-dimensional vector, where the jth component is 0 if ωj|=¬tiand -1 otherwise.
Note how each component now corresponds to a possible world. The jth coordinate in the embedding of
node nis 0 if that node captures the knowledge that the world ωjcan be excluded. The input embeddings of
the article and research nodes are initialised as (1, ..., 1). The embeddings of the nodes in the subsequent
layers are again computed using (2). For l1 we then have that a(l)
i= (x1, ..., x32) where xj= 0 iff article
aihas some topic tksuch that ωj|=¬tk, and xj=1 otherwise. For l2, we have that r(l)
i= (x1, ..., x32)
where xj= 0 iff researcher rihas written an article that has some topic tksuch that ωj|=¬tk. In other
words, we have that xj= 0 if we can exclude ωjas a possible representation of the expertise of ribased
on the knowledge encoded in the graph. To test whether we can entail that riis a generalist, it suffices to
check whether every countermodel of (t1t4)(t1t5)(t2t5) can be excluded. Let cR32 be the
vector whose jth component is 0 if ωiis a model of (t1t4)(t1t5)(t2t5) and 1 otherwise. Then we
have that researcher riis a generalist iff
r(2)
i·c0
By associating properties with possible worlds, we can thus study settings in which GNNs need to aggregate
knowledge encoded using propositional formulas. While the kind of knowledge that we had to consider
in this example was rather simple, in general the formulas that need to be aggregated can be arbitrarily
complex propositional formulas (including e.g. formulas with negation). The topic of propositional reasoning
by pooling embeddings, as illustrated in the aforementioned example, will be studied in Section 5. Among
others, we will see that the choice of the pooling operator plays an important role, where faithful reasoning
is, in general, not possible with averaging and summation, when linear scoring functions are used.
2.3. Background knowledge
Background knowledge can straightforwardly be taken into account by restricting the set of possible
worlds to the models of a given knowledge base. Embeddings then capture which models of the knowledge
base are still possible. To illustrate this, let us consider a similar setting as before, but with four topics:
5
摘要:

EmbeddingsasEpistemicStates:LimitationsontheUseofPoolingOperatorsforAccumulatingKnowledgeStevenSchockaertCardiffUniversity,SchoolofComputerScience&Informatics,Abacwsbuilding,SenghennyddRoad,CF244AG,Cardiff,UKAbstractVariousneuralnetworkarchitecturesrelyonpoolingoperatorstoaggregateinformationcomingf...

展开>> 收起<<
Embeddings as Epistemic States Limitations on the Use of Pooling Operators for Accumulating Knowledge.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:924.57KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注