Theorem 3.1: For every predicate P(x)in 2, there exists a
P(y0)which has the following formulation and is logically
equivalent to the P(x).
P(y0) = ∃y1∃y2...∃yT
M
∨
m=1 ∧
t∈N mPm
t(yt)(3)
∧∧
(tp,tc)∈EmWm
(tp,tc)(ytp, ytc)∧Qm,
where each Nm⊂ {0,1,2,· · · , T }is set of indices and 0∈
Nm.Emis set of index pairs. Each Nmand Emcan form a
graph Gm= (Nm,Em).
Then we provide the following definitions and assumption to
show the logical expressiveness of our join chain network.
Definition 3.1 (FOET (I1,I2)): FOET (I1,I2)is the set
of all the predicates P(y0)which can be transformed into the
formulations in 3 and satisfy the following requirements.
•Each Wm
(tp,tc)(ytp, ytc)∈ I2and Pm
t(yt) = Pm
j1(yt)∧
Pm
j2(yt)∧· · ·∧Pm
jt(yt), where {Pm
j1, P m
j2,· · · , P m
jt}⊂I1.
•Each graph Gm= (Nm,Em)is a tree.
Definition 3.2: Each graph Gm= (Nm,Em), m ∈
{1,2,· · · , M}is a tree. We denote the height for
each tree as L1, L2,· · · , LMrespectively and Lmax =
max{L1, L2,· · · , LM}. We denote the number of leaf nodes
of each tree as H1, H2,· · · , HMrespectively and Hsum =
H1+H2+· · · +HM. We define the height of the predicate
as L=Lmax and the width of the predicate as H=Hsum.
Definition 3.3 (FOET {¯
L, ¯
H}(I1,I2)): FOET {¯
L, ¯
H}(I1,I2)
is the set of the predicates P(x)∈ FOET with height L≤¯
L
and width H≤¯
H.
Assumption 3.1: For any W∈ I2, there exists some
function f, such that W=f(Is), where Is⊂ I1.
Now we can provide the theorem to show the logical expres-
siveness.
Theorem 3.2: Under the assumption 3.1, a join-chain
network with the ¯
H-head and ¯
L-layer self-attention block
can express all the predicates in FOET {¯
L, ¯
H}(I1,I2)if the
input to the join-chain network is I1.
We look back at the example in Eq 1 to understand the
definitions and theorem. There is one tree i.e. M=1. The node
set of the graph is N={0,1,2,4}. The edge set of the graph
Eis {(0,1),(0,2),(1,3),(1,4)}. The graph (N,E)is a tree
visualized in the Figure 2. The height of the P(y0)is 2. The
width is 3 since there are 3 leaf nodes i.e. y3,y4and y2.
According to the theorem 3.2, P(x)can be expressed by the
join-chain network with 3 heads and 2 layers.
IV. RETHINK MULTI-HEAD ATTENTION AS JOIN OPERATION
To design the join operator with differentiable learning capa-
bility, we study various neural operators for approximating the
symbolic join operations. Interestingly, we find that the widely
adopted multi-head attention mechanism can be understood as
a join operator.
First, we denote the domain of all the predicates, in-
cluding all the binary predicates W(x, y)and unary pred-
icates P(y), as {x1, x1, x3, ..., xS}, which means x, y ∈
{x1, x2, x3, ..., xS}. Then in the multi-head attention mech-
anism, the core part is the product between the self-attention
matrix Aand the value tensor V,
Z=AV (4)
If we want to calculate the the join operation between
W(x, y)and P(y), we know
∃yW (x, y)∧P(y) = S
∨
s=1W(x, xs)∧P(xs)(5)
Then, if the multiplication can be understood as the conjunc-
tion operation ∧and the addition as the disjunction ∨, we can
consider the value tensor V as [P(x1), P (x2), ..., P (xS)] and
self-attention matrix Aas {W(xs, xs0)}. Based on this, the
s-th element zsin the tensor Zis the value of ∃yW (xs, y)∧
P(y). Hence the tensor Zwill the join operation between
W(x, y)and P(y). Generally speaking, the self-attention
matrix Alearns all the values of the binary predicate W(x, y),
and the value tensor Vlearns all the values of the unary
predicate P(y). For each head of attention mechanism in each
layer, the self-attention matrix Alearns a binary predicate
W(x, y). Hence, the amount of the leaf nodes in the Figure 2
reflects the importance of multiple heads for self-attention.
V. DISCUSSION
Our work provides a novel understanding of the multi-
head attention from the logical reasoning view. Based on the
previous work on semantic parsing such as dependency tree
and lambda dependency-based compositional semantics [9],
[10], most sentences can be represented in logical forms which
have similar tree structure as the example in Eq 1. Hence,
our work provides a novel explanation why the multi-head
attention achieves great success in recent development of NLP
from a new perspective. Furthermore, the logic reasoning view
also provides us with some suggestions on how to improve the
design of transformers. Since logical expressions of most sen-
tences in NLP have tree structures similar to the example in Eq
1 shown in Figure 2, the number of join operations decreases
as we proceed the calculation. This means the amount of the
heads could decrease as the layer become less close to the
inputs. This provides us with a new insight on how to compress
the multi-head attention blocks in transformers. Besides, it is
worth noting that the skip connection is also heavily utilized
in the transformers. Our work provide a new interpretation for
the use of skip-connection in transformers which is different
from its original motivation in residual learning. Moreover,
the assumption 3.1 also provides us with a potential way to
augment the transformer with some external knowledge. We
could incorporate some additional commonsense knowledge
into the self-attention block to boost the logical reasoning as
well as the inference of transformer.
Our work has lots of interesting future directions. One is
to design some more efficient neural operators for the join