A Logic for Expressing Log-Precision Transformers William Merrill New York University

2025-04-27 0 0 577.39KB 17 页 10玖币
侵权投诉
A Logic for Expressing Log-Precision Transformers
William Merrill
New York University
willm@nyu.edu
Ashish Sabharwal
Allen Institute for AI
ashishs@allenai.org
Abstract
One way to interpret the reasoning power of transformer-based language models
is to describe the types of logical rules they can resolve over some input text.
Recently, Chiang et al. (2023) showed that finite-precision transformer classifiers
can be equivalently expressed in a generalization of first-order logic. However,
finite-precision transformers are a weak transformer variant because, as we show,
a single head can only attend to a constant number of tokens and, in particular,
cannot represent uniform attention. Since attending broadly is a core capability for
transformers, we ask whether a minimally more expressive model that can attend
universally can also be characterized in logic. To this end, we analyze transformers
whose forward pass is computed in
log n
precision on contexts of length
n
. We
prove any log-precision transformer classifier can be equivalently expressed as
a first-order logic sentence that, in addition to standard universal and existential
quantifiers, may also contain majority-vote quantifiers. This is the tightest known
upper bound and first logical characterization of log-precision transformers.
Any log-precision transformer can be re-expressed as a sentence in FO(M)logic, e.g.:
Mi. a(i)Mj. b(j)∧ ¬∃k, ℓ. (a(k)b()ℓ<k)
(mas followed by mbs, i.e., ambm)
aaaabbbb aaabbbbb baaaabbb
Figure 1: A first-order logic with majority (
FO(M)
) sentence for
ambm
. In addition to standard
and
quantifiers over string indices,
FO(M)
allows majority quantifiers (
M
) that take a majority-vote
across indices.
a(i)
indicates whether token
i
is
a
(and analogously for
b
). We prove
FO(M)
can
express any function computed by a log-precision transformer.
1 Introduction
The incredible success of deep learning models, especially very large language and vision transformers
with hundreds of billions of parameters (Brown et al., 2020; Thoppilan et al., 2022), has come at the
cost of increasingly limited understanding of how these models actually work and when they might
fail. This raises many concerns, such as around their safe deployment, fairness, and accountability.
Does the inner working of a transformer defy description in a simpler symbolic system that we
can better understand? Or can transformer computation be described using a familiar symbolic
formalism? Understanding how to view the reasoning process of a transformer in terms of logic could
potentially expand our ability to formally reason about their behavior over large domains of inputs.
Chiang et al. (2023) provide a partial answer to this question, showing that any finite-precision
transformer classifier can be expressed as a sentence in a variant of first-order logic with counting
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.02671v6 [cs.LG] 6 Nov 2023
quantifiers and modular arithmetic over input position indices. Specifically, counting quantifiers take
the form
=xi:ϕ(i)
where
x
is a count variable and
i
is a position index. They show that there exists
a single sentence in this logic that computes the output of the transformer for any input string of any
length. This is a powerful result because it shows that a simple logical formalism is fully sufficient to
describe all the complexity of a massive finite-precision transformer. It also provides an upper bound
on finite-precision transformers: any function that cannot be defined in first-order counting logic with
modular indexing cannot be expressed by the transformer.
However, Chiang et al.’s result is not fully general because it relies on the transformer precision being
fixed with respect to the transformer’s context length. More generally, as we will demonstrate in
Section 3, finite-precision transformers are a fundamentally weak variant of transformers: crucially,
cannot express uniform attention patterns, which are a core algorithmic primitive of transformers
(Weiss et al., 2018). In fact, we show that they can only attend to a constant number of input
positions, which may be seen as a rather limited generalization of hard attention.
1
For example,
Chiang et al. show that their logic for finite-precision transformers cannot recognize
ambm
, whereas
in practice, transformers can (Bhattamishra et al., 2020).
2
This motivates studying a formal model
of transformers where precision grows with context length (which we formalize as log-precision),
making it possible to capture uniform attention as well as other broad attention patterns. This is
useful both for recognizing ambmand more generally for reasoning globally over the input.
We demonstrate that log-precision transformer classifiers can also be expressed as sentences in a
simple logic: first-order logic with majority, or
FO(M)
, over inputs strings (Barrington et al., 1990).
In addition to standard existential and universal quantifiers,
FO(M)
has majority quantifiers that
return true iff at least half the propositions they quantify are true. It also allows comparing input
positions (e.g., ℓ < k in Figure 1) and accessing their individual bits. Our main result is as follows:
Theorem 1 (Informal version of Theorem 2).For any log-precision transformer
T
, there exists an
FO(M)sentence ϕthat computes the same function as T, i.e., ϕ(x) = T(x)for any input string x.
Upper bound. Theorem 2 shows transformers with more than finite precision can also be expressed
in a simple extension of first-order logic, going beyond Chiang et al. (2023)’s result. On the other
hand,
FO(M)
is a strict superset of Chiang et al.’s counting logic; it can simulate counting quantifiers
(see Section 2.2) and allows non-modular position comparisons. Thus, handling a more general class
of transformers powerful enough to express uniform attention slightly weakens the bound.
Still, our result constitutes (to our knowledge) the tightest upper bound on log-precision transformers
and the first defined in terms of logic, building on a line of complexity-theoretic work analyzing the
power of transformers (Hahn, 2020; Merrill et al., 2022; Liu et al., 2023; Merrill & Sabharwal, 2023).
In particular,
FO(M)
strengthens the upper bound of log-space-uniform
TC0
by Merrill & Sabharwal
(2023). The refined bound adds to the limitations of transformers identified by Merrill & Sabharwal
(2023): for example, it establishes unconditionally that log-precision transformers cannot compute
boolean matrix permanents, and shows that, in a certain formal sense, integer division and matching
parentheses are among the formally hardest problems that transformers can solve (see Section 4).3
Mechanistic interpretability. Beyond providing an upper bound on the reasoning problems solv-
able by transformers, we believe Theorem 1 could guide the design of “transformer-complete”
programming languages similar in spirit to RASP (Weiss et al., 2018). RASP is a declarative program-
ming language designed to capture transformer computation, and Lindner et al. (2023) implement a
compiler from RASP into transformers. Unlike RASP,
FO(M)
can provably express any transformer
(Theorem 1), which we believe justifies using it (or an equivalent but more user-friendly variant) as a
target language for programs extracted from transformers.
Similar to a decision tree, an
FO(M)
sentence has the interpretable property that each sub-sentence
corresponds to a constraint on input (see Figure 1). In contrast, the internal modules of a transformer
or circuit do not satisfy this since they map between arbitrary latent spaces. We speculate this property
1
Hard attention is provably substantially weaker than general attention (Hao et al., 2022; Merrill et al., 2022).
2
Technically, the empirical results of Bhattamishra et al. (2020) are for
ambmcm
, a harder variant of
ambm
.
3
To be clear, Theorem 1 is one-sided: every transformer can be expressed as an
FO(M)
sentence, but
not necessarily the other way. Moreover, we believe that many
FO(M)
sentences cannot be expressed by
transformers. An exact logical characterization of transformers remains an open problem.
2
could facilitate interpreting models by translating them to
FO(M)
, though a careful exploration of the
algorithmic and HCI aspects of this idea lies outside the current paper’s theoretical scope.
Contributions. Our results shed new light on how to view the computation inside transformers in
terms of logic. Specifically, our main contributions are to prove the following:
1.
Fixed-precision transformers can only attend to a fixed number of tokens, and those with
precision less than
log log n
cannot uniformly attend over length-
n
contexts (Proposition 1).
2.
Log-precision transformer classifiers can be expressed as sentences in
FO(M)
(Theorem 2).
2 Preliminaries: Transformers and FO(M)
Let
Σ
be a finite alphabet. We denote by
the Kleene star operator, i.e., for a set
X
,
X=S
n=0 Xn
.
We will view transformers and
FO(M)
sentences both as functions from
Σ→ {0,1}
, and show that
any function a transformer computes can also be computed by an FO(M)sentence.
2.1 Transformers
We view the transformer precision
p
as a function of the context length
n
, writing
p(n)
where
appropriate. Let
Dp
be the datatype of
p
-precision floats, i.e., tuples
m, e
where
m, e
are signed
integers together taking
p
bits. Using
|x|
to mean the size of integer
x
, a float represents the value
m·2e−|m|+1
.
4
Following Appendix A of Merrill & Sabharwal (2023), we define
p
-truncated addition
(
+,P
), multiplication (
·
), and division (
/
) over
Dp
. We now define a transformer encoder binary
classifier over Dp, largely adopting Merrill & Sabharwal’s notation.5
Definition 1. A
p
-precision transformer
T
with
h
heads,
d
layers, model dimension
m
(divisible by
h), and feedforward width wis specified by:
1. An embedding function ϕ: Σ ×NDm
pwhose form is defined in Appendix C.1;6
2.
For each
1d
and
1kh
, a head similarity function
s
k:Dm
p×Dm
pDp
whose
form is defined in Appendix C.2;
3.
For each
1d
and
1kh
, a head value function
v
k:Dm
pDm/h
p
whose form is
defined in Appendix C.2;
4.
For each
1d
, an activation function
f: (Dm/h
p)h×Dm
pDm
p
whose form is
defined in Appendix C.3 and implicitly uses the feedforward dimension w;
5. An output classifier head κ:Dm
p→ {0,1}whose form is defined in Appendix C.4.
Definition 2. We define the transformer computation and output as a function of an input xΣn.
1. Embeddings: For 1in,h0
i=ϕ(xi, i).6
2. Self Attention:
For
0d1
, (multihead) self-attention block
+ 1
computes
h
attention heads:
a+1
i,k =
n
X
j=1
s+1
k(h
i,h
j)
Zi,k
·v+1
k(h
j),where Zi,k =
n
X
j=1
s+1
k(h
i,h
j).
3. Activation Block:
For
0d1
, activation block
+ 1
aggregates the head outputs to
produce h+1:
h+1
i=f+1(a+1
i,1,...,a+1
i,h ,h
i).
4. Classifier Head: The network prediction on xΣnis κ(hd
n).
4101,010
represents
1.012×2102
. This is closer to the IEEE standard than the
m·2e
semantics used in
Merrill & Sabharwal (2023), letting us define the minimum representable float more realistically in Proposition 1.
5
Increasing the classifier’s output space arity (e.g., a transformer that predicts the next token) or switching
to causal attention of a decoder-only model would not change our results. However, our proof no longer goes
through if the decoder can generate tokens that get added to the input at the next step (cf. Pérez et al., 2019).
6ϕ
, like
p
, is actually a function of the context length
n
, and Appendix C.1 enforces that
ϕ
is computable in
O(log n)time, as standard choices of positional embeddings would satisfy.
3
We say
T(x) = κ(hd
|x|)
and
LT
is the language of
xΣ
such that
T(x)=1
. We refer to
ϕ, s
k, v
h, f
, and
κ
as the core functions in
T
, and to embeddings, self attention, activation, and the
classifier head as the components of
T
. We write
θT
for the concatenated vector of parameters for
the functions ϕ, s
k, v
h, f, and κ, for all 1dand 1kh.
We define a log-precision transformer as one where
p
is at most
O(log n)
and is a “simple” function,
i.e., computable in
O(log n)
time. In our model, the weights
θT
defining
T
are fixed, but the precision
pused to compute the forward pass can depend on n(see Footnote 13 for a generalization).
2.2 First-Order Logic with Majority
As we will show, transformers can be translated into sentences in
FO(M)
. But what do such sentences
look like? Informally,
FO(M)
is first-order logic extended to also have majority (
M
) quantifiers.
Following Barrington et al. (1990), our sense of
FO(M)
takes strings in
Σ
as input and returns
0
or
1
to define a formal language. In this setting, quantifiers range over indices (positions) into the string.
Predicates can be applied to the variables introduced by these quantifiers.
Definition 3 (FO(M)index).Indices in FO(M)are integers denoting positions in the input string:
1. The constant 1, representing the first token’s position.
2. The constant n, representing the last token’s position.
3. Strings (e.g., i, j, k) representing variables ranging over positions 1to n.
4. Any index built by applying addition or subtraction to other indices.7
Definition 4 (FO(M)formula).Formulas in FO(M)are constructed as follows:8
1.
Let
Σ
be a finite alphabet. For each
σΣ
and any index
i
,
σ(i)
, e.g.,
a(i)
, is a formula that
is true if the i-th input token is σ.9
2. For any indices i, j, the formula bit(i, j)returns the j-th bit of the binary expansion of i.10
3.
For two indices
i, j
,
i=j
,
ij
, and
ij
are formulas with their conventional semantics.
4. For two formulas ϕ, ψ,ϕψand ϕψare formulas with their conventional semantics.
5. For any formula ϕ(which may refer to i), the following are valid formulas:
(a) i. ϕ means some value of iin [1, n]makes ϕtrue.
(b) i. ϕ means all values of iin [1, n]make ϕtrue.
(c) Mi. ϕ means n/2values of iin [1, n]make ϕtrue.
We use parentheses where necessary to disambiguate the order of operations. General formulas may
contain free (i.e., unbound) variables: e.g.,
i. i =j
. A sentence is an
FO(M)
formula
ϕ
with no free
variables. Sentences represent functions from from
Σ
to
{0,1}
and thus define a formal language.
11
Extensions. Beyond Definition 4, FO(M)can express counting and threshold quantifiers in terms
of majority quantifiers (Barrington et al., 1990). Given a formula
ϕ
, a counting quantifier creates a
new formula
ki:ϕ
that is true iff
ϕ
is true across exactly
k
values of
i
. Threshold quantifiers
k
and
k
work similarly but check if
ϕ
is true for at least or at most
k
values of
i
. In addition, we
show in Appendix A that
FO(M)
can express conditional majority quantifiers, which create a formula
Mi:ϕ[ψ]that is true iff ψis true for at least half the values of ithat make ϕtrue.
2.2.1 Examples
To illustrate the formalism, we provide example languages definable in
FO(M)
with
Σ = {a,b}
.
First, we show two languages that do not require majority quantifiers to express:
Example 1 (Bigram matching).Strings containing the bigram ab:i[a(i)b(i+ 1)] .
Example 2 (Skip-bigram matching).Strings containing the long-distance pattern
a. . . b
(cf. “induc-
tion heads” of Elhage et al. 2021): i[b(i)∧ ∃j[jia(j)]] .
7
Barrington et al. (1990) did not introduce this as a primitive, but it can be simulated using the
predicate.
8We write parentheses to indicate the order of operations.
9
Barrington et al. (1990) define
Qb(i)
for
b∈ {0,1}
. We generalize this to an arbitrary vocabulary
Σ
by
assuming each token is one-hot-encoded: σ(i) = Q1(|Σ|i+s)where sis the index of σin the vocabulary.
10This predicate is included in the logic for technical reasons; see Barrington et al. (1990).
11
One can also take multiple sub-sentences within
ϕ
to be labeled as ordered outputs, thus allowing
ϕ
to be a
function from Σto {0,1}kfor some fixed constant k.
4
摘要:

ALogicforExpressingLog-PrecisionTransformersWilliamMerrillNewYorkUniversitywillm@nyu.eduAshishSabharwalAllenInstituteforAIashishs@allenai.orgAbstractOnewaytointerpretthereasoningpoweroftransformer-basedlanguagemodelsistodescribethetypesoflogicalrulestheycanresolveoversomeinputtext.Recently,Chiangeta...

展开>> 收起<<
A Logic for Expressing Log-Precision Transformers William Merrill New York University.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:577.39KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注