A Logic for Expressing Log-Precision Transformers William Merrill New York University

2025-04-27 0 0 577.39KB 17 页 10玖币

侵权投诉

A Logic for Expressing Log-Precision Transformers

William Merrill

New York University

willm@nyu.edu

Ashish Sabharwal

Allen Institute for AI

ashishs@allenai.org

Abstract

One way to interpret the reasoning power of transformer-based language models

is to describe the types of logical rules they can resolve over some input text.

Recently, Chiang et al. (2023) showed that ﬁnite-precision transformer classiﬁers

can be equivalently expressed in a generalization of ﬁrst-order logic. However,

ﬁnite-precision transformers are a weak transformer variant because, as we show,

a single head can only attend to a constant number of tokens and, in particular,

cannot represent uniform attention. Since attending broadly is a core capability for

transformers, we ask whether a minimally more expressive model that can attend

universally can also be characterized in logic. To this end, we analyze transformers

whose forward pass is computed in

log n

precision on contexts of length

. We

prove any log-precision transformer classiﬁer can be equivalently expressed as

a ﬁrst-order logic sentence that, in addition to standard universal and existential

quantiﬁers, may also contain majority-vote quantiﬁers. This is the tightest known

upper bound and ﬁrst logical characterization of log-precision transformers.

Any log-precision transformer can be re-expressed as a sentence in FO(M)logic, e.g.:

Mi. a(i)∧Mj. b(j)∧ ¬∃k, ℓ. (a(k)∧b(ℓ)∧ℓ<k)

(ma’s followed by mb’s, i.e., ambm)

aaaabbbb ✓aaabbbbb ✗baaaabbb ✗

Figure 1: A ﬁrst-order logic with majority (

FO(M)

) sentence for

ambm

. In addition to standard

∀

and

∃

quantiﬁers over string indices,

FO(M)

allows majority quantiﬁers (

) that take a majority-vote

across indices.

a(i)

indicates whether token

(and analogously for

). We prove

FO(M)

can

express any function computed by a log-precision transformer.

1 Introduction

The incredible success of deep learning models, especially very large language and vision transformers

with hundreds of billions of parameters (Brown et al., 2020; Thoppilan et al., 2022), has come at the

cost of increasingly limited understanding of how these models actually work and when they might

fail. This raises many concerns, such as around their safe deployment, fairness, and accountability.

Does the inner working of a transformer defy description in a simpler symbolic system that we

can better understand? Or can transformer computation be described using a familiar symbolic

formalism? Understanding how to view the reasoning process of a transformer in terms of logic could

potentially expand our ability to formally reason about their behavior over large domains of inputs.

Chiang et al. (2023) provide a partial answer to this question, showing that any ﬁnite-precision

transformer classiﬁer can be expressed as a sentence in a variant of ﬁrst-order logic with counting

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.02671v6 [cs.LG] 6 Nov 2023

quantiﬁers and modular arithmetic over input position indices. Speciﬁcally, counting quantiﬁers take

the form

∃=xi:ϕ(i)

where

is a count variable and

is a position index. They show that there exists

a single sentence in this logic that computes the output of the transformer for any input string of any

length. This is a powerful result because it shows that a simple logical formalism is fully sufﬁcient to

describe all the complexity of a massive ﬁnite-precision transformer. It also provides an upper bound

on ﬁnite-precision transformers: any function that cannot be deﬁned in ﬁrst-order counting logic with

modular indexing cannot be expressed by the transformer.

However, Chiang et al.’s result is not fully general because it relies on the transformer precision being

ﬁxed with respect to the transformer’s context length. More generally, as we will demonstrate in

Section 3, ﬁnite-precision transformers are a fundamentally weak variant of transformers: crucially,

cannot express uniform attention patterns, which are a core algorithmic primitive of transformers

(Weiss et al., 2018). In fact, we show that they can only attend to a constant number of input

positions, which may be seen as a rather limited generalization of hard attention.

For example,

Chiang et al. show that their logic for ﬁnite-precision transformers cannot recognize

ambm

, whereas

in practice, transformers can (Bhattamishra et al., 2020).

This motivates studying a formal model

of transformers where precision grows with context length (which we formalize as log-precision),

making it possible to capture uniform attention as well as other broad attention patterns. This is

useful both for recognizing ambmand more generally for reasoning globally over the input.

We demonstrate that log-precision transformer classiﬁers can also be expressed as sentences in a

simple logic: ﬁrst-order logic with majority, or

FO(M)

, over inputs strings (Barrington et al., 1990).

In addition to standard existential and universal quantiﬁers,

FO(M)

has majority quantiﬁers that

return true iff at least half the propositions they quantify are true. It also allows comparing input

positions (e.g., ℓ < k in Figure 1) and accessing their individual bits. Our main result is as follows:

Theorem 1 (Informal version of Theorem 2).For any log-precision transformer

, there exists an

FO(M)sentence ϕthat computes the same function as T, i.e., ϕ(x) = T(x)for any input string x.

Upper bound. Theorem 2 shows transformers with more than ﬁnite precision can also be expressed

in a simple extension of ﬁrst-order logic, going beyond Chiang et al. (2023)’s result. On the other

hand,

FO(M)

is a strict superset of Chiang et al.’s counting logic; it can simulate counting quantiﬁers

(see Section 2.2) and allows non-modular position comparisons. Thus, handling a more general class

of transformers powerful enough to express uniform attention slightly weakens the bound.

Still, our result constitutes (to our knowledge) the tightest upper bound on log-precision transformers

and the ﬁrst deﬁned in terms of logic, building on a line of complexity-theoretic work analyzing the

power of transformers (Hahn, 2020; Merrill et al., 2022; Liu et al., 2023; Merrill & Sabharwal, 2023).

In particular,

FO(M)

strengthens the upper bound of log-space-uniform

TC0

by Merrill & Sabharwal

(2023). The reﬁned bound adds to the limitations of transformers identiﬁed by Merrill & Sabharwal

(2023): for example, it establishes unconditionally that log-precision transformers cannot compute

boolean matrix permanents, and shows that, in a certain formal sense, integer division and matching

parentheses are among the formally hardest problems that transformers can solve (see Section 4).3

Mechanistic interpretability. Beyond providing an upper bound on the reasoning problems solv-

able by transformers, we believe Theorem 1 could guide the design of “transformer-complete”

programming languages similar in spirit to RASP (Weiss et al., 2018). RASP is a declarative program-

ming language designed to capture transformer computation, and Lindner et al. (2023) implement a

compiler from RASP into transformers. Unlike RASP,

FO(M)

can provably express any transformer

(Theorem 1), which we believe justiﬁes using it (or an equivalent but more user-friendly variant) as a

target language for programs extracted from transformers.

Similar to a decision tree, an

FO(M)

sentence has the interpretable property that each sub-sentence

corresponds to a constraint on input (see Figure 1). In contrast, the internal modules of a transformer

or circuit do not satisfy this since they map between arbitrary latent spaces. We speculate this property

Hard attention is provably substantially weaker than general attention (Hao et al., 2022; Merrill et al., 2022).

Technically, the empirical results of Bhattamishra et al. (2020) are for

ambmcm

, a harder variant of

ambm

To be clear, Theorem 1 is one-sided: every transformer can be expressed as an

FO(M)

sentence, but

not necessarily the other way. Moreover, we believe that many

FO(M)

sentences cannot be expressed by

transformers. An exact logical characterization of transformers remains an open problem.

could facilitate interpreting models by translating them to

FO(M)

, though a careful exploration of the

algorithmic and HCI aspects of this idea lies outside the current paper’s theoretical scope.

Contributions. Our results shed new light on how to view the computation inside transformers in

terms of logic. Speciﬁcally, our main contributions are to prove the following:

Fixed-precision transformers can only attend to a ﬁxed number of tokens, and those with

precision less than

log log n

cannot uniformly attend over length-

contexts (Proposition 1).

Log-precision transformer classiﬁers can be expressed as sentences in

FO(M)

(Theorem 2).

2 Preliminaries: Transformers and FO(M)

Let

be a ﬁnite alphabet. We denote by

∗

the Kleene star operator, i.e., for a set

X∗=S∞

n=0 Xn

We will view transformers and

FO(M)

sentences both as functions from

Σ∗→ {0,1}

, and show that

any function a transformer computes can also be computed by an FO(M)sentence.

2.1 Transformers

We view the transformer precision

as a function of the context length

, writing

p(n)

where

appropriate. Let

be the datatype of

-precision ﬂoats, i.e., tuples

⟨m, e⟩

where

m, e

are signed

integers together taking

bits. Using

|x|

to mean the size of integer

, a ﬂoat represents the value

m·2e−|m|+1

Following Appendix A of Merrill & Sabharwal (2023), we deﬁne

-truncated addition

(

+,P

), multiplication (

), and division (

) over

. We now deﬁne a transformer encoder binary

classiﬁer over Dp, largely adopting Merrill & Sabharwal’s notation.5

Deﬁnition 1. A

-precision transformer

with

heads,

layers, model dimension

(divisible by

h), and feedforward width wis speciﬁed by:

1. An embedding function ϕ: Σ ×N→Dm

pwhose form is deﬁned in Appendix C.1;6

For each

1≤ℓ≤d

and

1≤k≤h

, a head similarity function

sℓ

k:Dm

p×Dm

p→Dp

whose

form is deﬁned in Appendix C.2;

For each

1≤ℓ≤d

and

1≤k≤h

, a head value function

vℓ

k:Dm

p→Dm/h

whose form is

deﬁned in Appendix C.2;

For each

1≤ℓ≤d

, an activation function

fℓ: (Dm/h

p)h×Dm

p→Dm

whose form is

deﬁned in Appendix C.3 and implicitly uses the feedforward dimension w;

5. An output classiﬁer head κ:Dm

p→ {0,1}whose form is deﬁned in Appendix C.4.

Deﬁnition 2. We deﬁne the transformer computation and output as a function of an input x∈Σn.

1. Embeddings: For 1≤i≤n,h0

i=ϕ(xi, i).6

2. Self Attention:

For

0≤ℓ≤d−1

, (multihead) self-attention block

ℓ+ 1

computes

attention heads:

aℓ+1

i,k =

j=1

sℓ+1

k(hℓ

i,hℓ

Zi,k

·vℓ+1

k(hℓ

j),where Zi,k =

j=1

sℓ+1

k(hℓ

i,hℓ

j).

3. Activation Block:

For

0≤ℓ≤d−1

, activation block

ℓ+ 1

aggregates the head outputs to

produce hℓ+1:

hℓ+1

i=fℓ+1(aℓ+1

i,1,...,aℓ+1

i,h ,hℓ

i).

4. Classiﬁer Head: The network prediction on x∈Σnis κ(hd

n).

4⟨101,010⟩

represents

1.012×2102

. This is closer to the IEEE standard than the

m·2e

semantics used in

Merrill & Sabharwal (2023), letting us deﬁne the minimum representable ﬂoat more realistically in Proposition 1.

Increasing the classiﬁer’s output space arity (e.g., a transformer that predicts the next token) or switching

to causal attention of a decoder-only model would not change our results. However, our proof no longer goes

through if the decoder can generate tokens that get added to the input at the next step (cf. Pérez et al., 2019).

6ϕ

, like

, is actually a function of the context length

, and Appendix C.1 enforces that

is computable in

O(log n)time, as standard choices of positional embeddings would satisfy.

We say

T(x) = κ(hd

|x|)

and

is the language of

x∈Σ∗

such that

T(x)=1

. We refer to

ϕ, sℓ

k, vℓ

h, fℓ

, and

as the core functions in

, and to embeddings, self attention, activation, and the

classiﬁer head as the components of

. We write

θT

for the concatenated vector of parameters for

the functions ϕ, sℓ

k, vℓ

h, fℓ, and κ, for all 1≤ℓ≤dand 1≤k≤h.

We deﬁne a log-precision transformer as one where

is at most

O(log n)

and is a “simple” function,

i.e., computable in

O(log n)

time. In our model, the weights

θT

deﬁning

are ﬁxed, but the precision

pused to compute the forward pass can depend on n(see Footnote 13 for a generalization).

2.2 First-Order Logic with Majority

As we will show, transformers can be translated into sentences in

FO(M)

. But what do such sentences

look like? Informally,

FO(M)

is ﬁrst-order logic extended to also have majority (

) quantiﬁers.

Following Barrington et al. (1990), our sense of

FO(M)

takes strings in

Σ∗

as input and returns

to deﬁne a formal language. In this setting, quantiﬁers range over indices (positions) into the string.

Predicates can be applied to the variables introduced by these quantiﬁers.

Deﬁnition 3 (FO(M)index).Indices in FO(M)are integers denoting positions in the input string:

1. The constant 1, representing the ﬁrst token’s position.

2. The constant n, representing the last token’s position.

3. Strings (e.g., i, j, k) representing variables ranging over positions 1to n.

4. Any index built by applying addition or subtraction to other indices.7

Deﬁnition 4 (FO(M)formula).Formulas in FO(M)are constructed as follows:8

Let

be a ﬁnite alphabet. For each

σ∈Σ

and any index

σ(i)

, e.g.,

a(i)

, is a formula that

is true if the i-th input token is σ.9

2. For any indices i, j, the formula bit(i, j)returns the j-th bit of the binary expansion of i.10

For two indices

i, j

i=j

i≤j

, and

i≥j

are formulas with their conventional semantics.

4. For two formulas ϕ, ψ,ϕ∧ψand ϕ∨ψare formulas with their conventional semantics.

5. For any formula ϕ(which may refer to i), the following are valid formulas:

(a) ∃i. ϕ means some value of iin [1, n]makes ϕtrue.

(b) ∀i. ϕ means all values of iin [1, n]make ϕtrue.

We use parentheses where necessary to disambiguate the order of operations. General formulas may

contain free (i.e., unbound) variables: e.g.,

∀i. i =j

. A sentence is an

FO(M)

formula

with no free

variables. Sentences represent functions from from

Σ∗

{0,1}

and thus deﬁne a formal language.

Extensions. Beyond Deﬁnition 4, FO(M)can express counting and threshold quantiﬁers in terms

of majority quantiﬁers (Barrington et al., 1990). Given a formula

, a counting quantiﬁer creates a

new formula

∃ki:ϕ

that is true iff

is true across exactly

values of

. Threshold quantiﬁers

∃≤k

and

∃≥k

work similarly but check if

is true for at least or at most

values of

. In addition, we

show in Appendix A that

FO(M)

can express conditional majority quantiﬁers, which create a formula

Mi:ϕ[ψ]that is true iff ψis true for at least half the values of ithat make ϕtrue.

2.2.1 Examples

To illustrate the formalism, we provide example languages deﬁnable in

FO(M)

with

Σ = {a,b}

First, we show two languages that do not require majority quantiﬁers to express:

Example 1 (Bigram matching).Strings containing the bigram ab:∃i[a(i)∧b(i+ 1)] .

Example 2 (Skip-bigram matching).Strings containing the long-distance pattern

a. . . b

(cf. “induc-

tion heads” of Elhage et al. 2021): ∃i[b(i)∧ ∃j[j≤i∧a(j)]] .

Barrington et al. (1990) did not introduce this as a primitive, but it can be simulated using the

≤

predicate.

8We write parentheses to indicate the order of operations.

Barrington et al. (1990) deﬁne

Qb(i)

for

b∈ {0,1}

. We generalize this to an arbitrary vocabulary

assuming each token is one-hot-encoded: σ(i) = Q1(|Σ|i+s)where sis the index of σin the vocabulary.

10This predicate is included in the logic for technical reasons; see Barrington et al. (1990).

One can also take multiple sub-sentences within

to be labeled as ordered outputs, thus allowing

to be a

function from Σ∗to {0,1}kfor some ﬁxed constant k.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ALogicforExpressingLog-PrecisionTransformersWilliamMerrillNewYorkUniversitywillm@nyu.eduAshishSabharwalAllenInstituteforAIashishs@allenai.orgAbstractOnewaytointerpretthereasoningpoweroftransformer-basedlanguagemodelsistodescribethetypesoflogicalrulestheycanresolveoversomeinputtext.Recently,Chiangeta...

展开>> 收起<<

A Logic for Expressing Log-Precision Transformers William Merrill New York University.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Logic for Expressing Log-Precision Transformers William Merrill New York University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: