Are Representations Built from the Ground Up An Empirical Examination of Local Composition in Language Models Emmy Liu and Graham Neubig

2025-04-27 0 0 3.17MB 21 页 10玖币

侵权投诉

Are Representations Built from the Ground Up?

An Empirical Examination of Local Composition in Language Models

Emmy Liu and Graham Neubig

Language Technologies Institute

Carnegie Mellon University

{mengyan3, gneubig}@cs.cmu.edu

Abstract

Compositionality, the phenomenon where the

meaning of a phrase can be derived from

its constituent parts, is a hallmark of human

language. At the same time, many phrases

are non-compositional, carrying a meaning be-

yond that of each part in isolation. Repre-

senting both of these types of phrases is crit-

ical for language understanding, but it is an

open question whether modern language mod-

els (LMs) learn to do so; in this work we ex-

amine this question. We ﬁrst formulate a prob-

lem of predicting the LM-internal representa-

tions of longer phrases given those of their

constituents. We ﬁnd that the representation

of a parent phrase can be predicted with some

accuracy given an afﬁne transformation of its

children. While we would expect the predic-

tive accuracy to correlate with human judg-

ments of semantic compositionality, we ﬁnd

this is largely not the case, indicating that LMs

may not accurately distinguish between com-

positional and non-compositional phrases. We

perform a variety of analyses, shedding light

on when different varieties of LMs do and

do not generate compositional representations,

and discuss implications for future modeling

work.1

1 Introduction

Compositionality is argued to be a hallmark of lin-

guistic generalization (Szabó,2020). However,

some phrases are non-compositional, and can-

not be reconstructed from individual constituents

(Dankers et al.,2022a). Intuitively, a phrase like

"I own cats and dogs" is locally compositional,

whereas "It’s raining cats and dogs" is not. There-

fore, any representation of language must be easily

composable, but it must also correctly handle cases

that deviate from compositional rules.

Both lack (Hupkes et al.,2020;Lake and Baroni,

2017) and excess (Dankers et al.,2022b) of compo-

Code and data available at

https://github.com/

nightingal3/lm-compositionality

[CLS] the dog sits on the sofa [SEP]

[CLS] the dog [SEP] [CLS] sits on the sofa [SEP]

Figure 1: An illustration of the local composition pre-

diction problem with [CLS] representations.

sitionality have been cited as common sources of

errors in NLP models, indicating that models may

handle phrase composition in an unexpected way.

In general form, the compositionality principle

is simply “the meaning of an expression is a func-

tion of the meanings of its parts and of the way

they are syntactically combined” (Pelletier,1994).

However, this deﬁnition is underspeciﬁed (Partee,

1984). Recent efforts to evaluate the compositional

abilities of neural networks have resulted in several

testable deﬁnitions of compositionality (Hupkes

et al.,2020).

Previous work on compositionality in natural lan-

guage focuses largely on the deﬁnition of

substitu-

tivity

, by focusing on changes to the constituents

of a complex phrase and how they change its repre-

sentation (Dankers et al.,2022a;Garcia et al.,2021;

Yu and Ettinger,2020). The deﬁnition we examine

localism

: whether or not the representation of

a complex phrase is derivable only from its local

structure and the representations of its immediate

“children” (Hupkes et al.,2020). A similar con-

cept has been proposed separately to measure the

compositionality of learned representations, which

we use in this work (Andreas,2019). We focus

on localism because it is a more direct deﬁnition

and does not rely on the collection of contrastive

pairs of phrases. This allows us to examine a wider

range of phrases of different types and lengths.

arXiv:2210.03575v2 [cs.CL] 22 Oct 2022

In this paper, we ask whether reasonable compo-

sitional probes can predict an LM’s representation

of a phrase from its children in a syntax tree, and if

so, which kinds of phrase are more or less compo-

sitional. We also ask whether this corresponds to

human judgements of compositionality.

We ﬁrst establish a method to examine local

compositionality on phrases through probes that

try to predict the representation of a parent given

its children (section 2). We create two English-

language datasets upon which to experiment: a

large-scale dataset of 823K phrases mined from

the Penn Treebank, and a new dataset of idioms

and paired non-idiomatic phrases for which we

elicit human compositionality judgements, which

we call the

ompositionality of

uman-annotated

Idiomatic Phrases dataset (CHIP) (section 3).

For multiple models and phrase types, we ﬁnd

that phrase embeddings across models and repre-

sentation types have a fairly predictable afﬁne com-

positional structure based on embeddings of their

constituents (section 4). We ﬁnd that there are

signiﬁcant differences in compositionality across

phrase types, and analyze these trends in detail,

contributing to understanding how LMs represent

phrases (section 5). Interestingly, we ﬁnd that hu-

man judgments do not generally align well with

the compositionality level of model representations

(section 6). This implies there is still work to be

done at the language modelling level to capture a

proper level of compositionality in representations.

2 Methods and Experimental Details

2.1 Tree Reconstruction Error

We follow Andreas (2019) in deﬁning deviance

from compositionality as tree reconstruction error.

Consider a phrase

x= [a][b]

, where

and

can be

any length

. Assume we always have some way

of knowing how

should be divided into

and

. Assume we also have some way of producing

representations for

, and

, which we represent

as a function

. Given representations

r(x)

r(a)

and

r(b)

, we wish to ﬁnd the function which most

closely approximates how

r(x)

is constructed from

r(a)and r(b).

f= arg min

f∈F

|X | X

x∈X

δx,ab (1)

δx,ab =d(r(x), f(r(a), r(b)) (2)

Where

is the set of possible phrases in the

language that can be decomposed into two parts,

is the set of functions under consideration, and

is a distance function. An example scenario is

depicted in Figure 1.

For

, we use cosine distance as this is the most

common function used to compare semantic vec-

tors. The division of

into

and

is speciﬁed by

syntactic structure (Chomsky,1959). Namely, we

use a phrase’s annotated constituency structure and

convert its constituency tree to a binary tree with

the right-factored Chomsky Normal Form conver-

sion included in NLTK (Bird and Loper,2004).

2.2 Language Models

We study representations produced by a variety

of widely used language models, speciﬁcally the

base-(uncased)

variants of Transformer-based

models:

BERT

RoBERTa

DeBERTa

, and

GPT-

(He et al.,2021;Liu et al.,2019;Devlin et al.,

2019;Radford et al.,2019).

2.2.1 Representation extraction

Let

[x0, ..., xN]

be a sequence of

N+ 1

input to-

kens, where

is the

[CLS]

token if applicable, and

is the end token if applicable. Let

[h(i)

0, ..., h(i)

be the embeddings of the input tokens after the

-th

layer.

For models with the

[CLS]

beginning of se-

quence token (BERT, RoBERTa, and DeBERTa),

we extracted the embedding of the

[CLS]

token

from the last layer, which we refer to as the

CLS

representation. For GPT-2, we extracted the last

token, which serves a similar purpose. This corre-

sponds to h(12)

0and h(12)

Nrespectively.

Alternately, we also averaged all embeddings

from the last layer, including special tokens. We

refer to this as the AVG representation.

N+ 1

N+1

i=0

h(12)

i(3)

2.3 Approximating a Composition Function

To use this deﬁnition, we need a composition func-

tion

. We examine choices detailed in this section.

For parameterized probes, we follow the prob-

ing literature in training several probes to predict a

property of the phrase given a representation of the

phrase. However, in this case, we are not predict-

ing a categorical attribute such as part of speech.

Instead, the probes that we use aim to predict the

parent representation

r(x)

based on the child rep-

resentations

r(a)

and

r(b)

. We call this an approxi-

mative probe to distinguish it from the usual use of

the word probe.

2.3.1 Arithmetic Probes

In the simplest probes, the phrase representation

r(x)

is computed by a single arithmetic operation

r(a)

and

r(b)

. We consider three arithmetic

probes:2

ADD(r(a), r(b)) = r(a) + r(b)(4)

W1(r(a), r(b)) = r(a)(5)

W2(r(a), r(b) = r(b)(6)

2.3.2 Learned Probes

We consider three types of learned probes. The

linear probe expresses

r(x)

as a linear combination

r(a)

and

r(b)

. The afﬁne probe adds a bias term.

The MLP probe is a simple feedforward neural

network with 3 layers, using the ReLU activation.

LIN(r(a), r(b)) = α1r(a) + α2r(b)(7)

AFF(r(a),r(b))=α1r(a) + α2r(b) + β(8)

MLP(r(a), r(b)) = W3h2(9)

Where

h1=σ(W1[r(a); r(b)])

h2=σ(W2h1),

(300 ×2)

(768 ×300)

, and

(1 ×768)

. We do not claim that this is the best

MLP possible, but use it as a simple architecture to

contrast with the linear models.

3 Data and Compositionality Judgments

3.1 Treebank

To collect a large set of phrases with syntactic

structure annotations, we collected all unique sub-

phrases (

≥2

words) from WSJ and Brown sections

of the Penn Treebank (v3) (Marcus et al.,1993).

The ﬁnal dataset consists of

823K

phrases after

excluding null values and duplicates. We collected

Initially, we considered the elementwise product

PROD(r(a),r(b))=r(a)r(b)

, but found that it was an

extremely poor approximation.

We converted the trees to Chomsky Normal Form with

right-branching using NLTK (Bird and Loper,2004). We note

that not all subtrees are syntactically meaningful. However,

we used this conversion to standardize the number of children

and formatting. We exclude phrases with a null value for the

left or right branch (Bies et al.,1995).

the length of the left child in words, the length of

the right child in words, and the tree’s production

rule, which we refer to as tree type. There were

50260 tree types in total, but many of these are

unique. Examples and phrase length distribution

can be found in Appendix A, and Appendix B.

3.2 English Idioms and Matched Phrase Set

Previous datasets center around notable bigrams,

some of which are compositional and some of

which are non-compositional (Ramisch et al.,

2016b;Reddy et al.,2011). However, there is a

positive correlation between bigram frequency and

human compositionality scores in these datasets,

which means that it is unclear whether models are

capturing compositionality or merely frequency ef-

fects if they correlate well with the human scores.

Because models are likely more sensitive to sur-

face features of language than humans, we gathered

a more controlled set of phrases to compare with

human judgments.

Since non-compositional phrases are somewhat

rare, we began with a set of seed idioms and bi-

grams from previous studies (Jhamtani et al.,2021;

Ramisch et al.,2016b;Reddy et al.,2011). We used

idioms because they are a common source of non-

compositional phrases. Duplicates after lemmatiza-

tion were removed.

For each idiom, we used Google Syntactic

NGrams to ﬁnd three phrases with an identical part

of speech and dependency structure to that idiom,

and frequency that was as close as possible relative

to others in Syntactic Ngrams (Goldberg and Or-

want,2013).

For example, the idiom "sail under

false colors" was matched with "distribute among

poor parishioners". More examples can be found

in Table 1. An author of this paper inspected the

idioms and removed those that were syntactically

analyzed incorrectly or offensive.

4 Approximating a Composition

Function

4.1 Methods

To approximate the composition functions of mod-

els, we extract the

CLS

and

AVG

representations

from each model on the Treebank dataset. We used

10-fold cross-validation and trained the learned

probes on the 90% training set in each fold. The

The part of speech/dependency pattern for each idiom

was taken to be the most common pattern for that phrase in

the dataset

Idiom Matched phrase Syntactic pattern Log frequency

Devil’s advocate Baker’s town JJ/dep/2 NN/pobj/0 2.398

Act of darkness Abandonment of institution NN/dobj/0 IN/prep/1 NN/pobj/2 4.304

School of hard knocks Field of social studies NN/pobj/0 IN/prep/1 JJ/amod/4 NNS/pobj/2 6.690

Table 1: Examples of idioms with their matched phrases, selected based on having the same syntactic pattern and

most similar log frequency in the Syntactic Ngrams dataset. Examples depicted here have the same log frequency.

Note that the frequency is based on the most common dependency and constituency pattern found in Syntactic

NGrams. Humans were asked to rate each phrase for its compositionality.

remaining 10% were divided into a test set (5%)

and dev set (5%).5

To fairly compare probes, we used mini-

mum description length probing (Voita and Titov,

2020).This approximates the length of the online

code needed to transmit both the model and data,

which is related to the area under the learning curve.

Speciﬁcally, we recorded average cosine similarity

of the predicted vector and actual vector on the test

set while varying the size of the training set from

0.005% to 100% of the original.

We compare the

AUC of each probe under these conditions to se-

lect the most parsimonious approximation for each

model.

4.2 Results

We ﬁnd that

afﬁne probes

are best able to cap-

ture the composition of phrase embeddings from

their left and right subphrases. A depiction of

probe performance at approximating representa-

tions across models and representation types is in

Figure 2. However, we note that scores for most

models are very high, due to the anisotropy phe-

nomenon. This describes the tendency for most

embeddings from pretrained language models to be

clustered in a narrow cone, rather than distributed

evenly in all directions (Li et al.,2020;Ethayarajh,

2019). We note that it is true for both word and

phrase embeddings.

Since we are comparing the probes to each other

relative to the same anisotropic vectors, this is not

necessarily a problem. However, in order to com-

The learned probes were trained with early stopping on

the dev set with a patience of 2 epochs, up to a maximum of

20 epochs. The Adam optimizer was used, with a batch size

of 512 and learning rate of 0.512.

We look at milestones of 0.005%, 0.01%, 0.1%, 0.5%,

1%, 10% and 100% speciﬁcally. This was because initial

experimentation showed that probes tended to converge at

or before 10% of the training data. Models were trained

separately (with the same seed and initialization) for each

percentage of the training data, and trained until convergence

for each data percentage condition.

pare each probe’s performance compared to chance,

we correct for anisotropy using a control task. This

task is using the trained probe to predict a ran-

dom phrase embedding from the set of treebank

phrase embeddings for that model, and recording

the distance between the compositional probe’s pre-

diction and the random embedding. This allows us

to calculate an error ratio

distprobe

distcontrol

, where

distprobe

represents the original average distance from the

true representation, and

distcontrol

is the average

distance on the control task. This quantiﬁes how

much the probe improves over a random baseline

that takes anisotropy into account, where a smaller

value is better. These results can be found in Ap-

pendix E. The results without anisotropy correction

can be found in Appendix G. In most cases, the

afﬁne probe still performs the best, so we continue

to use it for consistency on all the model and repre-

sentation types.

We also compare the AUC of training curves

for each probe and ﬁnd that the afﬁne probe re-

mains the best in most cases, except

RoBERTaCLS

and

DeBERTaCLS

. Training curves are depicted in

Appendix C. AUC values are listed in Appendix H.

Interestingly, there was a trend of the right child

being weighted more heavily than the left child,

and each model/representation type combination

had its own characteristic ratio of the left child to

the right child. For instance, in BERT, the weight

on the left child was 12, whereas it was 20 for the

right child.

For example, the approximation for the phrase

"green eggs and ham" with BERT

[CLS]

embed-

dings would be:

rCLS ("green eggs and ham") =

12rCLS ("green eggs") + 20rCLS ("and ham") +β

Figure 2: Mean compositionality score (cosine similarity) and standard deviation of each approximative probe

across 10 folds. Error bar indicates 95% CI.

5 Examining Compositionality across

Phrase Types

5.1 Methods

Intuitively, we expect the phrases whose represen-

tations are close to their predicted representation

to be more compositional. We call similarity to the

expected representation,

sim(r(x),ˆ

f(r(a), r(b)))

the compositionality score of a phrase.

We record the mean reconstruction error for each

tree type and report the results. In addition to com-

paring tree types to each other, we also examine the

treatment of named entities in subsubsection 5.2.1.

We examine the relationship between length of a

phrase in words and its compositionality score in

subsubsection 5.2.2.

5.2 Results

There is a signiﬁcant difference between the mean

compositionality score of phrase types. Particu-

larly, the

AVG

representation assigns a lower com-

positionality score to NP

→

NNP NNP phrases,

which is expected since this phrase type often corre-

sponds to named entities. By contrast, the

CLS

rep-

resentation assigns a low compositionality score to

→

DT NN, which is unexpected given that such

phrases are generally seen as compositional. The

reconstruction error for the most common phrase

types is shown in Figure 5.

Because different phrase types may be treated

differently by the model, we examine the relative

compositionality of phrases within each phrase

type. Examples of the most and least compositional

phrases from several phrase types are shown in Ta-

ble 2 for

RoBERTaCLS

. Patterns vary for model and

representation types, but long phrases are generally

represented more compositionally.

5.2.1 Named Entities

We used SpaCy to tag and examine named entities

(Honnibal and Montani,2017), as they are expected

to be less compositional. We ﬁnd that named enti-

ties indeed have a lower compositionality score in

all cases except

RoBERTaCLS

, indicating that they

are correctly represented as less compositional. A

representative example is shown in Figure 3. Full

results can be found in Appendix J. We break down

the compositionality scores of named entities by

type and ﬁnd surprising variation within categories

of named entities. For numerical examples, this

often depends on the unit used. For example, in

RoBERTaAVG

representations, numbers with "mil-

lion" and "billion" are grouped together as composi-

tional, whereas numbers with quantiﬁers ("about",

"more than", "some") are grouped together as not

compositional. The compositionality score distri-

butions for types of named entities are presented in

Figure 4.

5.2.2 Examining Compositionality and

Phrase Length

There is no consistent relationship between phrase

length and compositionality score across models

and representation types. However,

CLS

and

AVG

representations show divergent trends. There is a

strong positive correlation between phrase length

and compositionality score in the

AVG

representa-

tions, while no consistent trend exists for the

CLS

representations. This indicates that longer phrases

are better approximated as an afﬁne transformation

of their subphrase representations. This trend is

summarized in Appendix D. All correlations are

highly signiﬁcant.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AreRepresentationsBuiltfromtheGroundUp?AnEmpiricalExaminationofLocalCompositioninLanguageModelsEmmyLiuandGrahamNeubigLanguageTechnologiesInstituteCarnegieMellonUniversity{mengyan3,gneubig}@cs.cmu.eduAbstractCompositionality,thephenomenonwherethemeaningofaphrasecanbederivedfromitsconstituentparts,isa...

展开>> 收起<<

Are Representations Built from the Ground Up An Empirical Examination of Local Composition in Language Models Emmy Liu and Graham Neubig.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Are Representations Built from the Ground Up An Empirical Examination of Local Composition in Language Models Emmy Liu and Graham Neubig

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: