Are Representations Built from the Ground Up An Empirical Examination of Local Composition in Language Models Emmy Liu and Graham Neubig

2025-04-27 0 0 3.17MB 21 页 10玖币
侵权投诉
Are Representations Built from the Ground Up?
An Empirical Examination of Local Composition in Language Models
Emmy Liu and Graham Neubig
Language Technologies Institute
Carnegie Mellon University
{mengyan3, gneubig}@cs.cmu.edu
Abstract
Compositionality, the phenomenon where the
meaning of a phrase can be derived from
its constituent parts, is a hallmark of human
language. At the same time, many phrases
are non-compositional, carrying a meaning be-
yond that of each part in isolation. Repre-
senting both of these types of phrases is crit-
ical for language understanding, but it is an
open question whether modern language mod-
els (LMs) learn to do so; in this work we ex-
amine this question. We first formulate a prob-
lem of predicting the LM-internal representa-
tions of longer phrases given those of their
constituents. We find that the representation
of a parent phrase can be predicted with some
accuracy given an affine transformation of its
children. While we would expect the predic-
tive accuracy to correlate with human judg-
ments of semantic compositionality, we find
this is largely not the case, indicating that LMs
may not accurately distinguish between com-
positional and non-compositional phrases. We
perform a variety of analyses, shedding light
on when different varieties of LMs do and
do not generate compositional representations,
and discuss implications for future modeling
work.1
1 Introduction
Compositionality is argued to be a hallmark of lin-
guistic generalization (Szabó,2020). However,
some phrases are non-compositional, and can-
not be reconstructed from individual constituents
(Dankers et al.,2022a). Intuitively, a phrase like
"I own cats and dogs" is locally compositional,
whereas "It’s raining cats and dogs" is not. There-
fore, any representation of language must be easily
composable, but it must also correctly handle cases
that deviate from compositional rules.
Both lack (Hupkes et al.,2020;Lake and Baroni,
2017) and excess (Dankers et al.,2022b) of compo-
1
Code and data available at
https://github.com/
nightingal3/lm-compositionality
X
AB
[CLS] the dog sits on the sofa [SEP]
[CLS] the dog [SEP] [CLS] sits on the sofa [SEP]
Figure 1: An illustration of the local composition pre-
diction problem with [CLS] representations.
sitionality have been cited as common sources of
errors in NLP models, indicating that models may
handle phrase composition in an unexpected way.
In general form, the compositionality principle
is simply “the meaning of an expression is a func-
tion of the meanings of its parts and of the way
they are syntactically combined” (Pelletier,1994).
However, this definition is underspecified (Partee,
1984). Recent efforts to evaluate the compositional
abilities of neural networks have resulted in several
testable definitions of compositionality (Hupkes
et al.,2020).
Previous work on compositionality in natural lan-
guage focuses largely on the definition of
substitu-
tivity
, by focusing on changes to the constituents
of a complex phrase and how they change its repre-
sentation (Dankers et al.,2022a;Garcia et al.,2021;
Yu and Ettinger,2020). The definition we examine
is
localism
: whether or not the representation of
a complex phrase is derivable only from its local
structure and the representations of its immediate
“children” (Hupkes et al.,2020). A similar con-
cept has been proposed separately to measure the
compositionality of learned representations, which
we use in this work (Andreas,2019). We focus
on localism because it is a more direct definition
and does not rely on the collection of contrastive
pairs of phrases. This allows us to examine a wider
range of phrases of different types and lengths.
arXiv:2210.03575v2 [cs.CL] 22 Oct 2022
In this paper, we ask whether reasonable compo-
sitional probes can predict an LM’s representation
of a phrase from its children in a syntax tree, and if
so, which kinds of phrase are more or less compo-
sitional. We also ask whether this corresponds to
human judgements of compositionality.
We first establish a method to examine local
compositionality on phrases through probes that
try to predict the representation of a parent given
its children (section 2). We create two English-
language datasets upon which to experiment: a
large-scale dataset of 823K phrases mined from
the Penn Treebank, and a new dataset of idioms
and paired non-idiomatic phrases for which we
elicit human compositionality judgements, which
we call the
C
ompositionality of
H
uman-annotated
Idiomatic Phrases dataset (CHIP) (section 3).
For multiple models and phrase types, we find
that phrase embeddings across models and repre-
sentation types have a fairly predictable affine com-
positional structure based on embeddings of their
constituents (section 4). We find that there are
significant differences in compositionality across
phrase types, and analyze these trends in detail,
contributing to understanding how LMs represent
phrases (section 5). Interestingly, we find that hu-
man judgments do not generally align well with
the compositionality level of model representations
(section 6). This implies there is still work to be
done at the language modelling level to capture a
proper level of compositionality in representations.
2 Methods and Experimental Details
2.1 Tree Reconstruction Error
We follow Andreas (2019) in defining deviance
from compositionality as tree reconstruction error.
Consider a phrase
x= [a][b]
, where
a
and
b
can be
any length
>0
. Assume we always have some way
of knowing how
x
should be divided into
a
and
b
. Assume we also have some way of producing
representations for
x
,
a
, and
b
, which we represent
as a function
r
. Given representations
r(x)
,
r(a)
and
r(b)
, we wish to find the function which most
closely approximates how
r(x)
is constructed from
r(a)and r(b).
ˆ
f= arg min
f∈F
1
|X | X
x∈X
δx,ab (1)
δx,ab =d(r(x), f(r(a), r(b)) (2)
Where
X
is the set of possible phrases in the
language that can be decomposed into two parts,
F
is the set of functions under consideration, and
d
is a distance function. An example scenario is
depicted in Figure 1.
For
d
, we use cosine distance as this is the most
common function used to compare semantic vec-
tors. The division of
x
into
a
and
b
is specified by
syntactic structure (Chomsky,1959). Namely, we
use a phrase’s annotated constituency structure and
convert its constituency tree to a binary tree with
the right-factored Chomsky Normal Form conver-
sion included in NLTK (Bird and Loper,2004).
2.2 Language Models
We study representations produced by a variety
of widely used language models, specifically the
base-(uncased)
variants of Transformer-based
models:
BERT
,
RoBERTa
,
DeBERTa
, and
GPT-
2
(He et al.,2021;Liu et al.,2019;Devlin et al.,
2019;Radford et al.,2019).
2.2.1 Representation extraction
Let
[x0, ..., xN]
be a sequence of
N+ 1
input to-
kens, where
x0
is the
[CLS]
token if applicable, and
xN
is the end token if applicable. Let
[h(i)
0, ..., h(i)
N]
be the embeddings of the input tokens after the
i
-th
layer.
For models with the
[CLS]
beginning of se-
quence token (BERT, RoBERTa, and DeBERTa),
we extracted the embedding of the
[CLS]
token
from the last layer, which we refer to as the
CLS
representation. For GPT-2, we extracted the last
token, which serves a similar purpose. This corre-
sponds to h(12)
0and h(12)
Nrespectively.
Alternately, we also averaged all embeddings
from the last layer, including special tokens. We
refer to this as the AVG representation.
1
N+ 1
N+1
X
i=0
h(12)
i(3)
2.3 Approximating a Composition Function
To use this definition, we need a composition func-
tion
ˆ
f
. We examine choices detailed in this section.
For parameterized probes, we follow the prob-
ing literature in training several probes to predict a
property of the phrase given a representation of the
phrase. However, in this case, we are not predict-
ing a categorical attribute such as part of speech.
Instead, the probes that we use aim to predict the
parent representation
r(x)
based on the child rep-
resentations
r(a)
and
r(b)
. We call this an approxi-
mative probe to distinguish it from the usual use of
the word probe.
2.3.1 Arithmetic Probes
In the simplest probes, the phrase representation
r(x)
is computed by a single arithmetic operation
on
r(a)
and
r(b)
. We consider three arithmetic
probes:2
ADD(r(a), r(b)) = r(a) + r(b)(4)
W1(r(a), r(b)) = r(a)(5)
W2(r(a), r(b) = r(b)(6)
2.3.2 Learned Probes
We consider three types of learned probes. The
linear probe expresses
r(x)
as a linear combination
of
r(a)
and
r(b)
. The affine probe adds a bias term.
The MLP probe is a simple feedforward neural
network with 3 layers, using the ReLU activation.
LIN(r(a), r(b)) = α1r(a) + α2r(b)(7)
AFF(r(a),r(b))=α1r(a) + α2r(b) + β(8)
MLP(r(a), r(b)) = W3h2(9)
Where
h1=σ(W1[r(a); r(b)])
h2=σ(W2h1),
W1
is
(300 ×2)
,
W2
is
(768 ×300)
, and
W3
is
(1 ×768)
. We do not claim that this is the best
MLP possible, but use it as a simple architecture to
contrast with the linear models.
3 Data and Compositionality Judgments
3.1 Treebank
To collect a large set of phrases with syntactic
structure annotations, we collected all unique sub-
phrases (
2
words) from WSJ and Brown sections
of the Penn Treebank (v3) (Marcus et al.,1993).
3
The final dataset consists of
823K
phrases after
excluding null values and duplicates. We collected
2
Initially, we considered the elementwise product
PROD(r(a),r(b))=r(a)r(b)
, but found that it was an
extremely poor approximation.
3
We converted the trees to Chomsky Normal Form with
right-branching using NLTK (Bird and Loper,2004). We note
that not all subtrees are syntactically meaningful. However,
we used this conversion to standardize the number of children
and formatting. We exclude phrases with a null value for the
left or right branch (Bies et al.,1995).
the length of the left child in words, the length of
the right child in words, and the tree’s production
rule, which we refer to as tree type. There were
50260 tree types in total, but many of these are
unique. Examples and phrase length distribution
can be found in Appendix A, and Appendix B.
3.2 English Idioms and Matched Phrase Set
Previous datasets center around notable bigrams,
some of which are compositional and some of
which are non-compositional (Ramisch et al.,
2016b;Reddy et al.,2011). However, there is a
positive correlation between bigram frequency and
human compositionality scores in these datasets,
which means that it is unclear whether models are
capturing compositionality or merely frequency ef-
fects if they correlate well with the human scores.
Because models are likely more sensitive to sur-
face features of language than humans, we gathered
a more controlled set of phrases to compare with
human judgments.
Since non-compositional phrases are somewhat
rare, we began with a set of seed idioms and bi-
grams from previous studies (Jhamtani et al.,2021;
Ramisch et al.,2016b;Reddy et al.,2011). We used
idioms because they are a common source of non-
compositional phrases. Duplicates after lemmatiza-
tion were removed.
For each idiom, we used Google Syntactic
NGrams to find three phrases with an identical part
of speech and dependency structure to that idiom,
and frequency that was as close as possible relative
to others in Syntactic Ngrams (Goldberg and Or-
want,2013).
4
For example, the idiom "sail under
false colors" was matched with "distribute among
poor parishioners". More examples can be found
in Table 1. An author of this paper inspected the
idioms and removed those that were syntactically
analyzed incorrectly or offensive.
4 Approximating a Composition
Function
4.1 Methods
To approximate the composition functions of mod-
els, we extract the
CLS
and
AVG
representations
from each model on the Treebank dataset. We used
10-fold cross-validation and trained the learned
probes on the 90% training set in each fold. The
4
The part of speech/dependency pattern for each idiom
was taken to be the most common pattern for that phrase in
the dataset
Idiom Matched phrase Syntactic pattern Log frequency
Devil’s advocate Baker’s town JJ/dep/2 NN/pobj/0 2.398
Act of darkness Abandonment of institution NN/dobj/0 IN/prep/1 NN/pobj/2 4.304
School of hard knocks Field of social studies NN/pobj/0 IN/prep/1 JJ/amod/4 NNS/pobj/2 6.690
Table 1: Examples of idioms with their matched phrases, selected based on having the same syntactic pattern and
most similar log frequency in the Syntactic Ngrams dataset. Examples depicted here have the same log frequency.
Note that the frequency is based on the most common dependency and constituency pattern found in Syntactic
NGrams. Humans were asked to rate each phrase for its compositionality.
remaining 10% were divided into a test set (5%)
and dev set (5%).5
To fairly compare probes, we used mini-
mum description length probing (Voita and Titov,
2020).This approximates the length of the online
code needed to transmit both the model and data,
which is related to the area under the learning curve.
Specifically, we recorded average cosine similarity
of the predicted vector and actual vector on the test
set while varying the size of the training set from
0.005% to 100% of the original.
6
We compare the
AUC of each probe under these conditions to se-
lect the most parsimonious approximation for each
model.
4.2 Results
We find that
affine probes
are best able to cap-
ture the composition of phrase embeddings from
their left and right subphrases. A depiction of
probe performance at approximating representa-
tions across models and representation types is in
Figure 2. However, we note that scores for most
models are very high, due to the anisotropy phe-
nomenon. This describes the tendency for most
embeddings from pretrained language models to be
clustered in a narrow cone, rather than distributed
evenly in all directions (Li et al.,2020;Ethayarajh,
2019). We note that it is true for both word and
phrase embeddings.
Since we are comparing the probes to each other
relative to the same anisotropic vectors, this is not
necessarily a problem. However, in order to com-
5
The learned probes were trained with early stopping on
the dev set with a patience of 2 epochs, up to a maximum of
20 epochs. The Adam optimizer was used, with a batch size
of 512 and learning rate of 0.512.
6
We look at milestones of 0.005%, 0.01%, 0.1%, 0.5%,
1%, 10% and 100% specifically. This was because initial
experimentation showed that probes tended to converge at
or before 10% of the training data. Models were trained
separately (with the same seed and initialization) for each
percentage of the training data, and trained until convergence
for each data percentage condition.
pare each probe’s performance compared to chance,
we correct for anisotropy using a control task. This
task is using the trained probe to predict a ran-
dom phrase embedding from the set of treebank
phrase embeddings for that model, and recording
the distance between the compositional probe’s pre-
diction and the random embedding. This allows us
to calculate an error ratio
distprobe
distcontrol
, where
distprobe
represents the original average distance from the
true representation, and
distcontrol
is the average
distance on the control task. This quantifies how
much the probe improves over a random baseline
that takes anisotropy into account, where a smaller
value is better. These results can be found in Ap-
pendix E. The results without anisotropy correction
can be found in Appendix G. In most cases, the
affine probe still performs the best, so we continue
to use it for consistency on all the model and repre-
sentation types.
We also compare the AUC of training curves
for each probe and find that the affine probe re-
mains the best in most cases, except
RoBERTaCLS
and
DeBERTaCLS
. Training curves are depicted in
Appendix C. AUC values are listed in Appendix H.
Interestingly, there was a trend of the right child
being weighted more heavily than the left child,
and each model/representation type combination
had its own characteristic ratio of the left child to
the right child. For instance, in BERT, the weight
on the left child was 12, whereas it was 20 for the
right child.
For example, the approximation for the phrase
"green eggs and ham" with BERT
[CLS]
embed-
dings would be:
rCLS ("green eggs and ham") =
12rCLS ("green eggs") + 20rCLS ("and ham") +β
.
Figure 2: Mean compositionality score (cosine similarity) and standard deviation of each approximative probe
across 10 folds. Error bar indicates 95% CI.
5 Examining Compositionality across
Phrase Types
5.1 Methods
Intuitively, we expect the phrases whose represen-
tations are close to their predicted representation
to be more compositional. We call similarity to the
expected representation,
sim(r(x),ˆ
f(r(a), r(b)))
,
the compositionality score of a phrase.
We record the mean reconstruction error for each
tree type and report the results. In addition to com-
paring tree types to each other, we also examine the
treatment of named entities in subsubsection 5.2.1.
We examine the relationship between length of a
phrase in words and its compositionality score in
subsubsection 5.2.2.
5.2 Results
There is a significant difference between the mean
compositionality score of phrase types. Particu-
larly, the
AVG
representation assigns a lower com-
positionality score to NP
NNP NNP phrases,
which is expected since this phrase type often corre-
sponds to named entities. By contrast, the
CLS
rep-
resentation assigns a low compositionality score to
NP
DT NN, which is unexpected given that such
phrases are generally seen as compositional. The
reconstruction error for the most common phrase
types is shown in Figure 5.
Because different phrase types may be treated
differently by the model, we examine the relative
compositionality of phrases within each phrase
type. Examples of the most and least compositional
phrases from several phrase types are shown in Ta-
ble 2 for
RoBERTaCLS
. Patterns vary for model and
representation types, but long phrases are generally
represented more compositionally.
5.2.1 Named Entities
We used SpaCy to tag and examine named entities
(Honnibal and Montani,2017), as they are expected
to be less compositional. We find that named enti-
ties indeed have a lower compositionality score in
all cases except
RoBERTaCLS
, indicating that they
are correctly represented as less compositional. A
representative example is shown in Figure 3. Full
results can be found in Appendix J. We break down
the compositionality scores of named entities by
type and find surprising variation within categories
of named entities. For numerical examples, this
often depends on the unit used. For example, in
RoBERTaAVG
representations, numbers with "mil-
lion" and "billion" are grouped together as composi-
tional, whereas numbers with quantifiers ("about",
"more than", "some") are grouped together as not
compositional. The compositionality score distri-
butions for types of named entities are presented in
Figure 4.
5.2.2 Examining Compositionality and
Phrase Length
There is no consistent relationship between phrase
length and compositionality score across models
and representation types. However,
CLS
and
AVG
representations show divergent trends. There is a
strong positive correlation between phrase length
and compositionality score in the
AVG
representa-
tions, while no consistent trend exists for the
CLS
representations. This indicates that longer phrases
are better approximated as an affine transformation
of their subphrase representations. This trend is
summarized in Appendix D. All correlations are
highly significant.
摘要:

AreRepresentationsBuiltfromtheGroundUp?AnEmpiricalExaminationofLocalCompositioninLanguageModelsEmmyLiuandGrahamNeubigLanguageTechnologiesInstituteCarnegieMellonUniversity{mengyan3,gneubig}@cs.cmu.eduAbstractCompositionality,thephenomenonwherethemeaningofaphrasecanbederivedfromitsconstituentparts,isa...

展开>> 收起<<
Are Representations Built from the Ground Up An Empirical Examination of Local Composition in Language Models Emmy Liu and Graham Neubig.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:21 页 大小:3.17MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注