Probing with Noise Unpicking the Warp and Weft of Embeddings Filip Klubi ˇcka and John D. Kelleher

2025-05-02 1 0 328.86KB 14 页 10玖币

侵权投诉

Probing with Noise:

Unpicking the Warp and Weft of Embeddings

Filip Klubiˇ

cka and John D. Kelleher

ADAPT Centre, Technological University Dublin, Ireland

{filip.klubicka,john.kelleher}@adaptcentre.ie

Abstract

Improving our understanding of how informa-

tion is encoded in vector space can yield valu-

able interpretability insights. Alongside vec-

tor dimensions, we argue that it is possible

for the vector norm to also carry linguistic

information. We develop a method to test

this: an extension of the probing framework

which allows for relative intrinsic interpreta-

tions of probing results. It relies on intro-

ducing noise that ablates information encoded

in embeddings, grounded in random baselines

and conﬁdence intervals. We apply the method

to well-established probing tasks and ﬁnd evi-

dence that conﬁrms the existence of separate

information containers in English GloVe and

BERT embeddings. Our correlation analysis

aligns with the experimental ﬁndings that dif-

ferent encoders use the norm to encode differ-

ent kinds of information: GloVe stores syntac-

tic and sentence length information in the vec-

tor norm, while BERT uses it to encode con-

textual incongruity.

1 Introduction

Probing in NLP, as deﬁned by Conneau et al.

(2018), is a classiﬁcation problem that predicts lin-

guistic properties using dense embeddings as train-

ing data. The framework rests on the assumption

that the probe’s success at a given task indicates

that the encoder is storing information on the perti-

nent linguistic properties. Probing has quickly be-

come an essential tool for encoder interpretability,

by providing interesting insights into embeddings.

In essence, embeddings are vectors positioned

in a shared multidimensional vector space, and

vectors are geometrically deﬁned by two aspects:

having both a

direction

and

magnitude

(Hefferon,

2018, page 36). Direction is the position in the

space that the vector points towards (expressed by

its dimension values), while magnitude is a vec-

tor’s length, deﬁned as its distance from the origin

(expressed by the vector norm) (Anton and Rorres,

2013, page 131). It is understood that information

contained in a vector is encoded in the dimension

values, which are most often studied in NLP re-

search (see §6). However, information can be en-

coded in a representational vector space in more

implicit ways, and relations can be inferred from

more than just vector dimension values.

We hypothesise that it is possible for the vec-

tor magnitude—the norm—to carry information

as well. Though it is a distributed property of a

vector’s dimensions, the norm not only relates the

distance of a vector from the origin, but indirectly

also its distance from other vectors. Two vectors

could be pointing in the exact same direction, but

their distance from the origin might differ dramat-

ically.

A similar effect has been observed in the

literature: for many word embedding algorithms,

the norm of the word vector correlates with the

word’s frequency (Schakel and Wilson,2015). E.g.

in fastText embeddings the vectors of stop words

(the most frequent words in English) are positioned

closer to the origin than content words (Balodis

and Deksne,2018); and Goldberg (2017) notes that

for many embeddings normalising the vectors re-

moves word frequency information. Additionally,

the norm plays an integral part in BERT’s atten-

tion layer, controlling the levels of contribution

from frequent, less informative words by control-

ling the norms of their vectors (Kobayashi et al.,

2020). It stands to reason that the norm could be

leveraged by embedding models to encode other

linguistic information as well. Hence, we argue

that a vector representation has two

information

containers

: vector dimensions and the vector norm

(the titular warp and weft). In this paper, we test

the assumption that these two components can be

used to encode different types of information.

To this end, we need a probing method that pro-

Mathematically, two vectors can only be considered equal

if both their direction and magnitude are equal (Anton and

Rorres,2013, page 137).

arXiv:2210.12206v1 [cs.CL] 21 Oct 2022

vides an intrinsic evaluation of any given embed-

ding representation, for which the typical probing

pipeline is not suited. We thus extend the existing

probing framework by introducing random noise

into the embeddings. This enables us to do an in-

trinsic evaluation of a single encoder by testing

whether the noise disrupted the information in the

embedding being tested. The right application of

noise enables us to determine which embedding

component the relevant information is encoded in,

by ablating that component’s information. In turn,

this can inform our understanding of how certain

linguistic properties are encoded in vector space.

We call the method probing with noise and demon-

strate its generalisability to both contextual and

static encoders by using it to intrinsically evaluate

English GloVe and BERT embeddings on a number

of established probing tasks.

This paper’s main contributions are: (a) a

methodological extension of the probing frame-

work: probing with noise; (b) an array of exper-

iments demonstrating the method on a range of

probing tasks; and (c) an exploration of the im-

portance of the vector norm in encoding linguistic

phenomena in different embedding models.

2 Method: Probing With Noise

Our method is an extension of the typical probing

pipeline (steps 1-6), incorporated as steps 7 and 8:

1. Choose a probing task

2. Choose or design an appropriate dataset

3. Choose a word/sentence representation

4. Choose a probing classiﬁer (the probe)

5. Train the probe on the embeddings as input

6. Evaluate the probe’s performance on the task

7. Introduce systematic noise in the embedding

8. Repeat training, evaluate and compare

Usually, the evaluation score from step 6 is used

as a basis to make inferences regarding the presence

of the probed information in embeddings. Differ-

ent encoders are compared based on their evalua-

tion score and the probe’s relative performance can

inform which model stores the information more

saliently. Though ours may seem like a minor ad-

dition, it changes the approach conceptually. Now,

rather than providing the ﬁnal score, the output

of step 6 establishes an intrinsic, vanilla baseline.

Embeddings with noise injections can then be com-

pared against it in steps 7 and 8, offering a relative

intrinsic interpretation of the evaluation. In other

words, using relative information between a vec-

tor representation and targeted ablations of itself

allows for inferences to be made on where informa-

tion is encoded in embeddings.

The method relies on three supporting pillars: (a)

random baselines, which in tandem with the vanilla

baseline provide the basis for a relative evaluation;

(b) statistical signiﬁcance derived from conﬁdence

intervals, which informs the inferences we make

based on the relative evaluation; and (c) targeted

noise, which enables us to examine where the in-

formation is encoded. We describe them in the

following subsections, starting with the noise.

2.1 Choosing the Noise

The nature of the noise is crucial for our method,

as the goal is to systematically disrupt the content

of the information containers in order to identify

whether a container encodes the information. We

use an ablation method to do this: by introduc-

ing noise into either container we “sabotage” the

representation, in turn identifying whether the in-

formation we are probing for has been removed.

Though we introduce random noise, our choice of

how to apply it is systematic, as it is important that

the noising function applied to one container leaves

the information in the remaining container intact,

otherwise the results will not offer relevant insight.

Ablating the Dimension Container:

The noise

function for ablating the dimensions needs to re-

move its information completely, while leaving

the norm intact. It should also not change the di-

mensionality of the vector, given that a change in

the dimensionality of a feature also changes the

chance of the probe ﬁnding a random or spurious

hyper-plane that performs well on the data sample.

Maintaining the dimensionality thus ensures that

the probability of the model ﬁnding such a lucky

split in the feature space remains unchanged.

Our noise function satisﬁes these constraints:

for each embedding in a dataset, we generate a

new, random vector of the same dimensionality,

then scale the new dimension values to match the

norm of the original vector. This invalidates any

semantics assigned to a particular dimension as the

values are replaced with meaningless noise, while

retaining the original vector’s norm values.

Ablating the Norm Container:

To remove in-

formation potentially carried by a vector’s norm

while retaining dimension information, we apply a

noising function analogous to the previous one: for

each embedding we generate a random norm value,

and then scale the vector’s original dimension val-

ues to match the new norm. This randomises vector

magnitudes, while the relative sizes of the dimen-

sions remain unchanged. In other words, all vectors

will keep pointing in the same directions, but any

information encoded by differences in magnitude

is removed.2

Ablating Both Containers:

The two ap-

proaches are not mutually exclusive: applying both

noising functions should have a compounding ef-

fect and ablate both information containers simul-

taneously, essentially generating a completely ran-

dom vector with none of the original information.

2.2 Random Baselines

Even when no information is encoded in an embed-

ding, the train set may contain class imbalance, and

the probe can learn the distribution of classes. To

account for this, as well as the possibility of a pow-

erful probe detecting an empty signal (Zhang and

Bowman,2018), we need to establish informative

random baselines against which we can compare

the probe’s performance.

We employ two such baselines: (a) we assert a

random prediction onto the test set, negating any

information that a classiﬁer could have learned,

class distributions included; and (b) we train the

probe on randomly generated vectors, establishing

a baseline with access only to class distributions.

2.3 Conﬁdence Intervals

Finally, we must account for the degrees of ran-

domness, which stem from two sources: (1) the

probe may contain a stochastic component, e.g. a

random weight initialisation; (2) the noise func-

tions are highly stochastic (i.e. sampling random

norm/dimension values). Hence, evaluation scores

will differ each time the probe is trained, making

relative comparisons of scores problematic. To

mitigate this, we retrain and evaluate each model

50 times reporting the average score of all runs,

essentially bootstrapping over the random seeds.

To obtain statistical signiﬁcance for the aver-

ages, we calculate a 99% conﬁdence interval (CI)

to conﬁrm that observed differences in the averages

of different model scores are signiﬁcant. We use

We are conscious that vectors have more than one kind of

norm, so choosing which norm to scale to might not be triv-

ial. We have explored this in supplementary experiments and

found that in our framework there is no signiﬁcant difference

between scaling to the L1 norm vs. L2 norm.

the CI range when comparing evaluation scores

of probes on any two noise models to determine

whether they come from the same distribution: if

there is overlap in the range of two possible av-

erages they might belong to the same distribution

and there is no statistically signiﬁcant difference

between them. Using CIs in this way gives us a

clearly deﬁned decision criterion on whether any

model performances are different.

3 Data

In our experiments we use 10 established prob-

ing task datasets for the English language intro-

duced by Conneau et al. (2018). The goal of the

multi-class Sentence Length (SL) probing task is

to predict the length of the sentence as binned in

6 possible categories, while Word Content (WC)

is a task with 1000 words as targets, predicting

which of the target words appears in a given sen-

tence. The Subject and Object Number tasks (SN

and ON) are binary classiﬁcation tasks that predict

the grammatical number of the subject/object of

the main clause as being singular or plural, while

the Tense (TE) task predicts whether the main verb

of the sentence is in the present or past tense. The

Coordination Inversion (CIN) task distinguishes

between a sentence where the order of two coor-

dinated clausal conjoints has been inverted or not.

Parse Tree Depth (TD) is a multi-class prediction

task where the goal is to predict the maximum

depth of the sentence’s syntactic tree, while Top

Constituents (TC) predicts one of 20-classes of the

most common syntactic top-constituent sequences.

In the Bigram Shift (BS) task, the goal is to pre-

dict whether two consecutive tokens in the sentence

have been inverted, and Semantic Odd Man Out

(SOMO) is a task predicting whether a noun or verb

was replaced with a different noun or verb. We use

these datasets as published in their totality, with

no modiﬁcations.

We also consider these tasks

to represent examples of different language do-

mains: surface information (SL,WC), morphology

(SN,ON,TE), syntax (TD,TC,CIN) and contextual

incongruity (BS,SOMO). This level of abstraction

can lend itself to interpreting the experimental re-

sults, as there may be similarities across tasks in

the same domain (note that Durrani et al. (2020)

follow a similar line of reasoning).

3https://github.com/facebookresearch/

SentEval/tree/master/data/probing

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProbingwithNoise:UnpickingtheWarpandWeftofEmbeddingsFilipKlubickaandJohnD.KelleherADAPTCentre,TechnologicalUniversityDublin,Ireland{filip.klubicka,john.kelleher}@adaptcentre.ieAbstractImprovingourunderstandingofhowinforma-tionisencodedinvectorspacecanyieldvalu-ableinterpretabilityinsights.Alongside...

展开>> 收起<<

Probing with Noise Unpicking the Warp and Weft of Embeddings Filip Klubi ˇcka and John D. Kelleher.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Probing with Noise Unpicking the Warp and Weft of Embeddings Filip Klubi ˇcka and John D. Kelleher

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: