Probing with Noise Unpicking the Warp and Weft of Embeddings Filip Klubi ˇcka and John D. Kelleher

2025-05-02 0 0 328.86KB 14 页 10玖币
侵权投诉
Probing with Noise:
Unpicking the Warp and Weft of Embeddings
Filip Klubiˇ
cka and John D. Kelleher
ADAPT Centre, Technological University Dublin, Ireland
{filip.klubicka,john.kelleher}@adaptcentre.ie
Abstract
Improving our understanding of how informa-
tion is encoded in vector space can yield valu-
able interpretability insights. Alongside vec-
tor dimensions, we argue that it is possible
for the vector norm to also carry linguistic
information. We develop a method to test
this: an extension of the probing framework
which allows for relative intrinsic interpreta-
tions of probing results. It relies on intro-
ducing noise that ablates information encoded
in embeddings, grounded in random baselines
and confidence intervals. We apply the method
to well-established probing tasks and find evi-
dence that confirms the existence of separate
information containers in English GloVe and
BERT embeddings. Our correlation analysis
aligns with the experimental findings that dif-
ferent encoders use the norm to encode differ-
ent kinds of information: GloVe stores syntac-
tic and sentence length information in the vec-
tor norm, while BERT uses it to encode con-
textual incongruity.
1 Introduction
Probing in NLP, as defined by Conneau et al.
(2018), is a classification problem that predicts lin-
guistic properties using dense embeddings as train-
ing data. The framework rests on the assumption
that the probe’s success at a given task indicates
that the encoder is storing information on the perti-
nent linguistic properties. Probing has quickly be-
come an essential tool for encoder interpretability,
by providing interesting insights into embeddings.
In essence, embeddings are vectors positioned
in a shared multidimensional vector space, and
vectors are geometrically defined by two aspects:
having both a
direction
and
magnitude
(Hefferon,
2018, page 36). Direction is the position in the
space that the vector points towards (expressed by
its dimension values), while magnitude is a vec-
tor’s length, defined as its distance from the origin
(expressed by the vector norm) (Anton and Rorres,
2013, page 131). It is understood that information
contained in a vector is encoded in the dimension
values, which are most often studied in NLP re-
search (see §6). However, information can be en-
coded in a representational vector space in more
implicit ways, and relations can be inferred from
more than just vector dimension values.
We hypothesise that it is possible for the vec-
tor magnitude—the norm—to carry information
as well. Though it is a distributed property of a
vector’s dimensions, the norm not only relates the
distance of a vector from the origin, but indirectly
also its distance from other vectors. Two vectors
could be pointing in the exact same direction, but
their distance from the origin might differ dramat-
ically.
1
A similar effect has been observed in the
literature: for many word embedding algorithms,
the norm of the word vector correlates with the
word’s frequency (Schakel and Wilson,2015). E.g.
in fastText embeddings the vectors of stop words
(the most frequent words in English) are positioned
closer to the origin than content words (Balodis
and Deksne,2018); and Goldberg (2017) notes that
for many embeddings normalising the vectors re-
moves word frequency information. Additionally,
the norm plays an integral part in BERT’s atten-
tion layer, controlling the levels of contribution
from frequent, less informative words by control-
ling the norms of their vectors (Kobayashi et al.,
2020). It stands to reason that the norm could be
leveraged by embedding models to encode other
linguistic information as well. Hence, we argue
that a vector representation has two
information
containers
: vector dimensions and the vector norm
(the titular warp and weft). In this paper, we test
the assumption that these two components can be
used to encode different types of information.
To this end, we need a probing method that pro-
1
Mathematically, two vectors can only be considered equal
if both their direction and magnitude are equal (Anton and
Rorres,2013, page 137).
arXiv:2210.12206v1 [cs.CL] 21 Oct 2022
vides an intrinsic evaluation of any given embed-
ding representation, for which the typical probing
pipeline is not suited. We thus extend the existing
probing framework by introducing random noise
into the embeddings. This enables us to do an in-
trinsic evaluation of a single encoder by testing
whether the noise disrupted the information in the
embedding being tested. The right application of
noise enables us to determine which embedding
component the relevant information is encoded in,
by ablating that component’s information. In turn,
this can inform our understanding of how certain
linguistic properties are encoded in vector space.
We call the method probing with noise and demon-
strate its generalisability to both contextual and
static encoders by using it to intrinsically evaluate
English GloVe and BERT embeddings on a number
of established probing tasks.
This paper’s main contributions are: (a) a
methodological extension of the probing frame-
work: probing with noise; (b) an array of exper-
iments demonstrating the method on a range of
probing tasks; and (c) an exploration of the im-
portance of the vector norm in encoding linguistic
phenomena in different embedding models.
2 Method: Probing With Noise
Our method is an extension of the typical probing
pipeline (steps 1-6), incorporated as steps 7 and 8:
1. Choose a probing task
2. Choose or design an appropriate dataset
3. Choose a word/sentence representation
4. Choose a probing classifier (the probe)
5. Train the probe on the embeddings as input
6. Evaluate the probe’s performance on the task
7. Introduce systematic noise in the embedding
8. Repeat training, evaluate and compare
Usually, the evaluation score from step 6 is used
as a basis to make inferences regarding the presence
of the probed information in embeddings. Differ-
ent encoders are compared based on their evalua-
tion score and the probe’s relative performance can
inform which model stores the information more
saliently. Though ours may seem like a minor ad-
dition, it changes the approach conceptually. Now,
rather than providing the final score, the output
of step 6 establishes an intrinsic, vanilla baseline.
Embeddings with noise injections can then be com-
pared against it in steps 7 and 8, offering a relative
intrinsic interpretation of the evaluation. In other
words, using relative information between a vec-
tor representation and targeted ablations of itself
allows for inferences to be made on where informa-
tion is encoded in embeddings.
The method relies on three supporting pillars: (a)
random baselines, which in tandem with the vanilla
baseline provide the basis for a relative evaluation;
(b) statistical significance derived from confidence
intervals, which informs the inferences we make
based on the relative evaluation; and (c) targeted
noise, which enables us to examine where the in-
formation is encoded. We describe them in the
following subsections, starting with the noise.
2.1 Choosing the Noise
The nature of the noise is crucial for our method,
as the goal is to systematically disrupt the content
of the information containers in order to identify
whether a container encodes the information. We
use an ablation method to do this: by introduc-
ing noise into either container we “sabotage” the
representation, in turn identifying whether the in-
formation we are probing for has been removed.
Though we introduce random noise, our choice of
how to apply it is systematic, as it is important that
the noising function applied to one container leaves
the information in the remaining container intact,
otherwise the results will not offer relevant insight.
Ablating the Dimension Container:
The noise
function for ablating the dimensions needs to re-
move its information completely, while leaving
the norm intact. It should also not change the di-
mensionality of the vector, given that a change in
the dimensionality of a feature also changes the
chance of the probe finding a random or spurious
hyper-plane that performs well on the data sample.
Maintaining the dimensionality thus ensures that
the probability of the model finding such a lucky
split in the feature space remains unchanged.
Our noise function satisfies these constraints:
for each embedding in a dataset, we generate a
new, random vector of the same dimensionality,
then scale the new dimension values to match the
norm of the original vector. This invalidates any
semantics assigned to a particular dimension as the
values are replaced with meaningless noise, while
retaining the original vector’s norm values.
Ablating the Norm Container:
To remove in-
formation potentially carried by a vector’s norm
while retaining dimension information, we apply a
noising function analogous to the previous one: for
each embedding we generate a random norm value,
and then scale the vector’s original dimension val-
ues to match the new norm. This randomises vector
magnitudes, while the relative sizes of the dimen-
sions remain unchanged. In other words, all vectors
will keep pointing in the same directions, but any
information encoded by differences in magnitude
is removed.2
Ablating Both Containers:
The two ap-
proaches are not mutually exclusive: applying both
noising functions should have a compounding ef-
fect and ablate both information containers simul-
taneously, essentially generating a completely ran-
dom vector with none of the original information.
2.2 Random Baselines
Even when no information is encoded in an embed-
ding, the train set may contain class imbalance, and
the probe can learn the distribution of classes. To
account for this, as well as the possibility of a pow-
erful probe detecting an empty signal (Zhang and
Bowman,2018), we need to establish informative
random baselines against which we can compare
the probe’s performance.
We employ two such baselines: (a) we assert a
random prediction onto the test set, negating any
information that a classifier could have learned,
class distributions included; and (b) we train the
probe on randomly generated vectors, establishing
a baseline with access only to class distributions.
2.3 Confidence Intervals
Finally, we must account for the degrees of ran-
domness, which stem from two sources: (1) the
probe may contain a stochastic component, e.g. a
random weight initialisation; (2) the noise func-
tions are highly stochastic (i.e. sampling random
norm/dimension values). Hence, evaluation scores
will differ each time the probe is trained, making
relative comparisons of scores problematic. To
mitigate this, we retrain and evaluate each model
50 times reporting the average score of all runs,
essentially bootstrapping over the random seeds.
To obtain statistical significance for the aver-
ages, we calculate a 99% confidence interval (CI)
to confirm that observed differences in the averages
of different model scores are significant. We use
2
We are conscious that vectors have more than one kind of
norm, so choosing which norm to scale to might not be triv-
ial. We have explored this in supplementary experiments and
found that in our framework there is no significant difference
between scaling to the L1 norm vs. L2 norm.
the CI range when comparing evaluation scores
of probes on any two noise models to determine
whether they come from the same distribution: if
there is overlap in the range of two possible av-
erages they might belong to the same distribution
and there is no statistically significant difference
between them. Using CIs in this way gives us a
clearly defined decision criterion on whether any
model performances are different.
3 Data
In our experiments we use 10 established prob-
ing task datasets for the English language intro-
duced by Conneau et al. (2018). The goal of the
multi-class Sentence Length (SL) probing task is
to predict the length of the sentence as binned in
6 possible categories, while Word Content (WC)
is a task with 1000 words as targets, predicting
which of the target words appears in a given sen-
tence. The Subject and Object Number tasks (SN
and ON) are binary classification tasks that predict
the grammatical number of the subject/object of
the main clause as being singular or plural, while
the Tense (TE) task predicts whether the main verb
of the sentence is in the present or past tense. The
Coordination Inversion (CIN) task distinguishes
between a sentence where the order of two coor-
dinated clausal conjoints has been inverted or not.
Parse Tree Depth (TD) is a multi-class prediction
task where the goal is to predict the maximum
depth of the sentence’s syntactic tree, while Top
Constituents (TC) predicts one of 20-classes of the
most common syntactic top-constituent sequences.
In the Bigram Shift (BS) task, the goal is to pre-
dict whether two consecutive tokens in the sentence
have been inverted, and Semantic Odd Man Out
(SOMO) is a task predicting whether a noun or verb
was replaced with a different noun or verb. We use
these datasets as published in their totality, with
no modifications.
3
We also consider these tasks
to represent examples of different language do-
mains: surface information (SL,WC), morphology
(SN,ON,TE), syntax (TD,TC,CIN) and contextual
incongruity (BS,SOMO). This level of abstraction
can lend itself to interpreting the experimental re-
sults, as there may be similarities across tasks in
the same domain (note that Durrani et al. (2020)
follow a similar line of reasoning).
3https://github.com/facebookresearch/
SentEval/tree/master/data/probing
摘要:

ProbingwithNoise:UnpickingtheWarpandWeftofEmbeddingsFilipKlubickaandJohnD.KelleherADAPTCentre,TechnologicalUniversityDublin,Ireland{filip.klubicka,john.kelleher}@adaptcentre.ieAbstractImprovingourunderstandingofhowinforma-tionisencodedinvectorspacecanyieldvalu-ableinterpretabilityinsights.Alongside...

展开>> 收起<<
Probing with Noise Unpicking the Warp and Weft of Embeddings Filip Klubi ˇcka and John D. Kelleher.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:328.86KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注