
each embedding we generate a random norm value,
and then scale the vector’s original dimension val-
ues to match the new norm. This randomises vector
magnitudes, while the relative sizes of the dimen-
sions remain unchanged. In other words, all vectors
will keep pointing in the same directions, but any
information encoded by differences in magnitude
is removed.2
Ablating Both Containers:
The two ap-
proaches are not mutually exclusive: applying both
noising functions should have a compounding ef-
fect and ablate both information containers simul-
taneously, essentially generating a completely ran-
dom vector with none of the original information.
2.2 Random Baselines
Even when no information is encoded in an embed-
ding, the train set may contain class imbalance, and
the probe can learn the distribution of classes. To
account for this, as well as the possibility of a pow-
erful probe detecting an empty signal (Zhang and
Bowman,2018), we need to establish informative
random baselines against which we can compare
the probe’s performance.
We employ two such baselines: (a) we assert a
random prediction onto the test set, negating any
information that a classifier could have learned,
class distributions included; and (b) we train the
probe on randomly generated vectors, establishing
a baseline with access only to class distributions.
2.3 Confidence Intervals
Finally, we must account for the degrees of ran-
domness, which stem from two sources: (1) the
probe may contain a stochastic component, e.g. a
random weight initialisation; (2) the noise func-
tions are highly stochastic (i.e. sampling random
norm/dimension values). Hence, evaluation scores
will differ each time the probe is trained, making
relative comparisons of scores problematic. To
mitigate this, we retrain and evaluate each model
50 times reporting the average score of all runs,
essentially bootstrapping over the random seeds.
To obtain statistical significance for the aver-
ages, we calculate a 99% confidence interval (CI)
to confirm that observed differences in the averages
of different model scores are significant. We use
2
We are conscious that vectors have more than one kind of
norm, so choosing which norm to scale to might not be triv-
ial. We have explored this in supplementary experiments and
found that in our framework there is no significant difference
between scaling to the L1 norm vs. L2 norm.
the CI range when comparing evaluation scores
of probes on any two noise models to determine
whether they come from the same distribution: if
there is overlap in the range of two possible av-
erages they might belong to the same distribution
and there is no statistically significant difference
between them. Using CIs in this way gives us a
clearly defined decision criterion on whether any
model performances are different.
3 Data
In our experiments we use 10 established prob-
ing task datasets for the English language intro-
duced by Conneau et al. (2018). The goal of the
multi-class Sentence Length (SL) probing task is
to predict the length of the sentence as binned in
6 possible categories, while Word Content (WC)
is a task with 1000 words as targets, predicting
which of the target words appears in a given sen-
tence. The Subject and Object Number tasks (SN
and ON) are binary classification tasks that predict
the grammatical number of the subject/object of
the main clause as being singular or plural, while
the Tense (TE) task predicts whether the main verb
of the sentence is in the present or past tense. The
Coordination Inversion (CIN) task distinguishes
between a sentence where the order of two coor-
dinated clausal conjoints has been inverted or not.
Parse Tree Depth (TD) is a multi-class prediction
task where the goal is to predict the maximum
depth of the sentence’s syntactic tree, while Top
Constituents (TC) predicts one of 20-classes of the
most common syntactic top-constituent sequences.
In the Bigram Shift (BS) task, the goal is to pre-
dict whether two consecutive tokens in the sentence
have been inverted, and Semantic Odd Man Out
(SOMO) is a task predicting whether a noun or verb
was replaced with a different noun or verb. We use
these datasets as published in their totality, with
no modifications.
3
We also consider these tasks
to represent examples of different language do-
mains: surface information (SL,WC), morphology
(SN,ON,TE), syntax (TD,TC,CIN) and contextual
incongruity (BS,SOMO). This level of abstraction
can lend itself to interpreting the experimental re-
sults, as there may be similarities across tasks in
the same domain (note that Durrani et al. (2020)
follow a similar line of reasoning).
3https://github.com/facebookresearch/
SentEval/tree/master/data/probing