
Idiom Matched phrase Syntactic pattern Log frequency
Devil’s advocate Baker’s town JJ/dep/2 NN/pobj/0 2.398
Act of darkness Abandonment of institution NN/dobj/0 IN/prep/1 NN/pobj/2 4.304
School of hard knocks Field of social studies NN/pobj/0 IN/prep/1 JJ/amod/4 NNS/pobj/2 6.690
Table 1: Examples of idioms with their matched phrases, selected based on having the same syntactic pattern and
most similar log frequency in the Syntactic Ngrams dataset. Examples depicted here have the same log frequency.
Note that the frequency is based on the most common dependency and constituency pattern found in Syntactic
NGrams. Humans were asked to rate each phrase for its compositionality.
remaining 10% were divided into a test set (5%)
and dev set (5%).5
To fairly compare probes, we used mini-
mum description length probing (Voita and Titov,
2020).This approximates the length of the online
code needed to transmit both the model and data,
which is related to the area under the learning curve.
Specifically, we recorded average cosine similarity
of the predicted vector and actual vector on the test
set while varying the size of the training set from
0.005% to 100% of the original.
6
We compare the
AUC of each probe under these conditions to se-
lect the most parsimonious approximation for each
model.
4.2 Results
We find that
affine probes
are best able to cap-
ture the composition of phrase embeddings from
their left and right subphrases. A depiction of
probe performance at approximating representa-
tions across models and representation types is in
Figure 2. However, we note that scores for most
models are very high, due to the anisotropy phe-
nomenon. This describes the tendency for most
embeddings from pretrained language models to be
clustered in a narrow cone, rather than distributed
evenly in all directions (Li et al.,2020;Ethayarajh,
2019). We note that it is true for both word and
phrase embeddings.
Since we are comparing the probes to each other
relative to the same anisotropic vectors, this is not
necessarily a problem. However, in order to com-
5
The learned probes were trained with early stopping on
the dev set with a patience of 2 epochs, up to a maximum of
20 epochs. The Adam optimizer was used, with a batch size
of 512 and learning rate of 0.512.
6
We look at milestones of 0.005%, 0.01%, 0.1%, 0.5%,
1%, 10% and 100% specifically. This was because initial
experimentation showed that probes tended to converge at
or before 10% of the training data. Models were trained
separately (with the same seed and initialization) for each
percentage of the training data, and trained until convergence
for each data percentage condition.
pare each probe’s performance compared to chance,
we correct for anisotropy using a control task. This
task is using the trained probe to predict a ran-
dom phrase embedding from the set of treebank
phrase embeddings for that model, and recording
the distance between the compositional probe’s pre-
diction and the random embedding. This allows us
to calculate an error ratio
distprobe
distcontrol
, where
distprobe
represents the original average distance from the
true representation, and
distcontrol
is the average
distance on the control task. This quantifies how
much the probe improves over a random baseline
that takes anisotropy into account, where a smaller
value is better. These results can be found in Ap-
pendix E. The results without anisotropy correction
can be found in Appendix G. In most cases, the
affine probe still performs the best, so we continue
to use it for consistency on all the model and repre-
sentation types.
We also compare the AUC of training curves
for each probe and find that the affine probe re-
mains the best in most cases, except
RoBERTaCLS
and
DeBERTaCLS
. Training curves are depicted in
Appendix C. AUC values are listed in Appendix H.
Interestingly, there was a trend of the right child
being weighted more heavily than the left child,
and each model/representation type combination
had its own characteristic ratio of the left child to
the right child. For instance, in BERT, the weight
on the left child was 12, whereas it was 20 for the
right child.
For example, the approximation for the phrase
"green eggs and ham" with BERT
[CLS]
embed-
dings would be:
rCLS ("green eggs and ham") =
12rCLS ("green eggs") + 20rCLS ("and ham") +β
.