
probing results with thousands of experiments
for each model.
•
We illustrate the possibilities of the frame-
work on the example of the mBERT model,
demonstrating new insights and reassuring the
results of previous studies on narrower data.
Performing probing studies on such a large scale
addresses the vision outlined in Nichols (2007) and
contribute to a new dimension to linguistic typol-
ogy research, as the revealed structures are encapsu-
lated in tools and data inseparably tied to nowadays
linguistic nature. Our framework provides users
from different fields, including linguists, with a
new point of view on the typological proximity of
languages and categories.
2 Related Work
Different attempts were made to interpret behavior
and hidden learned representation of language mod-
els. For example, Hoover et al. (2020) investigated
the attention-heads of the BERT model on word
tokens connectivity level. Wallace et al. (2019)
presented an interpretation framework where they
improved a visual component of the model predic-
tion process on several NLP tasks for the end-user.
Flourishing after the ACL debates on semantic
parsing
1
, the probing methodology has developed
its own model interpretation tools. Thus,
SentEval
framework
(Conneau and Kiela,2018) includes
various types of linguistically-motivated tasks: sur-
face tasks probe for sentence length (SentLen) and
for the presence of words in the sentence (WC);
syntactic tasks test for sensitivity to word order
(BShift), the depth of the syntactic tree (TreeDepth)
and the sequence of top-level constituents in the
syntax tree (TopConst); semantic tasks check for
the tense (Tense), the subject (resp. direct object)
number in the main clause (SubjNum, resp. Ob-
jNum), the sensitivity to random replacement of a
noun/verb (SOMO) and the random swapping of
coordinated clausal conjuncts (CoordInv).
Linspector
(¸Sahin et al.,2019) includes 15 prob-
ing tasks for 24 languages by taking morphosyntac-
tic language properties into account, including case,
verb mood, and tense, syntactic correctness, and
the semantic impossibility of an example. While
lacking the simplicity of the SentEval approach, the
framework provides both a linguistically-grounded
1https://aclanthology.org/volumes/W14-24/
and multilingual setup. We are significantly ex-
panding both the list of languages and properties
being examined.
Probe-X
(Ravishankar et al.,2019b) has ex-
panded SentEval setup with 5 additional languages,
while
NeuroX framework
(Dalvi et al.,2019)
also introduced novelty, but proposed to enrich the
methodology to allow for cross-model analysis of
the results, supporting neuron-level inspection.
2.1 Probing Critique
We would state a few problems why some of the
probing practices are methodologically problem-
atic.
First, the probing interpretation result can differ
from paper to paper, creating various conclusions
from different authors. While Jawahar et al. (2019)
achieves from 69.5-96.2% accuracy on the SentLen
SentEval probing task (BERT model), they state
that this info is somehow represented at the bot-
tom layers. The work (Ravishankar et al.,2019b)
achieves 38-51% accuracy on SentLen (RNN en-
coder) and states that "recurrent encoders show
solid performance on certain tasks, such as sen-
tence length." This drastic difference in result in-
terpretation (“somehow” vs. “extremely strong”)
leads to misrepresenting the factual results. Con-
flicting evidence within the field of BERTology can
be found in Rogers et al. (2020), see Sec 3.1 and
4.3.
Secondly, the results on similar tasks can be ob-
tained with unstable success if the hyperparame-
ters are not fixed or exhaustively described: for
example, study (Jawahar et al.,2019) finds that
"BERT’s intermediate layers encode a rich hierar-
chy of linguistic information, with surface features
at the bottom, syntactic features in the middle and
semantic features at the top," while the work by
Tikhonova et al. (2022) on mBERT shows, that
the model does not learn the linguistic informa-
tion. More meta-research is needed to explore the
contradictory results obtained by the community.
2.2 Task Representation
In the survey of post-hoc language model inter-
pretation (Madsen et al.,2021), the linguistic
information-based tasks fall into the groups of the
highest abstraction and the top-informativeness of
properties used. This group of projects includes
tasks based on the various theoretical language lev-
els: from part-of-speech tagging to discourse.