Universal and Independent Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation Oleg Serikov Vitaly Protasov Ekaterina Voloshina Viktoria Knyazkova

2025-05-06 0 0 3.83MB 16 页 10玖币
侵权投诉
Universal and Independent: Multilingual Probing Framework for
Exhaustive Model Interpretation and Evaluation
Oleg Serikov♠♥, Vitaly Protasov, Ekaterina Voloshina$, Viktoria Knyazkova,
Tatiana Shavrina‡$
Artificial Intelligence Research Institute, $SberDevices,
HSE University, DeepPavlov lab, MIPT
Abstract
Linguistic analysis of language models is one
of the ways to explain and describe their rea-
soning, weaknesses, and limitations. In the
probing part of the model interpretability re-
search, studies concern individual languages
as well as individual linguistic structures. The
question arises: are the detected regularities
linguistically coherent, or on the contrary, do
they dissonate at the typological scale? More-
over, the majority of studies address the in-
herent set of languages and linguistic struc-
tures, leaving the actual typological diversity
knowledge out of scope. In this paper, we
present and apply the GUI-assisted framework
allowing us to easily probe a massive number
of languages for all the morphosyntactic fea-
tures present in the Universal Dependencies
data. We show that reflecting the anglo-centric
trend in NLP over the past years, most of the
regularities revealed in the mBERT model are
typical for the western-European languages.
Our framework can be integrated with the ex-
isting probing toolboxes, model cards, and
leaderboards, allowing practitioners to use and
share their standard probing methods to inter-
pret multilingual models. Thus we propose a
toolkit to systematize the multilingual flaws in
multilingual models, providing a reproducible
experimental setup for 104 languages and 80
morphosyntactic features. GitHub
1 Introduction
Probing methods shed light on the black box of the
neural models in unearthing the linguistic features
encoded in them. Probing sets a standard setup with
various internal representations from the model
and uses an auxiliary classifier to predict linguistic
information captured in the representation.
As probing research results have come up with
contradictory results on different languages and
language models, there appears to be a method-
ological need for a meta-study of the accumulated
knowledge and a need to standardize the experi-
mental setup. At the same time, the fixation of the
setup and hyperparameters should allow the repro-
duction of a wide range of experiments, such as
multilingual probing, like X-Probe (Ravishankar
et al.,2019a) and Linspector (Sahin et al.,2020),
layer-wise probing (Fayyaz et al.,2021), chrono-
logical probing (Voloshina et al.,2022).
Often, data for probing experiments is based on
already known competition data, benchmarks, and
gold standards. To obtain consistent results, such
data must be high-quality, manually validated, and
carefully include multiple languages. For this rea-
son, in this work, we use the Universal Dependen-
cies data (de Marneffe et al.,2021) as a source of
multilingual data with a validated and standardized
complete morphological and syntactic annotation,
which will allow us to accumulate the assimila-
tion of specific linguistic phenomena in many lan-
guages at once. Probing these languages on the
respective annotated linguistic categories would re-
veal how models seize the typological proximity of
languages.
Therefore, the general probing methodology
should include (according to Conneau and Kiela
(2018)) 1) a fixed set of evaluations based on what
appears to be community consensus; 2) a fixed eval-
uation pipeline with standard hyperparameters; 3)
a straightforward Python interface.
This paper aims to extrapolate the multilin-
gual linguistic diversity on the proven and tested
SentEval-like methodology.
We state our contribution as follows:
We develop a framework for exhaustive multi-
lingual probing of the language models, with a
complete enumeration of all grammatical char-
acteristics and all languages available in Uni-
versal Dependencies while maintaining the
standard SentEval format.
We provide a setup for better and explanatory
aggregation and exploration of the massive
arXiv:2210.13236v1 [cs.CL] 24 Oct 2022
probing results with thousands of experiments
for each model.
We illustrate the possibilities of the frame-
work on the example of the mBERT model,
demonstrating new insights and reassuring the
results of previous studies on narrower data.
Performing probing studies on such a large scale
addresses the vision outlined in Nichols (2007) and
contribute to a new dimension to linguistic typol-
ogy research, as the revealed structures are encapsu-
lated in tools and data inseparably tied to nowadays
linguistic nature. Our framework provides users
from different fields, including linguists, with a
new point of view on the typological proximity of
languages and categories.
2 Related Work
Different attempts were made to interpret behavior
and hidden learned representation of language mod-
els. For example, Hoover et al. (2020) investigated
the attention-heads of the BERT model on word
tokens connectivity level. Wallace et al. (2019)
presented an interpretation framework where they
improved a visual component of the model predic-
tion process on several NLP tasks for the end-user.
Flourishing after the ACL debates on semantic
parsing
1
, the probing methodology has developed
its own model interpretation tools. Thus,
SentEval
framework
(Conneau and Kiela,2018) includes
various types of linguistically-motivated tasks: sur-
face tasks probe for sentence length (SentLen) and
for the presence of words in the sentence (WC);
syntactic tasks test for sensitivity to word order
(BShift), the depth of the syntactic tree (TreeDepth)
and the sequence of top-level constituents in the
syntax tree (TopConst); semantic tasks check for
the tense (Tense), the subject (resp. direct object)
number in the main clause (SubjNum, resp. Ob-
jNum), the sensitivity to random replacement of a
noun/verb (SOMO) and the random swapping of
coordinated clausal conjuncts (CoordInv).
Linspector
(¸Sahin et al.,2019) includes 15 prob-
ing tasks for 24 languages by taking morphosyntac-
tic language properties into account, including case,
verb mood, and tense, syntactic correctness, and
the semantic impossibility of an example. While
lacking the simplicity of the SentEval approach, the
framework provides both a linguistically-grounded
1https://aclanthology.org/volumes/W14-24/
and multilingual setup. We are significantly ex-
panding both the list of languages and properties
being examined.
Probe-X
(Ravishankar et al.,2019b) has ex-
panded SentEval setup with 5 additional languages,
while
NeuroX framework
(Dalvi et al.,2019)
also introduced novelty, but proposed to enrich the
methodology to allow for cross-model analysis of
the results, supporting neuron-level inspection.
2.1 Probing Critique
We would state a few problems why some of the
probing practices are methodologically problem-
atic.
First, the probing interpretation result can differ
from paper to paper, creating various conclusions
from different authors. While Jawahar et al. (2019)
achieves from 69.5-96.2% accuracy on the SentLen
SentEval probing task (BERT model), they state
that this info is somehow represented at the bot-
tom layers. The work (Ravishankar et al.,2019b)
achieves 38-51% accuracy on SentLen (RNN en-
coder) and states that "recurrent encoders show
solid performance on certain tasks, such as sen-
tence length." This drastic difference in result in-
terpretation (“somehow” vs. “extremely strong”)
leads to misrepresenting the factual results. Con-
flicting evidence within the field of BERTology can
be found in Rogers et al. (2020), see Sec 3.1 and
4.3.
Secondly, the results on similar tasks can be ob-
tained with unstable success if the hyperparame-
ters are not fixed or exhaustively described: for
example, study (Jawahar et al.,2019) finds that
"BERT’s intermediate layers encode a rich hierar-
chy of linguistic information, with surface features
at the bottom, syntactic features in the middle and
semantic features at the top," while the work by
Tikhonova et al. (2022) on mBERT shows, that
the model does not learn the linguistic informa-
tion. More meta-research is needed to explore the
contradictory results obtained by the community.
2.2 Task Representation
In the survey of post-hoc language model inter-
pretation (Madsen et al.,2021), the linguistic
information-based tasks fall into the groups of the
highest abstraction and the top-informativeness of
properties used. This group of projects includes
tasks based on the various theoretical language lev-
els: from part-of-speech tagging to discourse.
Languages
While the most tasks are English-
based, there appear the non-English monolingual
frameworks: French-based probing (Merlo,2019),
Russian-based SentEval (Mikhailov et al.,2021),
Chinese word masking probing (Cui et al.,2021).
The multilingual benchmarks have paved the way
for multilingual probing studies by collecting the
necessary data.
Linguistic features
Most language-based tasks
tend to be based on morphology or syntax, de-
riving from SentEval methodology. Thus, higher-
level tasks can concentrate both on monolingual
discourse evaluation (Koto et al.,2021) (mostly
English-based by now), as well as the multilin-
gual discursive probing based on the conversion of
the existing multilingual benchmarks (Kurfalı and
Östling,2021) (XNLI, XQUAD).
3 Framework Design
This section describes the probing framework and
the experimental setup part.
The main goal is to probe how well a model
assimilates language constructions during training.
For the framework, we want to form an end-to-end
solution that can be applied to different models,
work on diverse data, and simplify the process of
getting insights from the results.
Based on that, the challenges we have are the
following:
1.
The data we use in the training and evaluation
parts must be in the standard format no matter
what language we deal with.
2.
The probing process should be universal for
different models. Based on it, we also need to
collect detailed results for further analysis.
3.
Since we aim to work with diverse data, we
should contain instruments to simplify the pro-
cess of getting insights from the results. If
we do not handle this problem, we can have
bunches of results that would be difficult to
interpret and provide findings for.
Thus, we can represent our framework as a tool
with different instruments. The first one is aimed
at pre-processing data for probing, which is com-
monly a classification task. The second one is a
probing engine supporting popular probing tech-
niques such as diagnostic classification. And the
last one is a visualization instrument which should
ease the process of interpreting the findings.
3.1 SentEval Format Converter
We found the SentEval format to be generally good
and universal in the data composition for classifi-
cation tasks. Since we have such a vast resource
as Universal Dependencies for different languages,
we can transform the data into the SentEval format
and compose different classification tasks based on
the language categories we can get.
UD annotation consists of several parts: lem-
mas, parts of speech, morphological features, and
universal dependencies relations. The converter
to SentEval format is focused on morphological
features. As Table 1illustrates, morphological cat-
egories are written in the sixth column with their
category values separated by the equals sign, for
example, in Number=Sing,Number is a category
and Sing is a category value. It took us 8 hours to
process by the SentEval converter on 96 CPUs for
absolutely all archives.
For each morphological category found in a
given file, the converter generates a new file in
SentEval format according to the following steps:
Data: CONLLU files or a directory to such
files for one language
Result: a file in SentEval format
read files;
find all morphological categories;
foreach categories do
foreach sentences do
if category is in sentence then
get a category value
end
stratified split on three samples;
write to a file
end
Algorithm 1: The conversion process
If split UD data into train, validation, and test
sets, we do not change this split. In other cases,
we split data into three sets, so the distribution of
category values in the original text will be kept in
each set.
If a sentence contains several words with the
same morphological categories, the closest to the
sentence node word is taken, preventing the one
sentence from being repeated several times. Table
1depicts the example of Tense category, the value
of word stopped will be taken, as it is the root of
the sentence.
Figure 1: The example of UD annotation
Format Data entry
Conll-U
# s e n t _ i d = weblog − t ypepad . com_ripples_20040407125600_ENG_20040407_125
# t e x t = T ha t t o o was s t o p p e d .
1 . That t h a t PRON DT Number= Si ng | P ron Ty pe =Dem 4 n s u b j : p a s s 4 : n s u b j : p a s s _
2 . too to o ADV RB _ 4 advmod 4 : advmod _
3 . was be AUX VBD Mood= I n d | Number= S in g | P er s o n = 3 | T ens e= P a s t | VerbForm= F in 4
aux : p a s s 4 : aux : p a s s _
4 . stopped s to p VERB VBN Tense = P as t | VerbForm= P a r t | Voice= Pa ss 0 r o o t 0: r o o t
S p a c e A f t e r =No
5 . .. PUNCT . _ 4 pu n c t 4 : p u nct _
SentEval
t r P a s t T ha t t o o was s t o p p e d .
Table 1: Example of CONLL-U format and its conversion to SentEval: Tense classification, train set.
3.2 Multilingual Data
We take 289 repositories, including the data of
172 languages available at the GitHub of Universal
Dependencies , updated in May 2022.2
While parsing files, we face several problems
inherited from UD. 71 of the repositories do not
contain any CONLLU files. Three Japanese repos-
itories and Korean and Frisian Dutch repositories
contain different annotations from standard UD an-
notations. The data from 16 repositories (Akkadian,
Cantonese, Chinese (2), German, Japanese, Hindi,
Irish, Kangri, Maltese, Neapolitan, South Levan-
tine Arabic, Swedish Sign language, Swiss Ger-
man, Old Turkish, Tagalog) do not contain morpho-
logical annotation. Also, some repositories include
correctly annotated data but are not suitable for
classification problems because all the examples
contain only one value of all the categories, for ex-
ample, only examples with class Plural are left for
the category Number (Cantonese, Chukchi, Frisian
Dutch, Hindi English, Japanese, Kangri, Khunsari,
Makurap, Maltese, Nayini, Neapolitan, Old Turk-
ish, Soi, South Levantine Arabic, Swedish Sign
Language, Swiss German, Telugu, Vietnamese).
After filtering, we have data from 104 languages
from 194 repositories (see Appendix A.1). From
the typological point of view, these languages be-
long to 20 language families, and the Basque lan-
guage is an isolate. Although almost half of the lan-
guages are from the Indo-European family, the data
include several under-studied language families.
2https://github.com/UniversalDependencies
Many of the languages in our data are endangered
or even extinct. The UD data is distributed based
on Creative Commons and GNU-based licenses,
varying from language to language
3
. Extracting
the tasks for every grammatical category results in
1927 probing datasets.
3.3 Probing Engine
3.3.1 Encoders
In the experiments, we consider the layers of
encoder-based models and their ability to acquire
language data and perform well on probing tasks.
Using the output of the model’s layers, we can get
contextualized token embeddings for elements of
the input text. For that reason, we can consider
several options for embedding aggregation:
CLS
where the text is presented as the embedding from
"[CLS] "token,
SUM
and
AVG
where the sentence
vector is a sum or average of embeddings of all text
tokens.
3.3.2 Classifiers and metrics
After the embeddings are obtained, we train a sim-
ple classification model based on the encoder lay-
ers’ representation and task data labels. We con-
sider linear (Logistic Regression) and non-linear
(MLP) classifiers. As the metrics for performance
evaluation, we use accuracy score and weighted
F1
score in case of unbalanced classes.
3https://lindat.mff.cuni.cz/repository/xmlui/
page/licence-UD-2.1
摘要:

UniversalandIndependent:MultilingualProbingFrameworkforExhaustiveModelInterpretationandEvaluationOlegSerikov‡~,VitalyProtasov‡,EkaterinaVoloshina$,ViktoriaKnyazkova~,TatianaShavrina‡$‡ArticialIntelligenceResearchInstitute,$SberDevices,~HSEUniversity,DeepPavlovlab,MIPTAbstractLinguisticanalysisofl...

展开>> 收起<<
Universal and Independent Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation Oleg Serikov Vitaly Protasov Ekaterina Voloshina Viktoria Knyazkova.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:3.83MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注