Universal and Independent Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation Oleg Serikov Vitaly Protasov Ekaterina Voloshina Viktoria Knyazkova

2025-05-06 3 0 3.83MB 16 页 10玖币

侵权投诉

Universal and Independent: Multilingual Probing Framework for

Exhaustive Model Interpretation and Evaluation

Oleg Serikov‡♠♥, Vitaly Protasov‡, Ekaterina Voloshina$, Viktoria Knyazkova♥,

Tatiana Shavrina‡$

‡Artiﬁcial Intelligence Research Institute, $SberDevices,

♥HSE University, ♠DeepPavlov lab, MIPT

Abstract

Linguistic analysis of language models is one

of the ways to explain and describe their rea-

soning, weaknesses, and limitations. In the

probing part of the model interpretability re-

search, studies concern individual languages

as well as individual linguistic structures. The

question arises: are the detected regularities

linguistically coherent, or on the contrary, do

they dissonate at the typological scale? More-

over, the majority of studies address the in-

herent set of languages and linguistic struc-

tures, leaving the actual typological diversity

knowledge out of scope. In this paper, we

present and apply the GUI-assisted framework

allowing us to easily probe a massive number

of languages for all the morphosyntactic fea-

tures present in the Universal Dependencies

data. We show that reﬂecting the anglo-centric

trend in NLP over the past years, most of the

regularities revealed in the mBERT model are

typical for the western-European languages.

Our framework can be integrated with the ex-

isting probing toolboxes, model cards, and

leaderboards, allowing practitioners to use and

share their standard probing methods to inter-

pret multilingual models. Thus we propose a

toolkit to systematize the multilingual ﬂaws in

multilingual models, providing a reproducible

experimental setup for 104 languages and 80

morphosyntactic features. GitHub

1 Introduction

Probing methods shed light on the black box of the

neural models in unearthing the linguistic features

encoded in them. Probing sets a standard setup with

various internal representations from the model

and uses an auxiliary classiﬁer to predict linguistic

information captured in the representation.

As probing research results have come up with

contradictory results on different languages and

language models, there appears to be a method-

ological need for a meta-study of the accumulated

knowledge and a need to standardize the experi-

mental setup. At the same time, the ﬁxation of the

setup and hyperparameters should allow the repro-

duction of a wide range of experiments, such as

multilingual probing, like X-Probe (Ravishankar

et al.,2019a) and Linspector (Sahin et al.,2020),

layer-wise probing (Fayyaz et al.,2021), chrono-

logical probing (Voloshina et al.,2022).

Often, data for probing experiments is based on

already known competition data, benchmarks, and

gold standards. To obtain consistent results, such

data must be high-quality, manually validated, and

carefully include multiple languages. For this rea-

son, in this work, we use the Universal Dependen-

cies data (de Marneffe et al.,2021) as a source of

multilingual data with a validated and standardized

complete morphological and syntactic annotation,

which will allow us to accumulate the assimila-

tion of speciﬁc linguistic phenomena in many lan-

guages at once. Probing these languages on the

respective annotated linguistic categories would re-

veal how models seize the typological proximity of

languages.

Therefore, the general probing methodology

should include (according to Conneau and Kiela

(2018)) 1) a ﬁxed set of evaluations based on what

appears to be community consensus; 2) a ﬁxed eval-

uation pipeline with standard hyperparameters; 3)

a straightforward Python interface.

This paper aims to extrapolate the multilin-

gual linguistic diversity on the proven and tested

SentEval-like methodology.

We state our contribution as follows:

•

We develop a framework for exhaustive multi-

lingual probing of the language models, with a

complete enumeration of all grammatical char-

acteristics and all languages available in Uni-

versal Dependencies while maintaining the

standard SentEval format.

•

We provide a setup for better and explanatory

aggregation and exploration of the massive

arXiv:2210.13236v1 [cs.CL] 24 Oct 2022

probing results with thousands of experiments

for each model.

•

We illustrate the possibilities of the frame-

work on the example of the mBERT model,

demonstrating new insights and reassuring the

results of previous studies on narrower data.

Performing probing studies on such a large scale

addresses the vision outlined in Nichols (2007) and

contribute to a new dimension to linguistic typol-

ogy research, as the revealed structures are encapsu-

lated in tools and data inseparably tied to nowadays

linguistic nature. Our framework provides users

from different ﬁelds, including linguists, with a

new point of view on the typological proximity of

languages and categories.

2 Related Work

Different attempts were made to interpret behavior

and hidden learned representation of language mod-

els. For example, Hoover et al. (2020) investigated

the attention-heads of the BERT model on word

tokens connectivity level. Wallace et al. (2019)

presented an interpretation framework where they

improved a visual component of the model predic-

tion process on several NLP tasks for the end-user.

Flourishing after the ACL debates on semantic

parsing

, the probing methodology has developed

its own model interpretation tools. Thus,

SentEval

framework

(Conneau and Kiela,2018) includes

various types of linguistically-motivated tasks: sur-

face tasks probe for sentence length (SentLen) and

for the presence of words in the sentence (WC);

syntactic tasks test for sensitivity to word order

(BShift), the depth of the syntactic tree (TreeDepth)

and the sequence of top-level constituents in the

syntax tree (TopConst); semantic tasks check for

the tense (Tense), the subject (resp. direct object)

number in the main clause (SubjNum, resp. Ob-

jNum), the sensitivity to random replacement of a

noun/verb (SOMO) and the random swapping of

coordinated clausal conjuncts (CoordInv).

Linspector

(¸Sahin et al.,2019) includes 15 prob-

ing tasks for 24 languages by taking morphosyntac-

tic language properties into account, including case,

verb mood, and tense, syntactic correctness, and

the semantic impossibility of an example. While

lacking the simplicity of the SentEval approach, the

framework provides both a linguistically-grounded

1https://aclanthology.org/volumes/W14-24/

and multilingual setup. We are signiﬁcantly ex-

panding both the list of languages and properties

being examined.

Probe-X

(Ravishankar et al.,2019b) has ex-

panded SentEval setup with 5 additional languages,

while

NeuroX framework

(Dalvi et al.,2019)

also introduced novelty, but proposed to enrich the

methodology to allow for cross-model analysis of

the results, supporting neuron-level inspection.

2.1 Probing Critique

We would state a few problems why some of the

probing practices are methodologically problem-

atic.

First, the probing interpretation result can differ

from paper to paper, creating various conclusions

from different authors. While Jawahar et al. (2019)

achieves from 69.5-96.2% accuracy on the SentLen

SentEval probing task (BERT model), they state

that this info is somehow represented at the bot-

tom layers. The work (Ravishankar et al.,2019b)

achieves 38-51% accuracy on SentLen (RNN en-

coder) and states that "recurrent encoders show

solid performance on certain tasks, such as sen-

tence length." This drastic difference in result in-

terpretation (“somehow” vs. “extremely strong”)

leads to misrepresenting the factual results. Con-

ﬂicting evidence within the ﬁeld of BERTology can

be found in Rogers et al. (2020), see Sec 3.1 and

4.3.

Secondly, the results on similar tasks can be ob-

tained with unstable success if the hyperparame-

ters are not ﬁxed or exhaustively described: for

example, study (Jawahar et al.,2019) ﬁnds that

"BERT’s intermediate layers encode a rich hierar-

chy of linguistic information, with surface features

at the bottom, syntactic features in the middle and

semantic features at the top," while the work by

Tikhonova et al. (2022) on mBERT shows, that

the model does not learn the linguistic informa-

tion. More meta-research is needed to explore the

contradictory results obtained by the community.

2.2 Task Representation

In the survey of post-hoc language model inter-

pretation (Madsen et al.,2021), the linguistic

information-based tasks fall into the groups of the

highest abstraction and the top-informativeness of

properties used. This group of projects includes

tasks based on the various theoretical language lev-

els: from part-of-speech tagging to discourse.

Languages

While the most tasks are English-

based, there appear the non-English monolingual

frameworks: French-based probing (Merlo,2019),

Russian-based SentEval (Mikhailov et al.,2021),

Chinese word masking probing (Cui et al.,2021).

The multilingual benchmarks have paved the way

for multilingual probing studies by collecting the

necessary data.

Linguistic features

Most language-based tasks

tend to be based on morphology or syntax, de-

riving from SentEval methodology. Thus, higher-

level tasks can concentrate both on monolingual

discourse evaluation (Koto et al.,2021) (mostly

English-based by now), as well as the multilin-

gual discursive probing based on the conversion of

the existing multilingual benchmarks (Kurfalı and

Östling,2021) (XNLI, XQUAD).

3 Framework Design

This section describes the probing framework and

the experimental setup part.

The main goal is to probe how well a model

assimilates language constructions during training.

For the framework, we want to form an end-to-end

solution that can be applied to different models,

work on diverse data, and simplify the process of

getting insights from the results.

Based on that, the challenges we have are the

following:

The data we use in the training and evaluation

parts must be in the standard format no matter

what language we deal with.

The probing process should be universal for

different models. Based on it, we also need to

collect detailed results for further analysis.

Since we aim to work with diverse data, we

should contain instruments to simplify the pro-

cess of getting insights from the results. If

we do not handle this problem, we can have

bunches of results that would be difﬁcult to

interpret and provide ﬁndings for.

Thus, we can represent our framework as a tool

with different instruments. The ﬁrst one is aimed

at pre-processing data for probing, which is com-

monly a classiﬁcation task. The second one is a

probing engine supporting popular probing tech-

niques such as diagnostic classiﬁcation. And the

last one is a visualization instrument which should

ease the process of interpreting the ﬁndings.

3.1 SentEval Format Converter

We found the SentEval format to be generally good

and universal in the data composition for classiﬁ-

cation tasks. Since we have such a vast resource

as Universal Dependencies for different languages,

we can transform the data into the SentEval format

and compose different classiﬁcation tasks based on

the language categories we can get.

UD annotation consists of several parts: lem-

mas, parts of speech, morphological features, and

universal dependencies relations. The converter

to SentEval format is focused on morphological

features. As Table 1illustrates, morphological cat-

egories are written in the sixth column with their

category values separated by the equals sign, for

example, in Number=Sing,Number is a category

and Sing is a category value. It took us 8 hours to

process by the SentEval converter on 96 CPUs for

absolutely all archives.

For each morphological category found in a

given ﬁle, the converter generates a new ﬁle in

SentEval format according to the following steps:

Data: CONLLU ﬁles or a directory to such

ﬁles for one language

Result: a ﬁle in SentEval format

read ﬁles;

ﬁnd all morphological categories;

foreach categories do

foreach sentences do

if category is in sentence then

get a category value

end

stratiﬁed split on three samples;

write to a ﬁle

end

Algorithm 1: The conversion process

If split UD data into train, validation, and test

sets, we do not change this split. In other cases,

we split data into three sets, so the distribution of

category values in the original text will be kept in

each set.

If a sentence contains several words with the

same morphological categories, the closest to the

sentence node word is taken, preventing the one

sentence from being repeated several times. Table

1depicts the example of Tense category, the value

of word stopped will be taken, as it is the root of

the sentence.

Figure 1: The example of UD annotation

Format Data entry

Conll-U

# s e n t _ i d = weblog − t ypepad . com_ripples_20040407125600_ENG_20040407_125

# t e x t = T ha t t o o was s t o p p e d .

1 . That t h a t PRON DT Number= Si ng | P ron Ty pe =Dem 4 n s u b j : p a s s 4 : n s u b j : p a s s _

2 . too to o ADV RB _ 4 advmod 4 : advmod _

3 . was be AUX VBD Mood= I n d | Number= S in g | P er s o n = 3 | T ens e= P a s t | VerbForm= F in 4

aux : p a s s 4 : aux : p a s s _

4 . stopped s to p VERB VBN Tense = P as t | VerbForm= P a r t | Voice= Pa ss 0 r o o t 0: r o o t

S p a c e A f t e r =No

5 . .. PUNCT . _ 4 pu n c t 4 : p u nct _

SentEval

t r P a s t T ha t t o o was s t o p p e d .

Table 1: Example of CONLL-U format and its conversion to SentEval: Tense classiﬁcation, train set.

3.2 Multilingual Data

We take 289 repositories, including the data of

172 languages available at the GitHub of Universal

Dependencies , updated in May 2022.2

While parsing ﬁles, we face several problems

inherited from UD. 71 of the repositories do not

contain any CONLLU ﬁles. Three Japanese repos-

itories and Korean and Frisian Dutch repositories

contain different annotations from standard UD an-

notations. The data from 16 repositories (Akkadian,

Cantonese, Chinese (2), German, Japanese, Hindi,

Irish, Kangri, Maltese, Neapolitan, South Levan-

tine Arabic, Swedish Sign language, Swiss Ger-

man, Old Turkish, Tagalog) do not contain morpho-

logical annotation. Also, some repositories include

correctly annotated data but are not suitable for

classiﬁcation problems because all the examples

contain only one value of all the categories, for ex-

ample, only examples with class Plural are left for

the category Number (Cantonese, Chukchi, Frisian

Dutch, Hindi English, Japanese, Kangri, Khunsari,

Makurap, Maltese, Nayini, Neapolitan, Old Turk-

ish, Soi, South Levantine Arabic, Swedish Sign

Language, Swiss German, Telugu, Vietnamese).

After ﬁltering, we have data from 104 languages

from 194 repositories (see Appendix A.1). From

the typological point of view, these languages be-

long to 20 language families, and the Basque lan-

guage is an isolate. Although almost half of the lan-

guages are from the Indo-European family, the data

include several under-studied language families.

2https://github.com/UniversalDependencies

Many of the languages in our data are endangered

or even extinct. The UD data is distributed based

on Creative Commons and GNU-based licenses,

varying from language to language

. Extracting

the tasks for every grammatical category results in

1927 probing datasets.

3.3 Probing Engine

3.3.1 Encoders

In the experiments, we consider the layers of

encoder-based models and their ability to acquire

language data and perform well on probing tasks.

Using the output of the model’s layers, we can get

contextualized token embeddings for elements of

the input text. For that reason, we can consider

several options for embedding aggregation:

CLS

where the text is presented as the embedding from

"[CLS] "token,

SUM

and

AVG

where the sentence

vector is a sum or average of embeddings of all text

tokens.

3.3.2 Classiﬁers and metrics

After the embeddings are obtained, we train a sim-

ple classiﬁcation model based on the encoder lay-

ers’ representation and task data labels. We con-

sider linear (Logistic Regression) and non-linear

(MLP) classiﬁers. As the metrics for performance

evaluation, we use accuracy score and weighted

score in case of unbalanced classes.

3https://lindat.mff.cuni.cz/repository/xmlui/

page/licence-UD-2.1

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UniversalandIndependent:MultilingualProbingFrameworkforExhaustiveModelInterpretationandEvaluationOlegSerikov~,VitalyProtasov,EkaterinaVoloshina$,ViktoriaKnyazkova~,TatianaShavrina$ArticialIntelligenceResearchInstitute,$SberDevices,~HSEUniversity,DeepPavlovlab,MIPTAbstractLinguisticanalysisofl...

展开>> 收起<<

Universal and Independent Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation Oleg Serikov Vitaly Protasov Ekaterina Voloshina Viktoria Knyazkova.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Universal and Independent Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation Oleg Serikov Vitaly Protasov Ekaterina Voloshina Viktoria Knyazkova

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: