Post-hoc analysis of Arabic transformer models Ahmed AbdelaliNadir DurraniFahim DalviHassan Sajjad Qatar Computing Research Institute Hamad Bin Khalifa University Qatar

2025-05-02 0 0 2.18MB 13 页 10玖币

侵权投诉

Post-hoc analysis of Arabic transformer models

Ahmed Abdelali♦Nadir Durrani♦Fahim Dalvi♦Hassan Sajjad♣∗

♦Qatar Computing Research Institute, Hamad Bin Khalifa University, Qatar

♣Faculty of Computer Science, Dalhousie University, Canada

{aabdelali, ndurrani,faimaduddin}@hbku.edu.qa, hsajjad@dal.ca

Abstract

Arabic is a Semitic language which is

widely spoken with many dialects. Given

the success of pre-trained language models,

many transformer models trained on Arabic

and its dialects have surfaced. While

there have been an extrinsic evaluation of

these models with respect to downstream

NLP tasks, no work has been carried

out to analyze and compare their internal

representations. We probe how linguistic

information is encoded in the transformer

models, trained on different Arabic dialects.

We perform a layer and neuron analysis

on the models using morphological tagging

tasks for different dialects of Arabic and a

dialectal identiﬁcation task. Our analysis

enlightens interesting ﬁndings such as: i) word

morphology is learned at the lower and middle

layers, ii) while syntactic dependencies are

predominantly captured at the higher layers,

iii) despite a large overlap in their vocabulary,

the MSA-based models fail to capture the

nuances of Arabic dialects, iv) we found that

neurons in embedding layers are polysemous

in nature, while the neurons in middle layers

are exclusive to speciﬁc properties.

1 Introduction

Arabic is a linguistically rich language, with its

structures realized using both concatenative and

templatic morphology. The agglutinating aspect

of the language adds to the complexity where

a given word could be formed using multiple

morphemes. For example, the word

èñÒ»A



®

A

(fOsqynAkmwh

– and we gave it to you to drink)

combines a conjunction, a verb, and three pronouns.

At another longitude, Arabic has three variants:

Classical Arabic (CA), Modern Standard Arabic

(MSA) and Dialectal Arabic (DA). While the

MSA is traditionally considered as the de facto

∗The work was done while the author was at QCRI

1Using Safe Buckwalter Arabic (SBA) encoding.

Figure 1: Data regimes of various pre-trained

Transformer models of Arabic

standard in the written medium and DA being

the predominantly spoken counterpart, this has

changed recently (Mubarak and Darwish,2014;

Zaidan and Callison-Burch,2014;Durrani et al.,

2014). Due to the recent inﬂux of Social Media

platforms, dialectal Arabic also enjoys a signiﬁcant

presence in the written medium.

Transfer learning using contextualized

representations in pre-trained language models

have revolutionized the arena of downstream

NLP tasks. A plethora of transformer-based

language models, trained in dozens of languages

are uploaded every day now. Arabic is no

different. Several researchers have released and

benchmarked pre-trained Arabic transformer

models such as AraBERT (Antoun et al.,2020),

ArabicBERT (Safaya et al.,2020), CAMeLBERT

(Inoue et al.,2021), MARBERT (Abdul-Mageed

et al.,2020) and QARIB (Abdelali et al.,2021) etc.

These models have demonstrated state-of-the-art

performance on many tasks as well as their

ability to learn salient features for Arabic. One

of the main differences among these models is

the genre and amount of Arabic data they are

trained on. For example, AraBERT was trained

only on the MSA (Modern Standard Arabic),

arXiv:2210.09990v1 [cs.CL] 18 Oct 2022

ArabicBERT additionally used DA during training,

and CAMelBERT-mix used a combination of all

types of Arabic text for training. Multilingual

models such as mBERT and XLM are mostly

trained on Wikipedia and CommonCrawl data

which is predominantly MSA (Suwaileh et al.,

2016). Figure 1summarizes the training data

regimes of these models.

This large variety of Arabic pre-trained

models motivates us to question

how their

representations encode various linguistic

concepts?

To this end,

we present the ﬁrst

work on interpreting deep Arabic models.

We experiment with nine transformer models

including: ﬁve Arabic BERT models, Arabic

ALBERT, Arabic Electra, and two multilingual

models (mBERT and XLM). We analyze their

representations using MSA and dialectal parts-

of-speech tagging and dialect identiﬁcation tasks.

This allows us to compare the representations of

Arabic transformer models using tasks involving

different varieties of Arabic dialects.

We analyze representations of the network at

layer-level and at neuron-level using diagnostic

classiﬁer framework (Belinkov et al.,2017;Hupkes

et al.,2018). The overall idea is to extract feature

vectors from the learned representations and train

probing classiﬁers towards understudied auxiliary

tasks (of predicting morphology or identifying

dialect). We additionally use the Linguistic

Correlation Analysis method (Dalvi et al.,2019a;

Durrani et al.,2020) to identify salient neurons

with respect to a downstream task. Our results

show that:

Network and Layer Analysis

•

Lower and middle layers capture word

morphology

•

The long-range contextual knowledge

required to solve the dialectal identiﬁcation is

preserved in the higher layers

Neuron Analysis

•

The salient neurons with respect to a property

are well distributed across the network

•

First (embedding) and last layers of the

models contribute a substantial amount of

salient neurons for any downstream task

•

The neurons of embedding layer layer are

polysemous in nature while the neurons of

middle layers specializes in speciﬁc properties

MSA vs. Dialect

•

Although dialects of Arabic are closely related

to MSA, the pre-trained models trained using

MSA only do not implicitly learn nuances of

dialectal Arabic

2 Methodology

Our methodology is based on the class of

interpretation methods called as the Probing

Classiﬁers. The central idea is to extract the

activation vectors from a pre-trained language

model as static features. These activation vectors

are then trained towards the task of predicting

a property of interest, a linguistic task that we

would like to probe the representation against. The

underlying assumption is that if the classiﬁer can

predict the property, the representations implicitly

encode this information. We train layer (Belinkov

et al.,2020) and neuron probes (Durrani et al.,

2022) using logistic-regression classiﬁers.

Formally, consider a pre-trained neural language

model

with

layers:

{l1, l2, . . . , lL}

. Given

a dataset

D={w1, w2, ..., wN}

with a

corresponding set of linguistic annotations

{tw1, tw2, ..., twN}

, we map each word

in the

data

to a sequence of latent representations:

7−→ z={z1,...,zN}

. The layer-wise probing

classiﬁer is trained by minimizing the following

loss function:

L(θ) = −X

log Pθ(twi|wi)

where

Pθ(twi|wi) = exp(θl·zi)

Pl0exp(θl0·zi)

is the

probability that word iis assigned property twi.

For neuron analysis, we use Linguistic

Correlation Analysis (LCA) as described in (Dalvi

et al.,2019a). LCA is also based on the probing

classiﬁer paradigm. However, they used elastic-net

regularization (Zou and Hastie,2005) that enables

the selection of both focused and distributed

neurons. The loss function is as follows:

L(θ) = −X

log Pθ(twi|wi) + λ1kθk1+λ2kθk2

The regularization parameters

λ1

and

λ2

are tuned

using a grid-search algorithm. The classiﬁer

assigns weight to each feature (neuron) which

serves as their importance with respect to a class

like Noun. We ranked the neurons based on the

absolute weights for every class. We select salient

neurons for the task such as POS by iteratively

selecting top neurons of every class.

A minimum set of neurons is identiﬁed by

iteratively selecting top neurons that achieves

classiﬁcation performance comparable (within a

certain threshold) to the Oracle – accuracy of

the classiﬁer trained using all the features in the

network.

Data Size Tokens Vocab Type

AraBERT 23GB 2.7B 64K MSA

ArabicBERT 95GB 8.2B 32K MSA

CAMeLBERT 167B 17.3B 30K MSA/CA/DA

MARBERT 128GB 15.6B 100K MSA/DA

mBERT - 1.5B 110K MSA

QARiB 127GB 14.0B 64K MSA/DA

AraELECTRA 77GB 8.6B 64K MSA

ALBERT - 4.4B 30K MSA

XLM 2.5TB - 250K MSA

Table 1: Pretrained Models data and statistics.

3 Experimental Setup

In this section, we describe our experimental setup

including the Arabic transformer models, probing

tasks that we have used to carry the analysis and

the classiﬁer settings.

3.1 Pre-trained Models

We select a number of Arabic transformer models,

trained using various varieties of Arabic and based

on different architectures. Table 1provides a

summary of these models. In the following, we

describe each model and the dataset used for their

training.

AraBERT

was trained using a combination of 70

million sentences from Arabic Wikipedia Dumps,

1.5B words Arabic Corpus (El-khair,2016) and the

Open Source International Arabic News Corpus

(OSIAN) from (Zeroual et al.,2019). The ﬁnal

corpus contained mostly MSA news from different

Arab regions.

ArabicBERT

Safaya et al. (2020) pretrained a

BERT model using a concatenation of Arabic

version of OSCAR (Ortiz Suárez et al.,2019), a

ﬁltered subset from Common Crawl and a dump of

Arabic Wikipedia totalling to 8.2B words.

CAMeLBERT

Inoue et al. (2021) combined a

mixed collection of MSA, Dialectal and Classical

Arabic texts with a total of 17.3B tokens. They used

the data to pre-train CAMeLBERT-Mix model.

MARBERT

Abdul-Mageed et al. (2020)

combined a dataset of 1B tweets that covering

mostly Arabic dialects and Arabic Gigaword 5th

Edition,

OSCAR (Ortiz Suárez et al.,2019),

OSIAN (Zeroual et al.,2019) and Wikipedia dump

totally up to 15.6B tokens.

QARIB

Abdelali et al. (2021) combined Arabic

Gigaword Fourth Edition,

1.5B words Arabic

Corpus (El-khair,2016), the Arabic part of Open

Subtitles (Lison and Tiedemann,2016) and 440M

tweets collected between 2012 and 2020. The data

was processed using Farasa (Abdelali et al.,2016).

ALBERT

used a subset of OSCAR (Ortiz Suárez

et al.,2019) and a dump of Wikipedia, selecting

around 4.4 Billion words (Safaya,2020). The

model differs from BERT using factorized

embedding and repeating layers which results in a

small memory footprint (Lan et al.,2020).

AraELECTRA

ELECTRA, model Clark et al.

(2020) is trained to distinguish "real" vs "fake"

input tokens generated by another neural network.

The Arabic ELECTRA was trained on 77GB of

data combining OSCAR dataset, Arabic Wikipedia

dump, the 1.5B words Arabic Corpus, the OSIAN

Corpus and Assaﬁr news articles (Antoun et al.,

2021). Different than other models, AraELECTRA

uses a hidden layer size of 256 while all other

models have 768 neurons per layer.

Multilingual BERT

Google research released

BERT multilingual base model pretrained on the

concatenation of monolingual Wikipedia corpora

from 104 languages with a shared word piece

vocabulary of 110K.

XLM

Conneau et al. (2020) is a multi-

lingual version of RoBERTa, trained on 2.5TB

CommonCrawl data. The model is trained on 100

different languages.

3.2 Probing Tasks

We consider morphological tagging on a variety

of Arabic dialects and dialectal identiﬁcation tasks

to analyze and compare the models. Below we

describe the task details.

POS Tagging on Arabic Treebank (ATB)

: The

Arabic Treebank Part1 v2.0 and Part3 v1.0 with a

total of 515k tokens labeled at the segment level

with POS tags. The data is a combination of

2LDC Catalogue LDC2011T11

3LDC Catalogue LDC2009T30

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Post-hocanalysisofArabictransformermodelsAhmedAbdelali}NadirDurrani}FahimDalvi}HassanSajjad|}QatarComputingResearchInstitute,HamadBinKhalifaUniversity,Qatar|FacultyofComputerScience,DalhousieUniversity,Canada{aabdelali,ndurrani,faimaduddin}@hbku.edu.qa,hsajjad@dal.caAbstractArabicisaSemiticlanguage...

展开>> 收起<<

Post-hoc analysis of Arabic transformer models Ahmed AbdelaliNadir DurraniFahim DalviHassan Sajjad Qatar Computing Research Institute Hamad Bin Khalifa University Qatar.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Post-hoc analysis of Arabic transformer models Ahmed AbdelaliNadir DurraniFahim DalviHassan Sajjad Qatar Computing Research Institute Hamad Bin Khalifa University Qatar

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: