Post-hoc analysis of Arabic transformer models Ahmed AbdelaliNadir DurraniFahim DalviHassan Sajjad Qatar Computing Research Institute Hamad Bin Khalifa University Qatar

2025-05-02 0 0 2.18MB 13 页 10玖币
侵权投诉
Post-hoc analysis of Arabic transformer models
Ahmed AbdelaliNadir DurraniFahim DalviHassan Sajjad
Qatar Computing Research Institute, Hamad Bin Khalifa University, Qatar
Faculty of Computer Science, Dalhousie University, Canada
{aabdelali, ndurrani,faimaduddin}@hbku.edu.qa, hsajjad@dal.ca
Abstract
Arabic is a Semitic language which is
widely spoken with many dialects. Given
the success of pre-trained language models,
many transformer models trained on Arabic
and its dialects have surfaced. While
there have been an extrinsic evaluation of
these models with respect to downstream
NLP tasks, no work has been carried
out to analyze and compare their internal
representations. We probe how linguistic
information is encoded in the transformer
models, trained on different Arabic dialects.
We perform a layer and neuron analysis
on the models using morphological tagging
tasks for different dialects of Arabic and a
dialectal identification task. Our analysis
enlightens interesting findings such as: i) word
morphology is learned at the lower and middle
layers, ii) while syntactic dependencies are
predominantly captured at the higher layers,
iii) despite a large overlap in their vocabulary,
the MSA-based models fail to capture the
nuances of Arabic dialects, iv) we found that
neurons in embedding layers are polysemous
in nature, while the neurons in middle layers
are exclusive to specific properties.
1 Introduction
Arabic is a linguistically rich language, with its
structures realized using both concatenative and
templatic morphology. The agglutinating aspect
of the language adds to the complexity where
a given word could be formed using multiple
morphemes. For example, the word
èñÒ»A
JJ
®
A
¯
(fOsqynAkmwh
1
– and we gave it to you to drink)
combines a conjunction, a verb, and three pronouns.
At another longitude, Arabic has three variants:
Classical Arabic (CA), Modern Standard Arabic
(MSA) and Dialectal Arabic (DA). While the
MSA is traditionally considered as the de facto
The work was done while the author was at QCRI
1Using Safe Buckwalter Arabic (SBA) encoding.
Figure 1: Data regimes of various pre-trained
Transformer models of Arabic
standard in the written medium and DA being
the predominantly spoken counterpart, this has
changed recently (Mubarak and Darwish,2014;
Zaidan and Callison-Burch,2014;Durrani et al.,
2014). Due to the recent influx of Social Media
platforms, dialectal Arabic also enjoys a significant
presence in the written medium.
Transfer learning using contextualized
representations in pre-trained language models
have revolutionized the arena of downstream
NLP tasks. A plethora of transformer-based
language models, trained in dozens of languages
are uploaded every day now. Arabic is no
different. Several researchers have released and
benchmarked pre-trained Arabic transformer
models such as AraBERT (Antoun et al.,2020),
ArabicBERT (Safaya et al.,2020), CAMeLBERT
(Inoue et al.,2021), MARBERT (Abdul-Mageed
et al.,2020) and QARIB (Abdelali et al.,2021) etc.
These models have demonstrated state-of-the-art
performance on many tasks as well as their
ability to learn salient features for Arabic. One
of the main differences among these models is
the genre and amount of Arabic data they are
trained on. For example, AraBERT was trained
only on the MSA (Modern Standard Arabic),
arXiv:2210.09990v1 [cs.CL] 18 Oct 2022
ArabicBERT additionally used DA during training,
and CAMelBERT-mix used a combination of all
types of Arabic text for training. Multilingual
models such as mBERT and XLM are mostly
trained on Wikipedia and CommonCrawl data
which is predominantly MSA (Suwaileh et al.,
2016). Figure 1summarizes the training data
regimes of these models.
This large variety of Arabic pre-trained
models motivates us to question
how their
representations encode various linguistic
concepts?
To this end,
we present the first
work on interpreting deep Arabic models.
We experiment with nine transformer models
including: ve Arabic BERT models, Arabic
ALBERT, Arabic Electra, and two multilingual
models (mBERT and XLM). We analyze their
representations using MSA and dialectal parts-
of-speech tagging and dialect identification tasks.
This allows us to compare the representations of
Arabic transformer models using tasks involving
different varieties of Arabic dialects.
We analyze representations of the network at
layer-level and at neuron-level using diagnostic
classifier framework (Belinkov et al.,2017;Hupkes
et al.,2018). The overall idea is to extract feature
vectors from the learned representations and train
probing classifiers towards understudied auxiliary
tasks (of predicting morphology or identifying
dialect). We additionally use the Linguistic
Correlation Analysis method (Dalvi et al.,2019a;
Durrani et al.,2020) to identify salient neurons
with respect to a downstream task. Our results
show that:
Network and Layer Analysis
Lower and middle layers capture word
morphology
The long-range contextual knowledge
required to solve the dialectal identification is
preserved in the higher layers
Neuron Analysis
The salient neurons with respect to a property
are well distributed across the network
First (embedding) and last layers of the
models contribute a substantial amount of
salient neurons for any downstream task
The neurons of embedding layer layer are
polysemous in nature while the neurons of
middle layers specializes in specific properties
MSA vs. Dialect
Although dialects of Arabic are closely related
to MSA, the pre-trained models trained using
MSA only do not implicitly learn nuances of
dialectal Arabic
2 Methodology
Our methodology is based on the class of
interpretation methods called as the Probing
Classifiers. The central idea is to extract the
activation vectors from a pre-trained language
model as static features. These activation vectors
are then trained towards the task of predicting
a property of interest, a linguistic task that we
would like to probe the representation against. The
underlying assumption is that if the classifier can
predict the property, the representations implicitly
encode this information. We train layer (Belinkov
et al.,2020) and neuron probes (Durrani et al.,
2022) using logistic-regression classifiers.
Formally, consider a pre-trained neural language
model
M
with
L
layers:
{l1, l2, . . . , lL}
. Given
a dataset
D={w1, w2, ..., wN}
with a
corresponding set of linguistic annotations
T=
{tw1, tw2, ..., twN}
, we map each word
wi
in the
data
D
to a sequence of latent representations:
DM
7−z={z1,...,zN}
. The layer-wise probing
classifier is trained by minimizing the following
loss function:
L(θ) = X
i
log Pθ(twi|wi)
where
Pθ(twi|wi) = exp(θl·zi)
Pl0exp(θl0·zi)
is the
probability that word iis assigned property twi.
For neuron analysis, we use Linguistic
Correlation Analysis (LCA) as described in (Dalvi
et al.,2019a). LCA is also based on the probing
classifier paradigm. However, they used elastic-net
regularization (Zou and Hastie,2005) that enables
the selection of both focused and distributed
neurons. The loss function is as follows:
L(θ) = X
i
log Pθ(twi|wi) + λ1kθk1+λ2kθk2
2
The regularization parameters
λ1
and
λ2
are tuned
using a grid-search algorithm. The classifier
assigns weight to each feature (neuron) which
serves as their importance with respect to a class
like Noun. We ranked the neurons based on the
absolute weights for every class. We select salient
neurons for the task such as POS by iteratively
selecting top neurons of every class.
A minimum set of neurons is identified by
iteratively selecting top neurons that achieves
classification performance comparable (within a
certain threshold) to the Oracle – accuracy of
the classifier trained using all the features in the
network.
Data Size Tokens Vocab Type
AraBERT 23GB 2.7B 64K MSA
ArabicBERT 95GB 8.2B 32K MSA
CAMeLBERT 167B 17.3B 30K MSA/CA/DA
MARBERT 128GB 15.6B 100K MSA/DA
mBERT - 1.5B 110K MSA
QARiB 127GB 14.0B 64K MSA/DA
AraELECTRA 77GB 8.6B 64K MSA
ALBERT - 4.4B 30K MSA
XLM 2.5TB - 250K MSA
Table 1: Pretrained Models data and statistics.
3 Experimental Setup
In this section, we describe our experimental setup
including the Arabic transformer models, probing
tasks that we have used to carry the analysis and
the classifier settings.
3.1 Pre-trained Models
We select a number of Arabic transformer models,
trained using various varieties of Arabic and based
on different architectures. Table 1provides a
summary of these models. In the following, we
describe each model and the dataset used for their
training.
AraBERT
was trained using a combination of 70
million sentences from Arabic Wikipedia Dumps,
1.5B words Arabic Corpus (El-khair,2016) and the
Open Source International Arabic News Corpus
(OSIAN) from (Zeroual et al.,2019). The final
corpus contained mostly MSA news from different
Arab regions.
ArabicBERT
Safaya et al. (2020) pretrained a
BERT model using a concatenation of Arabic
version of OSCAR (Ortiz Suárez et al.,2019), a
filtered subset from Common Crawl and a dump of
Arabic Wikipedia totalling to 8.2B words.
CAMeLBERT
Inoue et al. (2021) combined a
mixed collection of MSA, Dialectal and Classical
Arabic texts with a total of 17.3B tokens. They used
the data to pre-train CAMeLBERT-Mix model.
MARBERT
Abdul-Mageed et al. (2020)
combined a dataset of 1B tweets that covering
mostly Arabic dialects and Arabic Gigaword 5th
Edition,
2
OSCAR (Ortiz Suárez et al.,2019),
OSIAN (Zeroual et al.,2019) and Wikipedia dump
totally up to 15.6B tokens.
QARIB
Abdelali et al. (2021) combined Arabic
Gigaword Fourth Edition,
3
1.5B words Arabic
Corpus (El-khair,2016), the Arabic part of Open
Subtitles (Lison and Tiedemann,2016) and 440M
tweets collected between 2012 and 2020. The data
was processed using Farasa (Abdelali et al.,2016).
ALBERT
used a subset of OSCAR (Ortiz Suárez
et al.,2019) and a dump of Wikipedia, selecting
around 4.4 Billion words (Safaya,2020). The
model differs from BERT using factorized
embedding and repeating layers which results in a
small memory footprint (Lan et al.,2020).
AraELECTRA
ELECTRA, model Clark et al.
(2020) is trained to distinguish "real" vs "fake"
input tokens generated by another neural network.
The Arabic ELECTRA was trained on 77GB of
data combining OSCAR dataset, Arabic Wikipedia
dump, the 1.5B words Arabic Corpus, the OSIAN
Corpus and Assafir news articles (Antoun et al.,
2021). Different than other models, AraELECTRA
uses a hidden layer size of 256 while all other
models have 768 neurons per layer.
Multilingual BERT
Google research released
BERT multilingual base model pretrained on the
concatenation of monolingual Wikipedia corpora
from 104 languages with a shared word piece
vocabulary of 110K.
XLM
Conneau et al. (2020) is a multi-
lingual version of RoBERTa, trained on 2.5TB
CommonCrawl data. The model is trained on 100
different languages.
3.2 Probing Tasks
We consider morphological tagging on a variety
of Arabic dialects and dialectal identification tasks
to analyze and compare the models. Below we
describe the task details.
POS Tagging on Arabic Treebank (ATB)
: The
Arabic Treebank Part1 v2.0 and Part3 v1.0 with a
total of 515k tokens labeled at the segment level
with POS tags. The data is a combination of
2LDC Catalogue LDC2011T11
3LDC Catalogue LDC2009T30
摘要:

Post-hocanalysisofArabictransformermodelsAhmedAbdelali}NadirDurrani}FahimDalvi}HassanSajjad|}QatarComputingResearchInstitute,HamadBinKhalifaUniversity,Qatar|FacultyofComputerScience,DalhousieUniversity,Canada{aabdelali,ndurrani,faimaduddin}@hbku.edu.qa,hsajjad@dal.caAbstractArabicisaSemiticlanguage...

展开>> 收起<<
Post-hoc analysis of Arabic transformer models Ahmed AbdelaliNadir DurraniFahim DalviHassan Sajjad Qatar Computing Research Institute Hamad Bin Khalifa University Qatar.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:2.18MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注