RuCoLA Russian Corpus of Linguistic Acceptability Vladislav Mikhailov1 Tatiana Shamardina2 Max Ryabinin34 Alena Pestova3Ivan Smurov2Ekaterina Artemova56

2025-05-03 0 0 516.43KB 21 页 10玖币
侵权投诉
RuCoLA: Russian Corpus of Linguistic Acceptability
Vladislav Mikhailov1
, Tatiana Shamardina2, Max Ryabinin3,4
Alena Pestova3,Ivan Smurov2,Ekaterina Artemova5,6
1SberDevices, 2ABBYY, 3HSE University,
4Yandex, 5Huawei Noah’s Ark Lab,
6Center for Information and Language Processing (CIS), MaiNLP lab, LMU Munich, Germany
Correspondence: vmikhailovhse@gmail.com
Abstract
Linguistic acceptability (LA) attracts the at-
tention of the research community due to its
many uses, such as testing the grammatical
knowledge of language models and filtering
implausible texts with acceptability classifiers.
However, the application scope of LA in lan-
guages other than English is limited due to the
lack of high-quality resources. To this end,
we introduce the Russian Corpus of Linguis-
tic Acceptability (RuCoLA), built from the
ground up under the well-established binary
LA approach. RuCoLA consists of 9.8k in-
domain sentences from linguistic publications
and 3.6k out-of-domain sentences produced
by generative models. The out-of-domain set
is created to facilitate the practical use of ac-
ceptability for improving language generation.
Our paper describes the data collection pro-
tocol and presents a fine-grained analysis of
acceptability classification experiments with a
range of baseline approaches. In particular,
we demonstrate that the most widely used lan-
guage models still fall behind humans by a
large margin, especially when detecting mor-
phological and semantic errors. We release
RuCoLA, the code of experiments, and a pub-
lic leaderboard1to assess the linguistic com-
petence of language models for Russian.
1 Introduction
Recent NLP research has approached the lin-
guistic competence of language models (LMs)
with acceptability judgments, which reflect a sen-
tence’s well-formedness and naturalness from the
perspective of native speakers (Chomsky,1965).
These judgments have formed an empirical foun-
dation in generative linguistics for evaluating hu-
mans’ grammatical knowledge and language ac-
quisition (Schütze,1996;Sprouse,2018).
Borrowing conventions from linguistic theory,
the community has put much effort into creating
Equal contribution.
1Available at rucola-benchmark.com
Language Size %
CoLA English 10.6k 70.5
ItaCoLA Italian 9.7k 85.4
RuCoLA Russian 13.4k 71.8
Table 1: Comparison of RuCoLA with related bi-
nary acceptability classification benchmarks: CoLA
(Warstadt et al.,2019) and ItaCoLA (Trotta et al.,
2021). %=Percentage of acceptable sentences.
linguistic acceptability (LA) resources to explore
whether LMs acquire grammatical concepts piv-
otal to human linguistic competence (Kann et al.,
2019;Warstadt et al.,2019,2020). Lately, simi-
lar non-English resources have been proposed to
address this question in typologically diverse lan-
guages (Trotta et al.,2021;Volodina et al.,2021;
Hartmann et al.,2021;Xiang et al.,2021). How-
ever, the ability of LMs to perform acceptability
judgments in Russian remains understudied.
To this end, we introduce the Russian Corpus
of Linguistic Acceptability (RuCoLA), a novel
benchmark of 13.4k sentences labeled as accept-
able or not. In contrast to related binary ac-
ceptability classification benchmarks in Table 1,
RuCoLA combines in-domain sentences manu-
ally collected from linguistic literature and out-
of-domain sentences produced by nine machine
translation and paraphrase generation models. The
motivation behind the out-of-domain set is to facil-
itate the practical use of acceptability judgments
for improving language generation (Kane et al.,
2020;Batra et al.,2021). Furthermore, each unac-
ceptable sentence is additionally labeled with four
standard and machine-specific coarse-grained cat-
egories: morphology, syntax, semantics, and hal-
lucinations (Raunak et al.,2021).
The main contributions of this paper are the fol-
lowing: (i) We create RuCoLA, the first large-
scale acceptability classification resource in Rus-
arXiv:2210.12814v1 [cs.CL] 23 Oct 2022
sian. (ii) We present a detailed analysis of ac-
ceptability classification experiments with a broad
range of baselines, including monolingual and
cross-lingual Transformer (Vaswani et al.,2017)
LMs, statistical approaches, acceptability mea-
sures from pretrained LMs, and human judge-
ments. (iii) We release RuCoLA, the code of ex-
periments2, and a leaderboard to test the linguistic
competence of modern and upcoming LMs for the
Russian language.
2 Related work
2.1 Acceptability Judgments
Acceptability Datasets The design of existing
LA datasets is based on standard practices in lin-
guistics (Myers,2017;Scholz et al.,2021): bi-
nary acceptability classification (Warstadt et al.,
2019;Kann et al.,2019), magnitude estima-
tion (Vázquez Martínez,2021), gradient judg-
ments (Lau et al.,2017;Sprouse et al.,2018),
Likert scale scoring (Brunato et al.,2020), and
a forced choice between minimal pairs (Marvin
and Linzen,2018;Warstadt et al.,2020). Recent
studies have extended the research to languages
other than English: Italian (Trotta et al.,2021),
Swedish (Volodina et al.,2021), French (Feld-
hausen and Buchczyk,2020), Chinese (Xiang
et al.,2021), Bulgarian and German (Hartmann
et al.,2021). Following the motivation and
methodology by Warstadt et al. (2019), this paper
focuses on the binary acceptability classification
approach for the Russian language.
Applications of Acceptability Acceptability
judgments have been broadly applied in NLP.
In particular, they are used to test LMs’ robust-
ness (Yin et al.,2020) and probe their acquisition
of grammatical phenomena (Warstadt and Bow-
man,2019;Choshen et al.,2022;Zhang et al.,
2021). LA has also stimulated the develop-
ment of acceptability measures based on pseudo-
perplexity (Lau et al.,2020), which correlate well
with human judgments (Lau et al.,2017) and show
benefits in scoring generated hypotheses in down-
stream tasks (Salazar et al.,2020). Another appli-
cation includes evaluating the grammatical and se-
mantic correctness in language generation (Kane
et al.,2020;Harkous et al.,2020;Bakshi et al.,
2021;Batra et al.,2021).
2Both RuCoLA and the code of our experiments are avail-
able at github.com/RussianNLP/RuCoLA
Source Size % Content
rusgram 563 49.7 Corpus grammar
Testelets (2001) 1335 73.9 General syntax
Lutikova (2010) 193 75.6 Syntactic structures
Mitrenina et al. (2017) 54 57.4 Generative grammar
Paducheva (2010) 1308 84.3 Semantics of tense
Paducheva (2004) 1374 90.8 Lexical semantics
Paducheva (2013) 1462 89.5 Aspects of negation
Seliverstova (2004) 2104 80.8 Semantics
Shavrina et al. (2020) 1444 36.6 Grammar exam tasks
In-domain 9837 74.5
Machine Translation 1286 72.8 English translations
Paraphrase Generation 2322 59.9 Automatic paraphrases
Out-of-domain 3608 64.6
Total 13445 71.8
Table 2: RuCoLA statistics by source. The number
of in-domain sentences is similar to that of CoLA and
ItaCoLA. %=Percentage of acceptable sentences.
2.2 Evaluation of Text Generation
Machine translation (or MT) is one of the first
sub-fields which has established diagnostic eval-
uation of neural models (Dong et al.,2021). Di-
agnostic datasets can be constructed by automatic
generation of contrastive pairs (Burlot and Yvon,
2017), crowdsourcing annotations of generated
sentences (Lau et al.,2014), and native speaker
data (Anastasopoulos,2019). Various phenom-
ena have been analyzed, to name a few: morphol-
ogy (Burlot et al.,2018), syntactic properties (Sen-
nrich,2017;Wei et al.,2018), commonsense (He
et al.,2020), anaphoric pronouns (Guillou et al.,
2018), and cohesion (Bawden et al.,2018).
Recent research has shifted towards overcoming
limitations in language generation, such as copy-
ing inputs (Liu et al.,2021), distorting facts (San-
thanam et al.,2021), and generating hallucinated
content (Zhou et al.,2021). Maynez et al. (2020)
and Liu et al. (2022) propose datasets on hal-
lucination detection. SCARECROW (Dou et al.,
2022) and TGEA (He et al.,2021) focus on tax-
onomies of text generation errors. Drawing inspi-
ration from these works, we create the machine-
generated out-of-domain set to foster text genera-
tion evaluation with acceptability.
3 RuCoLA
3.1 Design
RuCoLA consists of in-domain and out-of-domain
subsets, as outlined in Table 2. Below, we describe
the data collection procedures for each subset.
Label Set Category Sentence Source
XIn-domain ×Ya obnaruzhil ego lezhaschego odnogo na krovati. Testelets (2001)
I found him lying in the bed alone.
* In-domain SYNTAX Ivan prileg, chtoby on otdokhnul.Testelets (2001)
Ivan laid down in order that he has a rest.
XOut-of-domain ×Ja ne chital ni odnogo iz ego romanov. Artetxe and Schwenk (2019)
I have not read any of his novels.
* Out-of-domain HALLUCINATION Ljuk ostanavlivaet udachu ot etogo. Schwenk et al. (2021)
Luke stops luck from doing this.
Table 3: A sample of RuCoLA. *=Unacceptable sentences. X=Acceptable sentences. The examples are translated
for illustration purposes.
In-domain Set Here, the data collection method
is analogous to CoLA. The in-domain sentences
and the corresponding authors’ acceptability judg-
ments3are drawn from fundamental linguistic
textbooks, academic publications, and method-
ological materials4. The works are focused on var-
ious linguistic phenomena, including but not lim-
ited to general syntax (Testelets,2001), the syn-
tactic structure of noun phrases (Lutikova,2010),
negation (Paducheva,2013), predicate ellipsis,
and subordinate clauses (rusgram5). Shavrina
et al. (2020) introduce a dataset on the Unified
State Exam in the Russian language, which serves
as school finals and university entry examinations
in Russia. The dataset includes standardized tests
on high school curriculum topics made by method-
ologists. We extract sentences from the tasks on
Russian grammar, which require identifying incor-
rect word derivation and syntactic violations.
Out-of-domain Set The out-of-domain sen-
tences are produced by nine open-source MT
and paraphrase generation models using sub-
sets of four datasets from different domains:
Tatoeba (Artetxe and Schwenk,2019), WikiMa-
trix (Schwenk et al.,2021), TED (Qi et al.,
2018), and Yandex Parallel Corpus (Antonova and
Misyurev,2011). We use cross-lingual MT mod-
els released as a part of the EasyNMT library6:
OPUS-MT (Tiedemann and Thottingal,2020), M-
BART50 (Tang et al.,2020) and M2M-100 (Fan
et al.,2021) of 418M and 1.2B parameters. Rus-
sian WikiMatrix sentences are paraphrased via the
3We keep unacceptable sentences marked with the “*”,
“*?” and “??” labels.
4The choice is also based on the ease of manual example
collection, e.g., high digital quality of the sources and no need
for manual transcription.
5A collection of materials written by linguists for a
corpus-based description of Russian grammar. Available at:
rusgram.ru
6github.com/UKPLab/EasyNMT
russian-paraphrasers library (Fenogen-
ova,2021) with the following models and nucleus
sampling strategy: ruGPT2-Large7(760M), ruT5
(244M)8, and mT5 (Xue et al.,2021) of Small
(300M), Base (580M) and Large (1.2B) versions.
The annotation procedure of the generated sen-
tences is documented in §3.3.
3.2 Violation Categories
Each unacceptable sentence is additionally labeled
with one of the four violation categories: mor-
phology, syntax, semantics, and hallucinations.
The annotation for the in-domain set is obtained
through manual working with the sources. The
categories are manually defined based on the in-
terpretation of examples provided by the experts,
topics covered by chapters, and the general con-
tent of a linguistic source. The out-of-domain sen-
tences are annotated as described in §3.3.
Phenomena The phenomena covered by Ru-
CoLA are well represented in Russian theoretical
and corpus linguistics and peculiar to modern gen-
erative models. We briefly summarize our infor-
mal categorization and list examples of the phe-
nomena below:
1. SYNTAX: agreement violations, corruption
of word order, misconstruction of syntactic
clauses and phrases, incorrect use of apposi-
tions, violations of verb transitivity or argu-
ment structure, ellipsis, missing grammatical
constituencies or words.
2. MORPHOLOGY: incorrect derivation or word
building, non-existent words.
3. SEMANTICS: incorrect use of negation, viola-
tion of the verb’s semantic argument structure.
4. HALLUCINATION: text degeneration, nonsen-
sical sentences, irrelevant repetitions, decod-
7hf.co/sberbank-ai/rugpt2large
8hf.co/cointegrated/rut5-base-paraphraser
ing confusions, incomplete translations, hallu-
cinated content.
Table 3 provides a sample of several RuCoLA
sentences, and examples for each violation cate-
gory can be found in Appendix A.
3.3 Annotation of Machine-Generated
Sentences
The machine-generated sentences un-
dergo a two-stage annotation procedure on
Toloka (Pavlichenko et al.,2021), a crowd-
sourcing platform for data labeling9. Each stage
includes an unpaid training phase with expla-
nations, control tasks for tracking annotation
quality10, and the main annotation task. Before
starting, the worker is given detailed instructions
describing the task, explaining the labels, and
showing plenty of examples. The instruction is
available at any time during both the training
and main annotation phases. To get access to the
main phase, the worker should first complete the
training phase by labeling more than 70% of its
examples correctly (Nangia and Bowman,2019).
Each trained worker receives a page with five
sentences, one of which is a control one.
We collect the majority vote labels via a dy-
namic overlap11 from three to five workers after
filtering them by response time and performance
on control tasks. Appendix B.2 contains a detailed
description of the annotation protocol, including
response statistics and the agreement rates.
Stage 1: Acceptability Judgments The first an-
notation stage defines whether a given sentence is
acceptable or not. Access to the project is granted
to workers certified as native speakers of Russian
by Toloka and ranked top-60% workers according
to the Toloka rating system. Each worker answers
30 examples in the training phase. Each training
example is accompanied by an explanation that
appears in an incorrect answer. The main anno-
tation phase counts 3.6k machine-generated sen-
tences. The pay rate is on average $2.55/hr, which
is twice the amount of the hourly minimum wage
in Russia. Each of 1.3k trained workers get paid,
but we keep votes from only 960 workers whose
9toloka.ai
10Control tasks are used on Toloka as common practice
for discarding results from bots or workers whose quality on
these tasks is unsatisfactory. In our annotation projects, the
tasks are manually selected or annotated by a few authors:
about 200 and 500 sentences for Stages 1 and 2, respectively.
11toloka.ai/docs/dynamic-overlap
annotation quality rate on the control sentences is
more than 50%. We provide a shortened translated
instruction and an example of the web interface
in Table 6 (see Appendix B.1).
Stage 2: Violation Categories The second
stage includes validation and annotation of sen-
tences labeled unacceptable on Stage 1 according
to five answer options: “Morphology”, “Syntax”,
“Semantics”, “Hallucinations” and “Other”. The
task is framed as a multi-label classification, i.e.,
the sentence may contain more than one violation
in some rare cases or be re-labeled as acceptable.
We create a team of 30 annotators who are under-
graduate BA and MA in philology and linguistics
from several Russian universities. The students
are asked to study the works on CoLA (Warstadt
et al.,2019), TGEA (He et al.,2021), and hallu-
cinations (Zhou et al.,2021). We also hold an on-
line seminar to discuss the works and clarify the
task specifics. Each student undergoes platform-
based training on 15 examples before moving onto
the main phase of 1.3k sentences. The students
are paid on average $5.42/hr and are eligible to
get credits for an academic course or an intern-
ship. Similar to one of the data collection proto-
cols by Parrish et al. (2021), this stage provides
direct interaction between authors and students in
a group chat. We keep submissions with more than
30 seconds of response time per page and collect
the majority vote labels for each answer indepen-
dently. Sentences having more than one violation
category or labeled as “Other” by the majority are
filtered out. The shortened instruction is presented
in Table 7 (see Appendix B.1).
3.4 General Statistics
Length and Frequency The sentences in Ru-
CoLA are filtered by the 4–30 token range with
razdel12, a rule-based Russian tokenizer. There
are 11 tokens in each sentence on average. We
estimate the number of high-frequency tokens in
each sentence according to the Russian National
Corpus (RNC)13 to control the word frequency
distribution. It is computed as the number of fre-
quently used tokens (i.e., the number of instances
per million in RNC is higher than 1) divided by the
number of tokens in a sentence. We use a moder-
ate frequency threshold t>0.6to keep sentences
containing rare token units typical for some vio-
12github.com/natasha/razdel
13ruscorpora.ru/new/en
Figure 1: Distribution of violation categories in RuCoLAs unacceptable sentences.
lations: non-existent or misderived words, incom-
plete translations, and others. The sentences con-
tain on average 92% of high-frequency tokens.
Category Distribution Figure 1 shows the dis-
tribution of violation categories in RuCoLA. Syn-
tactic violations are the most common in RuCoLA
(53.3% and 40.8% in the in-domain and out-of-
domain sets). The in-domain set includes 40.2%
of semantic and 6.6% of morphological viola-
tions, while the out-of-domain set accounts for
11.9% and 9.8%, respectively. Model hallucina-
tions make up a percentage of 12.7% of the total
number of unacceptable sentences.
Splits The in-domain set of RuCoLA is split into
train, validation and private test splits in the stan-
dard 80/10/10 ratio (7.9k/1k/1k examples). The
out-of-domain set is divided into validation and
private test splits in a 50/50 ratio (1.8k/1.8k ex-
amples). Each split is balanced by the number of
examples per target class, the source type, and the
violation category.
4 Experiments
We evaluate several methods for acceptability
classification ranging from simple non-neural ap-
proaches to state-of-the-art cross-lingual models.
4.1 Performance Metrics
Following Warstadt et al. (2019), the perfor-
mance is measured by the accuracy score (Acc.)
and Matthews Correlation Coefficient (MCC,
Matthews,1975). MCC on the validation set is
used as the target metric for hyperparameter tun-
ing and early stopping. We report the results aver-
aged over ten restarts from different random seeds.
4.2 Models
Non-neural Models We use two models from
the scikit-learn library (Pedregosa et al.,
2011) as simple non-neural baselines: a major-
ity vote classifier, and a logistic regression clas-
sifier over tf-idf (Salton and Yang,1973) features
computed on word n-grams with the n-gram range
[1; 3], which results in a total of 2509 features.
For the linear model, we tune the `2regularization
coefficient C∈ {0.01,0.1,1.0}based on the vali-
dation set performance.
Acceptability Measures Probabilistic measures
allow evaluating the acceptability of a sentence
while taking its length and lexical frequency into
account (Lau et al.,2020). There exist several
different acceptability measures, such as PenLP,
MeanLP, NormLP, and SLOR (Lau et al.,2020);
we use PenLP due to its results in our prelimi-
nary experiments. We obtain the PenLP measure
for each sentence by computing its log-probability
(computed as a sum of token log-probabilities)
from the ruGPT3-medium14 model. PenLP nor-
malizes the log-probability of a sentence P(s)by
the sentence length with a scaling factor α:
PenLP(s) = P(s)
((5 + |s|)(5 + 1))α.(1)
After we compute the PenLP value of the sen-
tence, we can predict its acceptability by com-
paring it with a specified threshold. To find this
threshold, we run 10-fold cross-validation on the
train set: for each fold, we get the candidate
thresholds on 90% of the data by taking 100 points
that evenly split the range between the minimum
and maximum PenLP values. After that, we get
the best threshold per fold by evaluating each
threshold on the remaining 10% of the training
data. Finally, we obtain the best threshold across
folds by computing the MCC metric for each of
them on the validation set. Figure 3 in Appendix D
shows the distribution of scores for acceptable and
14hf.co/sberbank-ai/rugpt3medium
摘要:

RuCoLA:RussianCorpusofLinguisticAcceptabilityVladislavMikhailov1,TatianaShamardina2,MaxRyabinin3,4AlenaPestova3,IvanSmurov2,EkaterinaArtemova5,61SberDevices,2ABBYY,3HSEUniversity,4Yandex,5HuaweiNoah'sArkLab,6CenterforInformationandLanguageProcessing(CIS),MaiNLPlab,LMUMunich,GermanyCorrespondence:...

展开>> 收起<<
RuCoLA Russian Corpus of Linguistic Acceptability Vladislav Mikhailov1 Tatiana Shamardina2 Max Ryabinin34 Alena Pestova3Ivan Smurov2Ekaterina Artemova56.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:516.43KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注