RuCoLA Russian Corpus of Linguistic Acceptability Vladislav Mikhailov1 Tatiana Shamardina2 Max Ryabinin34 Alena Pestova3Ivan Smurov2Ekaterina Artemova56

2025-05-03 0 0 516.43KB 21 页 10玖币

侵权投诉

RuCoLA: Russian Corpus of Linguistic Acceptability

Vladislav Mikhailov1∗

, Tatiana Shamardina2∗, Max Ryabinin3,4∗

Alena Pestova3,Ivan Smurov2,Ekaterina Artemova5,6

1SberDevices, 2ABBYY, 3HSE University,

4Yandex, 5Huawei Noah’s Ark Lab,

6Center for Information and Language Processing (CIS), MaiNLP lab, LMU Munich, Germany

Correspondence: vmikhailovhse@gmail.com

Abstract

Linguistic acceptability (LA) attracts the at-

tention of the research community due to its

many uses, such as testing the grammatical

knowledge of language models and ﬁltering

implausible texts with acceptability classiﬁers.

However, the application scope of LA in lan-

guages other than English is limited due to the

lack of high-quality resources. To this end,

we introduce the Russian Corpus of Linguis-

tic Acceptability (RuCoLA), built from the

ground up under the well-established binary

LA approach. RuCoLA consists of 9.8k in-

domain sentences from linguistic publications

and 3.6k out-of-domain sentences produced

by generative models. The out-of-domain set

is created to facilitate the practical use of ac-

ceptability for improving language generation.

Our paper describes the data collection pro-

tocol and presents a ﬁne-grained analysis of

acceptability classiﬁcation experiments with a

range of baseline approaches. In particular,

we demonstrate that the most widely used lan-

guage models still fall behind humans by a

large margin, especially when detecting mor-

phological and semantic errors. We release

RuCoLA, the code of experiments, and a pub-

lic leaderboard1to assess the linguistic com-

petence of language models for Russian.

1 Introduction

Recent NLP research has approached the lin-

guistic competence of language models (LMs)

with acceptability judgments, which reﬂect a sen-

tence’s well-formedness and naturalness from the

perspective of native speakers (Chomsky,1965).

These judgments have formed an empirical foun-

dation in generative linguistics for evaluating hu-

mans’ grammatical knowledge and language ac-

quisition (Schütze,1996;Sprouse,2018).

Borrowing conventions from linguistic theory,

the community has put much effort into creating

∗Equal contribution.

1Available at rucola-benchmark.com

Language Size %

CoLA English 10.6k 70.5

ItaCoLA Italian 9.7k 85.4

RuCoLA Russian 13.4k 71.8

Table 1: Comparison of RuCoLA with related bi-

nary acceptability classiﬁcation benchmarks: CoLA

(Warstadt et al.,2019) and ItaCoLA (Trotta et al.,

2021). %=Percentage of acceptable sentences.

linguistic acceptability (LA) resources to explore

whether LMs acquire grammatical concepts piv-

otal to human linguistic competence (Kann et al.,

2019;Warstadt et al.,2019,2020). Lately, simi-

lar non-English resources have been proposed to

address this question in typologically diverse lan-

guages (Trotta et al.,2021;Volodina et al.,2021;

Hartmann et al.,2021;Xiang et al.,2021). How-

ever, the ability of LMs to perform acceptability

judgments in Russian remains understudied.

To this end, we introduce the Russian Corpus

of Linguistic Acceptability (RuCoLA), a novel

benchmark of 13.4k sentences labeled as accept-

able or not. In contrast to related binary ac-

ceptability classiﬁcation benchmarks in Table 1,

RuCoLA combines in-domain sentences manu-

ally collected from linguistic literature and out-

of-domain sentences produced by nine machine

translation and paraphrase generation models. The

motivation behind the out-of-domain set is to facil-

itate the practical use of acceptability judgments

for improving language generation (Kane et al.,

2020;Batra et al.,2021). Furthermore, each unac-

ceptable sentence is additionally labeled with four

standard and machine-speciﬁc coarse-grained cat-

egories: morphology, syntax, semantics, and hal-

lucinations (Raunak et al.,2021).

The main contributions of this paper are the fol-

lowing: (i) We create RuCoLA, the ﬁrst large-

scale acceptability classiﬁcation resource in Rus-

arXiv:2210.12814v1 [cs.CL] 23 Oct 2022

sian. (ii) We present a detailed analysis of ac-

ceptability classiﬁcation experiments with a broad

range of baselines, including monolingual and

cross-lingual Transformer (Vaswani et al.,2017)

LMs, statistical approaches, acceptability mea-

sures from pretrained LMs, and human judge-

ments. (iii) We release RuCoLA, the code of ex-

periments2, and a leaderboard to test the linguistic

competence of modern and upcoming LMs for the

Russian language.

2 Related work

2.1 Acceptability Judgments

Acceptability Datasets The design of existing

LA datasets is based on standard practices in lin-

guistics (Myers,2017;Scholz et al.,2021): bi-

nary acceptability classiﬁcation (Warstadt et al.,

2019;Kann et al.,2019), magnitude estima-

tion (Vázquez Martínez,2021), gradient judg-

ments (Lau et al.,2017;Sprouse et al.,2018),

Likert scale scoring (Brunato et al.,2020), and

a forced choice between minimal pairs (Marvin

and Linzen,2018;Warstadt et al.,2020). Recent

studies have extended the research to languages

other than English: Italian (Trotta et al.,2021),

Swedish (Volodina et al.,2021), French (Feld-

hausen and Buchczyk,2020), Chinese (Xiang

et al.,2021), Bulgarian and German (Hartmann

et al.,2021). Following the motivation and

methodology by Warstadt et al. (2019), this paper

focuses on the binary acceptability classiﬁcation

approach for the Russian language.

Applications of Acceptability Acceptability

judgments have been broadly applied in NLP.

In particular, they are used to test LMs’ robust-

ness (Yin et al.,2020) and probe their acquisition

of grammatical phenomena (Warstadt and Bow-

man,2019;Choshen et al.,2022;Zhang et al.,

2021). LA has also stimulated the develop-

ment of acceptability measures based on pseudo-

perplexity (Lau et al.,2020), which correlate well

with human judgments (Lau et al.,2017) and show

beneﬁts in scoring generated hypotheses in down-

stream tasks (Salazar et al.,2020). Another appli-

cation includes evaluating the grammatical and se-

mantic correctness in language generation (Kane

et al.,2020;Harkous et al.,2020;Bakshi et al.,

2021;Batra et al.,2021).

2Both RuCoLA and the code of our experiments are avail-

able at github.com/RussianNLP/RuCoLA

Source Size % Content

rusgram 563 49.7 Corpus grammar

Testelets (2001) 1335 73.9 General syntax

Lutikova (2010) 193 75.6 Syntactic structures

Mitrenina et al. (2017) 54 57.4 Generative grammar

Paducheva (2010) 1308 84.3 Semantics of tense

Paducheva (2004) 1374 90.8 Lexical semantics

Paducheva (2013) 1462 89.5 Aspects of negation

Seliverstova (2004) 2104 80.8 Semantics

Shavrina et al. (2020) 1444 36.6 Grammar exam tasks

In-domain 9837 74.5

Machine Translation 1286 72.8 English translations

Paraphrase Generation 2322 59.9 Automatic paraphrases

Out-of-domain 3608 64.6

Total 13445 71.8

Table 2: RuCoLA statistics by source. The number

of in-domain sentences is similar to that of CoLA and

ItaCoLA. %=Percentage of acceptable sentences.

2.2 Evaluation of Text Generation

Machine translation (or MT) is one of the ﬁrst

sub-ﬁelds which has established diagnostic eval-

uation of neural models (Dong et al.,2021). Di-

agnostic datasets can be constructed by automatic

generation of contrastive pairs (Burlot and Yvon,

2017), crowdsourcing annotations of generated

sentences (Lau et al.,2014), and native speaker

data (Anastasopoulos,2019). Various phenom-

ena have been analyzed, to name a few: morphol-

ogy (Burlot et al.,2018), syntactic properties (Sen-

nrich,2017;Wei et al.,2018), commonsense (He

et al.,2020), anaphoric pronouns (Guillou et al.,

2018), and cohesion (Bawden et al.,2018).

Recent research has shifted towards overcoming

limitations in language generation, such as copy-

ing inputs (Liu et al.,2021), distorting facts (San-

thanam et al.,2021), and generating hallucinated

content (Zhou et al.,2021). Maynez et al. (2020)

and Liu et al. (2022) propose datasets on hal-

lucination detection. SCARECROW (Dou et al.,

2022) and TGEA (He et al.,2021) focus on tax-

onomies of text generation errors. Drawing inspi-

ration from these works, we create the machine-

generated out-of-domain set to foster text genera-

tion evaluation with acceptability.

3 RuCoLA

3.1 Design

RuCoLA consists of in-domain and out-of-domain

subsets, as outlined in Table 2. Below, we describe

the data collection procedures for each subset.

Label Set Category Sentence Source

XIn-domain ×Ya obnaruzhil ego lezhaschego odnogo na krovati. Testelets (2001)

I found him lying in the bed alone.

* In-domain SYNTAX Ivan prileg, chtoby on otdokhnul.Testelets (2001)

Ivan laid down in order that he has a rest.

XOut-of-domain ×Ja ne chital ni odnogo iz ego romanov. Artetxe and Schwenk (2019)

I have not read any of his novels.

* Out-of-domain HALLUCINATION Ljuk ostanavlivaet udachu ot etogo. Schwenk et al. (2021)

Luke stops luck from doing this.

Table 3: A sample of RuCoLA. *=Unacceptable sentences. X=Acceptable sentences. The examples are translated

for illustration purposes.

In-domain Set Here, the data collection method

is analogous to CoLA. The in-domain sentences

and the corresponding authors’ acceptability judg-

ments3are drawn from fundamental linguistic

textbooks, academic publications, and method-

ological materials4. The works are focused on var-

ious linguistic phenomena, including but not lim-

ited to general syntax (Testelets,2001), the syn-

tactic structure of noun phrases (Lutikova,2010),

negation (Paducheva,2013), predicate ellipsis,

and subordinate clauses (rusgram5). Shavrina

et al. (2020) introduce a dataset on the Uniﬁed

State Exam in the Russian language, which serves

as school ﬁnals and university entry examinations

in Russia. The dataset includes standardized tests

on high school curriculum topics made by method-

ologists. We extract sentences from the tasks on

Russian grammar, which require identifying incor-

rect word derivation and syntactic violations.

Out-of-domain Set The out-of-domain sen-

tences are produced by nine open-source MT

and paraphrase generation models using sub-

sets of four datasets from different domains:

Tatoeba (Artetxe and Schwenk,2019), WikiMa-

trix (Schwenk et al.,2021), TED (Qi et al.,

2018), and Yandex Parallel Corpus (Antonova and

Misyurev,2011). We use cross-lingual MT mod-

els released as a part of the EasyNMT library6:

OPUS-MT (Tiedemann and Thottingal,2020), M-

BART50 (Tang et al.,2020) and M2M-100 (Fan

et al.,2021) of 418M and 1.2B parameters. Rus-

sian WikiMatrix sentences are paraphrased via the

3We keep unacceptable sentences marked with the “*”,

“*?” and “??” labels.

4The choice is also based on the ease of manual example

collection, e.g., high digital quality of the sources and no need

for manual transcription.

5A collection of materials written by linguists for a

corpus-based description of Russian grammar. Available at:

rusgram.ru

6github.com/UKPLab/EasyNMT

russian-paraphrasers library (Fenogen-

ova,2021) with the following models and nucleus

sampling strategy: ruGPT2-Large7(760M), ruT5

(244M)8, and mT5 (Xue et al.,2021) of Small

(300M), Base (580M) and Large (1.2B) versions.

The annotation procedure of the generated sen-

tences is documented in §3.3.

3.2 Violation Categories

Each unacceptable sentence is additionally labeled

with one of the four violation categories: mor-

phology, syntax, semantics, and hallucinations.

The annotation for the in-domain set is obtained

through manual working with the sources. The

categories are manually deﬁned based on the in-

terpretation of examples provided by the experts,

topics covered by chapters, and the general con-

tent of a linguistic source. The out-of-domain sen-

tences are annotated as described in §3.3.

Phenomena The phenomena covered by Ru-

CoLA are well represented in Russian theoretical

and corpus linguistics and peculiar to modern gen-

erative models. We brieﬂy summarize our infor-

mal categorization and list examples of the phe-

nomena below:

1. SYNTAX: agreement violations, corruption

of word order, misconstruction of syntactic

clauses and phrases, incorrect use of apposi-

tions, violations of verb transitivity or argu-

ment structure, ellipsis, missing grammatical

constituencies or words.

2. MORPHOLOGY: incorrect derivation or word

building, non-existent words.

3. SEMANTICS: incorrect use of negation, viola-

tion of the verb’s semantic argument structure.

4. HALLUCINATION: text degeneration, nonsen-

sical sentences, irrelevant repetitions, decod-

7hf.co/sberbank-ai/rugpt2large

8hf.co/cointegrated/rut5-base-paraphraser

ing confusions, incomplete translations, hallu-

cinated content.

Table 3 provides a sample of several RuCoLA

sentences, and examples for each violation cate-

gory can be found in Appendix A.

3.3 Annotation of Machine-Generated

Sentences

The machine-generated sentences un-

dergo a two-stage annotation procedure on

Toloka (Pavlichenko et al.,2021), a crowd-

sourcing platform for data labeling9. Each stage

includes an unpaid training phase with expla-

nations, control tasks for tracking annotation

quality10, and the main annotation task. Before

starting, the worker is given detailed instructions

describing the task, explaining the labels, and

showing plenty of examples. The instruction is

available at any time during both the training

and main annotation phases. To get access to the

main phase, the worker should ﬁrst complete the

training phase by labeling more than 70% of its

examples correctly (Nangia and Bowman,2019).

Each trained worker receives a page with ﬁve

sentences, one of which is a control one.

We collect the majority vote labels via a dy-

namic overlap11 from three to ﬁve workers after

ﬁltering them by response time and performance

on control tasks. Appendix B.2 contains a detailed

description of the annotation protocol, including

response statistics and the agreement rates.

Stage 1: Acceptability Judgments The ﬁrst an-

notation stage deﬁnes whether a given sentence is

acceptable or not. Access to the project is granted

to workers certiﬁed as native speakers of Russian

by Toloka and ranked top-60% workers according

to the Toloka rating system. Each worker answers

30 examples in the training phase. Each training

example is accompanied by an explanation that

appears in an incorrect answer. The main anno-

tation phase counts 3.6k machine-generated sen-

tences. The pay rate is on average $2.55/hr, which

is twice the amount of the hourly minimum wage

in Russia. Each of 1.3k trained workers get paid,

but we keep votes from only 960 workers whose

9toloka.ai

10Control tasks are used on Toloka as common practice

for discarding results from bots or workers whose quality on

these tasks is unsatisfactory. In our annotation projects, the

tasks are manually selected or annotated by a few authors:

about 200 and 500 sentences for Stages 1 and 2, respectively.

11toloka.ai/docs/dynamic-overlap

annotation quality rate on the control sentences is

more than 50%. We provide a shortened translated

instruction and an example of the web interface

in Table 6 (see Appendix B.1).

Stage 2: Violation Categories The second

stage includes validation and annotation of sen-

tences labeled unacceptable on Stage 1 according

to ﬁve answer options: “Morphology”, “Syntax”,

“Semantics”, “Hallucinations” and “Other”. The

task is framed as a multi-label classiﬁcation, i.e.,

the sentence may contain more than one violation

in some rare cases or be re-labeled as acceptable.

We create a team of 30 annotators who are under-

graduate BA and MA in philology and linguistics

from several Russian universities. The students

are asked to study the works on CoLA (Warstadt

et al.,2019), TGEA (He et al.,2021), and hallu-

cinations (Zhou et al.,2021). We also hold an on-

line seminar to discuss the works and clarify the

task speciﬁcs. Each student undergoes platform-

based training on 15 examples before moving onto

the main phase of 1.3k sentences. The students

are paid on average $5.42/hr and are eligible to

get credits for an academic course or an intern-

ship. Similar to one of the data collection proto-

cols by Parrish et al. (2021), this stage provides

direct interaction between authors and students in

a group chat. We keep submissions with more than

30 seconds of response time per page and collect

the majority vote labels for each answer indepen-

dently. Sentences having more than one violation

category or labeled as “Other” by the majority are

ﬁltered out. The shortened instruction is presented

in Table 7 (see Appendix B.1).

3.4 General Statistics

Length and Frequency The sentences in Ru-

CoLA are ﬁltered by the 4–30 token range with

razdel12, a rule-based Russian tokenizer. There

are 11 tokens in each sentence on average. We

estimate the number of high-frequency tokens in

each sentence according to the Russian National

Corpus (RNC)13 to control the word frequency

distribution. It is computed as the number of fre-

quently used tokens (i.e., the number of instances

per million in RNC is higher than 1) divided by the

number of tokens in a sentence. We use a moder-

ate frequency threshold t>0.6to keep sentences

containing rare token units typical for some vio-

12github.com/natasha/razdel

13ruscorpora.ru/new/en

Figure 1: Distribution of violation categories in RuCoLA’s unacceptable sentences.

lations: non-existent or misderived words, incom-

plete translations, and others. The sentences con-

tain on average 92% of high-frequency tokens.

Category Distribution Figure 1 shows the dis-

tribution of violation categories in RuCoLA. Syn-

tactic violations are the most common in RuCoLA

(53.3% and 40.8% in the in-domain and out-of-

domain sets). The in-domain set includes 40.2%

of semantic and 6.6% of morphological viola-

tions, while the out-of-domain set accounts for

11.9% and 9.8%, respectively. Model hallucina-

tions make up a percentage of 12.7% of the total

number of unacceptable sentences.

Splits The in-domain set of RuCoLA is split into

train, validation and private test splits in the stan-

dard 80/10/10 ratio (7.9k/1k/1k examples). The

out-of-domain set is divided into validation and

private test splits in a 50/50 ratio (1.8k/1.8k ex-

amples). Each split is balanced by the number of

examples per target class, the source type, and the

violation category.

4 Experiments

We evaluate several methods for acceptability

classiﬁcation ranging from simple non-neural ap-

proaches to state-of-the-art cross-lingual models.

4.1 Performance Metrics

Following Warstadt et al. (2019), the perfor-

mance is measured by the accuracy score (Acc.)

and Matthews Correlation Coefﬁcient (MCC,

Matthews,1975). MCC on the validation set is

used as the target metric for hyperparameter tun-

ing and early stopping. We report the results aver-

aged over ten restarts from different random seeds.

4.2 Models

Non-neural Models We use two models from

the scikit-learn library (Pedregosa et al.,

2011) as simple non-neural baselines: a major-

ity vote classiﬁer, and a logistic regression clas-

siﬁer over tf-idf (Salton and Yang,1973) features

computed on word n-grams with the n-gram range

∈[1; 3], which results in a total of 2509 features.

For the linear model, we tune the `2regularization

coefﬁcient C∈ {0.01,0.1,1.0}based on the vali-

dation set performance.

Acceptability Measures Probabilistic measures

allow evaluating the acceptability of a sentence

while taking its length and lexical frequency into

account (Lau et al.,2020). There exist several

different acceptability measures, such as PenLP,

MeanLP, NormLP, and SLOR (Lau et al.,2020);

we use PenLP due to its results in our prelimi-

nary experiments. We obtain the PenLP measure

for each sentence by computing its log-probability

(computed as a sum of token log-probabilities)

from the ruGPT3-medium14 model. PenLP nor-

malizes the log-probability of a sentence P(s)by

the sentence length with a scaling factor α:

PenLP(s) = P(s)

((5 + |s|)(5 + 1))α.(1)

After we compute the PenLP value of the sen-

tence, we can predict its acceptability by com-

paring it with a speciﬁed threshold. To ﬁnd this

threshold, we run 10-fold cross-validation on the

train set: for each fold, we get the candidate

thresholds on 90% of the data by taking 100 points

that evenly split the range between the minimum

and maximum PenLP values. After that, we get

the best threshold per fold by evaluating each

threshold on the remaining 10% of the training

data. Finally, we obtain the best threshold across

folds by computing the MCC metric for each of

them on the validation set. Figure 3 in Appendix D

shows the distribution of scores for acceptable and

14hf.co/sberbank-ai/rugpt3medium

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RuCoLA:RussianCorpusofLinguisticAcceptabilityVladislavMikhailov1,TatianaShamardina2,MaxRyabinin3,4AlenaPestova3,IvanSmurov2,EkaterinaArtemova5,61SberDevices,2ABBYY,3HSEUniversity,4Yandex,5HuaweiNoah'sArkLab,6CenterforInformationandLanguageProcessing(CIS),MaiNLPlab,LMUMunich,GermanyCorrespondence:...

展开>> 收起<<

RuCoLA Russian Corpus of Linguistic Acceptability Vladislav Mikhailov1 Tatiana Shamardina2 Max Ryabinin34 Alena Pestova3Ivan Smurov2Ekaterina Artemova56.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

RuCoLA Russian Corpus of Linguistic Acceptability Vladislav Mikhailov1 Tatiana Shamardina2 Max Ryabinin34 Alena Pestova3Ivan Smurov2Ekaterina Artemova56

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: