
sian. (ii) We present a detailed analysis of ac-
ceptability classification experiments with a broad
range of baselines, including monolingual and
cross-lingual Transformer (Vaswani et al.,2017)
LMs, statistical approaches, acceptability mea-
sures from pretrained LMs, and human judge-
ments. (iii) We release RuCoLA, the code of ex-
periments2, and a leaderboard to test the linguistic
competence of modern and upcoming LMs for the
Russian language.
2 Related work
2.1 Acceptability Judgments
Acceptability Datasets The design of existing
LA datasets is based on standard practices in lin-
guistics (Myers,2017;Scholz et al.,2021): bi-
nary acceptability classification (Warstadt et al.,
2019;Kann et al.,2019), magnitude estima-
tion (Vázquez Martínez,2021), gradient judg-
ments (Lau et al.,2017;Sprouse et al.,2018),
Likert scale scoring (Brunato et al.,2020), and
a forced choice between minimal pairs (Marvin
and Linzen,2018;Warstadt et al.,2020). Recent
studies have extended the research to languages
other than English: Italian (Trotta et al.,2021),
Swedish (Volodina et al.,2021), French (Feld-
hausen and Buchczyk,2020), Chinese (Xiang
et al.,2021), Bulgarian and German (Hartmann
et al.,2021). Following the motivation and
methodology by Warstadt et al. (2019), this paper
focuses on the binary acceptability classification
approach for the Russian language.
Applications of Acceptability Acceptability
judgments have been broadly applied in NLP.
In particular, they are used to test LMs’ robust-
ness (Yin et al.,2020) and probe their acquisition
of grammatical phenomena (Warstadt and Bow-
man,2019;Choshen et al.,2022;Zhang et al.,
2021). LA has also stimulated the develop-
ment of acceptability measures based on pseudo-
perplexity (Lau et al.,2020), which correlate well
with human judgments (Lau et al.,2017) and show
benefits in scoring generated hypotheses in down-
stream tasks (Salazar et al.,2020). Another appli-
cation includes evaluating the grammatical and se-
mantic correctness in language generation (Kane
et al.,2020;Harkous et al.,2020;Bakshi et al.,
2021;Batra et al.,2021).
2Both RuCoLA and the code of our experiments are avail-
able at github.com/RussianNLP/RuCoLA
Source Size % Content
rusgram 563 49.7 Corpus grammar
Testelets (2001) 1335 73.9 General syntax
Lutikova (2010) 193 75.6 Syntactic structures
Mitrenina et al. (2017) 54 57.4 Generative grammar
Paducheva (2010) 1308 84.3 Semantics of tense
Paducheva (2004) 1374 90.8 Lexical semantics
Paducheva (2013) 1462 89.5 Aspects of negation
Seliverstova (2004) 2104 80.8 Semantics
Shavrina et al. (2020) 1444 36.6 Grammar exam tasks
In-domain 9837 74.5
Machine Translation 1286 72.8 English translations
Paraphrase Generation 2322 59.9 Automatic paraphrases
Out-of-domain 3608 64.6
Total 13445 71.8
Table 2: RuCoLA statistics by source. The number
of in-domain sentences is similar to that of CoLA and
ItaCoLA. %=Percentage of acceptable sentences.
2.2 Evaluation of Text Generation
Machine translation (or MT) is one of the first
sub-fields which has established diagnostic eval-
uation of neural models (Dong et al.,2021). Di-
agnostic datasets can be constructed by automatic
generation of contrastive pairs (Burlot and Yvon,
2017), crowdsourcing annotations of generated
sentences (Lau et al.,2014), and native speaker
data (Anastasopoulos,2019). Various phenom-
ena have been analyzed, to name a few: morphol-
ogy (Burlot et al.,2018), syntactic properties (Sen-
nrich,2017;Wei et al.,2018), commonsense (He
et al.,2020), anaphoric pronouns (Guillou et al.,
2018), and cohesion (Bawden et al.,2018).
Recent research has shifted towards overcoming
limitations in language generation, such as copy-
ing inputs (Liu et al.,2021), distorting facts (San-
thanam et al.,2021), and generating hallucinated
content (Zhou et al.,2021). Maynez et al. (2020)
and Liu et al. (2022) propose datasets on hal-
lucination detection. SCARECROW (Dou et al.,
2022) and TGEA (He et al.,2021) focus on tax-
onomies of text generation errors. Drawing inspi-
ration from these works, we create the machine-
generated out-of-domain set to foster text genera-
tion evaluation with acceptability.
3 RuCoLA
3.1 Design
RuCoLA consists of in-domain and out-of-domain
subsets, as outlined in Table 2. Below, we describe
the data collection procedures for each subset.